## Splitting Sentences to Chunks

The input to this notebook is a dataframe of sentences that mention one menu item. The output is a list of sentence chunks that mention that menu item. `get_chunks()` extracts sentence fragments from a sentence. The size of sentence chunks can be customized changing the arguments `n_before` and `n_after`.

For example,   
```s = 'The French onion soup is out of this world and the skate fish entree is terrific'
get_chunks(s, 'onion soup', 7, 7)```  
becomes  
`'the french onion soup is out of this world and'`


### Import Libraries

In [65]:
import pandas as pd
import numpy as np
from time import time
import pickle
import string

In [66]:
onion_soup_sentences = pd.read_csv('../data/interim/onion_soup_sentences.csv')
menu = pickle.load( open( "../data/interim/mon_ami_gabi_menu.pk", "rb" ) )

In [67]:
onion_soup_sentences.head()

Unnamed: 0,text,tags
0,Our table ordered Bordelaise Steak Frites (...,"onion soup au gratin, scallops gratinees, bord..."
1,The steak frites and onion soup were the be...,"onion soup au gratin, prime steak frites, frites"
2,"Onion soup was also a nice, big portion, but ...",onion soup au gratin
3,French onion soup was watery with little taste,onion soup au gratin
4,We ate almost everything on the menu - altho...,"onion soup au gratin, baked goat cheese"


In [68]:
onion_soup_sentences.shape

(1007, 2)

In [69]:
menu.head()

Unnamed: 0,id,name,variations
0,onion_soup_au_gratin,onion soup au gratin,"[french onion soup, onion soup, french onion, ..."
1,steamed_artichoke,steamed artichoke,[steamed artichoke]
2,smoked_salmon,smoked salmon,[smoked salmon]
3,baked_goat_cheese,baked goat cheese,[goat cheese]
4,duck_confit,duck confit,[duck confit]


In [70]:
def find_term(word_list, term):
    '''
    Arguments:
    word_list : List of words or a string
    term      : List or string of words to search for
    
    Finds the start and end indices of a search term in a string.
    `start` is the index of the first character in `term` in word_list,
    `end` is the index of the last character in `term` in word_list.    
    
    Return:
    results : List of tuples (start, end)
    '''    
    # Check if word_list is a string or list
    if type(word_list) is str:
        word_list = word_list.lower().split()
    elif type(word_list) is not list:
        print('Error: word_list must be a list or string.')
        return None

    # Check if term is a string or list    
    if type(term) is str:
        term = term.lower().split()
    elif type(term) is not list:
        print('Error: term must be a list or string.')
        return None

    results = []
    term_length = len(term)

    # Find indices of term[0] in sentence
    for ind in (i for i, word in enumerate(word_list) if word == term[0]):
        # Check if rest of the term matches
        if word_list[ind:ind + term_length] == term:
            results.append((ind, ind+term_length-1))

    return results

In [71]:
# test that find_term works
find_term('The onion soup is at index (1,2). The onion soup is also at index (8,9).', 'onion soup')

[(1, 2), (8, 9)]

In [72]:
def get_chunks(word_list, term, n_before = 5, n_after = 5):
    '''
    Arguments:
    word_list : List or string of words
    term      : List or string of words to search for
    before    : Number of characters to span before term
    after     : Number of characters to span after term   
    
    Gets a list of sentence fragments containing term in word_list
    Each sentence fragment spans n_before characters to the left
    or until the start of the word_list
    and n_after characters to the right 
    or until the end of the word_list
    
    Return:
    chunks : List of chunks
    
    '''
    # Check if word_list is a string or list
    if type(word_list) is str:
        word_list = word_list.lower().split()
    elif type(word_list) is not list:
        print('Error: word_list must be a list or string.')
        return None
    
    # Check if term is a string or list    
    if type(term) is str:
        term = term.lower().split()
    elif type(term) is not list:
        print('Error: term must be a list or string.')
        return None    
    
    indices = find_term(word_list, term)
    chunks = []

    for start, end in indices:
        before = n_before
        after = n_after
        
        # Check if start index is near the beginning of the word_list
        if start < n_before:
            before = start
        # Check if end index is near the end of the word_list
        if end > len(word_list) - n_after:
            after = len(word_list) - end
            
        chunks.append(' '.join(word_list[start-before : end+after+1]))
        
    return chunks



In [73]:
# test get_chunks() with punctuation
test = 'I got the the onion soup, which was great my wife also enjoyed the onion soup, my children do not like onion soup'
get_chunks(test, 'onion soup')

['my children do not like onion soup']

The input string must be lemmatized before running `get_chunks()`, or it will fail to extract terms with punctuation.

In [74]:
# test get_chunks() without punctuation
test = 'I got the the onion soup which was great my wife also enjoyed the onion soup my children do not like onion soup'
get_chunks(test, 'onion soup')

['i got the the onion soup which was great my wife',
 'my wife also enjoyed the onion soup my children do not like',
 'my children do not like onion soup']

In [75]:
def flatten(superlist): 
    '''
    Arguments: 
    superlist : A list of list of strings.

    Requirements: 
    Each element in superlist must be a list.
    
    Return:
    A flattened list of strings.

    ex: 
    flatten([['a'], ['b', 'c'], ['d', 'e', 'f']])
    >> ['a', 'b', 'c', 'd', 'e', 'f']
    '''    
    return [item \
            for sublist in superlist \
            for item in sublist]

In [76]:
def remove_punctuation(s):
    return s.translate(str.maketrans('', '', string.punctuation))

In [158]:
# onion_soup_chunks = []
onion_soup_chunks = onion_soup_sentences['text'].apply(lambda row: get_chunks(remove_punctuation(row), 'onion soup', 7, 7))
onion_soup_chunks = pd.Series(flatten(onion_soup_chunks))


In [159]:
onion_soup_chunks.shape

(1005,)

In [163]:
onion_soup_chunks.head()

0    crepe with scallops shrimp peas and cream onio...
1    the steak frites and onion soup were the best ...
2           onion soup was also a nice big portion but
3       french onion soup was watery with little taste
4    everything on the menu although their french o...
dtype: object

In [166]:
pd.DataFrame(onion_soup_chunks, columns=['text']).to_csv('../data/interim/onion_soup_chunks.csv', index = False)