## Splitting Reviews to Sentences

The input of this notebook is a dataframe of reviews that mention one menu item. The output is a dataframe of the individual sentences that mention that menu item. As an example, I am using `onion soup` and `eggs benedict`.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from time import time

# Display full content of column
# pd.set_option('display.max_colwidth', -1)

import spacy
import textblob

# For reading from Postgres
from sqlalchemy import create_engine

# For pickling
import pickle

# For tracking progress
from IPython.display import clear_output

# For reading and writing to postgres
from odo import odo

# For detecting language of document
from langdetect import detect
from langdetect import DetectorFactory 

# for consistent results
DetectorFactory.seed = 42 

import nltk

# The NLP workhorse in Python is Natural Language Toolkit (NLTK)
# Tokenizing, lemmatizing
from nltk import word_tokenize, pos_tag, ne_chunk

# Preprocessing packages used in class
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer


# For loading secret environment variables, e.g. postgres username and password
import os
from dotenv import load_dotenv, find_dotenv


You can access NaTType as type(pandas.NaT)
  @convert.register((pd.Timestamp, pd.Timedelta), (pd.tslib.NaTType, type(None)))


### Create Postgres table

In [2]:
sql = '''
CREATE TABLE reviews (
    date          date, 
    stars         integer NOT NULL,
    text          varchar(5000), 
    review_id     varchar(22),
    business_id   varchar(22),
    business_name varchar(64)  
);
'''

### Import CSV file to Postgres

In [3]:
raw_data_directory     = os.path.join('..', 'data', 'raw')
interim_data_directory = os.path.join('..', 'data', 'interim')

review_filepath            = os.path.join(raw_data_directory, 'yelp_academic_dataset_review.csv')
business_filepath          = os.path.join(raw_data_directory, 'yelp_academic_dataset_business.csv')
restaurant_review_filepath = os.path.join(interim_data_directory, 'restaurant_review.csv')
restaurant_filepath        = os.path.join(interim_data_directory, 'restaurant.csv')

In [5]:
sql = f'''
COPY reviews(date, stars, text, review_id, business_id, business_name)
FROM '{restaurant_review_filepath}' DELIMITER ',' CSV HEADER;
'''

### Set Environment Variables

In [6]:
# Find .env
dotenv_path = find_dotenv()

# Load entries as environment variables
load_dotenv(dotenv_path)

public_ip = os.environ.get("PUBLIC_IP")
username = os.environ.get("USERNAME")
password = os.environ.get("PASSWORD")
port = os.environ.get("PORT")

# Construct database URL from environment variables
uri = f'postgresql://{username}:{password}@{public_ip}:{port}'

### Load Data 
#### Option 1: Local Postgres

In [7]:
public_ip = 'localhost'
username = 'postgres'
password = 'password'
port = '5432'
database = 'yelp'

# Construct database URL from environment variables
uri = f'postgresql://{username}:{password}@{public_ip}:{port}/{database}'

# uri = 'postgresql://pos:lilo8catfood@127.0.0.0:5432'
# reviews = odo(uri+'::reviews', pd.DataFrame)

# Connection to Postgres database
engine = create_engine(uri)


In [30]:
%%time

SQL = '''
SELECT r.*
FROM reviews AS r
WHERE r.business_name ILIKE 'Mon Ami Gabi'
  AND r.text LIKE '%%onion soup%%'
'''

onion_soup_reviews = pd.read_sql(SQL, con = engine)

CPU times: user 6.34 ms, sys: 3.81 ms, total: 10.1 ms
Wall time: 6.93 s


In [31]:
onion_soup_reviews.shape

(674, 7)

In [10]:
onion_soup_reviews.head()

Unnamed: 0,date,stars,text,review_id,business_id,business_name,text_tsv
0,2015-02-10,2,Other than being right across the Fountains of...,uczUlWIWuO-KzoUiLhICNw,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,
1,2010-07-11,5,This review is long overdue! I have been eat...,l0Lm7Dx69s6aH7a-5dwKDg,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,
2,2017-04-24,3,French onion soup was watery with little taste...,185E0cpQpDRUO4JRGu3fXQ,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,
3,2010-12-04,4,Charming resturant that looks like it would be...,nth_q-GqOy_Ly8sxsREIwA,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,
4,2014-12-30,5,Brunch with the family was out of this world. ...,EL1LCOPj40kQjLweA81Uww,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi,


#### Option 2: Load data locally

In [32]:
onion_soup_reviews = pd.read_csv('../data/interim/onion_soup_reviews.csv')
onion_soup_reviews.shape

(868, 6)

In [33]:
eggs_benedict_reviews = pd.read_csv('../data/interim/eggs_benedict_reviews.csv')
eggs_benedict_reviews.shape

(610, 6)

In [34]:
menu = pickle.load( open( "../data/interim/mon_ami_gabi_menu.pk", "rb" ) )

In [35]:
menu.head()

Unnamed: 0,id,name,variations
0,onion_soup_au_gratin,onion soup au gratin,"[french onion soup, onion soup, french onion, ..."
1,steamed_artichoke,steamed artichoke,[steamed artichoke]
2,smoked_salmon,smoked salmon,[smoked salmon]
3,baked_goat_cheese,baked goat cheese,[goat cheese]
4,duck_confit,duck confit,[duck confit]


### Drop rows that are not English

In [36]:
def get_english_reviews(df):
    language = df['text'].apply(detect)
                  
    return df.drop(df[language != 'en'].index, axis = 0)
    

In [37]:
%%time
onion_soup_reviews = get_english_reviews(onion_soup_reviews)
print(onion_soup_reviews.shape[0])

867
CPU times: user 3.99 s, sys: 113 ms, total: 4.1 s
Wall time: 4.1 s


In [38]:
%%time
eggs_benedict_reviews = get_english_reviews(eggs_benedict_reviews)
print(eggs_benedict_reviews.shape[0])

608
CPU times: user 2.74 s, sys: 82.1 ms, total: 2.82 s
Wall time: 2.82 s


### Tokenize & Lemmatize Data

[Penn Part of Speech Tags](https://cs.nyu.edu/grishman/jet/guide/PennPOS.html)

In [51]:
def preprocess2(sentence):
    # Tokenize doc
    tokens = word_tokenize(sentence)
    
    # Tag sentences
    tagged_tokens = pos_tag(tokens)

    # Named entity chunker
    ne_chunked_tokens = ne_chunk(tagged_tokens, binary = True)
#     return ne_chunked_tokens
    # Extract all named entities
    named_entities = []
    
    for tagged_tree in ne_chunked_tokens:
#         print(tagged_tree)
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label() # category
            named_entities.append((entity_name, entity_type))
    
    return named_entities
    

In [68]:
processed_review = preprocess2(onion_soup_reviews['text'][0])

In [69]:
processed_review

[('Bellagio', 'NE'),
 ('Bordelaise Steak Frites', 'NE'),
 ('Chicken', 'NE'),
 ('Mushroom Crepe', 'NE'),
 ('Seafood Crepe', 'NE'),
 ('Eggs Benedict', 'NE'),
 ('Canadian', 'NE'),
 ('Steak', 'NE')]

NLTK's tokenizer fails to detect onion soup as an entity, so I will continue using a less programmatic approach. 

### Separate each review document into individual sentences

In [70]:
def flatten(superlist): 
    '''
    Arguments: 
    superlist : A list of list of strings.

    Requirements: 
    Each element in superlist must be a list.
    
    Return:
    A flattened list of strings.

    ex: 
    flatten([['a'], ['b', 'c'], ['d', 'e', 'f']])
    >> ['a', 'b', 'c', 'd', 'e', 'f']
    '''    
    return [item \
            for sublist in superlist \
            for item in sublist]

In [131]:
def get_sentences(doc, menu_item):
    '''
    Arguments: 
    doc : pd.Series of reviews
    menu_item : pd.Series of a menu item. example: menu.iloc[0] = "onion_soup_au_gratin"
    
    Splits a string into individual sentences and
    selects only the sentences that contain the search term
    
    Return:
    DataFrame of sentences with their target
    '''
    sentences = doc.apply(lambda text : text.split('.'))
#     df['sentences'] = df['sentences'].apply(lambda sentences : [s for s in sentences])
    sentences = flatten(sentences)
    
    tagged_sentences = []
    sentences_tags = []
    
    n_sentences = len(sentences)
    
    for i,s in enumerate(sentences):
        s = s.lower()
        
        if (i+1)%1000==0:
            clear_output(wait = True)
            print(f'Finding tags in {i+1}/{n_sentences} sentences')

        tags = []
        for i,row in menu.iterrows():
            for variation in row['variations']:
                if variation in s:
                    # print(row['entity'], '\t', variation)
                    tags.append(row['name'])
                    break
        sentences_tags.append(', '.join(tags))
    
    return pd.DataFrame(list(zip(sentences, sentences_tags)), columns = ['text', 'tags'])
#     return pd.DataFrame([{'text' : s, 'target' : term} for s in sentences])    
    

### Find all sentences that mention "onion soup"

In [162]:
%%time
onion_soup_sentences = get_sentences(onion_soup_reviews['text'], menu[menu['id'] == 'onion_soup_au_gratin'])

Finding tags in 10000/10720 sentences
CPU times: user 45.2 s, sys: 95.1 ms, total: 45.3 s
Wall time: 45.4 s


In [163]:
onion_soup_sentences = onion_soup_sentences[onion_soup_sentences['tags'].str.contains('onion soup au gratin')]
onion_soup_sentences.shape


(1007, 2)

In [164]:
onion_soup_sentences.head()

Unnamed: 0,text,tags
2,Our table ordered Bordelaise Steak Frites (...,"onion soup au gratin, scallops gratinees, bord..."
3,The steak frites and onion soup were the be...,"onion soup au gratin, prime steak frites, frites"
5,"Onion soup was also a nice, big portion, but ...",onion soup au gratin
12,French onion soup was watery with little taste,onion soup au gratin
20,We ate almost everything on the menu - altho...,"onion soup au gratin, baked goat cheese"


In [165]:
onion_soup_sentences.to_csv('../data/interim/onion_soup_sentences.csv', index = False)


### Find all sentences that mention "eggs benedict" 

In [166]:
eggs_benedict_sentences = get_sentences(eggs_benedict_reviews['text'], menu[menu['id'] == 'classic_eggs_benedict'])

Finding tags in 6000/6464 sentences


In [167]:
eggs_benedict_sentences = eggs_benedict_sentences[eggs_benedict_sentences['tags'].str.contains('classic eggs benedict')]
eggs_benedict_sentences.shape


(546, 2)

In [168]:
eggs_benedict_sentences.head()

Unnamed: 0,text,tags
1,Eggs Benedict for me was fab and some other g...,classic eggs benedict
5,The French toast and eggs Benedict with duck ...,"classic eggs benedict, french toast"
11,Our table ordered Bordelaise Steak Frites (...,"onion soup au gratin, scallops gratinees, bord..."
18,Eggs benedict here is definitely not a stand-...,classic eggs benedict
27,"The Salmon Eggs Benedict is scrumptious, and ...","salmon, classic eggs benedict, waffle"


In [169]:
eggs_benedict_sentences.to_csv('../data/interim/eggs_benedict_sentences.csv', index = False)
