# Build Yelp Review Corpus

This notebook outlines the steps to load, preprocess, and clean yelp review text.

It was heavily inspired by: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [1]:
import psycopg2
import nltk
import unicodedata
import pandas as pd
import pprint
import pickle
import re
import os
import copy

pp = pprint.PrettyPrinter(indent=4).pprint

data_dir = '/home/tlappas/data_science/Yelp-Ratings/data/processed'

## Move to ./Yelp-Ratings/data/corpus folder

...This needs to be fixed. Breaks if run more than once.

path = os.getcwd()
print('Notebook path: {}'.format(path))

os.chdir('..')
data_path = os.path.join(os.getcwd(), 'corpus')
print('Corpus path: {}'.format(data_path))

if os.path.exists(data_path) == False:
    os.mkdir(data_path)

os.chdir(data_path)

## Get Data

Text is stored in yelp db. Query and store in a pandas DataFrame.

In [2]:
conn = psycopg2.connect('dbname=yelp user=tlappas host=/var/run/postgresql')
cur = conn.cursor()
cur.execute("""
    select review.review_text 
    from review, business, user_info
    where review.user_id = user_info.user_id
    and business.business_id = review.business_id
    and business.categories LIKE '%Restaurants%'
    and length(user_info.elite) != 0 
    limit 1
""")

cols = ['review_text']

data = pd.DataFrame(cur.fetchall(), columns=cols)

### Example Instance

print('DataFrame shape: {}\n'.format(data.shape))
print('First instance: \n')
print(data.loc[0])

## Example Review

print(data.loc[1, 'review_text'])

## Basic Text Pre-Preprocessing

1. Remove any non-ASCII characters
2. Replace any characters that aren't alphanumeric/whitespace/"'"
3. Convert all letters to lowercase
4. Replace whitespace with ' '.

In [3]:
for i, text in enumerate(data.loc[:,'review_text']):

    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf8', 'ignore')
    text = re.sub(r"[^A-Za-z0-9\s']", '', text)
    text = text.lower()
    text = re.sub(r'[\n|\r|\n\r|\r\n]', ' ', text)
    data.loc[i,'review_text'] = text

In [4]:
print(data.loc[0,'review_text'])

get the a wreck  it has everything roast beef turkey and ham oh my  i like to add oil and italian seasoning with hot giardiniera if you really want to thrill your tastebuds   love the selection of zapp's chips here goes great with your sammie op


## Save Corpus - Cleaned Text

In [5]:
processing = ['ascii-only', 'whitespace alphanumeric only', 'lowercase', 'remove whitespace chars']
with open(os.path.join(data_dir, 'reviews-clean.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)

## Remove Stopwords

In [6]:
stops = nltk.corpus.stopwords.words('english')

for i, text in data.loc[:,'review_text'].iterrows():
    text = [word for word in text.split() if word not in stops]
    text = ' '.join(text)
    data.at[i,'review_text'] = text

In [7]:
print(data.at[0,'review_text'])

get wreck everything roast beef turkey ham oh like add oil italian seasoning hot giardiniera really want thrill tastebuds love selection zapp's chips goes great sammie op


## Save Corpus - Cleaned + No Stopwords

In [8]:
processing.append('no stopwords')
with open(os.path.join(data_dir, 'reviews-clean-nostop.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)

## Remove neutral stopwords

Since we're attempting to differentiate negative reviews from positive reviews negations may be important. For example 'not', 'no', 'hasn't', etc.

In [9]:
neg_stops = [
    'no', 'not', 'nor', 'don\'', 'don\'t', 'ain', 
    'ain\'t', 'aren\'t', 'aren', 'couldn', 'couldn\'t', 
    'didn', 'didn\'t', 'doesn', 'doesn\'t', 'hadn', 
    'hadn\'t', 'hasn', 'hasn\'t', 'haven', 'haven\'t',
    'isn', 'isn\'t', 'mightn', 'mightn\'t', 'mustn', 
    'mustn\'t', 'needn', 'needn\'t', 'shan', 'shan\'t',
    'shouldn', 'shouldn\'t', 'wasn', 'wasn\'t', 'weren',
    'weren\'t', 'won', 'won\'t', 'wouldn', 'wouldn\'t'
]

no_neg_stops = [word for word in stops if word in stops and word not in neg_stops]

data_neg_stops = []
processing = []

with open(os.path.join(data_dir, 'reviews-clean.pkl'), 'rb') as f:
    [data_neg_stops, processing] = pickle.load(f)
    
for i, text in data_neg_stops.iterrows():
    text = [word for word in text.split() if word not in stops]
    text = ' '.join(text)
    data.at[i,'review_text'] = text

In [10]:
print(data.at[0,'review_text'])

review_text


## Save Corpus - Cleaned + No Neutral Stopwords

In [11]:
processing.append('Remove all pos/neutral stopwords')
with open(os.path.join(data_dir, 'reviews-clean-neg-stops.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)

## Lemmatize Text

The wordnet lemmatizer only lemmatizes a single pos at a time. Default pos param is 'N' (noun). [Set of pos tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

[Here's an example](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizerwithappropriatepostag) that ties in the nltk pos tagger to identify the pos, then use the lemmatizer to lem just that type. Which will save a huge amount of time, since it won't need to go through every one.

[Wordnet documentation](http://www.nltk.org/howto/wordnet.html)

In [12]:
wnl = nltk.WordNetLemmatizer()

for i, text in data.loc[:,'review_text'].iterrows():
    text = text.split()
    text = [wnl.lemmatize(word, pos='n') for word in text]
    text = [wnl.lemmatize(word, pos='v') for word in text]
    text = [wnl.lemmatize(word, pos='a') for word in text]
    text = [wnl.lemmatize(word, pos='r') for word in text]
    text = ' '.join(text)
    data.at[i,'review_text'] = text

In [13]:
print(data.at[0,'review_text'])

review_text


## Save Corpus - Cleaned + No Stopwords + Lemmatized

In [14]:
processing.append('lemmatize: NVJR')
with open(os.path.join(data_dir, 'reviews-clean-nostop-lemmed.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)