# Build Yelp Review Corpus

This notebook outlines the steps to load, preprocess, and clean yelp review text.

It was heavily inspired by: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [None]:
import psycopg2
import nltk
import unicodedata
import pandas as pd
import pprint
import pickle
import re
import os

pp = pprint.PrettyPrinter(indent=4).pprint

## Move to ./Yelp-Ratings/data/corpus folder

...This needs to be fixed. Breaks if run more than once.

path = os.getcwd()
print('Notebook path: {}'.format(path))

os.chdir('..')
data_path = os.path.join(os.getcwd(), 'corpus')
print('Corpus path: {}'.format(data_path))

if os.path.exists(data_path) == False:
    os.mkdir(data_path)

os.chdir(data_path)

## Get Data

Text is stored in yelp db. Query and store in a pandas DataFrame.

In [None]:
conn = psycopg2.connect('dbname=yelp user=tlappas host=/var/run/postgresql')
cur = conn.cursor()
cur.execute("""
    SELECT * FROM review LIMIT 10
""")

cols = ['review_id', 'user_id', 'business_id', 'stars', 'review_date', 'review_text', 'useful', 'funny', 'cool']

data = pd.DataFrame(cur.fetchall(), columns=cols)

### Example Instance

In [None]:
print('DataFrame shape: {}\n'.format(data.shape))
print('First instance: \n')
print(data.loc[0])

## Example Review

In [None]:
print(data.loc[1, 'review_text'])

## Basic Text Pre-Preprocessing

1. Remove any non-ASCII characters
2. Replace any characters that aren't alphanumeric/whitespace/"'"
3. Convert all letters to lowercase
4. Replace whitespace with ' '.

In [None]:
for i, text in enumerate(data.loc[:,'review_text']):

    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf8', 'ignore')
    text = re.sub(r"[^A-Za-z0-9\s']", '', text)
    text = text.lower()
    text = re.sub(r'[\n|\r|\n\r|\r\n]', ' ', text)
    data.loc[i,'review_text'] = text

In [None]:
print(data.loc[1,'review_text'])

## Save Corpus - Cleaned Text

In [None]:
with open('reviews-clean.pkl', 'wb') as f:
    pickle.dump(data, f)

## Remove Stopwords

In [None]:
stops = nltk.corpus.stopwords.words('english')

for i, text in enumerate(data.loc[:,'review_text']):
    text = [word for word in text.split() if word not in stops]
    text = ' '.join(text)
    data.loc[i,'review_text'] = text

In [None]:
print(data.loc[1,'review_text'])

## Save Corpus - Cleaned + No Stopwords

In [None]:
with open('reviews-clean-nostop.pkl', 'wb') as f:
    pickle.dump(data, f)

## Lemmatize Text

The wordnet lemmatizer only lemmatizes a single pos at a time. Default pos param is 'N' (noun). Read this is the pos tags it uses:

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Here's an example that ties in the nltk pos tagger to identify the pos, then use the lemmatizer to lem just that type. Which will save a huge amount of time, since it won't need to go through every one.

https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizerwithappropriatepostag

http://www.nltk.org/howto/wordnet.html

In [None]:
wnl = nltk.WordNetLemmatizer()

for i, text in enumerate(data.loc[:,'review_text']):
    text = [wnl.lemmatize(word) for word in text.split()]
    text = [wnl.lemmatize(word, pos='VB') for word in text]
    text = ' '.join(text)
    data.loc[i,'review_text'] = text

In [None]:
print(data.loc[1,'review_text'])

## Save Corpus - Cleaned + No Stopwords + Lemmatized

In [None]:
with open('reviews-clean-nostop.pkl', 'wb') as f:
    pickle.dump(data, f)