# Build Yelp Review Corpus

This notebook outlines the steps to load, preprocess, and clean yelp review text.

It was heavily inspired by: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [1]:
import psycopg2
import nltk
import unicodedata
import pandas as pd
import pprint
import pickle
import re
import os

pp = pprint.PrettyPrinter(indent=4).pprint

data_dir = '/home/tlappas/data_science/Yelp-Ratings/data/processed'

## Move to ./Yelp-Ratings/data/corpus folder

...This needs to be fixed. Breaks if run more than once.

path = os.getcwd()
print('Notebook path: {}'.format(path))

os.chdir('..')
data_path = os.path.join(os.getcwd(), 'corpus')
print('Corpus path: {}'.format(data_path))

if os.path.exists(data_path) == False:
    os.mkdir(data_path)

os.chdir(data_path)

## Get Data

Text is stored in yelp db. Query and store in a pandas DataFrame.

In [2]:
conn = psycopg2.connect('dbname=yelp user=tlappas host=/var/run/postgresql')
cur = conn.cursor()
cur.execute("""
    SELECT * FROM review LIMIT 10
""")

cols = ['review_id', 'user_id', 'business_id', 'stars', 'review_date', 'review_text', 'useful', 'funny', 'cool']

data = pd.DataFrame(cur.fetchall(), columns=cols)

### Example Instance

In [3]:
print('DataFrame shape: {}\n'.format(data.shape))
print('First instance: \n')
print(data.loc[0])

DataFrame shape: (10, 9)

First instance: 

review_id                                 Q1sbwvVQXV2734tPgoKj4Q
user_id                                   hG7b0MtEbXx5QzbzE6C_VA
business_id                               ujmEBvifdJM6h6RLv4wQIg
stars                                                          1
review_date                                           2013-05-07
review_text    Total bill for this horrible service? Over $8G...
useful                                                         6
funny                                                          1
cool                                                           0
Name: 0, dtype: object


## Example Review

In [4]:
print(data.loc[1, 'review_text'])

I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  

Travis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much 

## Basic Text Pre-Preprocessing

1. Remove any non-ASCII characters
2. Replace any characters that aren't alphanumeric/whitespace/"'"
3. Convert all letters to lowercase
4. Replace whitespace with ' '.

In [5]:
for i, text in enumerate(data.loc[:,'review_text']):

    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf8', 'ignore')
    text = re.sub(r"[^A-Za-z0-9\s']", '', text)
    text = text.lower()
    text = re.sub(r'[\n|\r|\n\r|\r\n]', ' ', text)
    data.loc[i,'review_text'] = text

In [6]:
print(data.loc[1,'review_text'])

i adore travis at the hard rock's new kelly cardenas salon  i'm always a fan of a great blowout and no stranger to the chains that offer this service however travis has taken the flawless blowout to a whole new level    travis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a vegasworthy rockstar outfit  next comes the most relaxing and incredible shampoo  where you get a full head message that could cure even the very worst migraine in minutes  and the scented shampoo room  travis has freakishly strong fingers in a good way and use the perfect amount of pressure  that was superb  then starts the glorious blowout where not one not two but three people were involved in doing the best roundbrush action my hair has ever seen  the team of stylists clearly gets along extremely well as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement  it was so much fun to be there   next trav

## Save Corpus - Cleaned Text

In [7]:
processing = ['ascii-only', 'whitespace alphanumeric only', 'lowercase', 'remove whitespace chars']
with open(os.path.join(data_dir, 'reviews-clean.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)

## Remove Stopwords

In [8]:
stops = nltk.corpus.stopwords.words('english')

for i, text in enumerate(data.loc[:,'review_text']):
    text = [word for word in text.split() if word not in stops]
    text = ' '.join(text)
    data.loc[i,'review_text'] = text

In [9]:
print(data.loc[1,'review_text'])

adore travis hard rock's new kelly cardenas salon i'm always fan great blowout stranger chains offer service however travis taken flawless blowout whole new level travis's greets perfectly green swoosh otherwise perfectly styled black hair vegasworthy rockstar outfit next comes relaxing incredible shampoo get full head message could cure even worst migraine minutes scented shampoo room travis freakishly strong fingers good way use perfect amount pressure superb starts glorious blowout one two three people involved best roundbrush action hair ever seen team stylists clearly gets along extremely well evident way talk help one another really genuine corporate requirement much fun next travis started flat iron way flipped wrist get volume around without overdoing making look like texas pagent girl admirable also worth noting fry hair something i've happen less skilled stylists end blowout style hair perfectly bouncey looked terrific thing better awesome blowout lasted days travis see every

## Save Corpus - Cleaned + No Stopwords

In [10]:
processing.append('no stopwords')
with open(os.path.join(data_dir, 'reviews-clean-nostop.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)

## Lemmatize Text

The wordnet lemmatizer only lemmatizes a single pos at a time. Default pos param is 'N' (noun). [Set of pos tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

[Here's an example](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizerwithappropriatepostag) that ties in the nltk pos tagger to identify the pos, then use the lemmatizer to lem just that type. Which will save a huge amount of time, since it won't need to go through every one.

[Wordnet documentation](http://www.nltk.org/howto/wordnet.html)

In [11]:
wnl = nltk.WordNetLemmatizer()

for i, text in enumerate(data.loc[:,'review_text']):
    text = text.split()
    text = [wnl.lemmatize(word, pos='n') for word in text]
    text = [wnl.lemmatize(word, pos='v') for word in text]
    text = [wnl.lemmatize(word, pos='a') for word in text]
    text = [wnl.lemmatize(word, pos='r') for word in text]
    text = ' '.join(text)
    data.loc[i,'review_text'] = text

In [12]:
print(data.loc[1,'review_text'])

adore travis hard rock's new kelly cardenas salon i'm always fan great blowout strange chain offer service however travis take flawless blowout whole new level travis's greet perfectly green swoosh otherwise perfectly style black hair vegasworthy rockstar outfit next come relax incredible shampoo get full head message could cure even bad migraine minute scent shampoo room travis freakishly strong finger good way use perfect amount pressure superb start glorious blowout one two three people involve best roundbrush action hair ever see team stylist clearly get along extremely well evident way talk help one another really genuine corporate requirement much fun next travis start flat iron way flip wrist get volume around without overdo make look like texas pagent girl admirable also worth note fry hair something i've happen le skilled stylist end blowout style hair perfectly bouncey look terrific thing good awesome blowout last day travis see every single time i'm vega make feel beauuuutif

## Save Corpus - Cleaned + No Stopwords + Lemmatized

In [13]:
processing.append('lemmatize: NVJR')
with open(os.path.join(data_dir, 'reviews-clean-nostop-lemmed.pkl'), 'wb') as f:
    pickle.dump([data, processing], f)