# Yelp Data Challenge - NLP

BitTiger DS501

Jun 2017


○	Load, visualize data

○	Define positive/negative reviews

○	Extract Tf-Idf feature vectors from review data

○	Build review classifiers using supervised ML models

○	Use cross-validation and grid search to tune parameters and select models



In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('clean_data/last_2_years_restaurant_reviews.csv')

In [5]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"Steakhouses, Restaurants, Cajun/Creole",4.0,0,2017-02-14,0,VETXTwMw6qxzOVDlXfe6Tg,5,went for dinner tonight. Amazing my husband ha...,0,ymlnR8UeFvB4FZL56tCZsA
1,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"Steakhouses, Restaurants, Cajun/Creole",4.0,0,2017-12-04,0,S8-8uZ7fa5YbjnEtaW15ng,5,This was an amazing dinning experience! ORDER ...,0,9pSSL6X6lFpY3FCRLEH3og
2,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"Steakhouses, Restaurants, Cajun/Creole",4.0,0,2016-08-22,1,1nK5w0VNfDlnR3bOz13dJQ,5,My husband and I went there for lunch on a Sat...,1,gm8nNoA3uB4In5o_Hxpq3g
3,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"Steakhouses, Restaurants, Cajun/Creole",4.0,0,2016-09-13,0,N1Z93BthdJ7FT2p5S22jIA,3,Went for a nice anniversary dinner. Researched...,0,CEtidlXNyQzgJSdF1ubPFw
4,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"Steakhouses, Restaurants, Cajun/Creole",4.0,0,2015-02-02,0,_Uwp6FO1X-avE9wqTMC59w,5,This place is first class in every way. Lobste...,0,-Z7Nw2UF7NiBSAzfXNA_XA


### Define your feature variables, here is the text of the review

In [6]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text'].values

In [7]:
# inspect your documents, e.g. check the size, take a peek at elements of the numpy array
documents[:3]

array(["went for dinner tonight. Amazing my husband had lobster bisque and the T bone both were delish.I had the French onion soup and the pan seared duck. Cooked to perfection and I'm still raving about the flavor. If you are ever in Vegas this is a must try.",
       'This was an amazing dinning experience! ORDER THE PORK CHOP! I will definitely return.',
       "My husband and I went there for lunch on a Saturday. We had a physically exhausting week so we decided to treat ourselves. But it hasn't always been easy for our allergy whenever we ate out. So we called Delmonico ahead to see if they can accommodate our special needs. The lady who answered our call was very courteous and we felt comfortable to try after having some answers from her.\r\nAs we arrived, the restaurant has a comfortable ambience. I wouldn't say it is grand or special but just comfortable. When it was time to order, the server was courteous regarding our allergy too and I believe the one who took care of us was 

### Define your target variable (any categorical variable that may be meaningful)

#### For example, I am interested in perfect (5 stars) and imperfect (1-4 stars) rating

In [8]:
# Make a column and take the values, save to a variable named "target"
df['favorable'] = df['stars']>4

#### You may want to look at the statistic of the target variable

In [9]:
# To be implemented
target = df['favorable'].values

In [10]:
target.mean()

0.4741461922405801

## Let's create training dataset and test dataset

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
# Documents is your X, target is your y
# Now split the data to training set and test set

In [14]:
# Split to documents_train, documents_test, target_train, target_test
documents_train, documents_test, target_train, target_test = train_test_split(
    documents, target, test_size = 0.95, random_state=0
)

## get NLP representation of the documents

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [49]:
# Create TfidfVectorizer, and name it vectorizer
vectorizer = TfidfVectorizer(stop_words='english',max_features=1000)

In [50]:
# Train the model with your training data
vectors_train = vectorizer.fit_transform(documents_train).toarray()

In [51]:
# Get the vocab of your tfidf
words = vectorizer.get_feature_names()

In [52]:
# Use the trained model to transform your test data
vectors_test = vectorizer.transform(documents_test).toarray()

## Similar review search engine

In [74]:
import numpy as np

# We will need these helper methods pretty soon

def get_top_values(lst, n, labels):
    '''
    INPUT: LIST, INTEGER, LIST
    OUTPUT: LIST

    Given a list of values, find the indices with the highest n values.
    Return the labels for each of these indices.

    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ["cat", "dog", "mouse", "pig", "rabbit"]
    output: ["cat", "pig"]
    '''
    return [labels[i] for i in np.argsort(lst)[::-1][:n]]  # np.argsort by default sorts values in ascending order

def get_bottom_values(lst, n, labels):
    '''
    INPUT: LIST, INTEGER, LIST
    OUTPUT: LIST

    Given a list of values, find the indices with the lowest n values.
    Return the labels for each of these indices.

    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ["cat", "dog", "mouse", "pig", "rabbit"]
    output: ["mouse", "rabbit"]
    '''
    return [labels[i] for i in np.argsort(lst)[:n]]


In [54]:
# Let's use cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [55]:
# Draw an arbitrary review from test (unseen in training) documents
search_query = documents_test[55]
search_queries = [search_query]
print(search_query)
print(search_queries)


Really dig this joint. The first time I was here, I was a bit confused as far as ordering food and pizza from separate places. It was crowded but it did not take long to get our order of pizza and wings. I've been several times since and have been very satisfied with the atmosphere and the people. I would recommend to anyone who wants a chill drinking spot.
["Really dig this joint. The first time I was here, I was a bit confused as far as ordering food and pizza from separate places. It was crowded but it did not take long to get our order of pizza and wings. I've been several times since and have been very satisfied with the atmosphere and the people. I would recommend to anyone who wants a chill drinking spot."]


In [56]:
# Transform the drawn review(s) to vector(s)
vectors_search_queries = vectorizer.transform(search_queries).toarray()

In [57]:
# Calculate the similarity score(s) between vector(s) and training vectors
similarity_scores= cosine_similarity(vectors_search_queries,vectors_train)

In [58]:
# Let's find top 5 similar reviews
n = 5
top5_review= get_top_values(similarity_scores[0],n,documents_train)

In [59]:
print('Our search query:')
print(search_queries[0]) 

Our search query:
Really dig this joint. The first time I was here, I was a bit confused as far as ordering food and pizza from separate places. It was crowded but it did not take long to get our order of pizza and wings. I've been several times since and have been very satisfied with the atmosphere and the people. I would recommend to anyone who wants a chill drinking spot.


In [60]:
print('Most %s similar reviews:' % n)
for i, review in enumerate(top5_review):
    print('#%s:' % i) 
    print(review) 
    print("="*60)

Most 5 similar reviews:
#0:
Boom dot com! I've had the pizza, wings, fingers and fries; everything is incredibly delicious most definitely my new favorite pizza joint
#1:
Two stars for the naked chicken wings because those were good. I thought i'd try something new by ordering the Skinny Up pizza this time. The description did say that it would be less...smaller portions. I had no idea it would be so minimalistic! It barely had anything on it. On top of that it was scorched! Thinnest pizza ever. Like cardboard. Super dissapointing. Never ordering this pizza ever again.
#2:
I rarely eat pizza, so I gave in to ordering through the phone App because I was too lazy to go out and get a real pizza.  So basically I got what I deserved...a nasty, under cooked pizza.  I usually order from Domino's, but recalled Pizza Hut being better.  I was wrong!  Next time I'm lazy and want a cheap pizza, I won't be ordering here.  


I did order the dessert brownie.  That was the only thing that was dece

## Classifying positive/negative review

#### Naive-Bayes Classifier

In [62]:
# Build a Naive-Bayes Classifier

from sklearn.naive_bayes import MultinomialNB

model_nb = MultinomialNB()

model_nb.fit(vectors_train,target_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [63]:
# Get score for training set
model_nb.score(vectors_train,target_train)

0.8010301233026378

In [65]:
# Get score for test set
model_nb.score(vectors_test,target_test)

0.7962650509378445

#### Logistic Regression Classifier

In [68]:
# Build a Logistic Regression Classifier

from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression()

model_lr.fit(vectors_train,target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [69]:
# Get score for training set
model_lr.score(vectors_train,target_train)

0.8279381926018418

In [70]:
# Get score for test set
model_lr.score(vectors_test,target_test)

0.8132952620658044

####key features(words) that make the positive prediction?

In [72]:
# Let's find it out by ranking
n = 20 
get_top_values(model_lr.coef_[0],n,words)

['amazing',
 'best',
 'perfect',
 'fantastic',
 'awesome',
 'delicious',
 'excellent',
 'favorite',
 'outstanding',
 'incredible',
 'great',
 'love',
 'thank',
 'phenomenal',
 'gem',
 'die',
 'greeted',
 'owner',
 'fabulous',
 'wonderful']

define good: amazing',
 'best',
 'perfect',
 'fantastic',
 'awesome',
 'delicious',
 'excellent', 

#### Q: What are the key features(words) that make the negative prediction?

In [75]:
# Let's find it out by ranking
n = 20
get_bottom_values(model_lr.coef_[0],n,words)

['worst',
 'ok',
 'horrible',
 'rude',
 'mediocre',
 'slow',
 'terrible',
 'overpriced',
 'bland',
 'disappointing',
 'average',
 'decent',
 'unfortunately',
 'okay',
 'dry',
 'reason',
 'wasn',
 'overall',
 'worse',
 'lacking']

define terrible: overpriced, rude, slow

#### Random Forest Classifier

In [82]:
# Build a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier


model_rfc = RandomForestClassifier(min_samples_leaf=10, n_estimators = 5)

model_rfc.fit(vectors_train,target_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [83]:
# Get score for training set
model_rfc.score(vectors_train,target_train)

0.8199157171843297

In [84]:
# Get score for test set
model_rfc.score(vectors_test,target_test)

0.7625085635708571

the training score and the test score?

worse than logistic regression and naive bayes.

what features (words) are important by inspecting the RFC model?

In [88]:
n = 20
get_top_values(model_rfc.feature_importances_,n,words)


['amazing',
 'delicious',
 'best',
 'great',
 'minutes',
 'love',
 'wasn',
 'like',
 'awesome',
 'favorite',
 'vegas',
 'place',
 'friendly',
 'good',
 'excellent',
 'didn',
 'bad',
 'ok',
 'definitely',
 'perfect']

 Use cross validation to evaluate classifiers

[sklearn cross validation](http://scikit-learn.org/stable/modules/cross_validation.html)

In [None]:
# To be implemented


Use grid search to find best predictable classifier


[sklearn grid search tutorial (with cross validation)](http://scikit-learn.org/stable/modules/grid_search.html#grid-search)

[sklearn grid search documentation (with cross validation)](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

In [None]:
# To be implemented
pass