# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [1]:
import pandas as pd

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [4]:
data = pd.read_csv('data/yelp.csv')
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [8]:
data_subset = data[(data['stars'] == 1) | (data['stars'] == 5)]
data_subset.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [15]:
X = data_subset['text']
y = data_subset['stars']
print(X.shape)
print(y.shape)

(4086,)
(4086,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

print(X_train_dtm.shape)
print(X_test_dtm.shape)

(3064, 16825)
(1022, 16825)


## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [18]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

In [20]:
nb.fit(X_train_dtm, y_train)

MultinomialNB()

In [21]:
y_pred = nb.predict(X_test_dtm)

In [26]:
from sklearn import metrics

accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

0.9187866927592955


In [27]:
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

[[126  58]
 [ 25 813]]


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [28]:
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

In [30]:
null_accuracy = 838 / (838 + 184)
print(null_accuracy)

0.8199608610567515


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [34]:
# false positives
fp = X_test[(y_test < 2) & (y_pred > 2)]
fp.head()

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
Name: text, dtype: object

In [35]:
fp[2175]

'This has to be the worst restaurant in terms of hygiene. Two of my friends had food -poisoning after having dinner here. The food is just unhealthy with tons of oil floating on the top of curries, and I am not sure if any health/hygiene code is followed here. \nThe service is poor and the information on its website is incorrect, the owner does not allow dine-in after 9 or 10 even though it says that the restaurant is open till 11. \n\nOne night I saw the owner cleaning the place without gloves and she was nice enough to give us a to-go parcel without cleaning her hands (great example to the servers!). I had a peek inside the kitchen when the door was ajar, and it definitely looked dirty.\n\nI have been a lot of hole-in-the-wall places around this restaurant, including Haji Baba, the Vietnamese place and others, but neither any of my friends nor I have fallen sick coz of the food. If you need a spicy-food fix, i strongly recommend you do not try this place, lest you want a visit to the

In [36]:
fn = X_test[(y_test > 2) & (y_pred < 2)]
fn.head()

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
Name: text, dtype: object

In [37]:
fn[7148]

"I now consider myself an Arizonian. If you drive a lot on the 101 or 51 like I do, you'll get your fair share of chips on your windshield. You'll also have to replace a windshield like I had to do just recently. Apparently, chips and cracking windshields  is common in Arizona. In fact, I seem to recall my insurance agent telling me that insurance companies must provide this coverage in Arizona.\n\nI had a chip repaired about a year ago near the very bottom of the windshield. Just recently a small, very fine crack started traveling north on the windshield from the repaired chip (a different vendor repaired the chip). I called these guys over to my house and they said it was too long to fix, so they replaced the whole windshield the next day.\n\nWhat great service, they come out to your residence or place of business to repair or replace your windshield."

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [46]:
X_train_tokens = vect.get_feature_names()

In [44]:
feature_counts = nb.feature_count_
feature_counts.shape

(2, 16825)

In [45]:
one_star = feature_counts[0]
five_star = feature_counts[1]

In [50]:
counts = pd.DataFrame({'tokens' : X_train_tokens, '1-star' : one_star, '5-star' : five_star}).set_index('tokens')
counts.head()

Unnamed: 0_level_0,1-star,5-star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
00,26.0,39.0
000,4.0,5.0
00a,1.0,0.0
00am,3.0,2.0
00pm,1.0,4.0


In [51]:
counts['1-star'] = counts['1-star'] + 1
counts['5-star'] = counts['5-star'] + 1

In [52]:
# Convert to frequencies

counts['1-star'] = counts['1-star']/ nb.class_count_[0]
counts['5-star'] = counts['5-star']/ nb.class_count_[1]

counts.head()

Unnamed: 0_level_0,1-star,5-star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
00,0.047788,0.016006
000,0.00885,0.002401
00a,0.00354,0.0004
00am,0.00708,0.0012
00pm,0.00354,0.002001


In [58]:
counts['5-star-ratio'] = counts['5-star']/counts['1-star']
counts.head()

Unnamed: 0_level_0,1-star,5-star,5-star-ratio
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00,0.047788,0.016006,0.334949
000,0.00885,0.002401,0.271309
00a,0.00354,0.0004,0.113045
00am,0.00708,0.0012,0.169568
00pm,0.00354,0.002001,0.565226


In [64]:
# top 10 most predictive tokens for 5-star review
counts['5-star-ratio'].sort_values(ascending = False)[:10]

tokens
fantastic      21.817727
perfect        18.464052
yum            14.017607
favorite       11.143029
outstanding    11.078431
brunch          9.495798
gem             9.043617
mozzarella      8.817527
pasty           8.817527
amazing         8.723323
Name: 5-star-ratio, dtype: float64

In [63]:
# top 10 most predictive tokens for 1-star review
counts['5-star-ratio'].sort_values()[:10]

tokens
staffperson       0.013299
refused           0.016149
disgusting        0.018841
filthy            0.020554
unprofessional    0.025121
unacceptable      0.025121
acknowledge       0.025121
ugh               0.026599
fuse              0.028261
boca              0.028261
Name: 5-star-ratio, dtype: float64

## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [65]:
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [66]:
X = data.text
y = data.stars

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [80]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [81]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)

In [82]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

0.4712


In [91]:
# Null accuracy
null_accuracy = y_test.value_counts().head(1) / len(y_test)
null_accuracy

4    0.3536
Name: stars, dtype: float64

The accuracy seems low but compared to the null accuracy, it is still performing better.

In [93]:
cm = metrics.confusion_matrix(y_test, y_pred)
cm

array([[ 55,  14,  24,  65,  27],
       [ 28,  16,  41, 122,  27],
       [  5,   7,  35, 281,  37],
       [  7,   0,  16, 629, 232],
       [  6,   4,   6, 373, 443]])

- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.

In [97]:
classification_report = metrics.classification_report(y_test, y_pred)
print(classification_report)

              precision    recall  f1-score   support

           1       0.54      0.30      0.38       185
           2       0.39      0.07      0.12       234
           3       0.29      0.10      0.14       365
           4       0.43      0.71      0.53       884
           5       0.58      0.53      0.55       832

    accuracy                           0.47      2500
   macro avg       0.45      0.34      0.35      2500
weighted avg       0.46      0.47      0.43      2500



In [98]:
# Precision (ratio of correct positive predictions to the total predicted positives) for class 1

precision_1 = 55 / (55 + 28 + 5 + 7 + 6)
precision_1

0.5445544554455446

In [99]:
# Recall (ratio of correct postiive predictions to the total positive examples) for class 1

recall_1 = 55 / (55 + 14 + 24 + 65 + 27)
recall_1

0.2972972972972973

In [100]:
# f1-score (weighted average of precision and recall) for class 1

f1_1 = (2 * precision_1 * recall_1) / (precision_1 + recall_1)
f1_1

0.38461538461538464

In [101]:
# support (total number of observations for which the class is true) for class 1

support_1 = 55 + 14 + 24 + 65 + 27
support_1

185

- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.