# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- The **cool** column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The **useful** and **funny** columns are similar to the **cool** column.

**Goal:** Predict the star rating of a review using **only** the review text. (We will not be using the cool, funny, or useful columns.)

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [78]:
#read the csv
import pandas as pd

df = pd.read_csv('/Users/SamK/Desktop/MLTextEx/data/yelp.csv')
df.head(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** You will need to filter the DataFrame using an OR condition. [Working with DataFrames](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/) has an example of this, and this [code snippet](http://chrisalbon.com/python/pandas_select_rows_multiple_filters.html) may also be helpful.

In [158]:
# create a new new dataframe with only star ratings of 1 and 5
ydf = pd.DataFrame(df[(df.stars == 5) | (df.stars == 1)])
ydf.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [80]:
# define X and y for new data frame
X = ydf.text
y = ydf.stars

# import train_test_split module and split data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape
print X_test.shape

(3064,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [81]:
# use CountVectorizer to create document-term matrices for X_trian and X_test
# vect.fit(X_train) creates a vocabulary for the dataset
# vect.transform(X_train) creates a documet-term-matrix for the data using fitted vocabulary
# vect.transform(X_test) creates a document-term-matrix for the test set using vocabulary from X_train

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# vect.fit(X_train)
# X_train_dtm = vect.transform(X_train)

X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<3064x16825 sparse matrix of type '<type 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [82]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x16825 sparse matrix of type '<type 'numpy.int64'>'
	with 77006 stored elements in Compressed Sparse Row format>

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [83]:
# import multinomial naive bayes and instatiate
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [84]:
# fit the model to the training_dtm data and predict star rating
nb.fit(X_train_dtm, y_train) # what is the reason for using X_train_dtm instead of X_train?
y_pred_class = nb.predict(X_test_dtm)
y_pred_class # why is this an array?

array([5, 5, 5, ..., 5, 1, 5])

In [85]:
# print the accuracy score
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.91878669275929548

In [86]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

# how does the array choose whether 1 is first (is it numeric order)?
# true positives (rating was 1, predicted 1): 126
# false positives (rating was 1, predicted 5): 58
# false negatives (rating was 5, predicted 1): 25
# true negatives (rating was 5, predicted 5): 813

array([[126,  58],
       [ 25, 813]])

In [87]:
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [88]:
# does "class" mean the rating 5 is one class and 1 is another class?

y_test.value_counts().head(1) / len(y_test)

# NB is better predictor of rating than just using null accuracy

5    0.819961
Name: stars, dtype: float64

## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [90]:
# scitkit has defined 1 as the positive class and 5 as the negative class
# the model is incorrectly classifying reviews that are very long and consist of a mix of positive and negative words/tokens

# false positives
X_testfp = X_test[y_test < y_pred_class]
X_testfp


2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\nI brought my Mountain Bike in (w...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
2490    Lazy Q CLOSED in 2010.  New Owners cleaned up ...
9125    La Grande Orange Grocery has a problem. It can...
9185    For frozen yogurt quality, I give this place a...
436     this another place that i would give no stars ...
2051    Sadly with new owners comes changes on menu.  ...
1721    This is the closest to a New York hipster styl...
3447    If you want a school that cares more about you...
842     Boy is

In [91]:
X_testfp[4311]

'Donuts are really good, if they have any when you get there!!!  Went in on a Tuesday morning at 1030, and they only had a total of 10 donuts. Drove out of my way to go there and still ended up at Dunkin Donuts. Very disappointed!!'

In [92]:
# false negatives (rating was 5, predicted 1)

X_testfn = X_test[y_test > y_pred_class]
X_testfn

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
3149    I was told to see Greg after a local shop diag...
423     These guys helped me out with my rear windshie...
763     Here's the deal. I said I was done with OT, bu...
8956    I took my computer to RedSeven recently when m...
750     This store has the most pleasant employees of ...
9765    You can't give anything less than 5 stars to a...
6334    I came here today for a manicure and pedicure....
1282    Loved 

In [95]:
X_testfn[7903]

'First, I\'m sorry this review is lengthy, but i really want people to understand how far a little kindness can go. \n\nI entered Mimi\'s Cafe at the end of possibly worst day ever. I came in looking to order food for takeout and get home to drown my sorrows in comfort food. i was directed by the hostess up front to go stand by the bakery and i would be able to order take out. i waited not 15 seconds and was greeted with a huge smile from a lovely girl named Danielle. She gave me a menu and told me to ask if i had questions and inquired if i had been there before. i looked at the menu for some time, unable to clear my mind enough to decide. She told me i was welcome to take a seat, and when i apologized for taking so long she responded with a smiley "where\'s the fire? take your time, can i get you a glass of wine while you decide?" wine was a perfect idea! I finally ordered and sat drinking my wine. i waited, and then waited a little longer, Danielle the young girl doing take out info

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [105]:
# get the feature names for train tokens

X_train_tokens = vect.get_feature_names()
X_train_tokens[-5:]

[u'zwiebel', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

In [110]:
# use feature_count_ to count the numbe of tokens in class 1 and class 5
# i am not sure if my assumption that 1 is 0 is correct
token_count_1 = nb.feature_count_[0,:]
token_count_1

array([ 26.,   4.,   1., ...,   0.,   0.,   0.])

In [111]:
token_count_5 = nb.feature_count_[1,:]
token_count_5

array([ 39.,   5.,   0., ...,   1.,   1.,   1.])

In [112]:
nb.feature_count_

array([[ 26.,   4.,   1., ...,   0.,   0.,   0.],
       [ 39.,   5.,   0., ...,   1.,   1.,   1.]])

In [130]:
tokens = pd.DataFrame({'tokens':X_train_tokens, '1':token_count_1, '5':token_count_5}).set_index('tokens')
tokens.head()

Unnamed: 0_level_0,1,5
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
00,26.0,39.0
000,4.0,5.0
00a,1.0,0.0
00am,3.0,2.0
00pm,1.0,4.0


In [131]:
nb.class_count_

array([  565.,  2499.])

In [132]:
tokens['5_freq'] = tokens['5'] + 1
tokens['1_freq'] = tokens['1'] + 1
tokens.head()

Unnamed: 0_level_0,1,5,5_freq,1_freq
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
00,26.0,39.0,40.0,27.0
000,4.0,5.0,6.0,5.0
00a,1.0,0.0,1.0,2.0
00am,3.0,2.0,3.0,4.0
00pm,1.0,4.0,5.0,2.0


In [140]:
tokens['5_freq'] = tokens['5_freq'] / nb.class_count_[1]
tokens['1_freq'] = tokens['1_freq'] / nb.class_count_[0]
tokens.head()
                        

Unnamed: 0_level_0,1,5,5_freq,1_freq
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
00,26.0,39.0,7.264073999999999e-19,8.5e-05
000,4.0,5.0,1.0896109999999998e-19,1.6e-05
00a,1.0,0.0,1.816019e-20,6e-06
00am,3.0,2.0,5.448056e-20,1.3e-05
00pm,1.0,4.0,9.080092999999999e-20,6e-06


In [141]:
tokens.sort_values('5_freq', ascending=False).head(10)

Unnamed: 0_level_0,1,5,5_freq,1_freq
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the,4175.0,14008.0,2.54406e-16,0.013082
and,2659.0,10429.0,1.894107e-16,0.008333
to,2409.0,6333.0,1.150266e-16,0.00755
of,1294.0,4557.0,8.277412000000001e-17,0.004057
is,770.0,4390.0,7.974137e-17,0.002415
it,1328.0,4259.0,7.736239e-17,0.004163
was,1484.0,3534.0,6.419625000000001e-17,0.004652
in,941.0,3486.0,6.332457e-17,0.002951
for,970.0,3315.0,6.021917000000001e-17,0.003042
you,637.0,2902.0,5.2719020000000007e-17,0.001999


In [142]:
tokens.sort_values('1_freq', ascending=False).head(10)

Unnamed: 0_level_0,1,5,5_freq,1_freq
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the,4175.0,14008.0,2.54406e-16,0.013082
and,2659.0,10429.0,1.894107e-16,0.008333
to,2409.0,6333.0,1.150266e-16,0.00755
was,1484.0,3534.0,6.419625000000001e-17,0.004652
it,1328.0,4259.0,7.736239e-17,0.004163
of,1294.0,4557.0,8.277412000000001e-17,0.004057
for,970.0,3315.0,6.021917000000001e-17,0.003042
that,958.0,2591.0,4.7071200000000006e-17,0.003004
in,941.0,3486.0,6.332457e-17,0.002951
my,841.0,2689.0,4.8850900000000006e-17,0.002638


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results.
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [164]:
# define X and y
X = df.text
y = df.stars

In [165]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [166]:
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [167]:
# calculate testing accuracy
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
y_pred_class
metrics.accuracy_score(y_test, y_pred_class)

0.47120000000000001

In [168]:
metrics.confusion_matrix(y_test, y_pred_class) # how would you read this confusion matrix?

array([[ 55,  14,  24,  65,  27],
       [ 28,  16,  41, 122,  27],
       [  5,   7,  35, 281,  37],
       [  7,   0,  16, 629, 232],
       [  6,   4,   6, 373, 443]])

In [169]:
y_test.value_counts()

4    884
5    832
3    365
2    234
1    185
Name: stars, dtype: int64

In [175]:
# compare testing accuracy with null accuracy
y_test.value_counts().head(1) / len(y_test)

# Even though the 47% classification accuracy score is quite low. it is still better than just predicting the most frequently occuring class

4    0.3536
Name: stars, dtype: float64