# Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- **Multnomial Naive Bayes** model used as it is more efficient than other models at predicting using text as the count vectorizer converts it into highly sparse matrices 

**Goal:** Predict the star rating of a review using **only** the review text.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [1]:
import pandas as pd 
import numpy as np

In [2]:
data=pd.read_csv("data\yelp.csv")

In [40]:
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

In [4]:
df=data[(data.stars==5) | (data.stars==1)]

In [41]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

In [6]:
y=df["stars"]
X=df["text"]

In [7]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)



## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
vect=CountVectorizer()

In [10]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
X_train1=vect.transform(X_train)

In [12]:
X_test1=vect.transform(X_test)

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
nb = MultinomialNB()

In [14]:
nb.fit(X_train1,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
y_pred=nb.predict(X_test1)

In [16]:
confusion_matrix(y_test,y_pred)

array([[117,  77],
       [ 18, 810]], dtype=int64)

## Task 6 

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

In [17]:
print(len(y_test[y_test==5])/len(y_test))
print(len(y_test[y_test==1])/len(y_test))

0.8101761252446184
0.1898238747553816


## Task 7

Browse through the review text of some of the **false positives** and **false negatives**.

In [18]:
print("False Negative:\n")
print(X_test[(y_test==5) & (y_pred==1)].iloc[11])
print("\nFalse Positive\n")
print(X_test[(y_test==1) & (y_pred==5)].iloc[20])

False Negative:

This is the only auto repair place I've ever seen with 5 stars solid, and they totally deserve it.

My boyfriend has a 1969 Pontiac Firebird that hasn't run in several years. After sinking lots of his own time into tinkering with it and $1000 to the awful Meineke down the street, he gave up... until my car's transmission went for good, and he decided to fix the Bird (yeah!) so I could drive it (yes, I am super lucky!)

Whitey's diagnosed it as the carburetor dumping fuel (which is what Meineke charged $500 to replace.. hmmm...). Upon hearing the story of Meineke, Whitey's told him to order the part himself to avoid the markup and they'd put it in for just labor.  Awesome!

We took it home, but it wasn't running quite right still, and then failed emissions. Back to the shop it went.

Whitey's took it back, spent ALL WEEKEND on it doing various minor things. At one point, they called and said "We've done this, this and this, and it's still not quite right, so we'

## Task 8

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

In [19]:
nb.feature_count_.shape

(2, 16796)

In [20]:
tokens=vect.get_feature_names()

In [21]:
star_5=nb.feature_count_[1,:]

In [22]:
star_1=nb.feature_count_[0,:]

In [23]:
token=pd.DataFrame({"tokens":tokens,"Star_5":star_5,"Star_1":star_1}).set_index('tokens')

In [42]:
token.head()

Unnamed: 0_level_0,Star_1,Star_5,ratio
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00,0.057658,0.017935,0.311068
000,0.009009,0.002391,0.265444
00a,0.003604,0.000399,0.110602
00am,0.007207,0.000797,0.110602
00pm,0.003604,0.001993,0.553009


In [25]:
token.sample(5)

Unnamed: 0_level_0,Star_1,Star_5
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
rootbeer,0.0,1.0
locales,0.0,1.0
220,0.0,1.0
pissed,6.0,2.0
recent,13.0,13.0


In [26]:
token["Star_1"]=token.Star_1+1
token["Star_5"]=token.Star_5+1

In [27]:
token["Star_1"]=token.Star_1/nb.class_count_[0]
token["Star_5"]=token.Star_5/nb.class_count_[1]

In [28]:
token["ratio"]=token.Star_5/token.Star_1


In [29]:
token.sort_values('ratio',ascending=False).head(10)

Unnamed: 0_level_0,Star_1,Star_5,ratio
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.001802,0.07214,40.037864
perfect,0.005405,0.100837,18.654843
roasted,0.001802,0.023515,13.051016
yum,0.001802,0.023515,13.051016
favorite,0.010811,0.134316,12.424273
awesome,0.010811,0.121164,11.207652
stuffed,0.001802,0.01953,10.83898
yummy,0.003604,0.036668,10.175369
mozzarella,0.001802,0.017537,9.732961
pasty,0.001802,0.017138,9.511758


## Task 9

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy.
- Print the confusion matrix
- Print the classification report

In [30]:
X=data["text"]
y=data["stars"]

In [31]:
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

In [32]:
vector=CountVectorizer()

In [33]:
vector.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [34]:
X_train=vector.transform(X_train)
X_test=vector.transform(X_test)

In [35]:
nb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [36]:
y_pred=nb.predict(X_test)

In [37]:
print(len(y_test[y_test==1])/len(y_test))
print(len(y_test[y_test==2])/len(y_test))
print(len(y_test[y_test==3])/len(y_test))
print(len(y_test[y_test==4])/len(y_test))
print(len(y_test[y_test==5])/len(y_test))

0.0752
0.088
0.1424
0.3528
0.3416


In [38]:
from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_pred,y_test)

0.4824

In [39]:
print(classification_report(y_pred,y_test))

             precision    recall  f1-score   support

          1       0.28      0.68      0.39        77
          2       0.08      0.27      0.12        62
          3       0.09      0.26      0.14       126
          4       0.72      0.44      0.54      1444
          5       0.55      0.60      0.58       791

avg / total       0.60      0.48      0.52      2500

