## Yelp Restaurant Reviews and Ratings Analysis
### Sprint2 - Modelling and Predictive analysis
#### By Steven Too Heng Kwee  - 304449


### (1). Importing all the necessary modules:

In [2]:
# IMPORTING ALL THE NECESSARY LIBRARIES AND PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import string
import math
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
%matplotlib inline

### (2). Loading and seeing the dataset details:

#### Use file 'review_TO_R.csv' generated in sprint1 

In [3]:
# LOADING THE DATASET AND SEEING THE DETAILS
data = pd.read_csv('review_TO_R.csv')
# SHAPE OF THE DATASET
print("Shape of the dataset:")
print(data.shape)
# COLUMN NAMES
print("Column names:")
print(data.columns)
# DATATYPE OF EACH COLUMN
print("Datatype of each column:")
print(data.dtypes)
# SEEING FEW OF THE ENTRIES
print("Few dataset entries:")
print(data.head())
# DATASET SUMMARY
data.describe(include='all')

Shape of the dataset:
(57047, 24)
Column names:
Index(['_id_x', 'business_id', 'cool', 'date', 'funny', 'review_id', 'stars_x',
       'text', 'useful', 'user_id', '_id_y', 'address', 'attributes',
       'categories', 'city', 'hours', 'is_open', 'latitude', 'longitude',
       'name', 'postal_code', 'review_count', 'stars_y', 'state'],
      dtype='object')
Datatype of each column:
_id_x            object
business_id      object
cool              int64
date             object
funny             int64
review_id        object
stars_x         float64
text             object
useful            int64
user_id          object
_id_y            object
address          object
attributes       object
categories       object
city             object
hours            object
is_open           int64
latitude        float64
longitude       float64
name             object
postal_code      object
review_count      int64
stars_y         float64
state            object
dtype: object
Few dataset entries:
   

Unnamed: 0,_id_x,business_id,cool,date,funny,review_id,stars_x,text,useful,user_id,...,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars_y,state
count,57047,57047,57047.0,57047,57047.0,57047,57047.0,57047,57047.0,57047,...,57047,53487,57047.0,57047.0,57047.0,57047,57028,57047.0,57047.0,57047
unique,57047,4912,,56950,,57047,,56958,,22254,...,1,2778,,,,4060,2459,,,2
top,5d35d0a305c8a038ca4171ab,r_BrIgzYcwo1NAuG9dLbpg,,2018-05-23 01:34:02,,v85RkCG6h7mV-6-yAHK31A,,This small and unassuming place blew me away! ...,,iRQ_YKpCBdaCwvc2X8_3NQ,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",,,,Pai Northern Thai Kitchen,M6A 2T9,,,ON
freq,1,600,,2,,1,,3,,123,...,57047,1111,,,,600,642,,,57038
mean,,,0.52888,,0.229758,,3.705646,,0.955493,,...,,,0.971234,43.680277,-79.388998,,,188.349624,3.729074,
std,,,3.379325,,1.801161,,1.30762,,3.964542,,...,,,0.167149,0.049278,0.050575,,,284.562989,0.570702,
min,,,0.0,,0.0,,1.0,,0.0,,...,,,0.0,43.592327,-79.680563,,,3.0,1.0,
25%,,,0.0,,0.0,,3.0,,0.0,,...,,,1.0,43.649166,-79.41076,,,41.5,3.5,
50%,,,0.0,,0.0,,4.0,,0.0,,...,,,1.0,43.657648,-79.391842,,,101.0,4.0,
75%,,,0.0,,0.0,,5.0,,1.0,,...,,,1.0,43.684581,-79.378093,,,216.0,4.0,


#### ALERT: Multiple hours required for vectorization process. Data reduced further for testing/debugging and to be able to produce a modelling base for the project
#### Current path: showcase with stars 1,2,3 only. Alternatively if keeping those codes, test on reducing to normalized records with all stars.

### (3). Classifying the dataset and splitting it into the reviews and stars:
 

In [8]:
# CLASSIFICATION
data_classes = data[(data['stars_x']==1) | (data['stars_x']==3) | (data['stars_x']==5)]
data_classes.head()
print(data_classes.shape)

# Seperate the dataset into X and Y for prediction
x = data_classes['text']
y = data_classes['stars_x']
print(x.head())
print(y.head())

(35096, 25)
4    All stars go to the decor and atmosphere of th...
5    Let's be honest, everyone's here for the photo...
6    This place is BEAUTIFUL! And the restaurant is...
8    If you are looking for a beautiful place to di...
9    Very great and unique experience dining at a f...
Name: text, dtype: object
4    3.0
5    3.0
6    5.0
8    5.0
9    5.0
Name: stars_x, dtype: float64


### (4). Data Cleaning for modelling:
We will now, define a function which will clean the dataset by removing stopwords and punctuations.

In [9]:
# CLEANING THE REVIEWS - REMOVAL OF STOPWORDS AND PUNCTUATION
def text_process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

### (5). Vectorization of the whole review set and and checking the sparse matrix:<br>


In [12]:
# CONVERTING THE WORDS INTO A VECTOR
vocab = CountVectorizer(analyzer=text_process).fit(x)

In [15]:
print(len(vocab.vocabulary_))

60581


In [16]:
x = vocab.transform(x)
#Shape of the matrix:
print("Shape of the sparse matrix: ", x.shape)
#Non-zero occurences:
print("Non-Zero occurences: ",x.nnz)

# DENSITY OF THE MATRIX
density = (x.nnz/(x.shape[0]*x.shape[1]))*100
print("Density of the matrix = ",density)

Shape of the sparse matrix:  (35096, 60581)
Non-Zero occurences:  1694056
Density of the matrix =  0.07967713386663411


### (6). Modelling:

- Splitting the dataset X into training and testing set:

In [17]:
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=101)

- Using multiple Machine Algorithms to see which gives the best performance.

(1). Multinomial Naive Bayes - We are using Multinomial Naive Bayes over Gaussian because with sparse data, Gaussian Naive Bayes assumption of a normal distribution is not met and by default is not a good fit in this present case.

In [18]:
# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_train,y_train)
predmnb = mnb.predict(x_test)
print("Confusion Matrix for Multinomial Naive Bayes:")
print(confusion_matrix(y_test,predmnb))
print("Score:",round(accuracy_score(y_test,predmnb)*100,2))
print("Classification Report:",classification_report(y_test,predmnb))

Confusion Matrix for Multinomial Naive Bayes:
[[ 899  247   37]
 [ 153 1359  296]
 [  52  267 3710]]
Score: 85.01
Classification Report:              precision    recall  f1-score   support

        1.0       0.81      0.76      0.79      1183
        3.0       0.73      0.75      0.74      1808
        5.0       0.92      0.92      0.92      4029

avg / total       0.85      0.85      0.85      7020



(2). Random Forest Classifier

In [19]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rmfr = RandomForestClassifier()
rmfr.fit(x_train,y_train)
predrmfr = rmfr.predict(x_test)
print("Confusion Matrix for Random Forest Classifier:")
print(confusion_matrix(y_test,predrmfr))
print("Score:",round(accuracy_score(y_test,predrmfr)*100,2))
print("Classification Report:",classification_report(y_test,predrmfr))

Confusion Matrix for Random Forest Classifier:
[[ 689  220  274]
 [ 199  842  767]
 [  63  323 3643]]
Score: 73.7
Classification Report:              precision    recall  f1-score   support

        1.0       0.72      0.58      0.65      1183
        3.0       0.61      0.47      0.53      1808
        5.0       0.78      0.90      0.84      4029

avg / total       0.73      0.74      0.72      7020



(3). Decision Tree

In [20]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
preddt = dt.predict(x_test)
print("Confusion Matrix for Decision Tree:")
print(confusion_matrix(y_test,preddt))
print("Score:",round(accuracy_score(y_test,preddt)*100,2))
print("Classification Report:",classification_report(y_test,preddt))

Confusion Matrix for Decision Tree:
[[ 702  288  193]
 [ 275  895  638]
 [ 172  537 3320]]
Score: 70.04
Classification Report:              precision    recall  f1-score   support

        1.0       0.61      0.59      0.60      1183
        3.0       0.52      0.50      0.51      1808
        5.0       0.80      0.82      0.81      4029

avg / total       0.70      0.70      0.70      7020



(4). Support Vector Machines

In [21]:
# Support Vector Machine
from sklearn.svm import SVC
svm = SVC(random_state=101)
svm.fit(x_train,y_train)
predsvm = svm.predict(x_test)
print("Confusion Matrix for Support Vector Machines:")
print(confusion_matrix(y_test,predsvm))
print("Score:",round(accuracy_score(y_test,predsvm)*100,2))
print("Classification Report:",classification_report(y_test,predsvm))

Confusion Matrix for Support Vector Machines:
[[   5    1 1177]
 [   1    1 1806]
 [   0    0 4029]]
Score: 57.48
Classification Report:              precision    recall  f1-score   support

        1.0       0.83      0.00      0.01      1183
        3.0       0.50      0.00      0.00      1808
        5.0       0.57      1.00      0.73      4029

avg / total       0.60      0.57      0.42      7020



(5). MULTILAYER PERCEPTRON CLASSIFIER

In [22]:
# MULTILAYER PERCEPTRON CLASSIFIER
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()
mlp.fit(x_train,y_train)
predmlp = mlp.predict(x_test)
print("Confusion Matrix for Multilayer Perceptron Classifier:")
print(confusion_matrix(y_test,predmlp))
print("Score:",round(accuracy_score(y_test,predmlp)*100,2))
print("Classification Report:")
print(classification_report(y_test,predmlp))

Confusion Matrix for Multilayer Perceptron Classifier:
[[ 945  182   56]
 [ 156 1306  346]
 [  40  266 3723]]
Score: 85.1
Classification Report:
             precision    recall  f1-score   support

        1.0       0.83      0.80      0.81      1183
        3.0       0.74      0.72      0.73      1808
        5.0       0.90      0.92      0.91      4029

avg / total       0.85      0.85      0.85      7020



### RESULTS - From the above algorithm modelling, we can see that: 
- Multilayer Perceptron = 85.1%
- Multinomial Naive Bayes = 85.01%
- Random Forest Classifier = 73.7%
- Decision Tree = 70.07%
- Support Vector Machine  = 57.48%

### Multilayer Perceptron Classifier has the best score, let us use it to predict a random positive review, a random average review and a random negative review!

### (7) Predicting rating for sample reviews

In [42]:
#Locate review samples
data.head(100)

Unnamed: 0,_id_x,business_id,cool,date,funny,review_id,stars_x,text,useful,user_id,...,hours,is_open,latitude,longitude,name,postal_code,review_count,stars_y,state,length
0,5d34af2f05c8a038caf1cb40,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-08-14 04:13:05,0,PB4wv1eNEGXh8QFQCMnd1g,4.0,The service here is impeccable. Our waitress w...,0,CZqHG0JtP6pxK2ox7zpwVQ,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,630
1,5d34af3605c8a038caf1dfe5,3a7Qby_IX7sU7O6ZsQZeOQ,1,2018-01-22 00:02:19,0,mNjd9B7dVBGH5f_rKIRofg,4.0,Decided to come check out the furniture and se...,3,TFxeEvpjMNQ3AWL49iMwtA,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,846
2,5d34af2505c8a038caf1a198,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-03-15 19:09:56,2,n9B9XWtqYy1s-nyV6qnGwQ,4.0,My best friend and I came here to see what the...,4,HxkWE8b1bJbSc4Ihmgy5dQ,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,1453
3,5d34af5205c8a038caf20309,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-10-03 23:13:59,0,hjSFsF9bxiRqf1OM73Gc7A,2.0,I went there for lunch today. They've changed...,0,BV2TQDbbgC5Mwgyoh3CHDA,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,238
4,5d34af6c05c8a038caf22ae0,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-02-02 02:41:52,0,nxzG0S6v2hkBB9iYIFSVoQ,3.0,All stars go to the decor and atmosphere of th...,3,Fsl7fnXttgugpoyuCJ0zkg,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,348
5,5d34afc805c8a038caf262c5,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-01-10 23:04:12,0,7BS4ndznS8Yx9AfkxG4M6A,3.0,"Let's be honest, everyone's here for the photo...",2,T5BOAvuPsNAhjYTFkRqZ1Q,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,333
6,5d34b00405c8a038caf28272,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-01-22 16:18:53,1,lC9dCZTPAfpry1S_4N_JBg,5.0,This place is BEAUTIFUL! And the restaurant is...,1,l55_yghqjkJQ2TzRMbNsag,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,534
7,5d34b03905c8a038caf2ac22,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-03-05 03:46:33,0,TNiXjddNS_Z-NMUdrCU6Qw,4.0,"Located at Yorkdale Mall, the Restoration Hard...",0,wdeWt5VqTW26PAeQsVg73g,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,1168
8,5d34b04f05c8a038caf2ccf4,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-03-04 19:39:15,0,w0iE1udHzKhpGSr7472igA,5.0,If you are looking for a beautiful place to di...,0,LtyBdOAFYXa4NC9l_WWn1A,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,799
9,5d34b05305c8a038caf2d8d5,3a7Qby_IX7sU7O6ZsQZeOQ,0,2018-08-25 18:30:19,0,J1qM8B1RKWhQkxxw4L13mw,5.0,Very great and unique experience dining at a f...,0,pae5DbzPtBfFQlceFwo1QQ,...,"{'Monday': '0:0-0:0', 'Tuesday': '10:0-20:0', ...",1,43.724684,-79.454173,RH Courtyard Cafe,M6A 2T9,82,4.0,ON,402


In [39]:
# POSITIVE REVIEW
pr = data['text'][73]
print(pr)
print("Actual Rating: ",data['stars_x'][73])
pr_t = vocab.transform([pr])
print("Predicted Rating:")
mlp.predict(pr_t)[0]

This is your neighbourhood greasy spoon diner. It gets busy on weekends so be prepared to wait. Excellent and personable service with a cozy and old school vibe. I had the gyro omlette which was tasty and had a unique flavour as it was served with tzaziki. Large portions, bottomless coffee as you would expect. A solid 4 stars from me and worth checking out over chains like eggsmart and cora's.
Actual Rating:  4.0
Predicted Rating:


5.0

In [38]:
# AVERAGE REVIEW
ar = data['text'][14]
print(ar)
print("Actual Rating: ",data['stars_x'][14])
ar_t = vocab.transform([ar])
print("Predicted Rating:")
mlp.predict(ar_t)[0]

This is a small restaurant at the front of the new restoration hardware store in Yorkdale mall. We came for lunch and there was about a 20-30 min wait so we put down our number and just walked around the mall. I'd suggest making a reservation in the future as this is the second time I've been confronted with a wait. 

The food choices are roughly Western style-esque brunch with sandwiches, salads, fries, etc. I wasn't too pleased to see that all the salads needed meat added to them- with the lowest starting at $8 for chicken. 

We both got a salad, I got the arugula with chicken and we had fries to split. The food came out promptly and it tasted amazing!! I didn't get a lot of chicken but my friend who added smoked salmon and avocado to her salad got a heaping of both! 

Overall the decorum of the place is so peaceful and beautiful. The seats range from single chairs to sofas for larger groups. There's lots of natural lighting. However, I would hesitate to come back again for the

3.0

In [36]:
# NEGATIVE REVIEW
nr = data['text'][90]
print(nr)
print("Actual Rating: ",data['stars_x'][90])
nr_t = vocab.transform([nr])
print("Predicted Rating:")
mlp.predict(nr_t)[0]

This place is a complete hit and miss depending on when you visit. I just finished throwing out a chicken wrap consisting of dry, inedible chicken scraps. The soup was good, as usual, but the crappy wraps ruined the entire experience.  Weekends are generally a bad time to visit. If you are curious to try this place out, best time to go in terms of food quality is lunchtime during weekdays.
Actual Rating:  1.0
Predicted Rating:


1.0

In [41]:
count = data['stars_x'].value_counts()
print(count)

5.0    20102
4.0    16898
3.0     9097
1.0     5897
2.0     5053
Name: stars_x, dtype: int64


### Conclusion and Observation
- From the above, we can see that the positive reviews tend to lean towards 5 stars. This might be due to the dataset having more positive reviews as compared to negative reviews. <br>
- Normalizing the dataset to have equal number of reviews might correct this.
- We are able to accurate predict the user star rating according to their reviews

### Next steps
- Review the vectorization process. Process takes too long.
- Review the whole modelling concept. Star rating prediction based on current ratings. Reasearch to undertake for rating based on reviews sentiment alone.
- Wrap everything up. Create detailed document and presentation brief.
