# Restaurant Reviews - NLP

In [1]:
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# For text cleaning
import re # Regular expression library 
import nltk # Natural Language Toolkit

In [2]:
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset.shape

(1000, 2)

In [None]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


## Cleaning the text

Because we will be making a sparse matrix we will need to remove any words not contributing to the
review's positive or negative state. 


We will do this by:
1. Removing stop words -  commonly used word (such as 'i','me','my')
2. Stemming            - The process of reducing inflected words to their root form (such as loved to love)

In [None]:
# download the stopwords list from nltk
nltk.download('stopwords') 
# import the downloaded stopwords
from nltk.corpus import stopwords
# PortStemmer is used for stemming
from nltk.stem.porter import PorterStemmer 

corpus = []
for i in range(0, len(dataset)):
    # remove everything but a-zA-Z and replace anything removed with a ' '
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    review = review.lower()
    review = review.split() # split review into list of words
    ps = PorterStemmer()
    # remove list of english words not relevant to review. Stopwords contains lists of different languages, must specify English
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # sets are faster than lists for big sentences
    review = ' '.join(review) # go back to a string for each review
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gega_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# let's take a look at what this has done to our top 5 reviews:
corpus[0:5]

We can see that we still have names in our reviews like 'rick steve'. These will cause our sparse matrix later on to be needlessly large. 

We can constrain the maximum features of our matrix and since names aren't as common as other words these can be removed.


In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500) # create object of the CountVectorizer class
X = cv.fit_transform(corpus).toarray() # tokenisation, need toarray() to crate the matirx
y = dataset.iloc[:, 1].values # Get our dependant variable from 'dataset' 

NOTE: because all our values are either 1 or 0 there is no need for feature scaling here

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Model Selection

Now we have our data ready which model do we use?

I will try all classification models I know and look at the accuracy of each:

In [None]:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
# prepare models
models = []
models.append(('LR', LogisticRegression(random_state = 0))) 
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier(criterion = 'entropy')))
models.append(('RNDFRST', RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(kernel = 'rbf')))

In [None]:
# Making the Confusion Matrix for each and evaluating accuracy
from sklearn.metrics import confusion_matrix

# evaluate each model in turn
for name, classifier in models:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    acc = (cm[0][0]+cm[1][1])/200
    msg = f'{name} : {str(acc)}'
    print(msg)

Naive Bayes offers the highest accuracy

In [None]:
# Fitting Naive Bayes to the Training set
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# pd.DataFrame(y_pred).head()

In [None]:
# Making the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
cm

### Improvements
One thing I noticed earlier was that the stop words included 'not', I feel within reviews this word 
would hold high importance and therefore high weighting when looking at bad reviews. 

The second review turned from 'Crust is not good.' to 'crust good'. 

There would be no differentiation between a review like 'Food is not good' and 'Food is good'. This would 'blur the lines', as it were, in the training stage and I conjecture if this word was included in our corpus we could increase the accuracy of our models.

In [None]:
corpus_amended = []
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    review = review.lower()
    review = review.split() 
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in (set(stopwords.words('english')) - set(["not"]))] ## keep the word not
    review = ' '.join(review)
    corpus_amended.append(review)

In [None]:
# we can see the review now has the word 'not'
corpus_amended[0:5]

In [None]:
# Recreate the Bag of Words model with our amended corpus
cv = CountVectorizer(max_features = 1500) 
X = cv.fit_transform(corpus_amended).toarray() 
y = dataset.iloc[:, 1].values 

In [None]:
# Splitting the amended corpus dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
# re run through the models
for name, classifier in models:
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    acc = (cm[0][0]+cm[1][1])/200
    msg = f'{name} : {str(acc)}'
    print(msg)

Let's compare these to our previous results:

| Classifier       | Before | After | Difference |
|------------------|--------|-------|------------|
| LR               | 0.71   | 0.775 | 0.065      |
| KNN              | 0.61   | 0.66  | 0.05       |
| CART             | 0.725  | 0.745 | 0.02       |
| RNDFRST          | 0.72   | 0.725 | 0.005      |
| NB               | 0.73   | 0.73  | 0          |
| SVM              | 0.485  | 0.485 | 0          |

We can see that accross the board we have done better or at least stayed the same.

Not only that but the accuracy of Linear Regression has increased so that it is now more accurate than Gaussian Naive Bayes.



### Next steps:

1. I would like to understand a little more the reason why some of the classifiers changed accuracy and why some other didn't after I made my changes. 

2. A nice feature would be to add a bit of code so that a user can write their own review and the classifier will come back with a prediction of either positive or negative review.
