#### So basically hear we have 1000 restaraunt real raw labeld reviews from csv file

Main task: create and train the most suitable and efficient model to detect wheather the review is positive or negative

Main components: Bag of Words algorithm, Scikit-learn models, NLP libraries, models evaluation

Author: Voitishyn Mykyta

In [56]:
# import necessary libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [57]:
dataset = pd.read_csv('Restaurant_Reviews.tsv',delimiter='\t',quoting=3)

In [58]:
dataset.head(10)

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


In [59]:
dataset.shape

(1000, 2)

In [60]:
# make some cleaning of text - essential step for NLP: should be cleaned as much as possible 
# import necessary tools:

import re
import nltk

# nltk.download('stopwords') # not include some non-relevent words - # already downloaded!

from nltk.corpus import stopwords

# only a root of a review, simplify a review,keep the presense form of world(loved -> love)

from nltk.stem.porter import PorterStemmer 

corpus = []

for i in range(0,1000):
    review = re.sub('[^a-zA-Z]','  ',dataset['Review'][i]) # get rid of punctuations by replacing with space
    
    review = review.lower() # lower function - very easy
    review = review.split() # split review in a different words
    
    # delete unnecessary words
    ps = PorterStemmer()
    # solving unexpected problem with 'not'
    all_stopwords =  stopwords.words('english')
    all_stopwords.remove('not')
    # first imporvment after evaluating
    all_stopwords.remove('any')
    all_stopwords.remove('no')
    
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    
    corpus.append(review)

In [61]:
# wow, looks better
corpus[0:10]

['wow love place',
 'crust not good',
 'not tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch']

In [62]:
# just checked how is it looks 
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [63]:
# creating a Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1].values

In [64]:
len(X[0]) # our result of tokinization: 1566 of words. Also can take 1000 of most frequent words

1568

In [65]:
cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1].values

In [66]:
 len(X[0]) # here our result of reshaping the list with words

1500

In [67]:
# split data into train and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [68]:
# create and train an naive bayes model 

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

In [69]:
# make a prediction on test_set
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)[0:15])

[[1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [1 1]]


In [70]:
# create an confusion matrix and print the accuracy of the model

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# 55 correct predictions of negative reviews
# 91 correct predictions of positive reviews

# in our case it is 73%
# not the best option but let's try to use different classification models
# also it's nice to check more carefully our list of stopwords and probably delete something non-relevent in our case

[[55 42]
 [12 91]]


0.73

In [71]:
#all_stopwords # any,no,very,too,isn't,aren't

In [72]:
# try another classification model - logistic regression classification

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [73]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# all right, 5% more of accuracy - sounds nice

[[81 16]
 [27 76]]


0.785

In [74]:
# try another classification model - K-Nearest Neignbors
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [75]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# not so nice, only 64,5% of accuracy, not so far away of coin probability

[[71 26]
 [45 58]]


0.645

In [76]:
# try another classification model - Support Vector Machine Classification
from sklearn.svm import SVC

classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

SVC(kernel='linear', random_state=0)

In [77]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# very nice, more than 81% of accuracy, seems like the best model for now, but we still hava a couple of 
# classification models to test

[[81 16]
 [22 81]]


0.81

In [78]:
# try another classification model - Kernel Support Vector Machine 
from sklearn.svm import SVC

classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

SVC(random_state=0)

In [79]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# 78,5% of accuracy

[[90  7]
 [36 67]]


0.785

In [80]:
# try another classification model - Decision Tree Classification model

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

In [81]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# 75,5% of accuracy, maybe an average of desicion tree classifications - Random forest classification might
# perfom better, let's see

[[78 19]
 [30 73]]


0.755

In [82]:
# try another classification model - Random Forest Classification model

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 400, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=400, random_state=0)

In [83]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# 77% of accuracy, if we are going to increase the number of estimators more, the accuracy will not change 

[[88  9]
 [37 66]]


0.77

### Final results

#### So, for now the best performance has the Support Vector Machine Classification model - 81% of accuracy.

81 correct predictions of negetive reviews TN
81 correct predictions of positive reviews TP
16 incorrect predictions of negetive reviews FN
22 incorrect predictions of positive reviews FP

Accuracy = (81+81)/(81+81+16+22) = 162/200= 0,81
Precision =  (81 / 81 + 22) = 0,786
Recall = (81 / 81 + 16) = 0,835

F1 Score = 2 * 0,786 * 0,835 / (0,786 + 0,835) = 1,31 / 1,621 = 0,80

Tried to work with all_stop_words list, changes doesn't change the performance so much
Maybe it's better to have more reviews to train the model more successfully