In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
dataset = pd.read_csv("Restaurant_Reviews.tsv",delimiter='\t',quoting=3)

In [6]:
dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


using a tsv file(and not a csv) file because reviews can themeselves contain commas but never tabs

ignoring quotes as they may add an extra layer of complexity

Goal:- create a model that predicts wether a  new review is positive or negative

### Cleaning the text:-
preparing it for use in the model

cleaning the first review for an example and will later be iterating through all the reviews

In [12]:
import re
# the regular expression operations module of python

In [14]:
dataset['Review'][0]

'Wow... Loved this place.'

In [15]:
review = re.sub('[^a-zA-Z]',' ',dataset['Review'][0])

read the docstring for the re.sub function

In [16]:
review

'Wow    Loved this place '

In [18]:
review = review.lower()

In [19]:
review

'wow    loved this place '

will be using a sparse matrix hence will remove unnecesary words - "this" in this case

In [20]:
import nltk

In [21]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Raj
[nltk_data]     Patil\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [26]:
from nltk.corpus import stopwords

splitting the review in different words so as to make it an iterable list

In [23]:
review = review.split()

In [24]:
review

['wow', 'loved', 'this', 'place']

In [31]:
review = [word for word in review if not word in set(stopwords.words('english'))]

In [32]:
review

['wow', 'loved', 'place']

using set of stopwords rather than a list because python goes through sets faster than lists

### Stemming the review

In essence loved will be recorded as love because different versions derived from the same root will give the same idea to the model hence it helps in making the sparsity matrix more efficiently

In [36]:
from nltk.stem.porter import PorterStemmer

In [37]:
ps = PorterStemmer()

will do this before removing the unnecessary words

hence resetting the review, and implementing the expression cleaning and stemming in the desired order

In [47]:
review = re.sub('[^a-zA-z]',' ',dataset['Review'][0])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

In [48]:
review

['wow', 'love', 'place']

Now, We're good

now remerging the list into one string

In [49]:
review = " ".join(review)

In [50]:
review

'wow love place'

now doing this for all the reviews iterating through the dataset

In [54]:
corpus = []#stores the modified reviews
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ',
                    dataset['Review'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not
              word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

    

In [58]:
pd.DataFrame(corpus)

Unnamed: 0,0
0,wow love place
1,crust good
2,tasti textur nasti
3,stop late may bank holiday rick steve recommen...
4,select menu great price
5,get angri want damn pho
6,honeslti tast fresh
7,potato like rubber could tell made ahead time ...
8,fri great
9,great touch


### The bag of words model 

<i> a matrix with lot of zeros is a sparse matrix</i>

creating a sparse matrix using tokenisation

removing the rare variables in the corpus eg:- names for reducing the sparsity of the matrix

IN ESSENCE<br>
this is a classification problem, we get all these cleaned and stemmed words as our categorical variables and we are predicting wether the output is a 1 or a 0


### creating the model now

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

In [63]:
cv = CountVectorizer()

 note that we are not using any parameters now, but have a look at them,  most of the manual cleaning of the text can be done via triggering some of these parameters
<br><br>
But, manual cleaning is preferable as it gives us more control over what we want to clean/include


### creating the sparse matrix

In [66]:
X = cv.fit_transform(corpus).toarray()

In [68]:
X.shape

(1000, 1565)

so we have reduced it down to a total of  1565 separate words(tokens) 

if the number of words are a lot, we can play with some of the parameters of the CountVectorizer - see for yourself

In [71]:
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus)

In [79]:
X.shape

(1000, 1500)

In [80]:
X = X.toarray()

this is a classification problem now

In [82]:
Y = dataset.iloc[:,1].values

In [83]:
from sklearn.model_selection import train_test_split

In [84]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.20,random_state=0)


### the most common models used for NLP classification are:-  
<ul>
    <li>Naïve Bayes</li>
    <li>Decision Trees</li>
    <li>Random Forests</li>
    </ul>

I'm using the Naïve Bayes here

In [85]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,Y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [87]:
Y_pred = classifier.predict(X_test)

In [88]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,Y_pred)

In [91]:
cm

array([[55, 42],
       [12, 91]], dtype=int64)

In [94]:
accuracy = (55+91)/200
accuracy*100

73.0

Not bad for this small of a test set,
so, will improve this later using some other model and parameter tuning