### Assignment: Natural Language Processing

In this assignment, you will work with a data set that contains restaurant reviews. You will use a Naive Bayes model to classify the reviews (positive or negative) based on the words in the review.  The main objective of this assignment is gauge the performance of a Naive Bayes model by using a confusion matrix; however in order to ascertain the efficiacy of the model, you will have to first train the Naive Bayes model with a portion (i.e. 70%) of the underlying data set and then test it against the remainder of the data set . Before you can train the model, you will have to go through a sequence of steps to get the data ready for training the model.

Steps you may need to perform:

**1) **Read in the list of restaurant reviews

**2)** Convert the reviews into a list of tokens

**3) **You will most likely have to eliminate stop words

**4)** You may have to utilize stemming or lemmatization to determine the base form of the words

**5) **You will have to vectorize the data (i.e. construct a document term/word matix) wherein select words from the reviews will constitute the columns of the matrix and the individual reviews will be part of the rows of the matrix

**6) ** Create 'Train' and 'Test' data sets (i.e. 70% of the underlying data set will constitute the training set and 30% of the underlying data set will constitute the test set)

**7)** Train a Naive Bayes model on the Train data set and test it against the test data set

**8) **Construct a confusion matirx to gauge the performance of the model

**Dataset**: https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1




In [0]:
import pandas as pd

In [10]:

data =  pd.read_csv('https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1',error_bad_lines=False,sep='\t')
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [13]:
print (data.shape)
print (data.isnull().any())

(1000, 2)
Review    False
Liked     False
dtype: bool


In [0]:
# Import the NLTK package
from nltk import word_tokenize
import nltk
from nltk.corpus import stopwords


In [23]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [26]:
stop_words = set(stopwords.words('english'))
ps = nltk.PorterStemmer()
#preprocess the dataset
text = data['Review']
def preprocess_text(text):
    words = word_tokenize(text)
    sentence = []
    for w in words:
      if (w not in stop_words and w.isalpha()):
        sentence.append(ps.stem(w))
    return sentence    
clean_data = data.copy()  
clean_data['Review'] = text.apply(preprocess_text) 
clean_data.head()

Unnamed: 0,Review,Liked
0,"[wow, love, place]",1
1,"[crust, good]",0
2,"[not, tasti, textur, nasti]",0
3,"[stop, late, may, bank, holiday, rick, steve, ...",1
4,"[the, select, menu, great, price]",1


In [28]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
count_vector = CountVectorizer(analyzer = preprocess_text)
count = count_vector.fit_transform(text)
print (count.shape)
pd.DataFrame(count.toarray())

(1000, 1614)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
y = clean_data['Liked'].values
print(y.shape)

(1000,)


In [0]:
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(41)
X_train, X_test, y_train, y_test = train_test_split(count.toarray(), y,test_size=0.3)

In [0]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

In [79]:
model = GaussianNB()
model.fit(X_train,y_train)

GaussianNB(priors=None)

In [80]:
yhat = model.predict(X_test)
accuracy = np.mean(y_test==yhat)
print('Model Accuracy: {:.1f}%'.format(accuracy*100))

Model Accuracy: 72.0%


In [81]:
cnf_matrix = confusion_matrix(y_test, yhat)
print('Confusion Matrix:\n' + str(cnf_matrix))

Confusion Matrix:
[[ 89  66]
 [ 18 127]]
