[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/M11_A_DJ_NLP_Assignment.ipynb)

### Assignment: Natural Language Processing

In this assignment, you will work with a data set that contains restaurant reviews. You will use a Naive Bayes model to classify the reviews (positive or negative) based on the words in the review.  The main objective of this assignment is gauge the performance of a Naive Bayes model by using a confusion matrix; however in order to ascertain the efficiacy of the model, you will have to first train the Naive Bayes model with a portion (i.e. 70%) of the underlying data set and then test it against the remainder of the data set . Before you can train the model, you will have to go through a sequence of steps to get the data ready for training the model.

Steps you may need to perform:

**1) **Read in the list of restaurant reviews

**2)** Convert the reviews into a list of tokens

**3) **You will most likely have to eliminate stop words

**4)** You may have to utilize stemming or lemmatization to determine the base form of the words

**5) **You will have to vectorize the data (i.e. construct a document term/word matix) wherein select words from the reviews will constitute the columns of the matrix and the individual reviews will be part of the rows of the matrix

**6) ** Create 'Train' and 'Test' data sets (i.e. 70% of the underlying data set will constitute the training set and 30% of the underlying data set will constitute the test set)

**7)** Train a Naive Bayes model on the Train data set and test it against the test data set

**8) **Construct a confusion matirx to gauge the performance of the model

**Dataset**: https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1




**1) **Read in the list of restaurant reviews

In [1]:
#%%time
#!wget -c https://www.dropbox.com/s/yl5r7kx9nq15gmi/Restaurant_Reviews.tsv?raw=1 && mv Restaurant_Reviews.tsv?raw=1 Restaurant_Reviews.tsv
!ls -lh *tsv

-rw-r--r--@ 1 darwinm  staff    60K Jun 13 17:43 Restaurant_Reviews.tsv


In [2]:
%%time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import re
import string
import nltk
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nl

In [3]:
df = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
df.tail()

Unnamed: 0,Review,Liked
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0
999,"Then, as if I hadn't wasted enough of my life ...",0


**2)** Convert the reviews into a list of tokens

In [5]:
review = df['Review'] # dropping the like here
print(review)
len(review)

0                               Wow... Loved this place.
1                                     Crust is not good.
2              Not tasty and the texture was just nasty.
3      Stopped by during the late May bank holiday of...
4      The selection on the menu was great and so wer...
5         Now I am getting angry and I want my damn pho.
6                  Honeslty it didn't taste THAT fresh.)
7      The potatoes were like rubber and you could te...
8                              The fries were great too.
9                                         A great touch.
10                              Service was very prompt.
11                                    Would not go back.
12     The cashier had no care what so ever on what I...
13     I tried the Cape Cod ravoli, chicken, with cra...
14     I was disgusted because I was pretty sure that...
15     I was shocked because no signs indicate cash o...
16                                   Highly recommended.
17                Waitress was 

1000

**3) **You will most likely have to eliminate stop words

**4)** You may have to utilize stemming or lemmatization to determine the base form of the words

In [6]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()


#Elmiminate punctations
#Tokenize based on whitespace
#Stem the text
#Remove stopwords
def process_text(txt):
    eliminate_punct = "".join([word.lower() for word in txt if word not in string.punctuation])
    tokens = re.split('\W+', txt)
    txt = [ps.stem(word) for word in tokens if word not in stopwords]
    return txt
  
df['clean_review'] = df['Review'].apply(lambda x: process_text(x))

df.head()

Unnamed: 0,Review,Liked,clean_review
0,Wow... Loved this place.,1,"[wow, love, place, ]"
1,Crust is not good.,0,"[crust, good, ]"
2,Not tasty and the texture was just nasty.,0,"[not, tasti, textur, nasti, ]"
3,Stopped by during the late May bank holiday of...,1,"[stop, late, may, bank, holiday, rick, steve, ..."
4,The selection on the menu was great and so wer...,1,"[the, select, menu, great, price, ]"


In [13]:
import gensim

# Use the Gensim document to create a dictionary - a dictionary maps every word to a number
dictionary = gensim.corpora.Dictionary(df['clean_review'])
# Examine the length of the dictionary
num_of_words = len(dictionary)
print("# of words in dictionary: {}".format(num_of_words))
#for index,word in dictionary.items():
#    print(index,word)
print(dictionary)

# of words in dictionary: 1668
Dictionary(1668 unique tokens: ['', 'love', 'place', 'wow', 'crust']...)


In [15]:
#print(dictionary.token2id)

**5) **You will have to vectorize the data (i.e. construct a document term/word matix) wherein select words from the reviews will constitute the columns of the matrix and the individual reviews will be part of the rows of the matrix

In [11]:
from pprint import pprint

In [None]:
corpus = [dictionary.doc2bow(list_corpus)]

In [12]:
%%time


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def cv(data):
    count_vectorizer = CountVectorizer()

    emb = count_vectorizer.fit_transform(data)

    return emb, count_vectorizer

list_corpus = df["clean_review"].tolist()
list_labels = df["Liked"].tolist()

X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, test_size=0.3, random_state=42)

#X_train_counts, count_vectorizer = cv(X_train)
#X_test_counts = count_vectorizer.transform(X_test)
#pprint(X_train)

#from sklearn.feature_extraction.text import CountVectorizer

#count_vect = CountVectorizer(analyzer=process_text, max_features=1668)
#W_counts = count_vect.fit_transform(df['clean_review'])
#print(W_counts.shape)
#print(count_vect.get_feature_names())

[['We',
  'wait',
  'thirti',
  'minut',
  'seat',
  'although',
  '8',
  'vacant',
  'tabl',
  'folk',
  'wait',
  ''],
 ['I', 'take', 'busi', 'dinner', 'dollar', 'elsewher', ''],
 ['seafood',
  'limit',
  'boil',
  'shrimp',
  'crab',
  'leg',
  'crab',
  'leg',
  'definit',
  'tast',
  'fresh',
  ''],
 ['furthermor', 'even', 'find', 'hour', 'oper', 'websit', ''],
 ['My', 'girlfriend', 'veal', 'bad', ''],
 ['mayb',
  'vegetarian',
  'fare',
  'I',
  'twice',
  'I',
  'thought',
  'averag',
  'best',
  ''],
 ['I', 'love', 'place', ''],
 ['I', 'swung', 'give', 'tri', 'deepli', 'disappoint', ''],
 ['the', 'place', 'fairli', 'clean', 'food', 'simpli', 'worth', ''],
 ['food', 'delici', ''],
 ['It', 'pale', 'color', 'instead', 'nice', 'char', 'NO', 'flavor', ''],
 ['everyth', 'good', 'tasti', ''],
 ['over', 'rate', ''],
 ['would', 'recommend', 'other', ''],
 ['the',
  'food',
  'delici',
  'spici',
  'enough',
  'sure',
  'ask',
  'spicier',
  'prefer',
  'way',
  ''],
 ['I', 'insult', '']

**6) ** Create 'Train' and 'Test' data sets (i.e. 70% of the underlying data set will constitute the training set and 30% of the underlying data set will constitute the test set)

**7)** Train a Naive Bayes model on the Train data set and test it against the test data set


**8) **Construct a confusion matirx to gauge the performance of the model