## Natural language processing (Bag of words)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing dataset - we use delimiter to show we are using tsv file and quoting 3 as our text itself has double quotes so to not cause error

In [2]:
data = pd.read_csv(r"C:\Users\sanid\OneDrive\Desktop\machine-learning-full\NLP\Restaurant_Reviews.tsv", delimiter='\t',quoting= 3)

In [3]:
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
data.columns

Index(['Review', 'Liked'], dtype='object')

In [5]:
data.nunique(axis=0)

Review    996
Liked       2
dtype: int64

## Cleaning texts
stemming = An algorithm for stemming. Stemming is the process of reducing words to their base or root form by chopping off their prefixes and suffixes. It's a crude but effective method.

Sparse Matrices
A sparse matrix is a data structure that only stores the non-zero values and their locations (row and column index). It ignores all the zeros.

In [6]:
import re
import nltk #allow us to download ensemble for stop words remove words which do not help the prediction at all like (the,a,and etc)
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #to get the root of the word like loved and love is same to make the review simple and reducing the final deminsion of sparse matrix
ps = PorterStemmer()
all_stop_words = stopwords.words('english')
all_stop_words.remove('not')
corpus = []
for i in range(0,1000):
    review = re.sub(r'[^a-zA-Z\s!?]', ' ', data['Review'][i]) #removes everything thats not an alphabet and replaces with space
    review = review.lower()
    review = review.split() #split each word so we can apply stemming 
    review = [ps.stem(word) for word in review if word not in set(all_stop_words) ]
    review = ' '.join(review)
    corpus.append(review)






[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sanid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
print(corpus)

['wow love place', 'crust not good', 'not tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch', 'servic prompt', 'would not go back', 'cashier care ever say still end wayyy overpr', 'tri cape cod ravoli chicken cranberri mmmm!', 'disgust pretti sure human hair', 'shock sign indic cash', 'highli recommend', 'waitress littl slow servic', 'place not worth time let alon vega', 'not like', 'burritto blah!', 'food amaz', 'servic also cute', 'could care less interior beauti', 'perform', 'right red velvet cake ohhh stuff good', 'never brought salad ask', 'hole wall great mexican street taco friendli staff', 'took hour get food tabl restaur food luke warm sever run around like total overwhelm', 'worst salmon sashimi', 'also combo like burger fri beer decent deal', 'like final blow!', 'found place accid could 

## create bag of worrd model to create sparse matrix
CountVectorizer is used to convert a collection of text documents (like all your reviews) into a numerical matrix of token counts. This matrix is often called the "Bag-of-Words" (BoW) model.

to remove the other words which are not part of the stop words we use count vector to remove them by putting a parameter 
we take more freq used words

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
Y = data.iloc[:,-1].values


to use len function i need to convert the spars evector to an array

In [9]:
X[0].shape

(1500,)

## Training and splitting data


In [10]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state=0,train_size=0.2)

In [11]:
from sklearn.naive_bayes import MultinomialNB,GaussianNB
model = MultinomialNB()
model.fit(X_train,Y_train)
model_2 = GaussianNB()
model_2.fit(X_train,Y_train)

In [15]:
y_pred = model.predict(X_test)
y_pred_2 = model_2.predict(X_test)

## Confusion matrix and accuracy
GaussianNB cannot handle sparse matrices. It is designed for dense, continuous data and its internal math requires operations that only work on dense arrays.

MultinomialNB can handle sparse matrices. It is specifically optimized for the type of count data you get from text vectorization.

In [14]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(Y_test, y_pred)
print(cm)
accuracy_score(Y_test, y_pred)

[[288 110]
 [135 267]]


0.69375

In [16]:
cm_2 = confusion_matrix(Y_test, y_pred_2)
print(cm_2)
accuracy_score(Y_test, y_pred_2)

[[191 207]
 [ 74 328]]


0.64875

Predicting if a single review is positive or negative
Positive review
Use our model to predict if the following review:

"I love this restaurant so much"

is positive or negative.

Solution: We just repeat the same text preprocessing process we did before, but this time with a single review.

In [18]:
new_review = 'I love this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if word not in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus)
new_y_pred = model.predict(new_X_test)
print(new_y_pred)

[1]


## Other classification models

In [19]:
print(X_train)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## SVM

In [20]:
from sklearn.svm import SVC
classifier = SVC(kernel='rbf',random_state=0)
classifier.fit(X_train,Y_train)

In [22]:
y_pred_SVM = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 ...
 [1 1]
 [0 0]
 [0 0]]


In [23]:
cm_svm = confusion_matrix(Y_test,y_pred_SVM)
print(cm_svm)
accuracy_score(Y_test, y_pred_SVM)

[[331  67]
 [154 248]]


0.72375

## Random forest classification

In [24]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 500, criterion = 'entropy', random_state = 0)
rf.fit(X_train, Y_train)

In [25]:
y_pred_rf = rf.predict(X_test)

In [26]:
cm_rf = confusion_matrix(Y_test,y_pred_rf)
print(cm_rf)
accuracy_score(Y_test, y_pred_rf)

[[348  50]
 [176 226]]


0.7175