<a href="https://colab.research.google.com/github/vaishnavipatil29/LetsUpgrade-AI-ML-/blob/master/Restaurant_Sentiment_Analysis_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Aim of the project is to analyse the reviews given by customers and classify it as a positive or a negative review.

In [2]:
#Dataset
import pandas as pd
url='https://raw.githubusercontent.com/vaishnavipatil29/LetsUpgrade-AI-ML-/master/Restaurant_Reviews.tsv'
df = pd.read_csv(url, delimiter="\t", quoting=3)
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Data Preprocessing

In [3]:
#Check if the dataset is balanced
df.groupby('Liked').size()

Liked
0    500
1    500
dtype: int64

The dataset is well-balanced.

Analyze all on first review and then apply on the whole dataset using loops.

In [5]:
#remove the extra characters from the reviews: #,$,% etc.
import re
print(df['Review'][0])
review = re.sub('[^a-zA-z]',' ',df['Review'][0]) #Substitute all the characters other than letters in Review by space
print(review)
#Convert all uppercase letters to lower case
review=review.lower()
print(review)
#Remove all the stopwords such as the, an, then, if etc. 
review=review.split()
import nltk
nltk.download('stopwords')  #download stopwords library
from nltk.corpus import stopwords
review1 = [word for word in review if not word in (stopwords.words('english'))] #if word in review is not word in stopwards then the word will go in review1
print(review1)
#Stemming -> get the root of the word e.g: loved-->love, playing-->play
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review1 = [ps.stem(word) for word in review1]
print(review1)
#Finally join all the words
review=' '.join(review1)
print("Final Review: ",review)

Wow... Loved this place.
Wow    Loved this place 
wow    loved this place 
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['wow', 'loved', 'place']
['wow', 'love', 'place']
Final Review:  wow love place


Problem : The data given above is in textual form(or unorganised form). But, machine learning always requires the data in an organised form:arrays, matrices, tables etc. to carry out processing. So, the next step is to form organised data.

In [6]:
#Countvectorizer is used to used to form organised data
corpus=[]
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=3)  #Consider the 3 most occuring words from the sentence
print(review)
corpus.append(review) #Convert to list
print(corpus)
X = (cv.fit_transform(corpus)).toarray()  #text converted to matrix, 1 if the word is present and 0 if not present
print(X)

wow love place
['wow love place']
[[1 1 1]]


Now, usual ML algorithm can be applied to X.

**Apply On Whole Dataset**

In [10]:
#-------------------------Importing libraries-------------------------#
import re     #regular expression for substitution
import nltk   #natural language tool kit for stopwords
# nltk.download('stopwords')  #download stopwords
from nltk.stem.porter import PorterStemmer  #PorterStemmer for stemming

corpus = [] #empty list for making list of all reviews
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ',df['Review'][i])  #substitue all characters other than letters by space
    review = review.lower() #Convert all to lowercase
    review = review.split() #split the words
    ps = PorterStemmer()  #object to stemmer
    review = [ ps.stem(word) for word in review if not word in set(stopwords.words('english')) ]  #delete the stopwords and do stemming 
    review = ' '.join(review) #join the words
    corpus.append(review) #append to corpus
  
# Create a Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500) #initialise object to Countvectorizer.
X = cv.fit_transform(corpus).toarray()  #dataset feature matrix

In [11]:
#Dataset
print(X)
y = df.iloc[:,1].values #output y
print(y)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[1 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1
 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0
 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0
 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1
 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0
 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1
 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1
 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1
 0 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1
 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 1
 1 1 0 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 

The dataset is ready!!

Since the features are non-collinear, the classifier I have choosen Naive Bayes Classifier.

In [13]:
#train_test spliting
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [14]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

Metrics and prediction

In [15]:
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[55 42]
 [12 91]]
0.73


Testing

In [17]:
Review = "bad service"
input1 = [Review]

input_data = cv.transform(input1).toarray()

input_pred = classifier.predict(input_data)

if input_pred[0]==1:
    print("Review is Positive")
else:
    print("Review is Negative")

Review is Negative


In [20]:
Review = "nice food"
input1 = [Review]

input_data = cv.transform(input1).toarray()

input_pred = classifier.predict(input_data)

if input_pred[0]==1:
    print("Review is Positive")
else:
    print("Review is Negative")

Review is Positive
