<a href="https://colab.research.google.com/github/snikhil17/NLP_course_Simplilearn/blob/main/Class_Codes_and_Assignments/Notebooks/3.%20Restraunt_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Statement**
- Use-case : Restraunt Reviews
- Goal: Create a model that predicts whether a given review is Positive or Negative.
- SL = 0.35, therefore CL = 1- SL = 0.65

## **Aquire Data**

In [2]:
!wget https://github.com/snikhil17/NLP_course_Simplilearn/blob/main/Class_Codes_and_Assignments/data/Restaurant_Reviews.tsv

--2021-11-22 08:33:32--  https://github.com/snikhil17/NLP_course_Simplilearn/blob/main/Class_Codes_and_Assignments/data/Restaurant_Reviews.tsv
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘Restaurant_Reviews.tsv’

Restaurant_Reviews.     [ <=>                ] 487.37K  --.-KB/s    in 0.04s   

2021-11-22 08:33:32 (10.8 MB/s) - ‘Restaurant_Reviews.tsv’ saved [499070]



## **Loading Libraries**

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import spacy
nlp = spacy.load('en_core_web_sm')
from nltk.stem.porter import PorterStemmer
from nltk import ngrams

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics
import statistics

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **Loading Data**

In [None]:
data = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [None]:
data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


## **Balanced or Unbalanced Data?**

In [None]:
data.Liked.value_counts()

1    500
0    500
Name: Liked, dtype: int64

#### **Observations:**

---


- Since the dataset in balanced, for generalization we will use ``ACCURACY`` as the metric
- For quality we will use ``f1-Score``

## **For REVIEWS analysis we need word ``NOT``**
- Remove ``not`` from stopwords

In [None]:
stop_words = stopwords.words('english')
stop_words.remove('not')

## **Expanding English language contraction**
- you've -> you have
- he's -> he is

In [None]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

data['Review'] = data['Review'].apply(lambda x: decontracted(x))

In [None]:
# Checking String.punctuation
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### **Seperate the data as features and label**

In [None]:
# Will ensure they are in Numpy form

features = data.iloc[:,[0]].values
label = data.iloc[:,[1]].values

In [None]:
#  Creating an instance of PorterStemmer class, for stemming.
stemObject = PorterStemmer()

## **Performing Text Preprocessing**
- **We will create a text preprocessing function that can perform the following:**
  1. Remove Punctuations
  2. Extract words out of the sentences
  3. Normalize the data (lowercase)
  4. Remove Stopwords
  5. Apply Stemming

In [None]:
def textPreprocessing(document):
  #1. Remove Punctuations
  sentWithoutPunct = ''.join([char for char in document if char not in string.punctuation])
  #2. Extract words out of the sentences
  words = sentWithoutPunct.split()
  #3. Normalize the data (lowercase | Uppercase | NormalCase)
  wordNormalized = [word.lower() for word in words]
  # 4. Remove Stopwords
  vocabulary = " ".join([word for word in wordNormalized if word not in stop_words])
  # 5. Apply Stemming
  stem_words = ''.join([stemObject.stem(word) for word in vocabulary])

  # 6. Extract words
  vocab= stem_words.split()

  return vocab

## **Creating BOW**

In [None]:
# Create BOW in SKlearn
wordVector = CountVectorizer(analyzer = textPreprocessing)

#Build the Vocabulary
finalWordVectorVocab = wordVector.fit(features)

# To create BOW
bagOfWords = finalWordVectorVocab.transform(features)


## **Vocabulary created**

In [None]:
# printing only first 10
{key: finalWordVectorVocab.vocabulary_[key] for key in finalWordVectorVocab.vocabulary_.keys() & \
 list(finalWordVectorVocab.vocabulary_.keys())[:10]}

{'crust': 564,
 'good.': 1009,
 'loved': 1341,
 'nasty.': 1496,
 'not': 1532,
 'place.': 1713,
 'stopped': 2173,
 'tasty': 2254,
 'texture': 2277,
 'wow...': 2555}

## **Apply TFIDF Algo on BOW to create a feature set**

In [None]:
# Apply TFIDF Algo on BOW to create a feature set
tfidfObject = TfidfTransformer().fit(bagOfWords)  #Calc IDF Values

# Lets create Numeric Feature set
processedFeatures = tfidfObject.transform(bagOfWords)

## **Model-Building and checking Generalization**

In [None]:
# CL_Decided = 1- SL = 1- 0.35 
CL_Decided = 1- 0.35 

# Create Train Test Split (90% training -10% testing)
X_train, X_test, y_train, y_test = train_test_split(processedFeatures, label,  test_size= 0.1, random_state = 27)

# Building Model
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=27)
model.fit(X_train, y_train)

# Checking Accuracy Using Training and Testing set.
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

# Printing CL_decided, Training and testing scores to check Generalization of the model.
print(f"Decided CL: \n{CL_Decided}\n")
print(f"Score of Training set: \n{train_score}\n")
print(f"Score of Testing set: \n{test_score}\n")


Decided CL: 
0.65

Score of Training set: 
0.6988888888888889

Score of Testing set: 
0.76



#### **Observation:**

---
- Since model is generalized (test_score > train_score and test_score > CL), **we can check the quality of model using ``accuraccy``, as data is balanced.**


## **Comparing accuracy of Training and Testing set**

In [None]:
# CL_Decided = 1- SL = 1- 0.35 
CL_Decided = 1- 0.35 

# Create Train Test Split (90% training -10% testing)
X_train, X_test, y_train, y_test = train_test_split(processedFeatures, label,  test_size= 0.1, random_state = 27)

# Building Model
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=27)
model.fit(X_train, y_train)

# Checking Accuracy Using Training and Testing set.
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

# Printing CL_decided, Training and testing scores to check Generalization of the model.
print(f"Decided CL: \n{CL_Decided}\n")
print(f"Accuracy Score of Training set: \n{metrics.accuracy_score(y_train, train_pred)}\n")
print(f"Accuracy Score of Testing set: \n{metrics.accuracy_score(y_test, test_pred)}\n")


Decided CL: 
0.65

Accuracy Score of Training set: 
0.69

Accuracy Score of Testing set: 
0.76



### **Observation:**


---
- Since data is balanced let's consider ``accuracy`` for Quality check of the model.
- ``accuracy`` of training data < testing data
- ``accuracy`` of test_set >= CL decided.
- Therefore model can be considered for deployment.
- Let's also run the model with whole dataset available and check ``accuracy``


## **Checking the performance on whole Data** 

In [None]:
pred = model.predict(processedFeatures)
print(metrics.classification_report(pred, label))

              precision    recall  f1-score   support

           0       0.52      0.81      0.63       321
           1       0.88      0.65      0.74       679

    accuracy                           0.70      1000
   macro avg       0.70      0.73      0.69      1000
weighted avg       0.76      0.70      0.71      1000



### **Observation:**


---
- Accuracy of whole data > CL.

## **Code for Deployement:**
- Take the review from user as input.
- Perform All preprocessing step on the given review.
- Pass the preprocessed review to model and get prediction. 

In [None]:
print(data['Review'].iloc[34])
print(data['Liked'].iloc[34])

Overall, I like this place a lot.
1


In [None]:
# Take the review from user as input.
query_review = input("Enter a review for the Restraunt: ")

# Perform All preprocessing step on the given review.
decon_query = decontracted(query_review)
preprocessed_query = textPreprocessing(decon_query)  
bagOfWords = finalWordVectorVocab.transform(preprocessed_query)
processedquery = tfidfObject.transform(bagOfWords)
print(f"\nPredictions: {model.predict(processedquery)}\n")
print(f"Selecting first value as final prediction : {model.predict(processedquery)[0]}\n")
print(f"Selecting mode as final prediction : {statistics.mode(model.predict(processedquery))}\n")

Enter a review for the Restraunt: Overall, I like this place a lot.

Predictions: [1 1 1 1]

Selecting first value as final prediction : 1

Selecting mode as final prediction : 1



## **More Work-to-be-done:** 
- Following case model fails, because sentiments were not captured by model.


In [None]:
print(data['Review'].iloc[33])
print(data['Liked'].iloc[33])

seems like a good quick place to grab a bite of some familiar pub food, but do yourself a favor and look elsewhere.
0


In [None]:
# Take the review from user as input.
query_review = input("Enter a review for the Restraunt: ")

# Perform All preprocessing step on the given review.
decon_query = decontracted(query_review)
preprocessed_query = textPreprocessing(decon_query)  
bagOfWords = finalWordVectorVocab.transform(preprocessed_query)
processedquery = tfidfObject.transform(bagOfWords)
print(f"\nPredictions: {model.predict(processedquery)}\n")
print(f"Selecting first value as final prediction : {model.predict(processedquery)[0]}\n")
print(f"Selecting mode as final prediction : {statistics.mode(model.predict(processedquery))}\n")

Enter a review for the Restraunt: seems like a good quick place to grab a bite of some familiar pub food, but do yourself a favor and look elsewhere.

Predictions: [1 1 1 1 1 1 1 1 1 1 1 1 1]

Selecting first value as final prediction : 1

Selecting mode as final prediction : 1



## **Let's see if Vader can predict this review correctly.**

In [None]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentimentAnalyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [None]:
data['labelFromVADER'] = data['Review'].apply(lambda review: "1" if sentimentAnalyser.polarity_scores(review) \
['compound'] > 0.5 else "0")

# Will ensure they are in Numpy form
features = data.iloc[:,[0,2]].values
label = data.iloc[:,[1]].values

# Create BOW in SKlearn
wordVector = CountVectorizer(analyzer = textPreprocessing)

#Build the Vocabulary
finalWordVectorVocab = wordVector.fit(features)

# To create BOW
bagOfWords = finalWordVectorVocab.transform(features)

# Apply TFIDF Algo on BOW to create a feature set
tfidfObject = TfidfTransformer().fit(bagOfWords)  #Calc IDF Values

# Lets create Numeric Feature set
processedFeatures = tfidfObject.transform(bagOfWords)

In [None]:
# CL_Decided = 1- SL = 1- 0.35 
CL_Decided = 1- 0.35 

# Create Train Test Split (90% training -10% testing)
X_train, X_test, y_train, y_test = train_test_split(processedFeatures, label,  test_size= 0.1, random_state = 27)

# Building Model
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=27)
model.fit(X_train, y_train)

# Checking Accuracy Using Training and Testing set.
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

# Printing CL_decided, Training and testing scores to check Generalization of the model.
print(f"Decided CL: \n{CL_Decided}\n")
print(f"Score of Training set: \n{train_score}\n")
print(f"Score of Testing set: \n{test_score}\n")


Decided CL: 
0.65

Score of Training set: 
0.69

Score of Testing set: 
0.76



In [None]:
# CL_Decided = 1- SL = 1- 0.35 
CL_Decided = 1- 0.35 

# Create Train Test Split (90% training -10% testing)
X_train, X_test, y_train, y_test = train_test_split(processedFeatures, label,  test_size= 0.1, random_state = 27)

# Building Model
model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=27)
model.fit(X_train, y_train)

# Checking Accuracy Using Training and Testing set.
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

# Printing CL_decided, Training and testing scores to check Generalization of the model.
print(f"Decided CL: \n{CL_Decided}\n")
print(f"Classification Report of Training set: \n{metrics.classification_report(y_train, train_pred)}\n")
print(f"Classification Report of Testing set: \n{metrics.classification_report(y_test, test_pred)}\n")


Decided CL: 
0.65

Classification Report of Training set: 
              precision    recall  f1-score   support

           0       0.81      0.51      0.63       461
           1       0.63      0.88      0.73       439

    accuracy                           0.69       900
   macro avg       0.72      0.69      0.68       900
weighted avg       0.72      0.69      0.68       900


Classification Report of Testing set: 
              precision    recall  f1-score   support

           0       0.74      0.59      0.66        39
           1       0.77      0.87      0.82        61

    accuracy                           0.76       100
   macro avg       0.76      0.73      0.74       100
weighted avg       0.76      0.76      0.75       100




In [None]:
pred = model.predict(processedFeatures)
print(metrics.classification_report(pred, label))



              precision    recall  f1-score   support

           0       0.52      0.81      0.63       321
           1       0.88      0.65      0.74       679

    accuracy                           0.70      1000
   macro avg       0.70      0.73      0.69      1000
weighted avg       0.76      0.70      0.71      1000



In [None]:
print(data['Review'].iloc[33])
print(data['Liked'].iloc[33])

seems like a good quick place to grab a bite of some familiar pub food, but do yourself a favor and look elsewhere.
0


In [None]:
# Take the review from user as input.
query_review = input("Enter a review for the Restraunt: ")

# Perform All preprocessing step on the given review.
decon_query = decontracted(query_review)
preprocessed_query = textPreprocessing(decon_query)  
bagOfWords = finalWordVectorVocab.transform(preprocessed_query)
processedquery = tfidfObject.transform(bagOfWords)
print(f"\nPredictions: {model.predict(processedquery)}\n")
print(f"Selecting first value as final prediction : {model.predict(processedquery)[0]}\n")
print(f"Selecting mode as final prediction : {statistics.mode(model.predict(processedquery))}\n")

Enter a review for the Restraunt:   seems like a good quick place to grab a bite of some familiar pub food, but do yourself a favor and look elsewhere.

Predictions: [1 0 1 1 1 1 1 1 1 1 1 1 1]

Selecting first value as final prediction : 1

Selecting mode as final prediction : 1

