<a href="https://colab.research.google.com/github/snikhil17/NLP_course_Simplilearn/blob/main/Class_Codes_and_Assignments/Notebooks/4.%20Sentimental_Analysis_Amazon_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Assignment**
- **Create a model that can predict positive and negative sentiment.**
- **SL = 0.3**
- **Dataset: Amazon reviews**

## **Aquiring Data**

In [1]:
!wget https://raw.githubusercontent.com/snikhil17/NLP_course_Simplilearn/main/Class_Codes_and_Assignments/data/amazonreviews.tsv

--2021-11-22 08:35:16--  https://raw.githubusercontent.com/snikhil17/NLP_course_Simplilearn/main/Class_Codes_and_Assignments/data/amazonreviews.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4448100 (4.2M) [text/plain]
Saving to: ‘amazonreviews.tsv’


2021-11-22 08:35:16 (45.3 MB/s) - ‘amazonreviews.tsv’ saved [4448100/4448100]



## **Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import spacy
nlp = spacy.load('en_core_web_sm')
from nltk.stem.porter import PorterStemmer
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentimentAnalyser = SentimentIntensityAnalyzer()
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statistics

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...




## **Loading data**

In [None]:
df = pd.read_csv('amazonreviews.tsv', delimiter= '\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


## **Expanding English language contraction**
- you've -> you have
- he's -> he is

In [None]:
# ref: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

df['review'] = df['review'].apply(lambda x: decontracted(x))

## **Missing Values**
- **When dealing with text data. Drop missing values**

In [None]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [None]:
df.label.value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

- Since the dataset in unbalanced, for generalization we will use **ACCURACY** as the metric
- for quality we will use **f1-Score**

## **For REVIEWS analysis we need word ``NOT``**
- **Remove ``not`` from stopwords**

In [None]:
stop_words = stopwords.words('english')
stop_words.remove('not')

## **Using ``SentimentIntensityAnalyzer`` from VADER to create a new column ``labelFromVADER``**

In [None]:
df['labelFromVADER'] = df['review'].apply(lambda review: "pos" if sentimentAnalyser.polarity_scores(review)['compound'] > 0.5 else "neg")

In [None]:
df.head()

Unnamed: 0,label,review,labelFromVADER
0,pos,Stuning even for the non-gamer: This sound tra...,pos
1,pos,The best soundtrack ever to anything.: I am re...,pos
2,pos,Amazing!: This soundtrack is my favorite music...,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...",pos


## **Checking accuracy_score using vader:**
- **Split the data, use accuracy_score to determine generalization and quality test of model.**

In [None]:
SL = 0.3
CL = 1- SL
X_train, X_test, y_train, y_test = train_test_split(df['labelFromVADER'].values, df['label'].values , stratify=df['label'].values, test_size= 0.1, random_state = 4)
print(f"\n*****Generalization Check*****\n")
print(f"Decided CL:{CL}")
print(f"Accuracy score on training_set: {accuracy_score(X_train,y_train)}")
print(f"Accuracy score on test_set: {accuracy_score(X_test,y_test)}\n") 

print(f"\n*****Quality Test*****\n")
print(f"Decided CL: {CL}")
print(f"f1-score score on training_set: {metrics.f1_score(y_train, X_train, pos_label='pos')}")
print(f"f1-score score on test_set: {metrics.f1_score(y_test, X_test, pos_label='pos')}") 
print(f"f1-score score on whole data: {metrics.f1_score(df['labelFromVADER'].values,df['label'].values, pos_label='pos')}") 



*****Generalization Check*****

Decided CL:0.7
Accuracy score on training_set: 0.742
Accuracy score on test_set: 0.775


*****Quality Test*****

Decided CL: 0.7
f1-score score on training_set: 0.7570621468926554
f1-score score on test_set: 0.7859181731684111
f1-score score on whole data: 0.7599208219436326


### **Observations:**
- Accuracy score of test >  Accuracy score of train
- Accuracy score of test >  CL Decided

#### **Quality Test: f1-score**
- f1-score score of test >  f1-score score of train
- f1-score score of test >  CL Decided
- f1-score score on whole data > CL decided.
- **Vader can be used for predictions**
- Further let's consider ``Review`` column. Perform textPreprocessing on that column and use that col also for prediction. 


### **Seperate the data as features and label**

In [None]:
# Seperate the data as features and label
# Will ensure they are in Numpy form

features_vader = df.iloc[:,[1,2]].values
label_vader = df.iloc[:,[0]].values

In [None]:
# Creating an instance of PorterStemmer class, for stemming.
stemObject = PorterStemmer()

## **Performing Text Preprocessing**
- **We will create a text preprocessing function that can perform the following:**
  1. Remove Punctuations
  2. Extract words out of the sentences
  3. Normalize the data (lowercase)
  4. Remove Stopwords
  5. Apply Stemming

In [None]:
def textPreprocessing(document):
  #1. Remove Punctuations
  sentWithoutPunct = ''.join([char for char in document if char not in string.punctuation])
  #2. Extract words out of the sentences
  words = sentWithoutPunct.split()
  #3. Normalize the data (lowercase | Uppercase | NormalCase)
  wordNormalized = [word.lower() for word in words]
  # 4. Remove Stopwords
  vocabulary = " ".join([word for word in wordNormalized if word not in stop_words])
  # 5. Apply Stemming
  stem_words = ''.join([stemObject.stem(word) for word in vocabulary])

  # 6. Extract words
  vocab= stem_words.split()

  return vocab

## **Creating BOW**

In [None]:
# Create BOW in SKlearn
wordVector_vader = CountVectorizer(analyzer = textPreprocessing)

#Build the Vocabulary
finalWordVectorVocab_vader = wordVector_vader.fit(features_vader)

# To create BOW
bagOfWords_vader = finalWordVectorVocab_vader.transform(features_vader)

## **Vocabulary created**

In [None]:
# printing only first 10
{key: finalWordVectorVocab_vader.vocabulary_[key] for key in finalWordVectorVocab_vader.vocabulary_.keys() & \
 list(finalWordVectorVocab_vader.vocabulary_.keys())[:10]}

{'beautiful!': 10519,
 'even': 24862,
 'mind': 41670,
 'non-gamer:': 44313,
 'paints': 46883,
 'senery': 56360,
 'sound': 59059,
 'stuning': 60838,
 'track': 64542,
 'well': 68683}

## **Apply TFIDF Algo on BOW to create a feature set**

In [None]:
# Apply TFIDF Algo on BOW to create a feature set
tfidfObject_vader = TfidfTransformer().fit(bagOfWords_vader)  #Calc IDF Values

# Lets create Numeric Feature set
processedFeatures_vader = tfidfObject_vader.transform(bagOfWords_vader)

## **Model-Building and checking Generalization**
- **Model with Vader Variable**
  - using ``stratify = True`` in train_test_split
  - hypertuning using ``random_state`` to obtain generalized model 
- **Model without Vader Variable**
  - hypertuning using ``random_state`` to obtain generalized model 

## **Model with VADER Variable**

**Hypertuning using ``random_state`` to generalize the model.**

In [None]:
for i in range(0,50,1):
  # Create Train Test Split (90% training -10% testing)
  X_train_vader, X_test_vader, y_train_vader, y_test_vader = train_test_split(processedFeatures_vader, label_vader, stratify = label_vader , test_size= 0.1, random_state = i)

  model_vader = RandomForestClassifier(n_estimators = 9, max_depth = 9, random_state=i)
  model_vader.fit(X_train_vader, y_train_vader)
  train_score_vader = model_vader.score(X_train_vader, y_train_vader)
  test_score_vader = model_vader.score(X_test_vader, y_test_vader)
  if test_score_vader > train_score_vader and test_score_vader > CL:
    print(f"RandomState: {i}")
    print(f"Score of Training set: \n{train_score_vader}")
    print(f"Score of Testing set: \n{test_score_vader}")
    print("="*50)

RandomState: 35
Score of Training set: 
0.7318888888888889
Score of Testing set: 
0.734


**Using ``random_state = 35`` to train a generalized model created by including ``vader_variable``**

In [None]:
X_train_vader, X_test_vader, y_train_vader, y_test_vader = train_test_split(processedFeatures_vader, label_vader, stratify = label_vader , test_size= 0.1, random_state = 35)

model_vader = RandomForestClassifier(n_estimators = 9, max_depth = 9, random_state=35)
model_vader.fit(X_train_vader, y_train_vader)
train_score_vader = model_vader.score(X_train_vader, y_train_vader)
test_score_vader = model_vader.score(X_test_vader, y_test_vader)
print(f"Score of Training set: \n{train_score_vader}")
print(f"Score of Testing set: \n{test_score_vader}")
print("="*50)

Score of Training set: 
0.7318888888888889
Score of Testing set: 
0.734


#### **Observation:**

---
- Since model is generalized (test_score > train_score and test_score > CL).
- **we can check the quality of model using ``f1-score``, as data is unbalanced.**


**Quality check using ``f1-score`` (data is unbalanced)**

In [None]:
pred_tr_vader = model_vader .predict(X_train_vader)
pred_test_vader = model_vader .predict(X_test_vader)

print(f"Classification Report of Training Set: \n{metrics.classification_report(pred_tr_vader, y_train_vader)}\n")
print(f"Classification Report of Test Set: \n{metrics.classification_report(pred_test_vader, y_test_vader)}\n")

Classification Report of Training Set: 
              precision    recall  f1-score   support

         neg       0.77      0.72      0.74      4854
         pos       0.70      0.74      0.72      4146

    accuracy                           0.73      9000
   macro avg       0.73      0.73      0.73      9000
weighted avg       0.73      0.73      0.73      9000


Classification Report of Test Set: 
              precision    recall  f1-score   support

         neg       0.76      0.73      0.74       532
         pos       0.71      0.74      0.72       468

    accuracy                           0.73      1000
   macro avg       0.73      0.73      0.73      1000
weighted avg       0.74      0.73      0.73      1000




### **Observation:**


---
- Since data is unbalanced we considered ``f1-score`` for Quality check of the model.
- ``f1-score`` of training data < testing data for both class (0 and 1) 
- ``f1-score`` of test_set >= CL decided.
- Therefore model can be considered for deployment.
- We will also run the model with whole dataset available, check ``f1-score`` and compare with ``model_without_VADER_column``


## **Model without VADER variable**

**Excluding the vader_variable and preprocessing whole data again**

In [None]:
# Seperate the data as features and label
# Will ensure they are in Numpy form

features_no_vader = df.iloc[:,[1]].values
label_no_vader = df.iloc[:,[0]].values

# Create BOW in SKlearn
wordVector_no_vader = CountVectorizer(analyzer = textPreprocessing)

#Build the Vocabulary
finalWordVectorVocab_no_vader = wordVector_no_vader.fit(features_no_vader)

# To create BOW
bagOfWords_no_vader = finalWordVectorVocab_no_vader.transform(features_no_vader)

# Apply TFIDF Algo on BOW to create a feature set
tfidfObject_no_vader = TfidfTransformer().fit(bagOfWords_no_vader)  #Calc IDF Values

# Lets create Numeric Feature set
processedFeatures_no_vader = tfidfObject_no_vader.transform(bagOfWords_no_vader)

**Using different values of ``random_state`` to obtain a generalized model.**

In [None]:
CL = 0.7
for i in range(0,50,1):
  # Create Train Test Split (90% training -10% testing)
  X_train_no_vader, X_test_no_vader, y_train_no_vader, y_test_no_vader = train_test_split(processedFeatures_no_vader, label_no_vader , test_size= 0.1, random_state = i)

  model_no_vader = RandomForestClassifier(n_estimators = 11, max_depth = 7, random_state=i)
  model_no_vader.fit(X_train_no_vader, y_train_no_vader)
  train_score_no_vader = model_no_vader.score(X_train_no_vader, y_train_no_vader)
  test_score_no_vader = model_no_vader.score(X_test_no_vader, y_test_no_vader)
  # Checking Generalization
  if test_score_no_vader > train_score_no_vader and test_score_no_vader > CL  :
    print(f"random State:{i}")
    print(f"Score of Training set: \n{train_score_no_vader}")
    print(f"Score of Testing set: \n{test_score_no_vader}")
    print("="*50)

random State:1
Score of Training set: 
0.7002222222222222
Score of Testing set: 
0.71
random State:44
Score of Training set: 
0.728
Score of Testing set: 
0.733


**Using ``random_state = 44`` to train a generalized model created by excluding ``vader_variable``**

In [None]:
X_train_no_vader, X_test_no_vader, y_train_no_vader, y_test_no_vader = train_test_split(processedFeatures_no_vader, label_no_vader , test_size= 0.1, random_state = 44)

model_no_vader = RandomForestClassifier(n_estimators = 11, max_depth = 7, random_state=44)
model_no_vader.fit(X_train_no_vader, y_train_no_vader)
train_score_no_vader = model_no_vader.score(X_train_no_vader, y_train_no_vader)
test_score_no_vader = model_no_vader.score(X_test_no_vader, y_test_no_vader)
print(f"Score of Training set: \n{train_score_no_vader}")
print(f"Score of Testing set: \n{test_score_no_vader}")
print("="*50)

Score of Training set: 
0.728
Score of Testing set: 
0.733


#### **Observation:**

---
- Since model is generalized (test_score > train_score and test_score > CL).
- **we can check the quality of model using ``f1-score``, as data is unbalanced.**


**Quality check using ``f1-score`` (data is unbalanced)**

In [None]:
pred_tr_no_vader = model_no_vader.predict(X_train_no_vader)
pred_test_no_vader = model_no_vader.predict(X_test_no_vader)

print(f"Classification Report of Training Set: \n{metrics.classification_report(pred_tr_no_vader, y_train_no_vader)}\n")
print(f"Classification Report of Test Set: \n{metrics.classification_report(pred_test_no_vader, y_test_no_vader)}\n")

Classification Report of Training Set: 
              precision    recall  f1-score   support

         neg       0.77      0.72      0.74      4892
         pos       0.69      0.74      0.71      4108

    accuracy                           0.73      9000
   macro avg       0.73      0.73      0.73      9000
weighted avg       0.73      0.73      0.73      9000


Classification Report of Test Set: 
              precision    recall  f1-score   support

         neg       0.76      0.73      0.74       536
         pos       0.70      0.74      0.72       464

    accuracy                           0.73      1000
   macro avg       0.73      0.73      0.73      1000
weighted avg       0.73      0.73      0.73      1000




### **Observation:**


---
- Since data is unbalanced we considered ``f1-score`` for Quality check of the model.
- ``f1-score`` of training data < testing data for both class (pos and neg) 
- ``f1-score`` of test_set >= CL decided.
- Therefore model can be considered for deployment.
- We will also run the model with whole dataset available, check ``f1-score`` and compare with ``model_without_VADER_column``


## **Let's Compare two models on whole dataset:**
- Model with VADER column
- Model without VADER column

In [None]:
pred_no_vader = model_no_vader.predict(processedFeatures_no_vader)
pred_vader = model_vader.predict(processedFeatures_vader)

print(f"Classification Report of NON-VADER: \n{metrics.classification_report(pred_no_vader, label_no_vader)}\n")
print(f"Classification Report of VADER: \n{metrics.classification_report(pred_vader, label_vader)}\n")

Classification Report of NON-VADER: 
              precision    recall  f1-score   support

         neg       0.77      0.72      0.74      5428
         pos       0.69      0.74      0.71      4572

    accuracy                           0.73     10000
   macro avg       0.73      0.73      0.73     10000
weighted avg       0.73      0.73      0.73     10000


Classification Report of VADER: 
              precision    recall  f1-score   support

         neg       0.77      0.72      0.74      5386
         pos       0.70      0.74      0.72      4614

    accuracy                           0.73     10000
   macro avg       0.73      0.73      0.73     10000
weighted avg       0.73      0.73      0.73     10000




## **Observation:**
- Model with vader_col is slightly better.
- Both models are generalized and satisfactory quality-wise.
- We can deploy both models.

## **Code for Deployement:**
**We will check prediction by both models on a given review**
- Take the review from user as input.
- Perform All preprocessing step on the given review.
- Pass the preprocessed review to model and get prediction. 

In [None]:
print(df.iloc[33])
print()
print(df['review'].iloc[33])

label                                                           pos
review            Is this great TV??? You bet it is: Hotel Babyl...
labelFromVADER                                                  pos
Name: 33, dtype: object

Is this great TV??? You bet it is: Hotel Babylon is not just good TV...it is great TV!!!! The show features some incredible acting from Tamzin Outhwaite (formerly of EastEnders, a BBC soap) and Max Beesley (from the ill-fated movie "Glitter" starring Mariah Carey). The show could make for a great drama series, but I felt that it is a mix of a drama, comedy, and soap opera all mixed into a great BBC show. The show aired on BBC America for a while but did not get around to seeing it. I can now say that I got the DVD set and all the episodes are great. The season finale was an interesting to watch.The show reminds me of Hotel which aired on ABC from 1983 to 1988. The reason...Hotel was set at a fictional San Francisco hotel as Hotel Babylon was set at a luxury fiv

**Deployement using ``model_vader`` i.e. model with VADER col**

In [None]:
# Take the review from user as input.
query_review = input("Enter an Amazon review: ")

# Perform All preprocessing step on the given review.
decon_query = decontracted(query_review)
preprocessed_query = textPreprocessing(decon_query)  
bagOfWords = finalWordVectorVocab_vader.transform(preprocessed_query)
processedquery = tfidfObject_vader.transform(bagOfWords)
print(f"\nPredictions: {model_vader.predict(processedquery)}\n")
print(f"Selecting first value as final prediction : {model_vader.predict(processedquery)[0]}\n")
print(f"Selecting mode as final prediction : {statistics.mode(model_vader.predict(processedquery))}\n")

Enter an Amazon review: Is this great TV??? You bet it is: Hotel Babylon is not just good TV...it is great TV!!!! The show features some incredible acting from Tamzin Outhwaite (formerly of EastEnders, a BBC soap) and Max Beesley (from the ill-fated movie "Glitter" starring Mariah Carey). The show could make for a great drama series, but I felt that it is a mix of a drama, comedy, and soap opera all mixed into a great BBC show. The show aired on BBC America for a while but did not get around to seeing it. I can now say that I got the DVD set and all the episodes are great. The season finale was an interesting to watch.The show reminds me of Hotel which aired on ABC from 1983 to 1988. The reason...Hotel was set at a fictional San Francisco hotel as Hotel Babylon was set at a luxury five-star hotel in England.I recommend this DVD to anyone who is willing to watch a great show from the BBC.

Predictions: ['pos' 'pos' 'pos' 'pos' 'pos' 'neg' 'pos' 'pos' 'pos' 'pos' 'pos' 'pos'
 'pos' 'pos'

**Deployement using ``model_vader`` i.e. model with VADER col**

In [None]:
# Take the review from user as input.
query_review = input("Enter an Amazon review: ")

# Perform All preprocessing step on the given review.
decon_query = decontracted(query_review)
preprocessed_query = textPreprocessing(decon_query)  
bagOfWords = finalWordVectorVocab_no_vader.transform(preprocessed_query)
processedquery = tfidfObject_no_vader.transform(bagOfWords)
print(f"\nPredictions: {model_no_vader.predict(processedquery)}\n")
print(f"Selecting first value as final prediction : {model_no_vader.predict(processedquery)[0]}\n")
print(f"Selecting mode as final prediction : {statistics.mode(model_no_vader.predict(processedquery))}\n")

Enter an Amazon review: Is this great TV??? You bet it is: Hotel Babylon is not just good TV...it is great TV!!!! The show features some incredible acting from Tamzin Outhwaite (formerly of EastEnders, a BBC soap) and Max Beesley (from the ill-fated movie "Glitter" starring Mariah Carey). The show could make for a great drama series, but I felt that it is a mix of a drama, comedy, and soap opera all mixed into a great BBC show. The show aired on BBC America for a while but did not get around to seeing it. I can now say that I got the DVD set and all the episodes are great. The season finale was an interesting to watch.The show reminds me of Hotel which aired on ABC from 1983 to 1988. The reason...Hotel was set at a fictional San Francisco hotel as Hotel Babylon was set at a luxury five-star hotel in England.I recommend this DVD to anyone who is willing to watch a great show from the BBC.

Predictions: ['pos' 'pos' 'pos' 'pos' 'pos' 'neg' 'pos' 'pos' 'pos' 'pos' 'pos' 'pos'
 'pos' 'pos'