## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Ventorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way



## Import datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
!pip install spacy
import spacy



In [3]:
!python -m spacy download en

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
nlp = spacy.load('en_core_web_sm')

Set your working directory  here

If you are using Google Colab, use this code snippet:

```from google.colab import drive
drive.mount('/content/drive')```

```%cd /content/drive/My Drive/Project6_SentimentAnalysis_with_Pipeline```

If you are working locally on PC, keep training data in the same directory as this code file

In [5]:
#Loading the dataset
dump = pd.read_csv('alexa_reviews_dataset.tsv',sep='\t') 

dump

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


## Data Preparation

In [6]:
dataset = dump[['verified_reviews','rating']]
dataset.columns = ['Review', 'Sentiment']

dataset.head()

Unnamed: 0,Review,Sentiment
0,Love my Echo!,5
1,Loved it!,5
2,"Sometimes while playing a game, you can answer...",4
3,I have had a lot of fun with this thing. My 4 ...,5
4,Music,5


In [7]:
# Creating a new column sentiment based on overall ratings
def compute_sentiments(labels):
  sentiments = []
  for label in labels:
    if label > 3.0:
      sentiment = 1
    elif label <= 3.0:
      sentiment = 0
    sentiments.append(sentiment)
  return sentiments

In [8]:
dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)


In [9]:
dataset.head()

Unnamed: 0,Review,Sentiment
0,Love my Echo!,1
1,Loved it!,1
2,"Sometimes while playing a game, you can answer...",1
3,I have had a lot of fun with this thing. My 4 ...,1
4,Music,1


In [10]:
# check distribution of sentiments

dataset['Sentiment'].value_counts()

1    2741
0     409
Name: Sentiment, dtype: int64

In [11]:
# check for null values
dataset.isnull().sum()

# no null values in the data

Review       0
Sentiment    0
dtype: int64

### Data Cleaning

In [12]:
x = dataset['Review']
y = dataset['Sentiment']

In [13]:
# Create a function to clean data 
# We shall remove stopwords, punctuations & apply lemmatization

In [14]:
# import string
# from spacy.lang.en.stop_words import STOP_WORDS

In [15]:
from custom_tokenizer_function import CustomTokenizer

# # creating a function for data cleaning

# def text_data_cleaning(sentence):
#   doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

#   tokens = [] # list of tokens
#   for token in doc:
#     if token.lemma_ != "-PRON-":
#       temp = token.lemma_.lower().strip()
#     else:
#       temp = token.lower_
#     tokens.append(temp)
 
#   cleaned_tokens = []
#   for token in tokens:
#     if token not in stopwords and token not in punct:
#       cleaned_tokens.append(token)
#   return cleaned_tokens

In [16]:
# if root form of that word is not pronoun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

In [17]:
# let's do a test
custom_tokenizer = CustomTokenizer()
custom_tokenizer.text_data_cleaning("Hello all, It's a beautiful day outside there!")
# stopwords and punctuations removed

['hello', 'beautiful', 'day', 'outside']

### Vectorization Feature Engineering (TF-IDF)

In [18]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [19]:
tfidf = TfidfVectorizer(tokenizer=custom_tokenizer.text_data_cleaning)
# tokenizer=text_data_cleaning, tokenization will be done according to this function

## Train the model

### Train/ Test Split

In [20]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = dataset.Sentiment, random_state = 0)

In [21]:
x_train.shape, x_test.shape
# 2520 samples in training dataset and 630 in test dataset

((2520,), (630,))

### Fit x_train and y_train

In [22]:
classifier = LinearSVC()

In [23]:
pipeline = Pipeline([('tfidf',tfidf), ('clf',classifier)])
# it will first do vectorization and then it will do classification

In [24]:
pipeline.fit(x_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<bound method CustomTokenizer.text_data_cleaning of <custom_tokenizer_function.CustomTokenizer object at 0x000001DED36C4040>>)),
                ('clf', LinearSVC())])

In [25]:
# in this we don't need to prepare the dataset for testing(x_test)

In [26]:
import joblib
joblib.dump(pipeline,'sentiment_model.pkl')

['sentiment_model.pkl']

## Check Model Performance

In [27]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [28]:
y_pred = pipeline.predict(x_test)

In [29]:
# confusion_matrix
confusion_matrix(y_test, y_pred)

array([[ 39,  43],
       [ 10, 538]], dtype=int64)

In [30]:
# classification_report
print(classification_report(y_test, y_pred))
# we are getting almost 91% accuracy

              precision    recall  f1-score   support

           0       0.80      0.48      0.60        82
           1       0.93      0.98      0.95       548

    accuracy                           0.92       630
   macro avg       0.86      0.73      0.77       630
weighted avg       0.91      0.92      0.91       630



In [31]:
round(accuracy_score(y_test, y_pred)*100,2)

91.59

## Predict Sentiments using Model

### Simple way

In [32]:
# prediction = pipeline.predict(["Alexa is bad"])

# if prediction == 1:
#   print("Result: This review is positive")
# else:
#   print("Result: This review is negative")

### Fancy way

In [33]:
# new_review = []
# pred_sentiment = []

# while True:
  
#   # ask for a new amazon alexa review
#   review = input("Please type an Alexa review (Type 'skip' to exit) - ")

#   if review == 'skip':
#     print("See you soon!")
#     break
#   else:
#     prediction = pipeline.predict([review])

#     if prediction == 1:
#       result = 'Positive'
#       print("Result: This review is positive\n")
#     else:
#       result = 'Negative'
#       print("Result: This review is negative\n")
  
#   new_review.append(review)
#   pred_sentiment.append(result)

In [34]:
# Results_Summary = pd.DataFrame(
#     {'New Review': new_review,
#      'Sentiment': pred_sentiment,
#     })

# Results_Summary.to_csv("./predicted_sentiments.tsv", sep='\t', encoding='UTF-8', index=False)
# Results_Summary