## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Ventorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way



## Import datasets

In [36]:
import pandas as pd

In [37]:
!pip install spacy
import spacy



In [38]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.2/12.8 MB 3.9 MB/s eta 0:00:04
     - -------------------------------------- 0.5/12.8 MB 4.9 MB/s eta 0:00:03
     -- ------------------------------------- 0.8/12.8 MB 5.4 MB/s eta 0:00:03
     --- ------------------------------------ 1.1/12.8 MB 5.6 MB/s eta 0:00:03
     ---- ----------------------------------- 1.4/12.8 MB 5.8 MB/s eta 0:00:02
     ---- ----------------------------------- 1.6/12.8 MB 5.9 MB/s eta 0:00:02
     ----- ---------------------------------- 1.9/12.8 MB 5.6 MB/s eta 0:00:02
     ------ --------------------------------- 2.1/12.8 MB 5.9 MB/s eta 0:00:02
     ------- -------------------------------- 2.4/12.8 MB 5.9 MB/s eta 0:00:02
     -------- ---------------------------

DEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl#egg=en_core_web_sm==3.3.0 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617


In [39]:
nlp = spacy.load('en_core_web_sm')

Set your working directory  here

If you are using Google Colab, use this code snippet:

```from google.colab import drive
drive.mount('/content/drive')```

```%cd /content/drive/My Drive/Project6_SentimentAnalysis_with_Pipeline```

If you are working locally on PC, keep training data in the same directory as this code file

In [40]:
#Loading the dataset
dump = pd.read_csv('sentiment.csv',sep=',')

dump

Unnamed: 0,id,Company,Product,Title,Date,Rating,Review,Clean_Title,Clean_Review,cleaned_title_review,Positive,Negative,Neutral,Compound,Sentiment
0,0,Apple,"Apple iPhone XS, US Version, 64GB, Space Gray","Honestly, it was worth it",22-Jun-19,5,I was very hesitant about buying an iPhone off...,honestly worth,hesitant buying iphone amazon disappoint come ...,honestly worth hesitant buying iphone amazon d...,0.246,0.197,0.556,0.6901,positive
1,1,Apple,"Apple iPhone XS, US Version, 64GB, Space Gray",Eh wouldn’t buy again,30-Jun-19,1,One - It comes in a weird boxTwo it had more s...,eh wouldnt buy,one come weird boxtwo scuff scratch id like pr...,eh wouldnt buy one come weird boxtwo scuff scr...,0.173,0.279,0.548,-0.3536,negative
2,2,Apple,"Apple iPhone XS, US Version, 64GB, Space Gray","Beautiful, lovely, practically brand new iPhon...",04-Feb-20,5,I absolutely love my new iPhone XS! It arrived...,beautiful lovely practically brand new iphone x,absolutely love new iphone x arrive time pract...,beautiful lovely practically brand new iphone ...,0.325,0.033,0.641,0.9896,positive
3,3,Apple,"Apple iPhone XS, US Version, 64GB, Space Gray",Phone not working,25-Dec-18,4,The phone is froze up and unable to use. Very ...,phone work,phone froze unable use poor productedit within...,phone work phone froze unable use poor product...,0.266,0.133,0.601,0.2732,positive
4,4,Apple,"Apple iPhone XS, US Version, 64GB, Space Gray",May be defective one,02-Jul-19,1,Suddenly Wifi is not working properly and i co...,may defective one,suddenly wifi work properly could see phone fe...,may defective one suddenly wifi work properly ...,0.120,0.232,0.648,-0.3415,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45907,45907,Samsung,"Samsung Galaxy Note 10+, 256GB, Aura WhiteFully",The phone was as described on amazon.,16-May-20,5,It was a Birthday gift for my wife and she lov...,phone describe amazon,birthday gift wife love,phone describe amazon birthday gift wife love,0.688,0.000,0.312,0.8316,positive
45908,45908,Samsung,"Samsung Galaxy Note 10+, 256GB, Aura WhiteFully",The phone is great,14-Feb-20,5,The phone itself is great. There are no issues...,phone great,phone great issue whatsoever mine come perfect...,phone great phone great issue whatsoever mine ...,0.339,0.095,0.567,0.9123,positive
45909,45909,Samsung,"Samsung Galaxy Note 10+, 256GB, Aura WhiteFully",Beautiful.,14-May-21,5,Like new condition. No scratches. Boots up as ...,beautiful,like new condition scratch boot att say fully ...,beautiful like new condition scratch boot att ...,0.348,0.000,0.652,0.7506,positive
45910,45910,Samsung,"Samsung Galaxy Note 10+, 256GB, Aura WhiteFully",Juvenile Star Wars model,17-Jun-21,3,"This particular model, the Galaxy Note 10+, is...",juvenile star war model,particular model galaxy note 10 heavy compare ...,juvenile star war model particular model galax...,0.099,0.148,0.753,-0.7882,negative


## Data Preparation

In [41]:
dataset = dump[['Review','Rating']]
dataset.columns = ['Review', 'Sentiment']

dataset.head()

Unnamed: 0,Review,Sentiment
0,I was very hesitant about buying an iPhone off...,5
1,One - It comes in a weird boxTwo it had more s...,1
2,I absolutely love my new iPhone XS! It arrived...,5
3,The phone is froze up and unable to use. Very ...,4
4,Suddenly Wifi is not working properly and i co...,1


In [42]:
# Creating a new column sentiment based on overall ratings
def compute_sentiments(labels):
  sentiments = []
  for label in labels:
    if label > 3.0:
      sentiment = 1
    elif label <= 3.0:
      sentiment = 0
    sentiments.append(sentiment)
  return sentiments

In [43]:
dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Sentiment'] = compute_sentiments(dataset.Sentiment)


In [44]:
dataset.head()

Unnamed: 0,Review,Sentiment
0,I was very hesitant about buying an iPhone off...,1
1,One - It comes in a weird boxTwo it had more s...,0
2,I absolutely love my new iPhone XS! It arrived...,1
3,The phone is froze up and unable to use. Very ...,1
4,Suddenly Wifi is not working properly and i co...,0


In [45]:
# check distribution of sentiments

dataset['Sentiment'].value_counts()

1    31540
0    14372
Name: Sentiment, dtype: int64

In [46]:
# check for null values
dataset.isnull().sum()

# no null values in the data

Review       0
Sentiment    0
dtype: int64

### Data Cleaning

In [47]:
x = dataset['Review']
y = dataset['Sentiment']

In [48]:
# Create a function to clean data 
# We shall remove stopwords, punctuations & apply lemmatization

In [49]:
# import string
# from spacy.lang.en.stop_words import STOP_WORDS

In [50]:
from custom_tokenizer_function import CustomTokenizer

# # creating a function for data cleaning

# def text_data_cleaning(sentence):
#   doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

#   tokens = [] # list of tokens
#   for token in doc:
#     if token.lemma_ != "-PRON-":
#       temp = token.lemma_.lower().strip()
#     else:
#       temp = token.lower_
#     tokens.append(temp)
 
#   cleaned_tokens = []
#   for token in tokens:
#     if token not in stopwords and token not in punct:
#       cleaned_tokens.append(token)
#   return cleaned_tokens

In [51]:
# if root form of that word is not pronoun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

In [52]:
# let's do a test
custom_tokenizer = CustomTokenizer()
custom_tokenizer.text_data_cleaning("Hello all, It's a beautiful day outside there!")
# stopwords and punctuations removed

['hello', 'beautiful', 'day', 'outside']

### Vectorization Feature Engineering (TF-IDF)

In [53]:
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [54]:
tfidf = TfidfVectorizer(tokenizer=custom_tokenizer.text_data_cleaning)
# tokenizer=text_data_cleaning, tokenization will be done according to this function
#cleaning operations such as removing stopwords, punctuation, and other non-alphabetic characters, and then return a list of tokens (words) that are ready for further processing.

## Train the model

### Train/ Test Split

In [55]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = dataset.Sentiment, random_state = 0)

In [56]:
x_train.shape, x_test.shape
# 2520 samples in training dataset and 630 in test dataset

((36729,), (9183,))

### Fit x_train and y_train

In [57]:
classifier = LinearSVC()

In [58]:
pipeline = Pipeline([('tfidf',tfidf), ('clf',classifier)])
# it will first do vectorization and then it will do classification

In [59]:
pipeline.fit(x_train, y_train)

In [60]:
# in this we don't need to prepare the dataset for testing(x_test)

In [61]:
import joblib
joblib.dump(pipeline,'sentiment_model.pkl')

['sentiment_model.pkl']

## Check Model Performance

In [62]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [63]:
y_pred = pipeline.predict(x_test)

In [64]:
# confusion_matrix
confusion_matrix(y_test, y_pred)

array([[2398,  477],
       [ 380, 5928]], dtype=int64)

In [65]:
# classification_report
print(classification_report(y_test, y_pred))
# we are getting almost 91% accuracy

              precision    recall  f1-score   support

           0       0.86      0.83      0.85      2875
           1       0.93      0.94      0.93      6308

    accuracy                           0.91      9183
   macro avg       0.89      0.89      0.89      9183
weighted avg       0.91      0.91      0.91      9183



In [66]:
round(accuracy_score(y_test, y_pred)*100,2)

90.67

## Predict Sentiments using Model

### Simple way

In [67]:
# prediction = pipeline.predict(["Alexa is bad"])

# if prediction == 1:
#   print("Result: This review is positive")
# else:
#   print("Result: This review is negative")

### Fancy way

In [68]:
# new_review = []
# pred_sentiment = []

# while True:
  
#   # ask for a new amazon alexa review
#   review = input("Please type an Alexa review (Type 'skip' to exit) - ")

#   if review == 'skip':
#     print("See you soon!")
#     break
#   else:
#     prediction = pipeline.predict([review])

#     if prediction == 1:
#       result = 'Positive'
#       print("Result: This review is positive\n")
#     else:
#       result = 'Negative'
#       print("Result: This review is negative\n")
  
#   new_review.append(review)
#   pred_sentiment.append(result)

In [69]:
# Results_Summary = pd.DataFrame(
#     {'New Review': new_review,
#      'Sentiment': pred_sentiment,
#     })

# Results_Summary.to_csv("./predicted_sentiments.tsv", sep='\t', encoding='UTF-8', index=False)
# Results_Summary

In [70]:
model = joblib.load('sentiment_model.pkl')

In [71]:
df = pd.read_csv('sentiment.csv')

In [72]:
# Apply the sentiment model to the reviews in the dataframe
sentiments = []
for review in df['Review']:
    sentiment = model.predict([review])[0]
    if sentiment == 1:
        sentiments.append('positive')
    else:
        sentiments.append('negative')

# Add the predicted sentiments to the dataframe
df['result'] = sentiments

In [73]:
# Save the updated dataframe as a new CSV file
df.to_csv('sentiment_with_results.csv', index=False)

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True