# ![title](amazon.jpg)

# Sentiment Analysis on Amazon reviews

# ![title](senti.jpg)

# About the Data

This is a list of over 1,500 consumer reviews for Amazon products like the Kindle, Fire TV Stick, and more provided by Datafiniti's Product Database. The dataset includes basic product information, rating, review text, and more for each product. The Kaggle's link to the [dataset](https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/data).

##  EDA and Preprocessing 

Importing necessary libraries pandas and numpy.

In [277]:
import pandas as pd
import numpy as np

Load the data and rename the reviews,title and rating coulmns.Also we have considered only these three columns going ahead.

In [278]:
#loading the data
data = pd.read_csv("Amazon_reviews.csv")
data = data[["reviews.text","reviews.title","reviews.rating"]]
data.columns = ["review","title","rating"]

In [279]:
#removing rows with no rating
data = data[~data.rating.isnull()]
data.head(10)

Unnamed: 0,review,title,rating
0,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",5.0
1,Allow me to preface this with a little history...,One Simply Could Not Ask For More,5.0
2,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,4.0
3,I bought one of the first Paperwhites and have...,Love / Hate relationship,5.0
4,I have to say upfront - I don't like coroporat...,I LOVE IT,5.0
13,"Had older model, that you could text to speech...",Liked the smaller size,4.0
14,This is a review of the Kindle Paperwhite laun...,Superb reading device - but which one's best f...,5.0
15,I love my kindle! I got one for my fiance on h...,I love it!,5.0
16,"Vraiment bon petit appareil , lger et facile d...",Un plaisir,4.0
17,Exactly what it is supposed to be. Works great...,Works great and I love the built-in light,5.0


Create a new Label to classify ratings as positive or negative

In [280]:
#Define Label
data["label"] = 0
data.loc[data["rating"]>3,"label"] = 1

#Delete the "rating" column 
del data["rating"]


In [281]:
data["label"].value_counts()

1    977
0    200
Name: label, dtype: int64

As seen from the output, there are more positive ratings in contrast to negative ratings. In order to avoid bias, we shall consider only 200 positive reviews.

In [282]:
data = pd.concat([data[data.label==1][:200],data[data.label==0]])
data = data.sample(frac=1)
print(data.shape)
#print(data.head())

(400, 3)


Splitting data into training and test dataset 

In [283]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data,test_size=0.3, random_state=42)
print(train.shape,test.shape)

(280, 3) (120, 3)


Define function to create dictionary for the positive and negative words

In [255]:
import csv
import nltk
from scipy.sparse import csr_matrix, hstack

In [256]:
def bing_lex_feat(data):
    negList = []
    posList = []
    wordDict = dict()

    with open('positive-words.txt', 'r') as f:
        reader = csv.reader(f)
        headerRows = [i for i in range(0, 35)]
        for row in headerRows:
            next(reader)
        for word in reader:
            posList.extend(word[0])
            wordDict[word[0]] = 'positive'

        # FYI, I had to edit the word 'inimically' in the original file as there was a weird non utf-8 character
    with open('negative-words.txt', 'r',encoding='latin') as f:
        reader = csv.reader(f)
        headerRows = [i for i in range(0, 35)]
        for row in headerRows:
            next(reader)
        for word in reader:
            negList.extend(word[0])
            wordDict[word[0]] = 'negative'
            
    def calc_it(review):
        tokens = nltk.tokenize.word_tokenize(review)
        counts = {'positive':0,'negative':0}
        for token in tokens:
            try:
                counts[wordDict[token]] += 1
            except:
                pass
        return counts
        
    data = data.apply(calc_it)
    return csr_matrix(pd.concat([data.apply(lambda x: x["positive"]),data.apply(lambda x: x["negative"])],axis=1))


# Feature Extraction 

In [285]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.review)
print(X_train_counts.shape)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print( X_train_tf.shape)


(280, 3634)
(280, 3634)


Extracting all features from the documents and assigning unique values.It also give the counts of positive or negative words in a review Normalized using TfidfTransformer.

In [286]:
train_lex_dict_feat = bing_lex_feat(train.review)
train_feat = hstack([X_train_tf,train_lex_dict_feat])
#print(X_train_tf)
#print(train_lex_dict_feat)
#print(train_feat)
print(train_feat.shape)


(280, 3636)


Applied the defined function on training data. Used Hstack to concatenate the outputs of CountVectorizer and TfidTransformer. 

# Data Modelling

In [287]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(train_feat, train.label)

X_test_counts = count_vect.transform(test.review)
X_test_tf = tf_transformer.transform(X_test_counts)
test_lex_dict_feat = bing_lex_feat(test.review)
test_feat = hstack([X_test_tf,test_lex_dict_feat])

predicted1 = clf.predict(test_feat)
print (predicted)

print("Accuracy Of RandomForest:",np.mean(predicted==test.label))



[0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0
 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0
 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 0 1 1 1]
Accuracy Of RandomForest: 0.466666666667


In [288]:
clf2 = SGDClassifier().fit(train_feat, train.label)

X_test_counts = count_vect.transform(test.review)
X_test_tf = tf_transformer.transform(X_test_counts)
test_lex_dict_feat = bing_lex_feat(test.review)
test_feat = hstack([X_test_tf,test_lex_dict_feat])

predicted2 = clf2.predict(test_feat)
print (predicted)

print("Accuracy Of SGDC:",np.mean(predicted2==test.label))




[0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0
 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0
 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 0 1 1 1]
Accuracy Of SGDC: 0.6


In [289]:
clf3 = MultinomialNB().fit(train_feat, train.label)

X_test_counts = count_vect.transform(test.review)
X_test_tf = tf_transformer.transform(X_test_counts)
test_lex_dict_feat = bing_lex_feat(test.review)
test_feat = hstack([X_test_tf,test_lex_dict_feat])

predicted3 = clf3.predict(test_feat)
print (predicted)

print("Accuracy Of Naive Bayes:",np.mean(predicted3==test.label))


[0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0
 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0
 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 0 1 1 1]
Accuracy Of Naive Bayes: 0.775


For the given random_state we found that Multinomial Naive Bayes gave better accuracy. Let explore more using the metrics function.

In [290]:
from sklearn import metrics
print(metrics.classification_report(test.label, predicted3))

             precision    recall  f1-score   support

          0       0.88      0.63      0.74        60
          1       0.71      0.92      0.80        60

avg / total       0.80      0.78      0.77       120



The F1 score can be interpreted as a weighted average of the precision and recall. Our model gave a f1-score of 0.82 which is a good measure.

## Insights

 * From our model we found that there are more positive reviews compared to negative ones for different products and services offered by Amazon.
 * We found thst for the past two years the sentiment of customers towards Amazon,its products and services is positive.
 * Our model works well for any Business who wants to evaluate their products. Since our dataset had reviews for different products, running this model on one product would help the product manager to get better insights on how is it performing in the market.  
 * This project utilizes unigram model but also has a scope for using bigram 

## Recommendations

 * There are situations when the feedback for a product is given only in terms of reviews and not ratings. Even in social media like Twitter,Facebook etc. where customers give reviews its difficult for a Product Manager to know which reviews were good or bad. It gets cumbersome to look into all the reviews to understand how certain subject is doing in the market.
 *   Sentiment analysis here plays an integral role to understand the customer experience and suggest suitable or optimal solutions for business.
 * Future scope for this project involves further classifying the emotions as angry, excited etc through more feature extractoin methods.   

## References

1.Opinion Lexicon: A list of [English](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) positive and negative opinion words or sentiment words (around 6800 words). This list was compiled by Hu and Liu, KDD in 2004.
Link - https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

2.Feature Extraction (Count Vectorizer and Tfid Transformer) - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


Compiled By :
Sanketh Bhagavanthi,
Shahshidhar Channabasavaraj,
Vinay Murthy.