<h1>Amazon Reviews Sentiment Analysis - <span style='color: blue;'>Satadru Mallick</span></h1>

## Read the data

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('Amazon_Unlocked_Mobile.csv')

## Drop incomplete rows

In [2]:
data.isnull().sum(axis = 0)

Product Name        0
Brand Name      65171
Price            5933
Rating              0
Reviews            62
Review Votes    12296
dtype: int64

As there are NaN or incomplete values, we are dropping those rows

In [3]:
data = data.dropna()

## Insert a 'Sentiment' column

Such that it is 1 where 'Rating' > 2 and otherwise -1.<br>1 implies positive sentiment and -1 implies negative sentiment.

In [4]:
data['Sentiment'] = np.where(data.Rating > 2, 1, -1)

## Display first 5 rows

In [5]:
data.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Sentiment
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1


## Divide dataset for testing and training

Here 'Reviews' is our feature vector and 'Sentiment' is the target vector.<br>X = 'Reviews'<br>y = 'Sentiment'<br>We will be using train_test_split under sklearn.model_selection to divide the dataset into testing and training parts.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.Reviews, data.Sentiment, test_size=0.2, random_state=0)

## Prepare bag of words model

We use CountVectorizer to calculate word frequencies.<br>min_df parameter = 5 ensures that we don't take words which appears too infrequently (less than 5 times)<br>We want to include BiGrams in our bag of words. So, we set ngram_range as (1, 2)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df = 5, ngram_range = (1, 2)).fit(X_train)
vectorized = vectorizer.transform(X_train)

## Perform Logistic Regression

In [8]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Calculate Model Accuracy Score

In [9]:
from sklearn.metrics import accuracy_score

pred = logreg.predict(vectorizer.transform(X_test))
accuracy_score(y_test, pred)

0.9643770469738436

## Simple function to help to understand the result more easily

In [10]:
def prediction(review):
    p = logreg.predict_proba(vectorizer.transform([review]))
    if p[0][0] > 0.5:
        print('\nSentiment type: Negative, Probability: ', p[0][0] * 100,' %')
    else: 
        print('\nSentiment type: Positive, Probability: ', p[0][1] * 100,' %')

## Test the trained model with other inputs

In [11]:
prediction('''Unable to get product support.''')


Sentiment type: Negative, Probability:  50.506803381645405  %


In [12]:
prediction('''Not bad. Quite useful.''')


Sentiment type: Positive, Probability:  91.15162495588062  %


In [13]:
prediction('''Not good.''')


Sentiment type: Negative, Probability:  72.39458374788644  %
