## Introduction

Amazon customer reviews about the products are one of the main reasons to attract customers on Amazon. It basically helps them understand almost every detail of the product. Since, consumers cannot physically inspect the product while shopping online, Amazon product review is the one they can trust in order to judge a product.

A research conducted by Dimensional Research claims that 90% of consumers online believe their purchasing choices are influenced by product reviews

The dataset I'll be using was aggregated from product reviews website -  www.trustpilot.com & www.consumeraffairs.com containing each customers'comments and rating of a particular service

In this project, I'll build a model using Natural Language Processing (NLP) to perform sentiment analysis (predict whether a comment is positive or negative) on Amazon customer's comments on their services.

In this project, I'll:

- Use Natural Language Processsing to prepare the data for Machine Learning
- Train a model using a Classifier model
- Evaluate the performance of the model

In [1]:
import pandas as pd
Amazon = pd.read_csv("Amazon_Reviews.csv")

In [4]:
Amazon.info()
Amazon

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 2 columns):
Customer_comments    292 non-null object
Customer_Rating      292 non-null float64
dtypes: float64(1), object(1)
memory usage: 4.6+ KB


Unnamed: 0,Customer_comments,Customer_Rating
0,"Sends defective items, bad support, condescend...",2.0
1,Amazon is normally good for shopping,4.0
2,There are thieves they asked for me to return ...,1.0
3,Amazon's customer review policy is fraught wit...,2.0
4,Convinence but unethical at it's best decline,2.0
5,It' s a good initiative that amazon is tie up ...,5.0
6,The customer service is useless if needed,2.0
7,Customer service doesn't raised ticket of desi...,1.0
8,Amazon is great example of a company that star...,2.0
9,Only purchased 2 items and this one is lost. N...,2.0


## Finding Missing Values

In [5]:
Amazon.isnull().sum()

Customer_comments    0
Customer_Rating      0
dtype: int64

## Data Cleaning & Engineering

From the customer_rating column, I'll want to just categorize it to just positive (1) & negative (0)

In [6]:
def sentiments(comments):
    if comments < 3:
        return 0
    elif comments > 3:
        return 1

In [8]:
Amazon["sentiments"] = Amazon["Customer_Rating"].apply(sentiments)
Amazon["sentiments"].value_counts(normalize=True) * 100

0    58.561644
1    41.438356
Name: sentiments, dtype: float64

## Tokenizing & Preprocessing the Customer Comments

In [9]:
tokenized_headlines = []
for item in Amazon["Customer_comments"]:
    tokenized_headlines.append(item.split())

- Some of the splitted words are closely related but are spelled differently

In [10]:
mapping_dict = {
    "Customer_comments": {
        "Amazon's": "Amazon",
        "suck": "sucks",
        "trusted": "trust"
    }
}

Amazon = Amazon.replace(mapping_dict)

In [12]:
tokenized_headlines = []
for item in Amazon["Customer_comments"]:
    tokenized_headlines.append(item.split())


- Also amongst the splitted words, punctuations are attached to the word, and this can decrease the uniqueness of the word

In [13]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")", "..", "..","...", "....", ".....", "......", "!"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

In [14]:
import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

In [15]:
counts

Unnamed: 0,for,and,they,return,it,is,i,a,the,review,...,loyal,dominant,display,complain,variety,satisfaction,favour,prefer,switching,thievesfraught
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,4,3,2,2,1,2,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,1,0,3,1,2,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,1,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,0,4,4,0,0,2,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,0,2,0,2,1,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
word_counts = counts.sum(axis=0)
word_counts

for                42
and               116
they               76
return              4
it                 28
is                 84
i                 157
a                 127
the               137
review              2
to                 94
me                 10
but                11
unethical           8
good               56
that               10
amazon            163
with              151
its                13
digital             2
customer           71
service           172
doesnt              2
on                 15
will               24
call                4
from               25
their              10
of                 64
up                  5
                 ... 
choose              4
describe            2
brick               9
mortar              9
stores              9
steals             10
angry              10
extremely           7
free                9
delayed             5
disgusted           5
unilateral          5
arrogant            5
problems            5
enjoyed   

### Removing stopwords


In [18]:
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 90)]

In [19]:
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

## Selecting Relevant Features

In [25]:
X = counts
Y = Amazon["sentiments"]
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
all_X = X
all_y = Y
lr = LogisticRegression()
selector = RFECV(lr,cv=5)
selector.fit(all_X,all_y)

Relevant_features = all_X.columns[selector.support_]

## Training a Model with LogisticRegression Model

Cause the training data (counts) consists of a kind of sparse matrix form, a LogisticRegression Model is suitable here rather a decision tree 

In [28]:
X = counts[Relevant_features]
Y = Amazon["sentiments"]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(class_weight="balanced")
predictions = cross_val_predict(lr, X, Y, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (Amazon["sentiments"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (Amazon["sentiments"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (Amazon["sentiments"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (Amazon["sentiments"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print("True Positive Rate = {}".format(tpr))
print("False Positive Rate = {}".format(fpr))

True Positive Rate = 0.9752066115702479
False Positive Rate = 0.07017543859649122


- The model is able to identify positive comments given a set of words in a document (each comment) at a rate of 97.5%
- The model is also able to identify negative comments given a set of documnent  