# Analysis of E-Commerce Dataset

## Overview

- The aim of this project is to perform sentiment analysis on an E-Commerce dataset. The data is taken from https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews. This project will demonstrate sentiment analysis with two machine learning classifiers.

- The 2 classifiers that will be used here are Naive Bayes Classifier and Support Vector Machine.

- We will also use nltk.sentiment.vader from the SentimentIntensityAnalyzer package.

#### Steps:

1. Import and process data
2. Exploratory Data Analysis and further data scraping
3. Sentiment Analysis
 - 3.1 Naive Bayes
 - 3.2 Support Vector Machine
 - 3.3 SentimentIntensityAnalyzer 



## 1. Imports and processing dataset

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('Clothing E-Commerce Reviews.csv')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [6]:
len(df)

23486

### Removing NaN values 

In [8]:
# getting the columns that matter for this project
df = df[['Review Text', 'Recommended IND']]

In [10]:
# Check for the existence of NaN values in a cell:
df.isnull().sum()

Review Text        845
Recommended IND      0
dtype: int64

In [11]:
df.dropna(inplace = True)

In [12]:
# we only removed 845 entries out of 23486, and still have substantial data to make analysis
len(df)

22641

In [13]:
df['Recommended IND'].value_counts()

1    18540
0     4101
Name: Recommended IND, dtype: int64

In [14]:
18540/(18540+4101) *100

81.88684245395521

As seen, we have a rather unbalanced dataset of positive and reviews. If we were to just predict all reviews are positive, we will be getting an accuracy rate of 81.8%. As such, our models should have a accuracy rate higher tham 81.8% for us to gain a comprehensive understanding of the reviews.

## 2. Exploratory Data Analysis and further data scraping

### Using Stemming  and stopwords to reduce reviews to root words 

In [15]:
import re
import nltk 
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

nltk.download('stopwords')
corpus = []

for i,content in df.iterrows():
    review = content[0]
    review = review.lower()
    review = review.split()
    s_stemmer = SnowballStemmer(language='english')
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [s_stemmer.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)

    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiajie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
x = pd.DataFrame (corpus, columns = ['Review Text'])

### Train Test Split

In [17]:
from sklearn.model_selection import train_test_split

X = x['Review Text']
y = df['Recommended IND']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=900)

## 3.1 Build pipelines to vectorize data, then train and fit a model using Naive Bayes first

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])


# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

In [19]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [20]:
predictions_nb = text_clf_nb.predict(X_test)

In [21]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_nb))

[[  45 1234]
 [   3 6190]]


In [22]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_nb))

              precision    recall  f1-score   support

           0       0.94      0.04      0.07      1279
           1       0.83      1.00      0.91      6193

    accuracy                           0.83      7472
   macro avg       0.89      0.52      0.49      7472
weighted avg       0.85      0.83      0.77      7472



In [23]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions_nb))

0.834448608137045


From the results, we can see that using Naive Bayes with TF-IDF vectorizer is not a good idea for extracting negative reviews.

## 3.2 Train and fit a model using Support Vector Machine

In [24]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [25]:
# Form a prediction set
predictions_svm = text_clf_lsvc.predict(X_test)

In [26]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_svm))

[[ 780  499]
 [ 311 5882]]


In [27]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_svm))

              precision    recall  f1-score   support

           0       0.71      0.61      0.66      1279
           1       0.92      0.95      0.94      6193

    accuracy                           0.89      7472
   macro avg       0.82      0.78      0.80      7472
weighted avg       0.89      0.89      0.89      7472



In [28]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions_svm))

0.8915952890792291


## 3.3 Sentiment Analysis using SentimentIntensityAnalyzer

In [29]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [30]:
df['scores'] = df['Review Text'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')



In [31]:
df['actual_score'] = df['Recommended IND'].apply(lambda c: 'pos' if c ==1 else 'neg')

In [32]:
df.head()

Unnamed: 0,Review Text,Recommended IND,scores,compound,comp_score,actual_score
0,Absolutely wonderful - silky and sexy and comf...,1,"{'neg': 0.0, 'neu': 0.272, 'pos': 0.728, 'comp...",0.8932,pos,pos
1,Love this dress! it's sooo pretty. i happene...,1,"{'neg': 0.0, 'neu': 0.664, 'pos': 0.336, 'comp...",0.9729,pos,pos
2,I had such high hopes for this dress and reall...,0,"{'neg': 0.027, 'neu': 0.792, 'pos': 0.181, 'co...",0.9427,pos,neg
3,"I love, love, love this jumpsuit. it's fun, fl...",1,"{'neg': 0.226, 'neu': 0.34, 'pos': 0.434, 'com...",0.5727,pos,pos
4,This shirt is very flattering to all due to th...,1,"{'neg': 0.0, 'neu': 0.7, 'pos': 0.3, 'compound...",0.9291,pos,pos


In [34]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [35]:
accuracy_score(df['actual_score'],df['comp_score'])

0.8375955125656994

In [36]:
print(classification_report(df['actual_score'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.65      0.23      0.34      4101
         pos       0.85      0.97      0.91     18540

    accuracy                           0.84     22641
   macro avg       0.75      0.60      0.62     22641
weighted avg       0.81      0.84      0.80     22641



In [37]:
print('Some negative reviews:')
print('')
print(df[df['comp_score'] == 'neg'].iloc[0,0])
print('\n')
print(df[df['comp_score'] == 'neg'].iloc[1,0])
print('\n')
print(df[df['comp_score'] == 'neg'].iloc[2,0])
print('\n')
print(df[df['comp_score'] == 'neg'].iloc[3,0])
print('\n')
print(df[df['comp_score'] == 'neg'].iloc[4,0])

Some negative reviews:

I ordered this in carbon for store pick up, and had a ton of stuff (as always) to try on and used this top to pair (skirts and pants). everything went with it. the color is really nice charcoal with shimmer, and went well with pencil skirts, flare pants, etc. my only compaint is it is a bit big, sleeves are long and it doesn't go in petite. also a bit loose for me, but no xxs... so i kept it and wil ldecide later since the light color is already sold out in hte smallest size...


I'm 5"5' and 125 lbs. i ordered the s petite to make sure the length wasn't too long. i typically wear an xs regular in retailer dresses. if you're less busty (34b cup or smaller), a s petite will fit you perfectly (snug, but not tight). i love that i could dress it up for a party, or down for work. i love that the tulle is longer then the fabric underneath.


3 tags sewn in, 2 small (about 1'' long) and 1 huge (about 2'' x 3''). very itchy so i cut them out. then the thread left behind

Based on the findings, use of support vector machine is the best classifier that gives us higher accuracy, precision and recall score.