# Content

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

**Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.

**Age**: Positive Integer variable of the reviewers age.

**Title**: String variable for the title of the review.

**Review Text**: String variable for the review body.

**Rating**: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.

**Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

**Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.

**Division Name**: Categorical name of the product high level division.

**Department Name**: Categorical name of the product department name.

**Class Name**: Categorical name of the product class name.

![](https://c1.wallpaperflare.com/preview/191/476/40/fashion-clothing-shop-clothes.jpg)

The objective of our analysis today is to look through the reviews, determine if they are positive or negative (sentiment analysis) and find out what the customers like and dislike about the clothing. We will also find out what is the most popular and least popular items, as well as look at the distribution of the customers according to their age groups.

This will help us optimize the product and market strategy for this e-commerce store. Please upvote if you liked this notebook!

# Importing Libraries and Data

In [None]:
!pip install plotly
!pip install chart_studio
!pip install cufflinks
!pip install textblob

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import plotly as py
import cufflinks as cf
from plotly.offline import iplot

In [None]:
py.offline.init_notebook_mode(connected=True)
cf.go_offline()

In [None]:
df = pd.read_csv("../input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.head()

Seems like "Clothing ID" and the "Title" is not that useful for us. So let's drop them from the dataset.

In [None]:
df.drop(["Clothing ID", "Title"], axis=1, inplace=True)
df.head()

# Data Cleaning and Preparation

Let's check for missing values.

In [None]:
df.isnull().sum()

Since we will be working on Sentiment Analysis based on the "Review Text", there is no way for us to fill in the missing "Review Text" data. Let's drop it from the dataframe.

Additionally, it seems that the missing values for "Division Name", "Department Name" and "Class Name" is the same. Let's test that hypothesis by dropping null values for "Division Name".

In [None]:
df.dropna(subset=["Review Text", "Division Name"], inplace=True)

In [None]:
df.isnull().sum()

Looks like our hypothesis is true. Next, let's convert our "Review Text" into one large corpus in the form of a list.

In [None]:
df["Review Text"].tolist()

Seems like there are many contractions in this corpus of text. Let's replace them in the next portion of our notebook.

# Text Cleaning

There are many ways to deal with contractions such as tokenization, stemming and lemmatization with different libraries (such as NLTK). But I will just use a modified version of the list of contractions from wikipedia (https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions) for convenience sake.

In [None]:
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [None]:
def decontracted(x):
    if type(x) is str:
        x = x.replace('\\', '')
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

In [None]:
df["Review Text"] = df["Review Text"].apply(lambda x: decontracted(x))

In [None]:
import string
string.punctuation

We want to remove all punctuation except for fullstops, commas, and exclamation marks.

In [None]:
punctuation = '"#$%&\'()*+-/:;<=>?@[\\]^_`{|}~'
numbers = "0123456789"

def clean_text(text):
    clean_list = [x for x in text if x not in punctuation]
    clean_list = [x for x in clean_list if x not in numbers]
    clean_list = [x.lower() for x in clean_list]
    cleaned_text = ''.join(clean_list)
    return cleaned_text

In [None]:
df["Review Text"] = df["Review Text"].apply(clean_text)
df["Review Text"].tolist()

Great. Now let's engineer some text features before moving on to data visualization.

# Feature Engineering

In [None]:
from textblob import TextBlob
df.head()

In [None]:
df["Polarity"] = df["Review Text"].apply(lambda x: TextBlob(x).sentiment.polarity)
df["Review Length"] = df["Review Text"].apply(lambda x: len(x))
df["Word Count"] = df["Review Text"].apply(lambda x: len(x.split()))

In [None]:
def average_word_length(x):
    words = x.split()
    word_length = 0
    for word in words:
        word_length += len(word)
        
    return word_length/len(words)

In [None]:
df["Average Word Length"] = df["Review Text"].apply(lambda x: average_word_length(x))

In [None]:
df.head()

# Data Visualization

Let's first take a look at the distribution of sentiment polarity in this dataset. To be clear, a polarity of 1 is overwhelmingly positive, a polarity of -1 is overwhelmingly negative and a polarity of 0 is neutral. 

In [None]:
df["Polarity"].iplot(kind="hist", colors="blue", bins=50,
                    xTitle = "Sentiment Polarity",
                    yTitle = "Count",
                    title = "Sentiment Polarity Distribution")

We can see that there is a normal distribution centered around 0.175 polarity. Most of the reviews were positive, and a small fraction was negative. Now let's explore the distribution of review ratings (how good or bad were the reviews exactly?) and the reviewers age (do certain age groups tend to have a better opinion of our clothes?)

In [None]:
df["Age"].iplot(kind="hist", colors="red", bins=50,
                xTitle = "Age",
                yTitle = "Count",
                title = "Age Distribution",
                linecolor = 'black')

From this, we can tell that the median of our age groups is around 38 years old, with younger people constituting to a larger proportion of our customers. However, a sizeable amount of customers still come from age 40+.

In [None]:
plt.figure(figsize=(8,8))
labels = ["5 stars", "4 stars", "3 stars", "2 stars", "1 star"]
cmap = plt.get_cmap("tab20c")
df["Rating"].value_counts().plot.pie(autopct='%1.1f%%', shadow=True, labels=labels, colors = cmap(np.arange(5)*2))

In [None]:
rocket = plt.get_cmap("rocket")
fig, axes = plt.subplots(nrows=2, ncols=3,figsize=(12, 8))
one = df[df["Rating"] == 1]["Age"]
two = df[df["Rating"] == 2]["Age"]
three = df[df["Rating"] == 3]["Age"]
four = df[df["Rating"] == 4]["Age"]
five = df[df["Rating"] == 5]["Age"]

ax1 = sns.distplot(one, ax=axes[0][0], kde=False, bins=20, color=rocket(100))
ax1.set_title('One Star')

ax2 = sns.distplot(two, ax=axes[0][1], kde=False, bins=20, color=rocket(120))
ax2.set_title('Two Stars')

ax3 = sns.distplot(three, ax=axes[0][2], kde=False, bins=20, color=rocket(140))
ax3.set_title('Three Stars')

ax4 = sns.distplot(four, ax=axes[1][0], kde=False, bins=20, color=rocket(160))
ax4.set_title('Four Stars')

ax5 = sns.distplot(five, ax=axes[1][1], kde=False, bins=20, color=rocket(180))
ax5.set_title('Five Stars')

axes[-1, -1].axis("off")

plt.tight_layout()

Majority of our review is good, with over 77% 4/5 stars. It also seems that our reviews, regardless of positive or negative, is distributed similarly across age groups (i.e. no age group seem to favour our clothes more than others). 

# Analyzing Engineered Features

Next, we will look at the features we created. We have already analyzed the polarity, so let's focus on the review text length and word length.

In [None]:
df["Review Length"].iplot(kind="hist", colors="green",
                          xTitle = 'Review Length',
                          yTitle = "Count",
                          title = "Review Length Distribution")

We can see that most reviews have 500 or more characters. This can be a useful for sorting out authentic reviews from fake reviews (bots).

In [None]:
df["Word Count"].iplot(kind="hist", colors="#B6E880",
                          xTitle = 'Review Length',
                          yTitle = "Count",
                          title = "Review Length Distribution")

Seems like the length of reviews also tend towards the high side (94+ words). As most of our reviews are positive, we can infer that positive reviews are likely to be long and have many word/characters.

# Distribution of Department, Division and Class

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 5))
ax=df["Department Name"].value_counts().plot.pie(ax = axes[0], shadow=True, colors=rocket(np.arange(5)*50))
ax1=df["Division Name"].value_counts().plot.pie(ax = axes[1], shadow=True, colors=rocket(np.arange(5)*100))

In [None]:
df["Class Name"].value_counts().iplot(kind="bar", colors='rgb(95, 70, 144)',
                                           xTitle = 'Class',
                                           yTitle = "Count",
                                           title = "Class Distribution")

From here, we can see that the most popular type of item are dresses, followed by an assortment of tops.

# Unigram, Bigram and Trigram Analysis

Let's first create a function that can read in a list of words and return us the top n number of words and their frequencies.

In [None]:
x = ["This is a list of words, which are words that are in a list."]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer().fit(x)
bag_of_words = vectorizer.transform(x)
sum_of_words = bag_of_words.sum(axis=0)
word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
word_frequency

And if we want the top 5 words:

In [None]:
word_frequency[:5]

Great, now let's put this into a function.

In [None]:
def top_n_words(x, n):
    vectorizer = CountVectorizer().fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

Let's take a look at our top 20 words in the reviews.

In [None]:
top_n_words(df["Review Text"], 20)

Let's take a look at the top 20 bigrams and trigrams too.

In [None]:
def top_n_bigrams(x, n):
    vectorizer = CountVectorizer(ngram_range=(2,2)).fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

In [None]:
top_n_bigrams(df["Review Text"], 20)

In [None]:
def top_n_trigrams(x, n):
    vectorizer = CountVectorizer(ngram_range=(3,3)).fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

In [None]:
top_n_trigrams(df["Review Text"], 20)

From here, we can see some useful key words such as "dress", "material", "fabric", and "color". But this is mostly not that useful as there are too many stopwords (commonly occuring words that have no context value, such as "this, it, the, is" etc). Let's remove the stopwords and see if our analysis turns up with something more useful.

In [None]:
def top_n_words(x, n):
    vectorizer = CountVectorizer(stop_words='english').fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

In [None]:
def top_n_bigrams(x, n):
    vectorizer = CountVectorizer(ngram_range=(2,2), stop_words='english').fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

In [None]:
def top_n_trigrams(x, n):
    vectorizer = CountVectorizer(ngram_range=(3,3), stop_words='english').fit(x)
    bag_of_words = vectorizer.transform(x)
    sum_of_words = bag_of_words.sum(axis=0)
    word_frequency = [(key, sum_of_words[0, value]) for key, value in vectorizer.vocabulary_.items()]
    word_frequency = sorted(word_frequency, key = lambda x: x[1], reverse=True)
    return word_frequency[:n]

In [None]:
top_unigrams = top_n_words(df["Review Text"], 20)
df_unigrams = pd.DataFrame(top_unigrams)
top_unigrams

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(background_color='white').generate_from_frequencies(df_unigrams.set_index(0)[1])
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
top_bigrams = top_n_bigrams(df["Review Text"], 20)
df_bigrams = pd.DataFrame(top_bigrams)
top_bigrams

In [None]:
wordcloud = WordCloud(background_color='white').generate_from_frequencies(df_bigrams.set_index(0)[1])
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
top_trigrams = top_n_trigrams(df["Review Text"], 20)
df_trigrams = pd.DataFrame(top_trigrams)
top_trigrams

In [None]:
wordcloud = WordCloud(background_color='white').generate_from_frequencies(df_trigrams.set_index(0)[1])
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

This is much more useful. We can see that the dresses are very well liked, and many of the reviews praise the fitting of the clothes as well as the aesthetics.

# Building a Sentiment Classifier

Even though we can get a rough judgement on the polarity of reviews based on the TextBlob sentiment polarity function, let's create a classifier based on our own terms. For this purpose, I will classify reviews with 4 and 5 stars as positive reviews, 3 as neutral, and below 3 as negative. Let's reflect this in a new feature column.

In [None]:
positive = (df["Rating"] >= 4)
neutral = (df["Rating"] == 3)
negative = (df["Rating"] < 3)

df["Review Type"] = " "
df["Review Type"][positive] = "Positive"
df["Review Type"][neutral] = "Neutral"
df["Review Type"][negative] = "Negative"

df.head()

Neat. Now let's map the Review Type from categorical data to numerical data. Positive = 1, Neutral = 0, and Negative = -1

In [None]:
review_type = {"Positive": 2, "Neutral": 1, "Negative": 0}
df["Review Type"] = df["Review Type"].map(review_type)

In [None]:
df.head()

In [None]:
X = df.iloc[:, 1].values
y = df.iloc[:, -1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Let's use a unigram vectorizer with no stopwords filtered as a baseline.

In [None]:
vect = CountVectorizer()
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

Now, let's try one of the most well known classifiers for text data, Naive Bayes. But before that, we will create a dummy classifier that predicts the most frequent value as a benchmark.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
    
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train_vect, y_train)
y_dummy_predictions = dummy_majority.predict(X_test_vect)
print('Dummy Classifier Accuracy: {:.2f}'.format(accuracy_score(y_test, y_dummy_predictions)))

Here we can see that even a classifier that just predicts the most frequent value ("Positive"/1) has an accuracy of 77%. This is due to the imbalanced nature of the dataset where most of the reviews are positive. Therefore, accuracy alone is not a reliable metric, and we have to take into account precision and recall as well.

![](https://miro.medium.com/max/1872/1*pOtBHai4jFd-ujaNXPilRg.png)

What exactly are accuracy, precision and recall? These are metrics to measure the performance of a predictive model. In essence:

1. Precision --> For all labels that were predicted positive by our algorithm, what % of them are actually positive? (i.e. We want an algorithm where not all true positive labels are predicted, but when it does predict a positive label, we can be confident that it's right.)

2. Recall --> For all labels that were predicted by our algorithm, if they were truly positive or classified wrongly as negative, what % of them are positive? (i.e. We want an algorithm that rarely fails to detect true positive labels, thereby minimizing false negatives.)

3. Accuracy --> For all labels, what % of them did our algorithm predict correctly?

Precision is important in customer-facing cases, where people tend to remember the failure of an algorithm even if it performs well most of the time. For example, a query suggestion in a web search interface.

Recall is especially important in the healthcare industry, where we want to be sure that if the AI predicts a tumor, it is actually a tumor.

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_vect, y_train)

In [None]:
y_pred = classifier.predict(X_test_vect)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='macro')))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='macro')))

Seems like it didn't fare much better than the dummy classifier. The problem probably stems from the fact that we used a default unigram vectorizer. Let's tune the paramters a bit. We will tune the vectorizer to take into account unigrams and bigrams that only occur in 3 reviews or more.

In [None]:
vect = CountVectorizer(min_df=3, ngram_range=(1,2))
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

In [None]:
classifier.fit(X_train_vect, y_train)
y_pred = classifier.predict(X_test_vect)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred, average='macro')))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred, average='macro')))

Results improved, but only marginally. One likely cause is that neutral reviews have a mix of positive and negative sentiments, confusing the algorithm. In that case, let's focus on only the positive and negative reviews as they are generally more important.

# Modified Sentiment Analysis

In [None]:
df_modified = df[df["Review Type"] != 1]
df_modified.head()

In [None]:
X = df_modified.iloc[:, 1]
y = df_modified.iloc[:, -1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
vect = CountVectorizer(min_df=3, ngram_range=(1,2))
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
    
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train_vect, y_train)
y_dummy_predictions = dummy_majority.predict(X_test_vect)
print('Dummy Classifier Accuracy: {:.2f}'.format(accuracy_score(y_test, y_dummy_predictions)))

![](https://glassboxmedicine.files.wordpress.com/2019/02/confusion-matrix.png)

If you realized, earlier I had already utilized a confusion matrix. However, I did not explain what it was as I used it for multi-class classification, which was unsuitable for this diagram (binary classification). That being said, once you understand the concept behind a confusion matrix, it is easy to extrapolate your interpretation to as many classes as you want.

1. Top left of the confusion matrix: True Positives --> Labels which the algorithm predicted as Positive and are actually Positive.

2. Top right of the confusion matrix: False Positives --> Labels which the algorithm predicted as Positive but are actually Negative.

3. Bottom left of the confusion matrix: False Negatives --> Labels which the algorithm predicted as Negatives but are actually Positive.

4. Bottom right of the confusion matrix: True Negatives --> Labels which the algorithm predicted as Negative and are actually Negative.

In this case, an ideal confusion matrix would be one where the values along the right diagonal (False Positives and False Negatives) are 0.

In [None]:
from sklearn.metrics import classification_report
classifier.fit(X_train_vect, y_train)
y_pred = classifier.predict(X_test_vect)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
val = ["The dress I ordered looked good online, but disappointing when I received it. Material is not bad but design needs improvement.", 
       "I had read bad reviews about their satin underwear, but it turned out to be great! Happy with my purchase."]
classifier.predict(vect.transform(val))

Great! Now we can be sure that any new review can be classified with a 94% accuracy. Let's dive deeper into the positive and negative reviews.

# Positive/Negative Review Analysis

In [None]:
reverse_map = {2: "Positive", 0: "Negative"}
df_modified["Review Type"] = df_modified["Review Type"].map(reverse_map)
df_modified.head()

In [None]:
df_positive = df_modified[df_modified["Review Type"] == "Positive"]
df_negative = df_modified[df_modified["Review Type"] == "Negative"]

In [None]:
plt.figure(figsize=(10, 8))
sns.stripplot(x="Division Name", y="Polarity", data=df_modified, palette='coolwarm', hue='Review Type')
plt.tight_layout()

This makes sense, positive reviews tend to have greater polarity than negative reviews. However, it seems like there are more polarized negative reviews in the General Division as well as General Petite Division. This is possibly because there is a larger sample size in those two divisions.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 8), sharex=True, sharey=True)
ax = sns.boxplot(x="Division Name", y="Polarity", data=df_positive, ax=axes[0], palette='rocket')
ax.set_title("Positive Reviews")
ax.set_xlabel(" ")

ax1 = sns.boxplot(x="Division Name", y="Polarity", data=df_negative, ax=axes[1], palette='rocket')
ax1.set_title("Negative Reviews")
ax1.set_xlabel(" ")
ax1.set_ylabel(" ")
plt.tight_layout()

Looking at the positive reviews, it seems that each Division performs almost equally well. The General Division performs minimally better with higher quartiles and less outliers with negative polarity. On the other hand, the Initmate Division seems to perform worse than the others in terms of negative reviews. It clearly has lower quartiles and with few outliers to offset this observation. Thus, we can be sure in our conclusion.

Let's do the same for Departments.

In [None]:
plt.figure(figsize=(10, 8))
sns.stripplot(x="Department Name", y="Polarity", data=df_modified)
plt.tight_layout()

A recap that Tops and Dresses occupy the majority of reviews here, followed closely by bottoms. It seems they are distributed quite similarly. However, there is an apparent trend that as the number of reviews increase, the number of polarizing reviews (more positive and more negative) reviews increase too.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12, 8), sharex=True, sharey=True)
ax = sns.boxplot(x="Department Name", y="Polarity", data=df_positive, ax=axes[0])
ax.set_title("Positive Reviews")
ax.set_xlabel(" ")

ax1 = sns.boxplot(x="Department Name", y="Polarity", data=df_negative, ax=axes[1])
ax1.set_title("Negative Reviews")
ax1.set_xlabel(" ")
ax1.set_ylabel(" ")
plt.tight_layout()

In terms of positive reviews, most departments score the same except for Jackets and Trend. They are also the departments with the least number of reviews, so this is unsurprising. In terms of negative reviews, we can see that the Intimate department attracts more negative reviews than the others, with lower quartiles compared to others, which is unexpected.

Finally, we will take a look at the 20 most occuring words for positive and negative reviews. However, we will only want to look at nouns and adjectives, so we will use the Natural Language Tool Kit (NLTK) library to help us do this. Additionally, we will impose an additional filter that requires positive reviews to have a polarity of >= 0.25 and negative reviews to have a polarity of <0.

In [None]:
import nltk
df_positive = df_positive[df_positive["Polarity"] >= 0.25]
df_negative = df_negative[df_negative["Polarity"] < 0]

In [None]:
positive_words = []
for word in df_positive["Review Text"]:
    sen = nltk.word_tokenize(word)
    postag = nltk.pos_tag(sen)
    for postag in postag:
        if postag[1] == 'NN':
            positive_words.append(postag[0])
        elif postag[1] == 'NNP':
            positive_words.append(postag[0])
        elif postag[1] == 'JJ':
            positive_words.append(postag[0])
        elif postag[1] == 'JJR':
            positive_words.append(postag[0])
        elif postag[1] == 'JJS':
            positive_words.append(postag[0])

In [None]:
positive_unigrams = top_n_words(positive_words, 20)
df_unigrams = pd.DataFrame(positive_unigrams)

wordcloud = WordCloud(background_color="white").generate_from_frequencies(df_unigrams.set_index(0)[1])
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

From here, we are able to identify several key features of the clothes which customers liked. For example, the quality/material of the clothes.. How soft and comfortable the clothes are.. The fabric and color etc.

In [None]:
negative_words = []
for word in df_negative["Review Text"]:
    sen = nltk.word_tokenize(word)
    postag = nltk.pos_tag(sen)
    for postag in postag:
        if postag[1] == 'NN':
            negative_words.append(postag[0])
        elif postag[1] == 'NNP':
            negative_words.append(postag[0])
        elif postag[1] == 'JJ':
            negative_words.append(postag[0])
        elif postag[1] == 'JJR':
            negative_words.append(postag[0])
        elif postag[1] == 'JJS':
            negative_words.append(postag[0])

In [None]:
negative_unigrams = top_n_words(negative_words, 20)
df_unigrams = pd.DataFrame(negative_unigrams)

wordcloud = WordCloud(background_color="white").generate_from_frequencies(df_unigrams.set_index(0)[1])
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Seems like the most common complaint revolves around the size and fitting of the clothes. Sometimes it's too big, sometimes it's too small. It doesn't look like it does on the model, and it's too short etc.

# Miscellaneous Features Analysis

In this final section, we will analyze the rest of the features we have not yet gone through. Let's start with the "Recommended IND" (whether customers recommended the product or not).

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
recommended = df[df["Recommended IND"] == 1]
not_recommended = df[df["Recommended IND"] == 0]

In [None]:
recommended_graph = go.Histogram(x=recommended["Polarity"], name="Recommended", opacity=0.8)
not_recommended_graph = go.Histogram(x=not_recommended["Polarity"], name="Not Recommended", opacity=0.8)

In [None]:
data = [recommended_graph, not_recommended_graph]
layout = go.Layout(barmode="overlay", title = "Distribution of Polarity based on Recommendations Ind")
fig = go.Figure(data=data, layout=layout)
fig.update_layout(
    autosize=False,
    width=1200,
    height=800,
    xaxis_title="Sentiment Polarity",
    yaxis_title="Count")
iplot(fig)

OK, so this is quite consistent with our findings. Better Sentiment Polarity equates to higher ratings and more recommendations. Let's see how recommendations are related to review ratings next.

In [None]:
sns.set()
plt.figure(figsize=(10, 8))
sns.barplot(x="Rating", y="Recommended IND", data=df, palette="coolwarm", edgecolor=".2", ci=None)
plt.tight_layout()

Unsurprisingly, higher ratings lead to more instances of the product being recommended. However, there are still some products that have low rating yet are still recommended. Let's explore some of these reviews.

In [None]:
recommended[recommended["Rating"] < 3]["Review Text"].tolist()[:3]

It's clear that some of these reviews are contradictory, which makes it difficult for our sentiment classifier to predict accurately. Let's remove them from the dataset. We will set a filter that if a review has a rating of 4 or more, it has to be recommended. If a review has a rating of less than 3, it cannot be recommended.

In [None]:
filtered = ((df_modified["Rating"] >= 4) & (df_modified["Recommended IND"] == 1)) | ((df_modified["Rating"] < 3) & (df_modified["Recommended IND"] == 0))
df_filtered = df_modified[filtered]

Let's use the filtered data to optimize the performance of our Sentiment Classifier.

In [None]:
X = df_filtered.iloc[:, 1]
y = df_filtered.iloc[:, -1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
vect = CountVectorizer(min_df=3, ngram_range=(1,2))
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

In [None]:
from sklearn.metrics import classification_report
classifier.fit(X_train_vect, y_train)
y_pred = classifier.predict(X_test_vect)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
test = ['i got this top to wear with shorts as the color goes with a lot of different prints. the quality is excellent. this top runs very large, as in three  sizes too large.',
 'i loved this dress when i saw it. however the fit was way off. i am   lbs and the small was way too big from the waist down. when the xs arrived i was sure it would be perfect. unfortunately the waist hit way too high, above my rib cage and the dress was too short. it was as if it was a petite size. i was very disappointed as this is such a pretty, easy dress to just throw on for school. unfortunately neither size looked right on me and i had to return both.']
classifier.predict(vect.transform(test))

Now, there are two ways we can optimize our Sentiment Classifier even further: by increasing Precision or by increasing Recall of our "Negative" review predictions. Remember, we will want a higher value of Precision if we want to be sure that when the classifier predicts a "Negative" review, it is actually a "Negative" review. We will want a higher value of Recall if we want the classifier to identify more "Negative" reviews correctly. 

However, there is always something to keep in mind: the precision-recall tradeoff. If we optimize one value, the other other value will inevitably drop. That is, there is no way to increase both precision and recall at the same time past a certain value. You can see this phenomena in the diagram below.

![](https://bbsmax.ikafan.com/static/L3Byb3h5L2h0dHBzL2ltZzIwMTguY25ibG9ncy5jb20vYmxvZy8xMDEyNTkwLzIwMTkwMy8xMDEyNTkwLTIwMTkwMzI3MTIyMTEwNjE4LTk5MzY2Nzg4OS5wbmc=.jpg)

As you can see, our algorithm is quite balanced with a Precision of 76% and a Recall of 80% for "Negative" review predictions. To optimize Precision or Recall any further will lead to skewed values. This is demonstrated below.

First, we will make use of a neat process known as Random Oversampling. This process duplicates the "Negative" reviews in our dataset randomly until the number of "Negative reviews" is the same as the number of "Positive Reviews". The rationale behind this is that an imbalanced dataset might lead to poorer predictive performance. Hence, equalizing the dataset might improve performance.

![](https://miro.medium.com/max/2246/1*o_KfyMzF7LITK2DlYm_wHw.png)

In [None]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(min_df=3, ngram_range=(1,2))),
                     ('ROS', RandomOverSampler()),
                     ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.neural_network import MLPClassifier
text_clf = Pipeline([('vect', CountVectorizer(min_df=3, ngram_range=(1,2))),
                     ('ROS', RandomOverSampler()),
                     ('clf', MLPClassifier((100,3), verbose=3, early_stopping=True))])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

As you can see, for both models, when the either precision or recall increase, the other decreased. That's it! Thanks for reading through this notebook and don't forget to upvote if you liked it!