# Predicting dress reviews 

In this notebook we will be using the Women’s E-Commerce Clothing Reviews data set to predict whether dresses reviews are positive or negative/ neutral. Positive reviews receive >3 point and negative/ neutral reviews <4 points. To predict this classification we will use a Naïve Bayes classifier. 

In [65]:
import pandas as pd
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### Importing and pre-processing the data

In [66]:
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df = df[['Clothing ID','Review Text', 'Rating', 'Class Name']]# subset needed columns 
df.head()

Unnamed: 0,Clothing ID,Review Text,Rating,Class Name
0,767,Absolutely wonderful - silky and sexy and comf...,4,Intimates
1,1080,Love this dress! it's sooo pretty. i happene...,5,Dresses
2,1077,I had such high hopes for this dress and reall...,3,Dresses
3,1049,"I love, love, love this jumpsuit. it's fun, fl...",5,Pants
4,847,This shirt is very flattering to all due to th...,5,Blouses


When we take a look at the data set the Class Name reveals there are not only dresses in this data set. Therefore, we need to drop the reviews about non dresses. Also, we need to drop NaN values. 

In [67]:
df_dresses = df[(df["Class Name"] == "Dresses")]# drop rows with non dresses 
df_dresses.dropna(axis=0,how='any',thresh=None,subset=None,inplace=True)# drop NaN values

df_dresses.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dresses.dropna(axis=0,how='any',thresh=None,subset=None,inplace=True)# drop NaN values


Unnamed: 0,Clothing ID,Review Text,Rating,Class Name
1,1080,Love this dress! it's sooo pretty. i happene...,5,Dresses
2,1077,I had such high hopes for this dress and reall...,3,Dresses
5,1080,"I love tracy reese dresses, but this one is no...",2,Dresses
8,1077,I love this dress. i usually get an xs but it ...,5,Dresses
9,1077,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Dresses


Now we need to determine whether a review is positive or negative/ neutral and create a new column for this.

In [68]:
def rate(x):
    if(x > 3): 
        return 'Positive'
    else: 
        return 'Negative/ neutral'

df_dresses['Rate'] = df_dresses['Rating'].apply(rate)# create a column with positive or negative/ neutral
df_dresses.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dresses['Rate'] = df_dresses['Rating'].apply(rate)# create a column with positive or negative/ neutral


Unnamed: 0,Clothing ID,Review Text,Rating,Class Name,Rate
1,1080,Love this dress! it's sooo pretty. i happene...,5,Dresses,Positive
2,1077,I had such high hopes for this dress and reall...,3,Dresses,Negative/ neutral
5,1080,"I love tracy reese dresses, but this one is no...",2,Dresses,Negative/ neutral
8,1077,I love this dress. i usually get an xs but it ...,5,Dresses,Positive
9,1077,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Dresses,Positive


### Document feature matrix

Now the data set is ready to be used, we can create the document feature matrix. The document feature matrix consist out of rows (Document) and columns (different words). Before we can create the matrix we first need to create a dictionary.

In [69]:
text = df_dresses['Review Text'].values.astype('U') # taking the text from the df_dresses and converting it to Unicode
vect = CountVectorizer(stop_words='english') # create the CV object, with English stop words
vect = vect.fit(text) # fit the model with the words from the Review Text
vect
feature_names = vect.get_feature_names() # get the words from the vocabulary

Now we have the dictionary we can count the occurance of every word.

In [70]:
docu_feat = vect.transform(text) # create the matrix
print(docu_feat[0:50,0:50]) # print the first 50 words of the matrix

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 40)	1
  (34, 12)	2
  (38, 31)	1


### Split the data set into a training and a test set and fitting the model

In [71]:
X = docu_feat # creating x
y = df_dresses['Rate'] # creating y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)# spliting the data set into training set and test set

nb = MultinomialNB() # creating the Naïve Bayes model
nb = nb.fit(X_train, y_train) # fitting the Naïve Bayes model

### Evaluate the performance of your model

Now we need to evaluate the performance of the model. We do this by using a confusion matrix.

In [72]:
y_pred = nb.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_pred) #creates a "confusion matrix"
cm = pd.DataFrame(cm, index = ["Negative (actual)", "Positive (actual)"], columns = ["Negative (predicted)", "Positive (predicted)"])
cm

Unnamed: 0,Negative (predicted),Positive (predicted)
Negative (actual),301,168
Positive (actual),107,1268


With the confusion matrix we can calculate the accuracy, recall and precision:

In [73]:
print(classification_report(y_test, y_pred))

                   precision    recall  f1-score   support

Negative/ neutral       0.74      0.64      0.69       469
         Positive       0.88      0.92      0.90      1375

         accuracy                           0.85      1844
        macro avg       0.81      0.78      0.79      1844
     weighted avg       0.85      0.85      0.85      1844



From the confusion matrix we can tell that our model has a decent accuracy of 85%. Now we can fit the model on the whole data set. 

In [74]:
df_dresses["Rate_Prediction"] = nb.predict(X)
df_dresses = df_dresses[df_dresses["Rate"] != df_dresses["Rate_Prediction"]]
df_dresses.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dresses["Rate_Prediction"] = nb.predict(X)


Unnamed: 0,Clothing ID,Review Text,Rating,Class Name,Rate,Rate_Prediction
23,1077,Cute little dress fits tts. it is a little hig...,3,Dresses,Negative/ neutral,Positive
52,1104,"Love the color and style, but material snags e...",3,Dresses,Negative/ neutral,Positive
311,1089,Looks beautiful online but has too much materi...,3,Dresses,Negative/ neutral,Positive
383,1104,This dress is not what i expected. the bottom ...,3,Dresses,Negative/ neutral,Positive
417,1083,"I love byron lars dresses, and this design is ...",2,Dresses,Negative/ neutral,Positive


Now we can check out 3 cases where the model is off. 

In [77]:
for i in range(0,3):
    print(f'Item number: {i+1}\n')
    print(f'Review text:\n{df_dresses.iloc[i,1]}\n')
    print(f'Review score:\n{df_dresses.iloc[i,2]}\n')
    print(f'Rating Category:\n{df_dresses.iloc[i,4]}\n')
    print(f'Predicted Category:\n{df_dresses.iloc[i,5]}\n\n')

Item number: 1

Review text:
Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.

Review score:
3

Rating Category:
Negative/ neutral

Predicted Category:
Positive


Item number: 2

Review text:
Love the color and style, but material snags easily

Review score:
3

Rating Category:
Negative/ neutral

Predicted Category:
Positive


Item number: 3

Review text:
Looks beautiful online but has too much material and the zipper catches on the lace. also runs very large, i am normally a small but would need and xs in this dress

Review score:
3

Rating Category:
Negative/ neutral

Predicted Category:
Positive


