In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk

In [2]:
df = pd.read_csv("data clothing reviews.csv")
#filtering out everything exept the only the dresses
df = df.loc[(df['Class Name'] == 'Dresses')]
df = df[["Review Text" , "Class Name", "Rating"]].dropna()
df.head()

Unnamed: 0,Review Text,Class Name,Rating
1,Love this dress! it's sooo pretty. i happene...,Dresses,5
2,I had such high hopes for this dress and reall...,Dresses,3
5,"I love tracy reese dresses, but this one is no...",Dresses,2
8,I love this dress. i usually get an xs but it ...,Dresses,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",Dresses,5


In [3]:
#changing the rating in two values, postive and negative. 
#So it's possible to train a Naïve Bayes model with postive and negative words

def rating_pos_neg(x):
    if(x  >= 4): 
        return "positve"
    elif (x <= 3): 
        return "negative"
    else: 
        return x
    
df["Rating"] = df["Rating"].apply(rating_pos_neg)
df.head() #The head() of the resulting dataframe.

Unnamed: 0,Review Text,Class Name,Rating
1,Love this dress! it's sooo pretty. i happene...,Dresses,positve
2,I had such high hopes for this dress and reall...,Dresses,negative
5,"I love tracy reese dresses, but this one is no...",Dresses,negative
8,I love this dress. i usually get an xs but it ...,Dresses,positve
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",Dresses,positve


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
review_text = df["Review Text"].values.astype('U') #convert to unicode, so it's possible to work with it

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english') #filtering out the english stopwords, to make analysing easier
#Converting the text to a matrix of token counts
#Beceause it's not possible to work with text directly when using machine learning
vect = vect.fit(review_text)


Bags Of Words model removes all information about the orders of words and focuses on the occurrence of words in a document. This is exactly what is done when makeing the vector, all the information about the text is trown away and it's just couting towards the document how many times a certain word is written. This works good in combination with Naive Bayes, because Naive Bayes sees all the values aswell as independent, which means that it does just as Bags of Words, not does anything with the information of "related" values. Both models see the words as independend values.    

In [6]:
matrix = vect.transform(review_text)#Text pre-processing steps resulting in a document-feature matrix
df_words = pd.concat([df, pd.DataFrame(matrix.toarray())], axis=1)
df_words.head(5)

Unnamed: 0,Review Text,Class Name,Rating,0,1,2,3,4,5,6,...,8069,8070,8071,8072,8073,8074,8075,8076,8077,8078
0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Love this dress! it's sooo pretty. i happene...,Dresses,positve,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,I had such high hopes for this dress and reall...,Dresses,negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

#Split the file into a training and a test set.

nb = MultinomialNB() #create the model to check each individual words on postivity/negativity

X = matrix
y = df['Rating'] #here comes the postive/negative part from, the words will be related towards these classes

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

#Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars)

nb = nb.fit(X_train, y_train)

y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8508676789587852

The accuracy is 85%

In [8]:
df["Rating"].value_counts() #there are more positve ratings then negative ones

positve     4634
negative    1511
Name: Rating, dtype: int64

In [9]:
#Evaluate the performance of your model on the test set.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_)) #checking the presiccion and recall


             precision    recall  f1-score   support

   negative       0.74      0.64      0.69       469
    positve       0.88      0.92      0.90      1375

avg / total       0.85      0.85      0.85      1844



For the negative reviews the precision is 74%, which means that 74% of predicted negative reviews are indeed negative. The recall for the negative predicted reviews is 64%, which means that the model didn't classify 35% of the negative cases as negative. For the negative part, the 

For the postive reviews the precision is 88%, so in 88% of the cases the model predicted the reviews in the right way. The recal for postive reviews is 92%, which means that there's only 8% not classified as postive but this review was actualy positive. 

In [11]:
for i in range(5):
    prob = nb.predict_proba(X[i])
    print(f"review_text: {i}")
    print(f"{df.iloc[i,0]}")
    print(f"Negative: {prob[0,0]}, Postive: {prob[0,1]}")
    print(f"Actual rating in dataframe: {df.iloc[i,2]}")
    print (" ")

review_text: 0
Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Negative: 0.00011898638596510512, Postive: 0.9998810136140351
Actual rating in dataframe: positve
 
review_text: 1
I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
Negative: 0.9999999660610689, Postive: 3.39389189824013e-08
Actua

Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain:


review_text: 90
This dress is really cute in person. however, it did not fit me like it does the model in the pic at all. first of all i'm 5 feet 1 and it was wayyy too short on me. i didn't have the petit on either-- i had the regular xs. it just hits a couple of inches too short for me. i am 50. it would be adorable if i were more comfortable in shorter dresses. i wear short things a lot, but this was just too high on me. it was probably a good 8 inches above my knee. also it flared too dramatically at the wa
Negative: 0.16007499752483886, Postive: 0.839925002475152
negative

--- review_text 90 --- In this review she's not happpy, but uses positive words for what should have changed ---

review_text: 141
I love retailer and fell in love as soon as i saw this dress online. being 5'10" i love a quality maxi dress and this one did not disappoint. however, being 5'10" also means the top sweater overlay hits me way short. it looks ridiculous. i also thought that this was very heavy for a maxi dress and could not imagine wearing it in 80 degree weather. unfortunately after waiting so long for it's arrival this is going back.
Negative: 0.18637915881188302, Postive: 0.8136208411881104
negative

--- review_text 140 --- The word Love is used 3 times, when talking about the prediction of the dress, but dissapointed afterwards ---

review_text: 143
I was so in love with this dress when i saw it in the store but so disappointed when i put it on. i am 5'10" with curves and usually buy a large in dresses. this dress looked like a sack on me. the top was way too big and loose making the dress a boxy cut rather than a maxi cut like i was expecting. the dress is lovely to look at on the hanger and feels good on but i don't think it flatters hourglass figures. it was the right length for me unlike many other reviewers. 
obviously, based off the
Negative: 0.2438478720855015, Postive: 0.7561521279145202
negative

--- review_text 140 --- The word Love and lovely when talked about the dress online, but it turns out to be dissapointing when it acutaly arrives  ---
