## Weekly Assignment 6 - Text Mining

### Brief explanation BoW & Naïve Bayes
Bag-of-Words (BoW) model is used to extract features from a piece of text. You can use these features for training machine learning algorithms. 

Naïve Bayes can be used for large amounts of data. It is a machine learning model with which we can calculate a probability that a text belonds to a certain category. 

In [98]:
#Imports
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import numpy as np

In [99]:
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### Pre-processing

In [57]:
df1 = df[df['Class Name'] == 'Dresses']

In [58]:
df1 = df1.drop(['Unnamed: 0', 'Clothing ID'], axis=1)

In [100]:
df1 = df1.dropna() #drop the empty values

In [102]:
#seperate the ratings. Positive = 1, negative = 0
df1.loc[df['Rating'] < 4, 'Positive/Negative'] = '0' 
df1.loc[df['Rating'] > 3, 'Positive/Negative'] = '1' 

In [103]:
df1.head(10)

Unnamed: 0,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Positive/Negative
2,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0
5,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,0
8,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,1
9,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses,1
10,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses,0
12,53,Perfect!!!,More and more i find myself reliant on the rev...,5,1,2,General Petite,Dresses,Dresses,1
14,50,Pretty party dress with some issues,This is a nice choice for holiday gatherings. ...,3,1,1,General,Dresses,Dresses,0
19,47,Stylish and comfortable,I love the look and feel of this tulle dress. ...,5,1,0,General,Dresses,Dresses,1
21,55,I'm torn!,"I'm upset because for the price of the dress, ...",4,1,14,General,Dresses,Dresses,1
22,31,Not what it looks like,"First of all, this is not pullover styling. th...",2,0,7,General,Dresses,Dresses,0


### Document matrix

In [104]:
text = df1['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


In [105]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat) #Let's print a little part of the matrix: the first 50 words & documents

  (0, 1353)	1
  (0, 1585)	1
  (0, 2074)	1
  (0, 2151)	1
  (0, 2292)	1
  (0, 2642)	1
  (0, 2782)	1
  (0, 2824)	1
  (0, 3244)	2
  (0, 3376)	1
  (0, 3443)	1
  (0, 3547)	1
  (0, 3619)	1
  (0, 3785)	1
  (0, 3921)	2
  (0, 3924)	1
  (0, 4179)	1
  (0, 4282)	1
  (0, 4554)	2
  (0, 4569)	1
  (0, 4681)	1
  (0, 4737)	1
  (0, 4776)	1
  (0, 4785)	1
  (0, 4977)	2
  :	:
  (5369, 4606)	1
  (5369, 4954)	1
  (5369, 4957)	1
  (5369, 6108)	1
  (5369, 6401)	1
  (5369, 6684)	1
  (5369, 6801)	1
  (5369, 6807)	1
  (5369, 7270)	1
  (5369, 7427)	1
  (5369, 7446)	1
  (5369, 7484)	1
  (5369, 7502)	1
  (5369, 7649)	1
  (5370, 1589)	1
  (5370, 2292)	1
  (5370, 2364)	1
  (5370, 2733)	1
  (5370, 2783)	1
  (5370, 3385)	1
  (5370, 4115)	1
  (5370, 4957)	1
  (5370, 5073)	1
  (5370, 5483)	1
  (5370, 7488)	1


In [106]:
#Final steps for the matrix

#rev_words = pd.concat([df1, pd.DataFrame(docu_feat.toarray())], axis=1)
#rev_words.head(10)

### Splitting the data into training and test set

In [108]:
X = docu_feat
y = df1['Positive/Negative']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### Using Naïve Bayes classifier for predicting

In [109]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB()

In [110]:
#Predicting the score of the first 6 reviews

print(clf.predict(X_test[0:6]))

['1' '0' '1' '0' '1' '0']


In [111]:
# Create a value for all predictions

prediction = clf.predict(X_test)
prediction

array(['1', '0', '1', ..., '1', '1', '1'], dtype='<U1')

### Evaluating the model

In [112]:
#Getting the accuracy
y_test_p = clf.predict(X_test)
clf.score(X_test, y_test)

0.8542183622828784

The accuracy is 85% (rounded). There are two categories. 

In [113]:
#checking what happens if we guess the same category over and over again
df1['Positive/Negative'].value_counts(normalize=True)

1    0.753119
0    0.246881
Name: Positive/Negative, dtype: float64

These results show that we would be guessing 'Positive' 75% of the time correctly (if we would only guess 'positive'). 

### The Confusion Matrix

In [114]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Neutral/Negative', 'Positive'], columns=['Neutral/Negative predictions', 'Positive predictions'])
cm

Unnamed: 0,Neutral/Negative predictions,Positive predictions
Neutral/Negative,238,168
Positive,67,1139


In [115]:
#checking the classes
clf.classes_

array(['0', '1'], dtype='<U1')

In [116]:
# Printing out the classification report

print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.78      0.59      0.67       406
           1       0.87      0.94      0.91      1206

    accuracy                           0.85      1612
   macro avg       0.83      0.77      0.79      1612
weighted avg       0.85      0.85      0.85      1612



The precision of the prediciton of 'Positive' is 87%. The recall of 'Positive' is 94% which means that the actual positive ratings are in 94% of the cases predicted as 'positive'. 

### Checking where the model is off-target

In [117]:
#Compare the prediction with the actual data

df2 = pd.DataFrame({'Pred': prediction, 'Actual': y_test})
df2.head(20)
df2["Comparison"] = np.where(df2["Pred"] == df2["Actual"], True, False)
df2 = df2.sort_values(by = "Comparison")
df2

Unnamed: 0,Pred,Actual,Comparison
15037,1,0,False
10669,1,0,False
5070,1,0,False
4403,0,1,False
18366,0,1,False
...,...,...,...
1448,1,1,True
5499,1,1,True
6247,1,1,True
10400,1,1,True


In [133]:
#The actual rating for the first three items of the dataframe
#Case 1
df.iloc[15037, :]

Unnamed: 0                                                             15037
Clothing ID                                                             1087
Age                                                                       25
Title                                                               Runs big
Review Text                This dress looks so cute in the pictures-i lov...
Rating                                                                     2
Recommended IND                                                            0
Positive Feedback Count                                                    0
Division Name                                                        General
Department Name                                                      Dresses
Class Name                                                           Dresses
Positive/Negative                                                          0
Name: 15037, dtype: object

In [122]:
#not matching

df.iloc[15037, 4]

'This dress looks so cute in the pictures-i love the style. ordered typical size and it was huge-felt like many sizes too big.'

This one is negative, but predicted as positive. Words like 'Cute' and 'Love' could've giving the hint that this might be positive. 

In [132]:
#case 2
df.iloc[10669, :]

Unnamed: 0                                                             10669
Clothing ID                                                             1083
Age                                                                       37
Title                                                      Beautiful idea...
Review Text                I ordered my normal size in this dress. i am 6...
Rating                                                                     3
Recommended IND                                                            1
Positive Feedback Count                                                    0
Division Name                                                        General
Department Name                                                      Dresses
Class Name                                                           Dresses
Positive/Negative                                                          0
Name: 10669, dtype: object

In [130]:
df.iloc[10669, 4]

"I ordered my normal size in this dress. i am 6 foot tall, but the regular sizes were too large and too long (mid-calf). i returned the dress for a size smaller in petite for a more flattering hemline. the dress is lovely, especially on the models in the pictures, but didn't quite work out for me. also, it feels like there are hundreds of closure hooks that make putting on/taking off the dress seem to take an unusually long time!"

Predicted positive, when in fact it is negative. This might also have to do with words like 'Lovely'. Maybe it also has to do with the rating not being too bad as it has gotten 3 stars?

In [131]:
#case 3
df.iloc[5070, :]

Unnamed: 0                                                              5070
Clothing ID                                                             1095
Age                                                                       52
Title                                                             Runs small
Review Text                This dress is very cute and is made well.  i b...
Rating                                                                     3
Recommended IND                                                            1
Positive Feedback Count                                                    0
Division Name                                                 General Petite
Department Name                                                      Dresses
Class Name                                                           Dresses
Positive/Negative                                                          0
Name: 5070, dtype: object

In [134]:
df.iloc[5070, 4]

"This dress is very cute and is made well.  i bought up a size from what i usually where as one reviewer mentioned the dress is small in the bust. this is always a problem for me therefore i ordered a 14. i was surprised that the 14 was sung in the bust. and then understood why the 16 was sold out. i was 30/40 lbs heavier when i wore a 16..never would have thought i'd have to order up that high. i've been running a 10 or 12 depending on the level of my activity. i'm keeping the dress because i kn"

Negative/neutral review but it was predicted as positive. I think it has to do with the rating of 3 stars and the fact that the word usage is not overly negative nor positive. So maybe the confusion is there. 