# Text modeling

Naive Bayes model assume that the value of a particular feature is independent of the value of any other feature, given the class variable.<br>
In the bag-of-words model, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. <br>
How Naive Bayes and bag-of-words work together is that the frequency of each words determines the probability.

In [95]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

First, let's read in the data file and filter it to only show (review) data for dresses only.

In [96]:
reviews = pd.read_csv('clothing-reviews.csv')
dresses = reviews['Department Name']=='Dresses' 
reviews_dresses = reviews[dresses]
reviews_dresses.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


To read the text and use it for our analysis, we need an object from sklearn called a CountVectorizer. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using .values.astype('U').

In [97]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = reviews_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [98]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


As you can see, there are no 0's in the matrix. Because the matrix is mostly zeroes, they are left out to save memory. Instead, the positions of the cells that _don't_ have a zero are spelled out, with their values. This is a so-called _sparse matrix_ which saves a lot of memory.

## Building the model ##

Now, we will use the Naïve Bayes classifier from `sklearn`.

In [99]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = reviews_dresses['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the model ##

In [100]:
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.5775316455696202

The accuracy is 57.7%. Considering there are 5 categories, almost 60% accuracy is pretty good.

In [101]:
reviews_dresses['Rating'].value_counts(normalize=True)

5    0.537585
4    0.220763
3    0.132616
2    0.072955
1    0.036082
Name: Rating, dtype: float64

In [102]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        67
           2       0.37      0.05      0.09       135
           3       0.41      0.22      0.29       272
           4       0.32      0.28      0.30       420
           5       0.67      0.91      0.77      1002

    accuracy                           0.58      1896
   macro avg       0.35      0.29      0.29      1896
weighted avg       0.51      0.58      0.52      1896



The precision for Rating 1 is 0.00, which means that out of a 100 lines that are predicted to be 1 star rating, nothing are actually right. Interesting enough, the higher the stars, the higher its precision. 

## Predicting probabilities instead of classes ##

In [103]:
df = reviews_dresses[['Rating', 'Review Text']]
df.head()

Unnamed: 0,Rating,Review Text
1,5,Love this dress! it's sooo pretty. i happene...
2,3,I had such high hopes for this dress and reall...
5,2,"I love tracy reese dresses, but this one is no..."
8,5,I love this dress. i usually get an xs but it ...
9,5,"I'm 5""5' and 125 lbs. i ordered the s petite t..."


In [104]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
[[9.04901645e-15 7.79333681e-09 1.23105973e-05 2.04774510e-02
  9.79510231e-01]]


In [105]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Review text: {i}. {df.iloc[i,1]}")
    print(f"1 star: {prob[0,0]}, 2 star: {prob[0,1]}, 3 star: {prob[0,2]}, 4 star: {prob[0,3]}, 5 star: {prob[0,4]}")
    print(f"---------------------------------------------------------------------------------------------------------")


Review text: 0. Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
1 star: 9.049016453606603e-15, 2 star: 7.793336811100728e-09, 3 star: 1.2310597337969297e-05, 4 star: 0.020477451028513994, 5 star: 0.9795102305807923
---------------------------------------------------------------------------------------------------------
Review text: 1. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over

For some reason the numbers are super high and I'm not sure if this is correct? I'm thinking that it's because of this line -- {prob[0,0]}, 2 star: {prob[0,1]}, 3 star: {prob[0,2]}, 4 star: {prob[0,3]}, 5 star: {prob[0,4]} ? <br>
<br>
Review text number 0, 3 and 9 use the word love a couple of times and my assumption is that if you leave review with the word love you'll rate it 4/5 stars. But instead their number is the smallest.<br>
<br>
Review text number 0: maybe because some words are shortened, like bc for because?<br>
Review text number 3: maybe because of the word 'but'?<br>
Review text number 9: maybe because of the word 'but'?