# Assignment 6 - Text Mining

In this assignment I am going to use text mining to predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars).

In [191]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import numpy as np
from string import punctuation
import os

## Reading the data
First I'm only selecting the dresses. 

In [192]:
df = pd.read_csv("dataset.csv")
df = df.loc[df['Class Name'] == 'Dresses']
del df['Unnamed: 0']
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


I'm going to work with the Review text and the recommendation data.

In [210]:
data = data[['Review Text','Recommended IND']]
data

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1
...,...,...
23481,I was very happy to snag this dress at such a ...,1
23482,"It reminds me of maternity clothes. soft, stre...",1
23483,"This fit well, but the top was very see throug...",0
23484,I bought this dress for a wedding i have this ...,1


## Datacleaning

I'm replacing all the NAN values. 

In [211]:
data['Review Text']

data['Review Text'].fillna("No Review",inplace=True)
data['Review Text'].isna().sum()

0

In [223]:
tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english')

review_text = data['Review Text']
target_rec = data['Recommended IND']

print(len(review_text))
print(len(target_rec))
target_rec

23486
23486


0        1
1        1
2        0
3        1
4        1
        ..
23481    1
23482    1
23483    0
23484    1
23485    1
Name: Recommended IND, Length: 23486, dtype: int64

## Textmining

### Uppercase > lowercase

Change all uppercase character to be lowercase character. For example "BEAUTY" to be "beauty".


In [214]:
#Tolowercase
def to_lower(text):
    return ' '.join([w.lower() for w in word_tokenize(text)])

review_text = review_text.apply(to_lower)

### Removing all special character like .?/@# etc


In [215]:
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

review_text = review_text.apply(remove_special_characters)
review_text = review_text.apply(strip_punctuation)

### Replace all elongated words with appropriate words. 

For example "looooong" to be "long". It also changes words like dress, to dres. But for training the model that doesn't matter.

In [232]:
from nltk.corpus import wordnet

def replaceElongated(word):
    repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    repl = r'\1\2\3'
    if wordnet.synsets(word):
        return word
    repl_word = repeat_regexp.sub(repl, word)
    if repl_word != word:      
        return replaceElongated(repl_word)
    else:       
        return repl_word
review_text = review_text.apply(replaceElongated)

### Tokenization

Splitting sentences into smaller unit, such as terms or word. I'm using NLTK for this.


In [217]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

review_text = review_text.apply(lambda x: tokenizer.tokenize(x))

### Removing Stopwords

Remove stopwords like "is, with, etc" since they don't give useful insights. 

In [218]:
def remove_stopwords(text):
    words = [w for w in text if w not in stopword_list]
    return words

review_text = review_text.apply(lambda x : remove_stopwords(x))

### Drop numbers

Remove numbers because they don't provide useful information about the review.

In [219]:
def drop_numbers(list_text):
    list_text_new = []
    for i in list_text:
        if not re.search('\d', i):
            list_text_new.append(i)
    return ' '.join(list_text_new)
review_text = review_text.apply(drop_numbers)

### Combining the data.

In [224]:
df = pd.concat([review_text,target_rec],axis=1)
df.head()

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


In [225]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Review Text      23486 non-null  object
 1   Recommended IND  23486 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 367.1+ KB


### Training the model

In [226]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import log_loss,confusion_matrix,classification_report, accuracy_score
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re

Splitting the data into test and training data:

In [227]:
cv=CountVectorizer()


train_data,test_data = train_test_split(df,train_size=0.8,random_state=0)

X_train = cv.fit_transform(train_data['Review Text'])
y_train = train_data['Recommended IND']
X_test = cv.transform(test_data['Review Text'])
y_test = test_data['Recommended IND']

## Applying naive bayes:


In [233]:
model = MultinomialNB()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
predictions

array([0, 1, 1, ..., 0, 1, 1])

## Evaluation

In [231]:
conf_matrix=confusion_matrix(y_test,predictions)
conf_matrix

array([[ 577,  251],
       [ 242, 3628]])

In [234]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.70      0.70      0.70       828
           1       0.94      0.94      0.94      3870

    accuracy                           0.90      4698
   macro avg       0.82      0.82      0.82      4698
weighted avg       0.89      0.90      0.89      4698



**Precision** 

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. A high precision relates to a low false positive rate. The model gives 0.82 precision. In this context the precision is quite good.

**Recall** 

Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The model got a quite high recall for the recommended dresses, but the model is less good for prediciting the not recommended categories.

**Accuracy** 

In [230]:
nb_report = accuracy_score(y_test, predictions)
print('Accuracy:', nb_report)

Accuracy: 0.8950617283950617


The accuracy is actually quite high. This means the model is 89,5% accurate. It steal means 10% is not predicted correctly. I think this is because some people write reviews like "I love this dress, but..." and then don't recommend the dress. By analyzing the text even better, I think the algorithm could achieve a better result.