# **Women Cloth Reviews Prediction with Multi Nomial Naïve Bayes**

-------------

## **Objective**

The objective of this project is to predict women's clothing reviews using the Multi-Nomial Naïve Bayes algorithm. The goal is to analyze the sentiment and categorize reviews into positive, neutral, or negative classes to help customers make informed purchasing decisions.

## **Data Source**

## **Import Library**

In [243]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snb

## **Import Data**

In [245]:
df = pd.read_csv("https://raw.githubusercontent.com/KriAga/Women-Clothing-Review/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")

## **Describe Data**

In [246]:
df.head()

Unnamed: 0,ID,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## **Data Visualization**

In [247]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   ID                       23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [248]:
df.shape

(23486, 11)

## **Data Preprocessing**

In [249]:
df.isna().sum()

ID                            0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [269]:
df[df['Review Text']==""]= np.NaN

In [251]:
df['Review Text'].fillna("N0 Reviews", inplace=True )

In [252]:
df.isna().sum()

ID                            0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [253]:
df['Review Text']

0        Absolutely wonderful - silky and sexy and comf...
1        Love this dress!  it's sooo pretty.  i happene...
2        I had such high hopes for this dress and reall...
3        I love, love, love this jumpsuit. it's fun, fl...
4        This shirt is very flattering to all due to th...
                               ...                        
23481    I was very happy to snag this dress at such a ...
23482    It reminds me of maternity clothes. soft, stre...
23483    This fit well, but the top was very see throug...
23484    I bought this dress for a wedding i have this ...
23485    This dress in a lovely platinum is feminine an...
Name: Review Text, Length: 23486, dtype: object

## **Define Target Variable (y) and Feature Variables (X)**

In [254]:
df.columns

Index(['ID', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

In [255]:
X = df['Review Text']

In [256]:
y = df['Rating']

In [257]:
df['Rating'].value_counts()

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

## **Train Test Split**

In [258]:
!pip install scikit-learn




[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [259]:
from sklearn.model_selection import train_test_split

In [270]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, stratify=y,random_state=2529)

In [271]:
X_train.shape,  X_test.shape, y_train.shape, y_test.shape

((16440,), (7046,), (16440,), (7046,))

## **Modeling**

## Get feature text conversion to tockens

In [262]:
from sklearn.feature_extraction.text import CountVectorizer

In [272]:
cv =CountVectorizer(lowercase= True, analyzer="word", ngram_range=(2,3), stop_words="english", max_features=5000)

In [275]:
X_train = cv.fit_transform(X_train)

In [276]:
cv.get_feature_names_out()

array(['10 12', '10 bought', '10 fit', ..., 'yellow color', 'yoga pants',
       'zipper little'], dtype=object)

In [281]:
X_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## **Get Model Train**

In [282]:
from sklearn.naive_bayes import MultinomialNB

In [283]:
model = MultinomialNB()

In [288]:
model.fit(X_train,y_train)

## **Get Model Prediction**

In [289]:
y_pred = model.predict(X_test)

In [290]:
y_pred.shape

(7046,)

In [292]:
y_pred

array([5, 5, 5, ..., 5, 5, 1], dtype=int64)

## **Get Probability of each predicted class**

In [293]:
model.predict_proba(X_test)

array([[2.45646547e-02, 1.07285711e-02, 6.03014898e-02, 2.56061164e-01,
        6.48344121e-01],
       [6.41928797e-02, 1.35375794e-01, 2.49363795e-02, 1.26535067e-01,
        6.48959880e-01],
       [1.69128313e-02, 2.05072981e-02, 2.38492979e-02, 4.04110394e-01,
        5.34620178e-01],
       ...,
       [9.22282813e-02, 9.51421232e-02, 2.96086884e-02, 5.06176841e-02,
        7.32403223e-01],
       [8.63116367e-02, 1.31897845e-03, 5.76827436e-03, 2.79103223e-04,
        9.06322007e-01],
       [5.72750679e-01, 2.84258959e-02, 1.23573556e-02, 7.80184066e-02,
        3.08447663e-01]])

## **Model Evaluation**

In [294]:
from sklearn.metrics import confusion_matrix, classification_report

In [295]:
print(confusion_matrix(y_test,y_pred))

[[  28   27   19   49  130]
 [  47   52   57   99  215]
 [  99   97  100  172  393]
 [ 173  118  182  267  783]
 [ 436  295  365  574 2269]]


In [296]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.04      0.11      0.05       253
           2       0.09      0.11      0.10       470
           3       0.14      0.12      0.13       861
           4       0.23      0.18      0.20      1523
           5       0.60      0.58      0.59      3939

    accuracy                           0.39      7046
   macro avg       0.22      0.22      0.21      7046
weighted avg       0.41      0.39      0.40      7046



## **Ratings as poor(0) and good(1)**

In [299]:
df['Rating'].value_counts()

Rating
1    18208
0     5278
Name: count, dtype: int64

#### **Re-Rating  1,2,3 as 0 and 4,5 as 1**

In [297]:
df.replace({'Rating': {1:0, 2:0, 3:0, 4:1, 5:1}}, inplace= True)

In [298]:
y = df['Rating']

In [300]:
X = df['Review Text']

## **Train test Split**

In [301]:
from sklearn.model_selection import train_test_split

In [302]:
x_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, stratify=y,random_state=2529)

In [303]:
x_train.shape,  X_test.shape, y_train.shape, y_test.shape

((16440,), (7046,), (16440,), (7046,))

## Get feature text conversion to tockens

In [215]:
from sklearn.feature_extraction.text import CountVectorizer

In [304]:
cv =CountVectorizer(lowercase= True, analyzer="word", ngram_range=(2,3), stop_words="english", max_features=5000)

In [305]:
x_train = cv.fit_transform(x_train)

In [306]:
X_test = cv.fit_transform(X_test)

## **Get Model Re Train**

In [307]:
from sklearn.naive_bayes import MultinomialNB

In [308]:
model = MultinomialNB()

In [309]:
model.fit(x_train, y_train)

## **Get Model Prediction**

In [310]:
y_pred = model.predict(X_test)

In [311]:
y_pred.shape

(7046,)

In [312]:
y_pred

array([0, 1, 1, ..., 1, 1, 1], dtype=int64)

## **Get Model Evaluation**

In [313]:
from sklearn.metrics import confusion_matrix, classification_report

In [314]:
print(confusion_matrix(y_test,y_pred))

[[ 414 1169]
 [ 948 4515]]


In [315]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.30      0.26      0.28      1583
           1       0.79      0.83      0.81      5463

    accuracy                           0.70      7046
   macro avg       0.55      0.54      0.55      7046
weighted avg       0.68      0.70      0.69      7046



## **Explaination**

The creation of a prediction model is the main goal of this research. A test dataset and the necessary libraries are first imported. A subset of the dataset was kept for testing while the remainder was used to train the model once it had been reviewed and pre-processed. To obtain certain prediction datasets, the model was utilised. Finally, the model was retrained for greater accuracy after the prediction accuracy was compared to the test dataset and certain adjustments were made.