<a href="https://colab.research.google.com/github/Sreenathkk00/Ecommerce_project/blob/main/Prodect_Reviews_Prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hand Written Digit Classification**

---



## **Objective**

The objective of this project is to develop a machine learning model using Multinomial Naive Bayes to accurately predict the sentiment (positive or negative) of product reviews. This will help businesses understand customer feedback and improve their products and services.

### Goals

1. **Data Preprocessing**:
   - Load and clean the dataset.
   - Transform textual data into numerical features.

2. **Model Development**:
   - Train a Multinomial Naive Bayes model.
   - Optimize the model and handle class imbalance.

3. **Model Evaluation**:
   - Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
   - Analyze the confusion matrix.

4. **Visualization and Reporting**:
   - Visualize performance with plots.
   - Generate a classification report.

### Expected Outcomes

- An accurate sentiment prediction model.
- Insights into features influencing sentiment.
- Recommendations for leveraging sentiment analysis to enhance customer satisfaction.

## **Data Source**

* The data source is a CSV file hosted on GitHub.
* The URL provided in the code snippet is:
* https://github.com/Sreenathkk00/Machine-Learning-Projects/raw/b41dfe79ce7709e5e9bd052bf22ff92dd02ac971/Dataset/Womens%20Clothing%20E-Commerce%20Reviews.csv

* This URL points to a repository called "Machine-Learning-Projects" owned by the user "Sreenathkk00".
* Within this repository, there is a folder called "Dataset" containing a file named "Womens Clothing E-Commerce Reviews.csv".

* GitHub is a popular platform for hosting code and data.
* It allows users to store and share their projects with others.
* The data source in this case is a CSV file that has been uploaded to a GitHub repository.

* To access the data, we use the `pd.read_csv()` function from the pandas library.
* This function takes a URL as an argument and reads the data from the specified location.
* In this case, the URL points to the CSV file on GitHub.

* Once the data has been read, it is stored in a pandas DataFrame object called `df`.
* This DataFrame can then be used for further analysis and processing.


## **Import Library**

In [117]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB

## **Import Data**

In [162]:
# Import Data
df = pd.read_csv('https://github.com/Sreenathkk00/Machine-Learning-Projects/raw/b41dfe79ce7709e5e9bd052bf22ff92dd02ac971/Dataset/Womens%20Clothing%20E-Commerce%20Reviews.csv')



## **Describe Data**

In [161]:
# Displays the first 5 rows
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,1,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,1,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",1,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,1,1,6,General,Tops,Blouses


In [120]:
# Clean the dataset
df = df.drop(['Unnamed: 0'], axis=1)

In [121]:
# df Information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              23486 non-null  int64 
 1   Age                      23486 non-null  int64 
 2   Title                    19676 non-null  object
 3   Review Text              22641 non-null  object
 4   Rating                   23486 non-null  int64 
 5   Recommended IND          23486 non-null  int64 
 6   Positive Feedback Count  23486 non-null  int64 
 7   Division Name            23472 non-null  object
 8   Department Name          23472 non-null  object
 9   Class Name               23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.8+ MB


In [122]:
# describe df
df.describe()

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0
mean,918.118709,43.198544,4.196032,0.822362,2.535936
std,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


## **Data Preprocessing**

### **Missing Values**

In [123]:
# Checking null values count
df.isnull().sum()

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

**Replace empty reviews with NaN**

In [124]:
# Replace empty reviews with NaN
df[df['Review Text']==" "]=np.NaN

**Fill missing reviews with "No Review**

In [125]:
# Fill missing reviews with "No Review
df['Review Text'].fillna('No Review',inplace=True)

In [126]:
# Checking null values count
df.isnull().sum()

Clothing ID                   0
Age                           0
Title                      3810
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

## **Definde Target(y) Feature(x)**

In [127]:
# Define target(y) and feature (x)
y = df['Rating']
x = df['Review Text']

In [128]:
# Calculate value counts
df['Rating'].value_counts()

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

## **Train Test Split**

In [129]:
# split train test
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,  random_state=2529)

In [130]:
# shape of the train and test
x_train.shape, x_test.shape, y_train.shape, y_test.shape


((16440,), (7046,), (16440,), (7046,))

## **Get Feature Text Coverstion To Tokens**

In [131]:
#TextFeatureExtraction
cv = CountVectorizer(lowercase=True, analyzer='word', ngram_range=(2,3),stop_words='english',max_features=5000)

In [132]:
# Vectorizing training data
x_train = cv.fit_transform(x_train)

In [133]:
# Retrieving feature names after vectorization
cv.get_feature_names_out()

array(['00 petite', '0p fit', '10 12', ..., 'years old', 'yellow color',
       'yoga pants'], dtype=object)

In [134]:
# Converting sparse matrix to  array for inspection
x_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [135]:
# Vectorizing test data
x_test = cv.fit_transform(x_test)

In [136]:
# Retrieving feature names after vectorization
cv.get_feature_names_out()

array(['10 12', '10 12 took', '10 12 tops', ..., 'years old',
       'yesterday love', 'yoga pants'], dtype=object)

In [137]:
# Converting sparse matrix to  array for inspection
x_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## **Get Model Training**

In [138]:
# Initializing Multinomial Naive Bayes model
model = MultinomialNB()

In [139]:
model.fit(x_train,y_train)

## **Model Predictio**

In [140]:
# Generate predictions
y_pred = model.predict(x_test)

In [141]:
y_pred.shape

(7046,)

In [142]:
y_pred

array([2, 3, 3, ..., 5, 4, 5])

## **Probability Of Each Prediction**

In [143]:
# Predict probabilities
model.predict_proba(x_test)

array([[0.26692373, 0.35957477, 0.30830537, 0.04730174, 0.01789439],
       [0.09725869, 0.09370016, 0.48762968, 0.26285916, 0.05855232],
       [0.21322638, 0.0344866 , 0.6775651 , 0.01713515, 0.05758676],
       ...,
       [0.22922895, 0.0288237 , 0.07354507, 0.02725008, 0.6411522 ],
       [0.01591788, 0.00115913, 0.00471111, 0.76076036, 0.21745151],
       [0.06184667, 0.09982129, 0.01834076, 0.04535696, 0.77463432]])

## **Model Evaluation**

In [144]:
# Print confusion matrix
print(confusion_matrix(y_test,y_pred))

[[  29   26   35   32  141]
 [  48   43   91   75  186]
 [  78   96  152  172  326]
 [ 125  129  212  301  742]
 [ 327  330  394  648 2308]]


In [145]:
# Print classification report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.05      0.11      0.07       263
           2       0.07      0.10      0.08       443
           3       0.17      0.18      0.18       824
           4       0.25      0.20      0.22      1509
           5       0.62      0.58      0.60      4007

    accuracy                           0.40      7046
   macro avg       0.23      0.23      0.23      7046
weighted avg       0.43      0.40      0.42      7046



## **Recategories Rating as Poor(0) and Good(1)**

In [146]:
# Calculate value counts
df['Rating'].value_counts()

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

### **Replace the values in the 'Rating' column**

In [147]:
# Re-Rating 1,2,3 as 0 and 4,5 as 1
df.replace({'Rating':{1:0,2:0,3:0,4:1,5:1}},inplace=True)

In [148]:
# Calculate value counts
df['Rating'].value_counts()

Rating
1    18208
0     5278
Name: count, dtype: int64

In [149]:
# Define Target(Y) and Feature(X)
X = df['Review Text']
Y = df['Rating']

## **Re-Train Test split**

In [150]:
# split to train test data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,random_state=2529)

In [151]:
# train-test model shape
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((16440,), (7046,), (16440,), (7046,))

## **Re-Feature Text Conversion to Tokens**

In [152]:
# Train data conversion to token
X_train = cv.fit_transform(X_train)

In [153]:
# Test data convertion to token
X_test = cv.fit_transform(X_test)

## **Re-Training Model**

In [154]:
# model
model = MultinomialNB()

In [155]:
# Model  training
model.fit(X_train,Y_train)

## **Re-Model Prediction**

In [156]:
# Model Prediction
Y_pred = model.predict(X_test)

In [157]:
# shape
Y_pred.shape

(7046,)

In [158]:
Y_pred

array([0, 0, 0, ..., 1, 1, 1])

## **Re-Model Evaluation**

In [159]:
# TP,TN  and FN,FP |  Print confusion matrix report
print(confusion_matrix(Y_test,Y_pred))

[[ 482 1048]
 [ 973 4543]]


In [160]:
# Print classification report
print(classification_report(Y_test,Y_pred))

              precision    recall  f1-score   support

           0       0.33      0.32      0.32      1530
           1       0.81      0.82      0.82      5516

    accuracy                           0.71      7046
   macro avg       0.57      0.57      0.57      7046
weighted avg       0.71      0.71      0.71      7046



## **Explanation**
This project aims to create a machine learning model using Multinomial Naive Bayes to predict the sentiment (positive or negative) of product reviews. By analyzing customer reviews, businesses can gain insights into customer satisfaction and areas for improvement.

The project involves:
1. **Data Preprocessing**: Loading and cleaning the dataset, then converting text reviews into numerical features using CountVectorizer.
2. **Model Development**: Training a Multinomial Naive Bayes model on the processed data and optimizing it by tuning hyperparameters and addressing any class imbalance.
3. **Model Evaluation**: Evaluating the model’s performance using metrics like accuracy, precision, recall, and F1-score, and analyzing the confusion matrix.
4. **Visualization and Reporting**: Visualizing the model’s performance and generating a comprehensive classification report.

The expected outcome is an accurate sentiment prediction model that provides valuable insights for businesses to enhance customer satisfaction and improve their products.