**Business Problem: E-Commerce Text Classification**

**Context**
- E-commerce websites list thousands of products across different categories like Electronics, Household, Books, and Clothing & Accessories. Properly    categorizing products is essential for enhancing search functionality, filtering, and improving user experience. Traditionally, this classification is  done manually, which is time-consuming and prone to errors.

**Problem Statement**
- **The goal is to build an automated text classification model that accurately assigns products to their respective categories based on product descriptions.**
- This will help e-commerce businesses:

- Improve search and filtering: Customers can find relevant products quickly.

- Enhance recommendation systems: Proper categorization leads to better recommendations.

- Reduce manual efforts: Saves time and cost by automating product tagging.

- Optimize inventory management: Helps businesses track product categories efficiently.

- Business Impact
- Faster product listing on e-commerce platforms.

- Better customer experience with accurate categorization.

- Increased revenue by improving product discoverability.

**DATA UNDERSTANDING**

In [1]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')

In [2]:
df = pd.read_excel(r"C:\Users\L.RAMYA\OneDrive\Desktop\text classification.xlsx", names=['label', 'message'])
df

Unnamed: 0,label,message
0,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
1,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
2,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
3,Household,Incredible Gifts India Wooden Happy Birthday U...
4,Household,Pitaara Box Romantic Venice Canvas Painting 6m...
...,...,...
3995,Electronics,Strontium MicroSD Class 10 8GB Memory Card (Bl...
3996,Electronics,CrossBeats Wave Waterproof Bluetooth Wireless ...
3997,Electronics,Karbonn Titanium Wind W4 (White) Karbonn Titan...
3998,Electronics,"Samsung Guru FM Plus (SM-B110E/D, Black) Colou..."


In [3]:
df.shape

(4000, 2)

**The dataset showing 4000 entries with 2 features.**

In [4]:
df['message'].nunique()

2592

**The dataset contains 2592 unique messages in the message column.**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    4000 non-null   object
 1   message  4000 non-null   object
dtypes: object(2)
memory usage: 62.6+ KB


**The dataset has 4000 entries and 2 columns (label and message), with no missing values.
Both columns are of object data type, indicating they contain text or categorical data.**

In [6]:
df['label'].unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

**The label column contains four unique categories: 'Household', 'Books', 'Clothing & Accessories', and 'Electronics'.
These represent the classes for product classification in the dataset.**










In [7]:
df['label'].value_counts()

label
Household                 1000
Books                     1000
Clothing & Accessories    1000
Electronics               1000
Name: count, dtype: int64

**The dataset is perfectly balanced with 1000 entries for each category: Household, Books, Clothing & Accessories, and Electronics.**

In [8]:
df['label'].value_counts()/len(df)

label
Household                 0.25
Books                     0.25
Clothing & Accessories    0.25
Electronics               0.25
Name: count, dtype: float64

**Each category in the dataset contributes equally, with 25% representation.
This confirms the dataset is evenly distributed across all four classes.**

# Text Cleaning

In [9]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  # Load stopwords once

corpus = []
for text in df['message']:  # Loop directly over the column
    rp = re.sub('[^a-zA-Z]', " ", text)  # Remove special chars
    rp = rp.lower().split()  # Lowercase & split
    rp = [ps.stem(word) for word in rp if word not in stop_words]  # Remove stopwords & stem
    corpus.append(" ".join(rp))  # Join back to a sentence

print(corpus[:5])  # Print first 5 preprocessed messages


['saf floral frame paint wood inch x inch special effect uv print textur sao paint made synthet frame uv textur print give multi effect attract toward special seri paint make wall beauti give royal touch perfect gift special one', 'saf uv textur modern art print frame paint synthet cm x cm x cm set color multicolor size cm x cm x cm overview beauti paint involv action skill use paint right manner henc end product pictur speak thousand word say art trend quit time give differ viewer differ mean style design saf wood matt abstract paint frame quit abstract mysteri beauti paint nice frame gift famili friend paint variou form certain figur seen imag add good set light place paint decor give differ feel look place qualiti durabl paint matt finish includ good qualiti frame last long period howev includ glass along frame specif purchas saf wood matt abstract paint frame amazon custom friendli platform wide rang product choos shop click away', 'saf flower print frame paint synthet inch x inch 

**Text data is cleaned by removing special characters, converting to lowercase, removing stopwords, and applying stemming.
This preprocessing step prepares the text for feature extraction and model training.**

# vectorization

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X = pd.DataFrame(cv.fit_transform(corpus).toarray())

In [11]:
X.shape

(4000, 12782)

In [12]:
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12772,12773,12774,12775,12776,12777,12778,12779,12780,12781
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**The CountVectorizer converts the cleaned text data into a numerical matrix using the bag-of-words approach.
Each row in X represents a message, and each column corresponds to the frequency of a unique word.**

In [13]:
df ['label'].replace({'Books':0,'Clothing & Accessories':1,'Electronics':2,'Household':3},inplace = True)
y = df['label']

**The categorical labels are converted into numerical form using replace(), which helps in training machine learning models.
The variable y now contains the target values for classification.**

# Train-Test- Split

In [14]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=True)

In [15]:
print(X.shape)  # Should be (50425, num_features)
print(y.shape)  # Should be (50425,)


(4000, 12782)
(4000,)


# Modelling
**Navie Bayes Classifier**

In [16]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(X_train,y_train)


# Evalution

In [17]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

from sklearn.metrics import accuracy_score
print("train accuracy:",accuracy_score(y_train,y_pred_train))
print("test accuracy:",accuracy_score(y_test,y_pred_test))

from sklearn.model_selection import cross_val_score
print('cvscore:',cross_val_score(model,X_train,y_train,cv=5,scoring='accuracy').mean())

train accuracy: 0.9790625
test accuracy: 0.97
cvscore: 0.9637499999999999


# Prediction On New Data

In [18]:
message ="The latest smartphone features a 6.5-inch AMOLED display, 128GB of internal storage, and a powerful 48MP camera, making it the perfect device for both work and entertainment."


# load data

In [19]:
df_test = pd.DataFrame({'message': [message]})
df_test

Unnamed: 0,message
0,The latest smartphone features a 6.5-inch AMOL...


In [20]:

corpus = []

for i in range(len(df_test)):

    rp = re.sub('[^a-zA-Z]'," ",df_test['message'][i])

    rp = rp.lower()

    rp = rp.split()

    rp = [ps.stem(word) for word in rp if not word in set(stopwords.words('english'))]

    rp = ' '.join(rp)

    corpus.append(rp)

print(corpus)



X = cv.transform(corpus).toarray()



['latest smartphon featur inch amol display gb intern storag power mp camera make perfect devic work entertain']


In [21]:
X

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [22]:
# Predict the category
pred = model.predict(X)

# Correctly map predictions back to labels
if pred[0] == 0:
    print('Books')
elif pred[0] == 1:
    print('Clothing & Accessories')
elif pred[0] == 2:
    print('Electronics')
elif pred[0] == 3:
    print('Household')
else:
    print('Unknown Category')


Electronics
