Importing the dataset into df

In [None]:
import pandas as pd

df= pd.read_csv("Ecommerce_data.csv")

In [21]:
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3


Checking the balance in the dataset

In [12]:
df['label'].value_counts()

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: label, dtype: int64

Assigning different values to each specific label

In [13]:
df['label_num']= df.label.map({
    'Household' : 0, 
    'Books': 1, 
    'Electronics': 2, 
    'Clothing & Accessories': 3
})

Training the dataset without preprocessing the data

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.Text, 
    df.label_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.label_num
)

Created pipeline of TF-IDF and Knn classifier

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

classification_line= Pipeline([
    ("tf-idf", TfidfVectorizer()),
    ("Knn", KNeighborsClassifier())
    
])

classification_line.fit(X_train, y_train)

y_pred = classification_line.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1200
           1       0.97      0.95      0.96      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.98      0.97      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [17]:
X_test[:10]

20706    Lal Haveli Designer Handmade Patchwork Decorat...
19166    GOTOTOP Classical Retro Cotton & PU Leather Ne...
15209    FabSeasons Camouflage Polyester Multi Function...
2462     Indian Superfoods: Change the Way You Eat Revi...
6621     Milton Marvel Insulated Steel Casseroles, Juni...
21839    HP X900 USB Mouse (Black) Dependable quality s...
3048     Gupta Fancy Store Wind Chime Feng Shui Om Vast...
21834    Wuthering Heights Intermediate Level Reader Ma...
15364    Gizga Essentials Gz-Ck-102 Professional Cleani...
19017    JBL GO Portable Wireless Bluetooth Speaker wit...
Name: Text, dtype: object

In [18]:
X_test[:10][21839]

"HP X900 USB Mouse (Black) Dependable quality shouldn't come at a cost and with the HP optical mouse you get great comfort and awesome features at an irresistible price. A contoured shape designed for all-day comfort in either hand, Powerful 1000 DPI optical sensor for precise movement on most surfaces, Strict HP standards and guidelines ensure long-lasting quality, 3 button solution and a built-in scroll wheel for optimized productivity, Supports Windows 7 and above and Mac OS X 10.x and above."

In [19]:
clf.predict(X_test[:10])

array([0, 2, 3, 1, 0, 2, 0, 1, 2, 2], dtype=int64)

Now lets do some preprocessing of text before training the data with Machine Learning methods

In [20]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [22]:
# created a new column "preprocessed_comment" and use the utility function above to get the clean data

df['preprocessed_comment'] = df['Text'].apply(preprocess) 

In [24]:
## Dividing the dataset into 20 and 80 for machine learning
X_train, X_test, y_train, y_test = train_test_split(
    df.preprocessed_comment, 
    df.label_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.label_num
)

In the below pipeline I used count vectorizer and random forest classifier

In [27]:
#1. create a pipeline object
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_bi_grams', CountVectorizer(ngram_range = (1, 2))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1200
           1       0.97      0.98      0.97      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.99      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



Next I used TF-IDF and Random forest classifier in the pipeline

In [28]:
#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),        #using the ngram_range parameter 
     ('Random Forest', RandomForestClassifier())         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred)) 

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.99      0.99      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



Conclusion: Achieved a 97% accuracy in text prediction through the application of NLP and machine learning models. Conducted a comparison between the results obtained before and after text preprocessing. Interestingly, in the current dataset, the impact of preprocessing on the results was marginal. However, when anticipated with real dataset will exhibit a significantly greater disparity in outcomes due to the preprocessing steps.