# Data Preparation

The `E-Commerce Product Classification` aims to predict the most appropriate category for each product uploaded to an e-commerce platform. In this case, we face the challenge of ensuring that products are classified into categories that truly match their features.

In the e-commerce business, accurate product classification is essential to enhance user experience and operational efficiency. One of the main concerns is ensuring that each product is classified into the most relevant category. Therefore, we prioritize business metrics similar to those used in churn prediction, namely Recall.

In this context, the business metric Recall focuses on the model's ability to identify and classify products into the categories they truly belong to, even if the products have complex features or ambiguous characteristics. By prioritizing Recall, we aim to minimize prediction errors that lead to products being incorrectly classified into a category they do not belong to.

Of course, there is a trade-off to consider, which is the potential increase in the number of False Positives (products wrongly classified into a certain category). However, in this context, these errors are considered more acceptable than False Negative errors, where products that should belong to a category are not correctly classified.

By prioritizing the business metric Recall in this `E-Commerce Product Classification`, we hope to improve the accuracy of product categorization, which in turn will enhance user experience and the overall operational efficiency of the e-commerce business. This metric helps us detect and prevent products that truly belong to a category (positive cases) more efficiently than worrying about products wrongly classified into a certain category (negative cases).

# Import Library

In [81]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import warnings for ignore the warnings
import warnings 
warnings.filterwarnings("ignore")

# import pickle and json file for columns and model file
import pickle
import json
import joblib
import copy

# Import train test split for splitting data
from sklearn.model_selection import train_test_split
import yaml
from tqdm import tqdm
import os

In [82]:
data = pd.read_csv("./../../dataset/1 - raw data/product_category.csv")

In [83]:
data

Unnamed: 0,title,category
0,Farm Gold Australia Carrot,Vegetable
1,China Broccoli,Vegetable
2,Segar2go Small Pak Choy,Vegetable
3,Segar2go Japanese Cucumber 2pcs,Vegetable
4,Segar2go Tomato In Pack,Vegetable
...,...,...
14255,Mission Wrap Salt Reduced Wholemeal,Bread
14256,Mission Wrap Protein Wholemeal,Bread
14257,Mission Mini Wraps Wholemeal,Bread
14258,Mission Naan - Plain,Bread


# Data Definition

The data used in this analysis includes information about product names and product categories, sourced from the following dataset:

Data Source: [Predict Categories of Items using NLP](https://www.kaggle.com/datasets/shivam1298/predict-categories-of-items-using-nlp)

This dataset consists of 20,188 rows of data with two main columns:

1. **title**: This column contains the names of the products sold on the e-commerce platform.
2. **category**: This column contains the categories or classifications that correspond to the products.

# Data Validation

In [84]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14260 entries, 0 to 14259
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     14260 non-null  object
 1   category  14260 non-null  object
dtypes: object(2)
memory usage: 222.9+ KB


In [51]:
data.isnull().sum()

Product     0
Category    0
dtype: int64

In [52]:
data.duplicated().sum()

0

In [53]:
data['category'].value_counts()

Sauce & Paste        764
Stationery           653
Frozen food          636
Chocolate & Candy    626
Noodles & Pasta      500
                    ... 
Tofu                  32
Cutlery               31
Eggs                  25
Water                 25
Noodles               24
Name: Category, Length: 77, dtype: int64

# Data Splitting
Perform data splitting to separate the dataset into training set, validation set, and test set, divided into variables x and y.

In [55]:
data.shape

(14260, 2)

In [86]:
X = data.drop(columns = "category")
y = data["category"]

In [87]:
#Split Data 80% training 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 42)

In [88]:
# Split the training data into training and validation sets
X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, 
                                                    test_size=0.4, 
                                                    random_state=42,
                                                    stratify = y_test
                                                   )

## Final Result - Data Preparation

Export the results of data preparation as a pickle file.

In [89]:
joblib.dump(X_train, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\X_train.pkl")
joblib.dump(y_train, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\y_train.pkl")
joblib.dump(X_valid, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\X_valid.pkl")
joblib.dump(y_valid, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\y_valid.pkl")
joblib.dump(X_test, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\X_test.pkl")
joblib.dump(y_test, "C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\y_test.pkl")

['C:\\Users\\penguin\\code\\product-classification\\dataset\\2 - processed\\y_test.pkl']