<a href="https://colab.research.google.com/github/rakosdonja/product-category-classifier/blob/main/notebooks/01_product_category_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product Category Classification â€“ Exploratory Data Analysis (EDA)

This notebook is the first step in a machine learning project aimed at
automatically predicting product categories based on product titles.

In this phase, we:
- load and inspect the dataset,
- explore category distribution,
- identify potential data quality issues,
- prepare insights for feature engineering and modeling.

Dataset: `products.csv`


In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/rakosdonja/product-category-classifier/main/data/products.csv"
df = pd.read_csv(url)

print("Shape (rows, cols):", df.shape)
display(df.head())


Shape (rows, cols): (35311, 8)


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## Initial Data Overview

In this step, we examine the basic structure of the dataset:
- number of rows and columns
- data types of each column
- presence of missing values

This helps us understand data quality and identify potential issues before further analysis.


In [3]:
print("Info:")
df.info()

print("\nMissing values per column:")
print(df.isna().sum())


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB

Missing values per column:
product ID           0
Product Title      172
Merchant ID          0
 Category Label     44
_Product Code       95
Number_of_Views     14
Merchant Rating    170
 Listing Date       59
dtype: int64


## Data Quality Assessment

The dataset contains **35,311 products** with **8 columns**, combining textual, numerical, and categorical data.

### Key observations:
- Several columns contain missing values.
- The most affected columns are:
  - `Product Title` (172 missing)
  - `Merchant Rating` (170 missing)
  - `_Product Code` (95 missing)
  - `Category Label` (44 missing)
- Target variable (`Category Label`) has a small number of missing values, which must be handled carefully.

### Initial conclusion:
- The dataset is sufficiently large, so removing rows with missing values is acceptable.
- All missing values must be addressed before training the model to avoid errors and bias.


## Handling Missing Values

Before training the model, it is necessary to handle missing values in the dataset.
Machine learning algorithms cannot work reliably with incomplete data, especially when the
target variable contains missing values.

Given the large size of the dataset, rows with missing values will be removed.
This approach simplifies preprocessing while preserving a sufficient amount of data
for training a robust classification model.

In the following step, we remove all rows that contain at least one missing value
and evaluate how much data is lost as a result.


In [4]:
# Rows before cleaning
rows_before = len(df)

# Remove rows with missing values
df_clean = df.dropna()

# Rows after cleaning
rows_after = len(df_clean)

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning: {rows_after}")
print(f"Removed rows: {rows_before - rows_after}")
print(f"Percentage removed: {((rows_before - rows_after) / rows_before) * 100:.2f}%")


Rows before cleaning: 35311
Rows after cleaning: 34760
Removed rows: 551
Percentage removed: 1.56%


### Column and Text Standardization

In this step, we perform basic standardization of the dataset to ensure
consistency and improve downstream text processing for machine learning models.

What we do:
- normalize column names (lowercase, underscores instead of spaces),
- convert text values to lowercase,
- remove leading and trailing whitespace from text fields.

This step is important because text-based models treat values such as
"Phone" and "phone" as different tokens, which can negatively impact
model performance.


In [5]:
# Rename columns for easier handling
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

# Normalize text columns
text_columns = ["product_title", "category_label"]

for col in text_columns:
    df[col] = (
        df[col]
        .astype(str)
        .str.lower()
        .str.strip()
    )

# Preview cleaned text columns
df[text_columns].head()


Unnamed: 0,product_title,category_label
0,apple iphone 8 plus 64gb silver,mobile phones
1,apple iphone 8 plus 64 gb spacegrau,mobile phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,mobile phones
3,apple iphone 8 plus 64gb space grey,mobile phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,mobile phones


## EDA: Target distribution (category_label)

Before training any model, we need to understand the target variable (`category_label`):
- How many unique categories exist?
- Which categories dominate the dataset?
- Are there rare categories with too few samples (potential issue for training and evaluation)?


In [10]:
import pandas as pd

# Basic target overview
print("Number of unique categories:", df["category_label"].nunique())

# Distribution of categories
category_counts = df["category_label"].value_counts()
print("\nTop 20 categories by count:")
display(category_counts.head(20).reset_index().rename(columns={"index": "category_label", "category_label": "count"}))

print("\nBottom 20 categories by count:")
display(category_counts.tail(20).reset_index().rename(columns={"index": "category_label", "category_label": "count"}))

# Optional: how many categories have very few samples?
threshold = 20
rare_categories = (category_counts < threshold).sum()
print(f"\nCategories with fewer than {threshold} samples:", rare_categories)


Number of unique categories: 10

Top 20 categories by count:


Unnamed: 0,count,count.1
0,fridge freezers,5495
1,mobile phones,4075
2,washing machines,4036
3,cpus,3855
4,fridges,3580
5,tvs,3564
6,dishwashers,3418
7,digital cameras,2696
8,microwaves,2338
9,freezers,2210



Bottom 20 categories by count:


Unnamed: 0,count,count.1
0,fridge freezers,5495
1,mobile phones,4075
2,washing machines,4036
3,cpus,3855
4,fridges,3580
5,tvs,3564
6,dishwashers,3418
7,digital cameras,2696
8,microwaves,2338
9,freezers,2210



Categories with fewer than 20 samples: 0


## Cleaning and standardizing category labels

The target variable contains semantically identical categories written in different forms
(e.g. singular vs plural, inconsistent naming, string "nan").

Before modeling, we standardize category labels to:
- avoid duplicate classes,
- reduce noise in the target variable,
- improve model learning and evaluation reliability.


In [9]:
# Standardize category labels
df["category_label"] = df["category_label"].astype(str).str.lower().str.strip()

# Remove invalid "nan" labels
df = df[df["category_label"] != "nan"]

# Manual normalization of known duplicates
category_mapping = {
    "mobile phone": "mobile phones",
    "cpu": "cpus",
    "fridge": "fridges"
}

df["category_label"] = df["category_label"].replace(category_mapping)

# Check result
print("Unique categories after standardization:", df["category_label"].nunique())
display(df["category_label"].value_counts().reset_index().rename(
    columns={"index": "category_label", "category_label": "count"}
))


Unique categories after standardization: 10


Unnamed: 0,count,count.1
0,fridge freezers,5495
1,mobile phones,4075
2,washing machines,4036
3,cpus,3855
4,fridges,3580
5,tvs,3564
6,dishwashers,3418
7,digital cameras,2696
8,microwaves,2338
9,freezers,2210


## Text EDA: Product Title Length Analysis

Before building features or training models, we analyze the structure of the product titles.
Text length often carries useful signal for classification tasks.

In this step, we:
- compute the length of each product title (number of characters)
- analyze overall statistics
- inspect how title length varies across product categories

This helps justify whether text-length-based features are meaningful for the model.


In [8]:
# Create title length feature (number of characters)
df["title_length"] = df["product_title"].astype(str).str.len()

# Overall statistics
print("Overall title length statistics:")
display(df["title_length"].describe())

# Statistics by category
print("\nTitle length statistics by category:")
display(
    df.groupby("category_label")["title_length"]
      .describe()
      .sort_values("mean", ascending=False)
)


Overall title length statistics:


Unnamed: 0,title_length
count,35267.0
mean,52.74245
std,22.652185
min,3.0
25%,38.0
50%,52.0
75%,65.0
max,200.0



Title length statistics by category:


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
category_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cpus,3855.0,66.622827,20.767135,3.0,54.0,65.0,76.0,190.0
fridge freezers,5495.0,54.999818,20.722124,3.0,42.0,54.0,68.0,170.0
tvs,3564.0,54.385241,23.683904,3.0,39.0,53.0,65.0,195.0
washing machines,4036.0,52.782458,19.146128,3.0,42.0,53.0,65.0,149.0
microwaves,2338.0,51.60308,19.027306,3.0,39.0,52.0,65.0,164.0
dishwashers,3418.0,50.027209,18.189395,3.0,39.0,50.0,61.0,192.0
digital cameras,2696.0,49.992953,24.633167,3.0,34.0,44.0,61.0,180.0
freezers,2210.0,48.583258,18.404971,3.0,37.0,47.0,60.0,168.0
fridges,3580.0,48.243017,18.638954,3.0,36.0,46.0,62.0,173.0
mobile phones,4075.0,46.049816,30.824701,3.0,26.0,40.0,57.0,200.0


## Text EDA: Product Title Word Count Analysis

In addition to character length, the number of words in a product title
can provide meaningful signal for category classification.

In this step, we:
- compute the number of words in each product title
- analyze overall word count statistics
- compare word count distributions across product categories

This helps determine whether word-based length features
should be included in the final model.


In [11]:
# Create word count feature
df["title_word_count"] = df["product_title"].astype(str).str.split().str.len()

# Overall statistics
print("Overall title word count statistics:")
display(df["title_word_count"].describe())

# Statistics by category
print("\nTitle word count statistics by category:")
display(
    df.groupby("category_label")["title_word_count"]
      .describe()
      .sort_values("mean", ascending=False)
)


Overall title word count statistics:


Unnamed: 0,title_word_count
count,35267.0
mean,8.628973
std,4.188479
min,1.0
25%,6.0
50%,8.0
75%,11.0
max,37.0



Title word count statistics by category:


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
category_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cpus,3855.0,12.346044,3.898504,1.0,10.0,12.0,15.0,29.0
tvs,3564.0,10.189675,4.51976,1.0,7.0,10.0,12.0,31.0
digital cameras,2696.0,9.603858,4.696224,1.0,6.0,9.0,12.0,36.0
mobile phones,4075.0,8.519264,5.62372,1.0,5.0,7.0,11.0,37.0
fridge freezers,5495.0,8.154504,3.310688,1.0,6.0,8.0,10.0,24.0
washing machines,4036.0,7.789148,3.157319,1.0,6.0,8.0,10.0,22.0
microwaves,2338.0,7.615483,3.183674,1.0,5.0,8.0,10.0,26.0
fridges,3580.0,7.448603,3.139201,1.0,5.0,7.0,9.0,28.0
freezers,2210.0,7.183258,3.036132,1.0,5.0,7.0,9.0,24.0
dishwashers,3418.0,6.789936,2.814797,1.0,5.0,7.0,8.0,29.0
