<a href="https://colab.research.google.com/github/rakosdonja/product-category-classifier/blob/main/notebooks/01_product_category_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product Category Classification â€“ Exploratory Data Analysis (EDA)

This notebook is the first step in a machine learning project aimed at
automatically predicting product categories based on product titles.

In this phase, we:
- load and inspect the dataset,
- explore category distribution,
- identify potential data quality issues,
- prepare insights for feature engineering and modeling.

Dataset: `products.csv`


In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/rakosdonja/product-category-classifier/main/data/products.csv"
df = pd.read_csv(url)

print("Shape (rows, cols):", df.shape)
display(df.head())


Shape (rows, cols): (35311, 8)


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## Initial Data Overview

In this step, we examine the basic structure of the dataset:
- number of rows and columns
- data types of each column
- presence of missing values

This helps us understand data quality and identify potential issues before further analysis.


In [2]:
print("Info:")
df.info()

print("\nMissing values per column:")
print(df.isna().sum())


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB

Missing values per column:
product ID           0
Product Title      172
Merchant ID          0
 Category Label     44
_Product Code       95
Number_of_Views     14
Merchant Rating    170
 Listing Date       59
dtype: int64


## Data Quality Assessment

The dataset contains **35,311 products** with **8 columns**, combining textual, numerical, and categorical data.

### Key observations:
- Several columns contain missing values.
- The most affected columns are:
  - `Product Title` (172 missing)
  - `Merchant Rating` (170 missing)
  - `_Product Code` (95 missing)
  - `Category Label` (44 missing)
- Target variable (`Category Label`) has a small number of missing values, which must be handled carefully.

### Initial conclusion:
- The dataset is sufficiently large, so removing rows with missing values is acceptable.
- All missing values must be addressed before training the model to avoid errors and bias.


## Handling Missing Values

Before training the model, it is necessary to handle missing values in the dataset.
Machine learning algorithms cannot work reliably with incomplete data, especially when the
target variable contains missing values.

Given the large size of the dataset, rows with missing values will be removed.
This approach simplifies preprocessing while preserving a sufficient amount of data
for training a robust classification model.

In the following step, we remove all rows that contain at least one missing value
and evaluate how much data is lost as a result.


In [3]:
# Rows before cleaning
rows_before = len(df)

# Remove rows with missing values
df_clean = df.dropna()

# Rows after cleaning
rows_after = len(df_clean)

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning: {rows_after}")
print(f"Removed rows: {rows_before - rows_after}")
print(f"Percentage removed: {((rows_before - rows_after) / rows_before) * 100:.2f}%")


Rows before cleaning: 35311
Rows after cleaning: 34760
Removed rows: 551
Percentage removed: 1.56%


### Column and Text Standardization

In this step, we perform basic standardization of the dataset to ensure
consistency and improve downstream text processing for machine learning models.

What we do:
- normalize column names (lowercase, underscores instead of spaces),
- convert text values to lowercase,
- remove leading and trailing whitespace from text fields.

This step is important because text-based models treat values such as
"Phone" and "phone" as different tokens, which can negatively impact
model performance.


In [4]:
# Rename columns for easier handling
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

# Normalize text columns
text_columns = ["product_title", "category_label"]

for col in text_columns:
    df[col] = (
        df[col]
        .astype(str)
        .str.lower()
        .str.strip()
    )

# Preview cleaned text columns
df[text_columns].head()


Unnamed: 0,product_title,category_label
0,apple iphone 8 plus 64gb silver,mobile phones
1,apple iphone 8 plus 64 gb spacegrau,mobile phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,mobile phones
3,apple iphone 8 plus 64gb space grey,mobile phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,mobile phones
