<a href="https://colab.research.google.com/github/rakosdonja/product-category-classifier/blob/main/notebooks/02_product_category_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product Category Classification â€“ Modeling

This notebook focuses on building and evaluating machine learning models for automatic product category prediction based on product titles.

In this phase, we:
- define features and target variables,
- split the data into training and test sets,
- apply appropriate text and numeric preprocessing,
- train multiple machine learning models,
- evaluate their performance and select the best-performing approach.

This notebook builds directly on the cleaned and prepared dataset from the EDA phase.


## Load dataset from GitHub

In this step, we load the `products.csv` file directly from the GitHub repository using its raw URL.
This ensures the notebook is reproducible (no manual uploads) and everyone on the team works with the same dataset version.


In [1]:
import pandas as pd

# Load dataset from GitHub (raw link)
url = "https://raw.githubusercontent.com/rakosdonja/product-category-classifier/main/data/products.csv"
df = pd.read_csv(url)

print("Shape (rows, cols):", df.shape)
display(df.head())


Shape (rows, cols): (35311, 8)


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## Basic data cleaning and column selection

Before training any machine learning model, we must ensure that:
- only relevant columns are used,
- missing values are handled consistently,
- column names are standardized for easier processing.

In this step, we:
- remove rows with missing values in critical columns,
- keep only the columns needed for modeling,
- standardize column names.


In [2]:
# Make a copy to avoid modifying the original dataframe
df_model = df.copy()

# Standardize column names
df_model.columns = (
    df_model.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

# Keep only relevant columns
df_model = df_model[[
    "product_title",
    "category_label"
]]

# Drop rows with missing values
rows_before = len(df_model)
df_model = df_model.dropna()
rows_after = len(df_model)

print(f"Rows before cleaning: {rows_before}")
print(f"Rows after cleaning: {rows_after}")
print(f"Removed rows: {rows_before - rows_after}")

display(df_model.head())


Rows before cleaning: 35311
Rows after cleaning: 35096
Removed rows: 215


Unnamed: 0,product_title,category_label
0,apple iphone 8 plus 64gb silver,Mobile Phones
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones


## Feature engineering for product titles

Machine learning models cannot work directly with raw text.
Before training, we need to extract meaningful numerical signals from the product titles.

In this step, we create additional features from the product title that may help the model distinguish between categories:
- title length (number of characters),
- number of words in the title,
- number of digits (often model numbers, sizes, capacities),
- number of special characters (/, -, +, etc.).

These features capture structural patterns commonly used in product naming.


In [3]:
# Create feature: title length (number of characters)
df_model["title_length"] = df_model["product_title"].str.len()

# Create feature: number of words
df_model["title_word_count"] = df_model["product_title"].str.split().str.len()

# Create feature: number of digits in the title
df_model["title_digit_count"] = df_model["product_title"].str.count(r"\d")

# Create feature: number of special characters
df_model["title_special_char_count"] = df_model["product_title"].str.count(r"[^a-zA-Z0-9\s]")

display(df_model.head())


Unnamed: 0,product_title,category_label,title_length,title_word_count,title_digit_count,title_special_char_count
0,apple iphone 8 plus 64gb silver,Mobile Phones,31,6,3,0
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones,35,7,3,0
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones,70,13,9,2
3,apple iphone 8 plus 64gb space grey,Mobile Phones,35,7,3,0
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones,54,11,6,1


## Preparing data for modeling

In this step, we define:
- the input features (X) that the model will learn from
- the target variable (y) that the model must predict

We then split the dataset into training and test sets.
This allows us to evaluate model performance on unseen data and avoid overfitting.


In [4]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df_model[
    [
        "product_title",
        "title_length",
        "title_word_count",
        "title_digit_count",
        "title_special_char_count",
    ]
]

y = df_model["category_label"]

# Train-test split (80% train, 20% test, stratified by category)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (28076, 5)
Test shape: (7020, 5)


## Text vectorization and numeric feature scaling

Machine learning models cannot work directly with raw text.
In this step, we transform all input features into a numeric format:

- Product titles are converted into TF-IDF vectors
- Numeric features are scaled to a common range

We combine these transformations using a ColumnTransformer,
so all preprocessing is applied consistently and safely.


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

# Define preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("title_tfidf", TfidfVectorizer(stop_words="english"), "product_title"),
        ("length_scaler", MinMaxScaler(), ["title_length"]),
        ("word_count_scaler", MinMaxScaler(), ["title_word_count"]),
        ("digit_count_scaler", MinMaxScaler(), ["title_digit_count"]),
        ("special_char_scaler", MinMaxScaler(), ["title_special_char_count"]),
    ]
)

# Fit and transform training data
X_train_processed = preprocessor.fit_transform(X_train)

# Transform test data
X_test_processed = preprocessor.transform(X_test)

print("Processed train shape:", X_train_processed.shape)
print("Processed test shape:", X_test_processed.shape)


Processed train shape: (28076, 17302)
Processed test shape: (7020, 17302)


## Train baseline models (first benchmark)

Now that all features are numeric, we can train multiple classifiers.
We start with a few reliable baseline models and compare them using:

- accuracy
- precision / recall / F1-score per class
- confusion matrix

This gives us a clear benchmark before deciding which model is best.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd

models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "Linear SVC": LinearSVC(),
    "Multinomial Naive Bayes": MultinomialNB()
}

results = []

for name, model in models.items():
    print("=" * 70)
    print(f"MODEL: {name}")
    print("=" * 70)

    # Train
    model.fit(X_train_processed, y_train)

    # Predict
    y_pred = model.predict(X_test_processed)

    # Metrics
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {acc:.4f}\n")

    print("Classification report:")
    print(classification_report(y_test, y_pred, digits=4))

    # Save summary for comparison
    report = classification_report(y_test, y_pred, output_dict=True)
    results.append({
        "model": name,
        "accuracy": acc,
        "macro_f1": report["macro avg"]["f1-score"],
        "weighted_f1": report["weighted avg"]["f1-score"]
    })

results_df = pd.DataFrame(results).sort_values(by="weighted_f1", ascending=False)
results_df


MODEL: Logistic Regression
Accuracy: 0.9553

Classification report:
                  precision    recall  f1-score   support

             CPU     0.0000    0.0000    0.0000        17
            CPUs     0.9752    0.9973    0.9861       749
 Digital Cameras     0.9926    0.9907    0.9916       538
     Dishwashers     0.9303    0.9604    0.9451       681
        Freezers     0.9927    0.9227    0.9564       440
 Fridge Freezers     0.9520    0.9424    0.9472      1094
         Fridges     0.8693    0.9098    0.8890       687
      Microwaves     0.9933    0.9571    0.9749       466
    Mobile Phone     0.0000    0.0000    0.0000        11
   Mobile Phones     0.9624    0.9913    0.9766       801
             TVs     0.9764    0.9929    0.9846       708
Washing Machines     0.9481    0.9552    0.9516       803
          fridge     0.0000    0.0000    0.0000        25

        accuracy                         0.9553      7020
       macro avg     0.7379    0.7400    0.7387      7020
  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Accuracy: 0.9664

Classification report:
                  precision    recall  f1-score   support

             CPU     0.0000    0.0000    0.0000        17
            CPUs     0.9777    0.9960    0.9868       749
 Digital Cameras     0.9944    0.9963    0.9954       538
     Dishwashers     0.9509    0.9677    0.9592       681
        Freezers     0.9884    0.9659    0.9770       440
 Fridge Freezers     0.9642    0.9589    0.9615      1094
         Fridges     0.8964    0.9316    0.9136       687
      Microwaves     0.9934    0.9764    0.9848       466
    Mobile Phone     0.0000    0.0000    0.0000        11
   Mobile Phones     0.9756    0.9963    0.9858       801
             TVs     0.9805    0.9958    0.9881       708
Washing Machines     0.9698    0.9601    0.9650       803
          fridge     0.0000    0.0000    0.0000        25

        accuracy                         0.9664      7020
       macro avg     0.7455    0.7496    0.7475      7020
    weighted avg     0.9598  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Accuracy: 0.9229

Classification report:
                  precision    recall  f1-score   support

             CPU     0.0000    0.0000    0.0000        17
            CPUs     0.9701    0.9973    0.9835       749
 Digital Cameras     0.9871    0.9944    0.9907       538
     Dishwashers     0.9893    0.9530    0.9708       681
        Freezers     0.9951    0.4659    0.6347       440
 Fridge Freezers     0.7270    0.9909    0.8387      1094
         Fridges     0.9068    0.7933    0.8463       687
      Microwaves     0.9934    0.9635    0.9782       466
    Mobile Phone     0.0000    0.0000    0.0000        11
   Mobile Phones     0.9778    0.9913    0.9845       801
             TVs     0.9929    0.9873    0.9901       708
Washing Machines     0.9822    0.9614    0.9717       803
          fridge     0.0000    0.0000    0.0000        25

        accuracy                         0.9229      7020
       macro avg     0.7324    0.6999    0.7069      7020
    weighted avg     0.9296  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,model,accuracy,macro_f1,weighted_f1
1,Linear SVC,0.966382,0.747477,0.963
0,Logistic Regression,0.955271,0.738707,0.951757
2,Multinomial Naive Bayes,0.922934,0.706863,0.916619


## ðŸ§  Model Evaluation and Final Selection

In this phase of the project, multiple machine learning models were trained and evaluated to determine the most effective approach for automatic product category prediction based on product titles.

### Models evaluated
The following algorithms were trained using the same feature set and preprocessing pipeline to ensure a fair comparison:
- **Multinomial Naive Bayes**
- **Logistic Regression**
- **Linear Support Vector Classifier (Linear SVC)**

Each model was evaluated on a held-out test set using:
- Accuracy  
- Precision, Recall, and F1-score (per class)  
- Macro-averaged F1 score  
- Weighted F1 score  

---

### Key observations

- **Multinomial Naive Bayes** performed reasonably well on frequent categories but struggled with rare classes, leading to undefined precision and recall for some labels. This behavior is expected due to strong class imbalance and the modelâ€™s simplifying assumptions.

- **Logistic Regression** achieved strong overall performance, with balanced precision and recall across most categories. It demonstrated robustness and good generalization on high-dimensional TF-IDF features.

- **Linear SVC** delivered the **best overall results**, achieving:
  - the highest accuracy,
  - the strongest weighted F1 score,
  - and the most stable performance across dominant product categories.

---

### Final model choice

**Linear Support Vector Classifier (Linear SVC)** was selected as the final model for this task.

This choice is justified by:
- superior overall predictive performance,
- strong handling of high-dimensional sparse text features,
- resilience to class imbalance,
- and proven effectiveness in real-world text classification systems.

---

### Notes on limitations

Some rare category labels were under-represented in the dataset, which impacted performance metrics for those classes across all evaluated models. In a production environment, such categories could be merged with semantically equivalent labels or excluded based on business requirements.

---

### Next steps

The selected model will be retrained on the full dataset, packaged together with the preprocessing pipeline, and saved as a reusable `.pkl` file for integration into downstream applications.

This concludes the model selection phase and prepares the project for deployment.
