In this notebook, we will train a model on the dataset created in the Transforming Meta-Categories notebook. The dataset contains product category labels (X) and a labeled Meta-Category (Y). The purpose of this model is to be able to recognize a product, and tell into which Meta-Category it belongs. We will try two different models: TF-IDF with Naive Bayes and TF-IDF with Logistic Regression. Further, we will use cross-validation to divide the dataset more evenly, following the results from v1 of this notebook. 

In [3]:
import os
import pandas as pd
from pathlib import Path
# Define the path to the directory containing the Excel files
path = os.getcwd()

current_dir = Path.cwd()
parent_dir = current_dir.parent

training_data_file = parent_dir / "Spreadsheets" / "merged_dataset_with_meta_category.csv"
# Load the training data file
training_data = pd.read_csv(training_data_file)



In [4]:
# Visualize the first few rows of the training data
print("Training Data Sample:")
print(training_data.head())

# drop columns name_y and Meta-Category2
training_data = training_data.drop(columns=['name_y', 'Meta-Category2'], errors='ignore')

print("Shape of the training data after dropping columns:")
print(training_data.shape)

Training Data Sample:
                                              name_x  \
0  AmazonBasics AA Performance Alkaline Batteries...   
1  AmazonBasics AA Performance Alkaline Batteries...   
2  AmazonBasics AA Performance Alkaline Batteries...   
3  AmazonBasics AA Performance Alkaline Batteries...   
4  AmazonBasics AA Performance Alkaline Batteries...   

                                            category primary_category name_y  \
0  AA,AAA,Electronics Features,Health,Electronics...  Health & Beauty    NaN   
1  AA,AAA,Electronics Features,Health,Electronics...  Health & Beauty    NaN   
2  AA,AAA,Electronics Features,Health,Electronics...  Health & Beauty    NaN   
3  AA,AAA,Electronics Features,Health,Electronics...  Health & Beauty    NaN   
4  AA,AAA,Electronics Features,Health,Electronics...  Health & Beauty    NaN   

  Meta-Category Meta-Category2  
0     Batteries      Batteries  
1     Batteries      Batteries  
2     Batteries      Batteries  
3     Batteries      Batteri

In [5]:
# Preprocess the date via tokenization and lemmatization
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


def text_preprocessing_pipeline(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)

    # Step 2: Remove punctuation and numbers
    tokens = [re.sub(r'[^a-zA-Z]', '', word) for word in tokens]  # Keep only letters
    tokens = [word for word in tokens if word]  # Remove empty strings

    # Step 3: Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(lemmatized_tokens)

# Create a local copy of the training data
training_data_copy = training_data.copy()

# Apply the text preprocessing pipeline function to the 'category' and 'Meta-Category' columns (NOT APPLICABLE WITH CROSS-VALIDATION, VECTORIZATION IS HANDLED BELOW)
training_data_copy['category'] = training_data_copy['category'].apply(text_preprocessing_pipeline)
training_data_copy['Meta-Category'] = training_data_copy['Meta-Category'].apply(text_preprocessing_pipeline)


In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
import numpy as np

# Assuming your data is in training_data_copy
X = training_data_copy['category']  # Features
y = training_data_copy['Meta-Category']  # Target

# Create pipelines (vectorization + model in one step)
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

logreg_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

# Use StratifiedKFold to maintain class distribution in each fold
# With 111 samples, 5 folds gives ~22 samples per fold
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validate both models
print("Cross-Validation Results:")
print("=" * 40)

# Naive Bayes
nb_scores = cross_val_score(nb_pipeline, X, y, cv=cv_strategy, scoring='accuracy')
print(f"Naive Bayes:")
print(f"  Mean Accuracy: {nb_scores.mean():.4f} (+/- {nb_scores.std() * 2:.4f})")
print(f"  Individual Fold Scores: {nb_scores}")

# Logistic Regression  
logreg_scores = cross_val_score(logreg_pipeline, X, y, cv=cv_strategy, scoring='accuracy')
print(f"\nLogistic Regression:")
print(f"  Mean Accuracy: {logreg_scores.mean():.4f} (+/- {logreg_scores.std() * 2:.4f})")
print(f"  Individual Fold Scores: {logreg_scores}")

# Compare the two
print(f"\nDifference in Mean Accuracy: {abs(nb_scores.mean() - logreg_scores.mean()):.4f}")

# Optional: More detailed cross-validation with multiple metrics
from sklearn.model_selection import cross_validate

# Get multiple metrics at once
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

print("\n" + "=" * 50)
print("DETAILED CROSS-VALIDATION RESULTS")
print("=" * 50)

# Naive Bayes detailed results
nb_detailed = cross_validate(nb_pipeline, X, y, cv=cv_strategy, scoring=scoring)
print("Naive Bayes:")
for metric in scoring:
    scores = nb_detailed[f'test_{metric}']
    print(f"  {metric.capitalize()}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# Logistic Regression detailed results  
logreg_detailed = cross_validate(logreg_pipeline, X, y, cv=cv_strategy, scoring=scoring)
print("\nLogistic Regression:")
for metric in scoring:
    scores = logreg_detailed[f'test_{metric}']
    print(f"  {metric.capitalize()}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# Check class distribution
print(f"\nClass Distribution:")
print(y.value_counts())
print(f"\nTotal samples: {len(y)}")

Cross-Validation Results:




Naive Bayes:
  Mean Accuracy: 0.9984 (+/- 0.0001)
  Individual Fold Scores: [0.9983524  0.99834547 0.99833507 0.99837318 0.99851871]





Logistic Regression:
  Mean Accuracy: 0.9996 (+/- 0.0001)
  Individual Fold Scores: [0.99950451 0.99959806 0.99953916 0.99956514 0.99961019]

Difference in Mean Accuracy: 0.0012

DETAILED CROSS-VALIDATION RESULTS


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Naive Bayes:
  Accuracy: 0.9984 (+/- 0.0001)
  Precision_macro: 0.6335 (+/- 0.1132)
  Recall_macro: 0.8987 (+/- 0.1627)
  F1_macro: 0.6730 (+/- 0.1209)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Logistic Regression:
  Accuracy: 0.9996 (+/- 0.0001)
  Precision_macro: 0.7121 (+/- 0.1288)
  Recall_macro: 0.6941 (+/- 0.1252)
  F1_macro: 0.7021 (+/- 0.1268)

Class Distribution:
Meta-Category
Portable Electronics          2855705
Connected Home Electronics      14000
Batteries                       12071
Office Supplies                  4223
Pet Products                        6
Kitchen Storage                     3
Name: count, dtype: int64

Total samples: 2886008


In [8]:
# Save the trained models and vectorizer via joblib for later use
import joblib
# Define the directory to save the models
models_dir = current_dir / "Models joblib"

joblib.dump(nb_pipeline, models_dir / 'naive_bayes_model_cv.joblib')
joblib.dump(logreg_pipeline, models_dir / 'logistic_regression_model_cv.joblib')
joblib.dump(cv_strategy, models_dir / 'cv_strategy.joblib')
print("Models and cross-validation saved succesfully.")

Models and cross-validation saved succesfully.


From the results above, we can see that unfortunately cross-validation does not yield better results. By the nature of this dataset, it is too limited to properly train a classification model on. As next steps, we will consider a different approach of using an LLM to 