#### 06 - BERT baseline for fashion vs non-fashion

This notebook fine-tunes a transformer-based classifier (BERT) on the
same cleaned dataset and time-aware splits used for the TF-IDF + logistic baseline:

- Input text: `product_text_norm`
- Labels: `label` (0 = non-fashion, 1 = fashion)
- Splits: `train`, `val`, `test` from `products_with_splits.parquet`

We will:
1. Load the processed dataset and splits.
2. Tokenize texts with a pre-trained transformer tokenizer.
3. Fine-tune the model with Hugging Face `Trainer`.
4. Plot training and validation loss curves to check for overfitting.
5. Evaluate on validation and test and later tune a probability threshold (as we did for logistic regression).

In [1]:
# Set up project paths, imports, and load products_with_splits

from pathlib import Path
import sys
import pandas as pd

# 1) Find project root (folder containing "src") and add src to sys.path
cwd = Path.cwd()
project_root = None
for path in [cwd, *cwd.parents]:
    if (path / "src").is_dir():
        project_root = path
        break

if project_root is None:
    raise FileNotFoundError("Could not find project root containing 'src' folder.")

SRC_DIR = project_root / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

from config import PROCESSED_DATA_DIR

print("Project root:", project_root)
print("PROCESSED_DATA_DIR:", PROCESSED_DATA_DIR)

# 2) Load processed dataset with splits
data_path = PROCESSED_DATA_DIR / "products_with_splits.parquet"
df = pd.read_parquet(data_path)

print("Full dataset shape:", df.shape)
print("\nSplits:")
print(df["split"].value_counts().sort_index())

print("\nLabels:")
print(df["label"].value_counts().sort_index())

df[["product_text_raw", "product_text_norm", "label", "split"]].head(10)

Project root: /Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation
PROCESSED_DATA_DIR: /Users/ramana/Documents/Homework/1st class ML opt/Project 1/Product-Classifcation/data/processed
Full dataset shape: (19767, 10)

Splits:
split
test      2544
train    13961
val       3262
Name: count, dtype: int64

Labels:
label
0     3165
1    16602
Name: count, dtype: int64


Unnamed: 0,product_text_raw,product_text_norm,label,split
0,001B 3000A Car Jump Starter Battery Pack (up t...,001b 3000a car jump starter battery pack up to...,0,train
1,"012 Jump Starter Battery Pack, 4000A Peak Car ...",012 jump starter battery pack 4000a peak car b...,0,train
2,1/2 Ct Diamond Stud Earrings 14k Yellow Gold F...,1 2 ct diamond stud earrings 14k yellow gold f...,1,test
3,1-2 Pairs 925 Sterling Silver Mens Earrings Cu...,1 2 pairs 925 sterling silver mens earrings cu...,1,train
4,"1/2"""" x 18"""" Zirconia Sanding Belts for Belt S...",1 2 x 18 zirconia sanding belts for belt sande...,0,train
5,"1.5 Gram x 50 Vial Convenient Super Glue, Wegl...",1 5 gram x 50 vial convenient super glue wegla...,0,train
6,"1/6 Scale Female Clothes, Female Black Leather...",1 6 scale female clothes female black leather ...,1,train
7,"1/6 Scale Female Clothes, Female Sports Underw...",1 6 scale female clothes female sports underwe...,1,train
8,"1/6 Scale Female Clothes, Female Sports Underw...",1 6 scale female clothes female sports underwe...,1,train
9,1.75mm Normal PLA 4 Most Basic Colors Bundle P...,1 75mm normal pla 4 most basic colors bundle p...,0,train


In [2]:
# Create train / val / test DataFrames for BERT fine-tuning

df_train = df[df["split"] == "train"].copy()
df_val   = df[df["split"] == "val"].copy()
df_test  = df[df["split"] == "test"].copy()

print("Shapes:")
print("  train:", df_train.shape)
print("  val  :", df_val.shape)
print("  test :", df_test.shape)

def print_class_balance(name, subdf):
    counts = subdf["label"].value_counts().sort_index()
    pct = subdf["label"].value_counts(normalize=True).sort_index() * 100
    print(f"\nClass balance in {name} (0=non-fashion, 1=fashion):")
    for k in counts.index:
        print(f"  label={k}: {counts[k]} rows ({pct[k]:.2f}%)")

print_class_balance("train", df_train)
print_class_balance("val", df_val)
print_class_balance("test", df_test)

df_train[["product_text_raw", "product_text_norm", "label"]].head(5)

Shapes:
  train: (13961, 10)
  val  : (3262, 10)
  test : (2544, 10)

Class balance in train (0=non-fashion, 1=fashion):
  label=0: 2246 rows (16.09%)
  label=1: 11715 rows (83.91%)

Class balance in val (0=non-fashion, 1=fashion):
  label=0: 474 rows (14.53%)
  label=1: 2788 rows (85.47%)

Class balance in test (0=non-fashion, 1=fashion):
  label=0: 445 rows (17.49%)
  label=1: 2099 rows (82.51%)


Unnamed: 0,product_text_raw,product_text_norm,label
0,001B 3000A Car Jump Starter Battery Pack (up t...,001b 3000a car jump starter battery pack up to...,0
1,"012 Jump Starter Battery Pack, 4000A Peak Car ...",012 jump starter battery pack 4000a peak car b...,0
3,1-2 Pairs 925 Sterling Silver Mens Earrings Cu...,1 2 pairs 925 sterling silver mens earrings cu...,1
4,"1/2"""" x 18"""" Zirconia Sanding Belts for Belt S...",1 2 x 18 zirconia sanding belts for belt sande...,0
5,"1.5 Gram x 50 Vial Convenient Super Glue, Wegl...",1 5 gram x 50 vial convenient super glue wegla...,0


#### BERT model and tokenizer

We use the `bert-base-uncased` checkpoint from Hugging Face. The tokenizer will turn
`product_text_norm` strings into input IDs and attention masks for BERT. Product
titles are short, so we cap sequence length at 64 tokens.

In [3]:
from transformers import AutoTokenizer

# Choose the BERT checkpoint
MODEL_NAME = "bert-base-uncased"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LENGTH = 64  # max tokens per product title (truncates very long titles)

print("Using model:", MODEL_NAME)
print("Tokenizer vocab size:", tokenizer.vocab_size)
print("Max length:", MAX_LENGTH)

# Quick sanity check on a couple of sample texts
for txt in df_train["product_text_norm"].head(3).tolist():
    enc = tokenizer(
        txt,
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
    )
    print("\nSample text:", txt)
    print("input_ids length:", len(enc["input_ids"]))

  from .autonotebook import tqdm as notebook_tqdm


Using model: bert-base-uncased
Tokenizer vocab size: 30522
Max length: 64

Sample text: 001b 3000a car jump starter battery pack up to 9 0l gas and 7 0l diesel engine 12v car battery charger jump box with usb 3 0 power bank
input_ids length: 64

Sample text: 012 jump starter battery pack 4000a peak car battery charger jump starter for up to 10 0l gas or 8 0l diesel engine 12v car jumper starter portable with full lcd screen led light usb
input_ids length: 64

Sample text: 1 2 pairs 925 sterling silver mens earrings cubic zirconia halo stud earrings for men 18k gold plated heart round square cut cz stud earrings set for women men
input_ids length: 64


	â€¢	Converts each split into a Hugging Face Dataset with text and label.
	â€¢	Runs the BERT tokenizer on all texts with padding/truncation to MAX_LENGTH.
	â€¢	Renames label â†’ labels because Trainer expects that column.
	â€¢	Shows a sample tokenized item so you can see what BERT will receive.

In [4]:
# Build Hugging Face Datasets from the pandas splits and tokenize them

from datasets import Dataset

# Create base datasets with 'text' and 'label' columns
train_ds = Dataset.from_pandas(
    df_train[["product_text_norm", "label"]].rename(columns={"product_text_norm": "text"})
)
val_ds = Dataset.from_pandas(
    df_val[["product_text_norm", "label"]].rename(columns={"product_text_norm": "text"})
)
test_ds = Dataset.from_pandas(
    df_test[["product_text_norm", "label"]].rename(columns={"product_text_norm": "text"})
)

print("Raw HF datasets:")
print("  train:", train_ds)
print("  val  :", val_ds)
print("  test :", test_ds)

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
    )

# Apply tokenizer
train_ds_tok = train_ds.map(tokenize_batch, batched=True, remove_columns=["text", "__index_level_0__"])
val_ds_tok   = val_ds.map(tokenize_batch, batched=True, remove_columns=["text", "__index_level_0__"])
test_ds_tok  = test_ds.map(tokenize_batch, batched=True, remove_columns=["text", "__index_level_0__"])

# Set the format so Trainer gets tensors and labels
train_ds_tok = train_ds_tok.rename_column("label", "labels")
val_ds_tok   = val_ds_tok.rename_column("label", "labels")
test_ds_tok  = test_ds_tok.rename_column("label", "labels")

print("\nTokenized HF datasets:")
print("  train_tok:", train_ds_tok)
print("  val_tok  :", val_ds_tok)
print("  test_tok :", test_ds_tok)

# Inspect one tokenized example
first_example = train_ds_tok[0]
print("\nExample tokenized item keys:", first_example.keys())
print("input_ids length:", len(first_example["input_ids"]))
print("label:", first_example["labels"])

Raw HF datasets:
  train: Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 13961
})
  val  : Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 3262
})
  test : Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 2544
})


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 13961/13961 [00:00<00:00, 32237.57 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3262/3262 [00:00<00:00, 34406.34 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2544/2544 [00:00<00:00, 39503.72 examples/s]


Tokenized HF datasets:
  train_tok: Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 13961
})
  val_tok  : Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3262
})
  test_tok : Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2544
})

Example tokenized item keys: dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask'])
input_ids length: 64
label: 0





#### BERT sequence classification setup

We now:
- load `bert-base-uncased` as a sequence classification model with 2 labels,
- define evaluation metrics (accuracy, precision, recall, F1 for "fashion" = label 1),
- configure Hugging Face `Trainer` to fine-tune BERT on the train split and evaluate on the val split.

In [5]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os

# 1) Load BERT classifier (2 labels)
id2label = {0: "non-fashion", 1: "fashion"}
label2id = {"non-fashion": 0, "fashion": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

# 2) Metrics: treat label 1 ("fashion") as the positive class
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc = accuracy_score(labels, preds)
    prec, rec, f1, _ = precision_recall_fscore_support(
        labels, preds, average="binary", pos_label=1
    )

    return {
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
    }

# 3) Data collator (handles padding; we already use fixed length, so this is simple)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 4) Training arguments
bert_output_dir = project_root / "models" / "bert_fashion"
os.makedirs(bert_output_dir, exist_ok=True)

training_args = TrainingArguments(
    output_dir=str(bert_output_dir),
    evaluation_strategy="epoch",      # run eval at end of each epoch
    save_strategy="epoch",            # save checkpoint each epoch
    learning_rate=2e-5,               # standard BERT fine-tuning LR
    per_device_train_batch_size=16,   # adjust if you hit memory issues
    per_device_eval_batch_size=32,
    num_train_epochs=3,               # start with 2â€“3 epochs
    weight_decay=0.01,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    save_total_limit=2,               # keep last 2 checkpoints
    report_to="none",                 # no wandb/tensorboard by default
)

# 5) Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds_tok,
    eval_dataset=val_ds_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer set up. Ready to fine-tune BERT.")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainer set up. Ready to fine-tune BERT.


In [None]:
# Fine-tune BERT on the training set, evaluating on the validation set

train_result = trainer.train()

# Save final model and training state
trainer.save_model(bert_output_dir)        # saves the best model (because load_best_model_at_end=True)
trainer.save_state()

print("\nTraining finished.")
print("Best model saved in:", bert_output_dir)

# Evaluate on validation and test splits
print("\n=== Evaluation on validation split ===")
eval_val = trainer.evaluate(eval_dataset=val_ds_tok)
print(eval_val)

print("\n=== Evaluation on test split ===")
eval_test = trainer.evaluate(eval_dataset=test_ds_tok)
print(eval_test)

