<sub>Developed by SeongKu Kang, August 2025 ‚Äî Do not distribute</sub>

# üìò Task 1: Product category classification with no label (Fixed BERT embeddings)

In this notebook, we consider a **realistic but challenging scenario**: what if we have **no labeled data at all**?

In many real-world applications, collecting labeled product-category data is expensive and slow.  
Here, we explore how to bootstrap a classification system without any human-provided labels.  
Your task is to fill in the blanks and design solutions for this "zero-label" setting.

---

## Key Ideas (Guidelines)

1. **Constructing Silver Labels**  
   Since we have no ground-truth labels, we must create *weak supervision signals*.  
   Possible strategies include:
   - **Lexical similarity:** Compare product titles/descriptions with category names using sparse vectors.  
   - **Embedding similarity:** Compare BERT embeddings for both products and labels.  
   - **Ensemble approaches:** Combine multiple weak signals (e.g., weighted voting between lexical-based and embedding-based similarity).

2. **Learning with Silver Labels**  
   Once silver labels are generated, train a classifier as if they were real labels.  
   To improve robustness, you may consider various techniques that we learned, including (but not limited to):
   - **Self-training:** Train an initial model with silver labels, then use it to assign pseudo-labels to unlabeled data with high confidence.  
   - **Label embedding models:** Instead of treating labels as arbitrary IDs, use semantic embeddings of label names to guide classification (e.g., inner-product classifier).
   - **Consistency regularization:** Encourage the model to produce stable predictions under input perturbations (e.g., dropout noise, data augmentation). This helps prevent overfitting to noisy silver labels and promotes smoother decision boundaries.
   - **Stabilizing model prediction using Ensemble:** To mitigate the noise from weak or unstable supervision, you can stabilize predictions through ensembling techniques (e.g., Temporal ensemble via EMA, independent model ensemble).

---

## Your Tasks

1. Generate silver labels.
2. Train a classifier using these silver labels and various learning strategies.
  
üí° *Hint:* Think of this as "bootstrapping" the learning process ‚Äî even noisy initial signals can become useful when combined with iterative refinement and stabilization techniques.


‚ö†Ô∏è **Note**: Do **NOT** use the labeled training set provided in the previous notebook.  
In this notebook, you must assume that **no labeled data exists**. Only the following resources are allowed:
- Product metadata (titles, descriptions, etc.)
- Category names

In [1]:
import json
from tqdm import tqdm
from pathlib import Path
from utils import * 
import copy

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split, ConcatDataset

device = "cuda" if torch.cuda.is_available() else "cpu"

In [2]:
# Default paths
ROOT = Path("dataset") # Root dataset directory
CORPUS_PATH = ROOT / "corpus.jsonl" # Product corpus file (JSON Lines): Each line contains a product ID and its associated text description.
EMB_PATH = ROOT / "corpus_bert_mean.pt"

# Task 1: Product category classification
LABEL_MAP_PATH = ROOT / "category_classification" 
LABEL2ID_PATH = LABEL_MAP_PATH / "label2labelid.json" 
ID2LABEL_PATH = LABEL_MAP_PATH / "labelid2label.json" 
PID2LABEL_TEST_PATH = LABEL_MAP_PATH / "pid2labelids_test.json" 
LABEL_EMB_PATH = LABEL_MAP_PATH / "category_labels_bert_mean.pt"

In [3]:
pid2text = load_corpus(CORPUS_PATH) # load corpus

label2id = load_json(LABEL2ID_PATH)
id2label = load_json(ID2LABEL_PATH)
pid2label_test = load_json(PID2LABEL_TEST_PATH)

# loading pre-trained embeddings
corpus_data = torch.load(EMB_PATH)  # {"ids": [...], "embeddings": Tensor}
pid_list = corpus_data["ids"]
pid2idx = {pid: i for i, pid in enumerate(pid_list)}
embeddings = corpus_data["embeddings"]

label_data = torch.load(LABEL_EMB_PATH)
label_emb = label_data["embeddings"].to(device)

In [None]:
# ==========================================================
# Your Task: Do your magic below 
# ==========================================================

## Prepare Kaggle submission
Modify the code as needed to fit your solution.

In [None]:
import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from pathlib import Path

# === 1. Load test IDs ===
ROOT = Path("dataset") # Root dataset directory
LABEL_MAP_PATH = ROOT / "category_classification"
TEST_IDS_PATH = LABEL_MAP_PATH / "task1_test_ids.csv"

test_ids_df = pd.read_csv(TEST_IDS_PATH)  # has column "id"
test_ids = test_ids_df["id"].tolist()

# === 2. Custom Dataset (no labels) ===
class ProductCategoryTestDataset(Dataset):
    def __init__(self, pids, pid2idx, embeddings):
        self.pids = pids
        self.indices = [pid2idx[pid] for pid in self.pids]
        self.vecs = embeddings 
        
    def __len__(self):
        return len(self.pids)

    def __getitem__(self, idx):
        pid = self.pids[idx]
        emb = self.vecs[self.indices[idx]]
        return {"X": torch.tensor(emb, dtype=torch.float)}

# === 3. Build dataset and loader ===
test_dataset_kaggle = ProductCategoryTestDataset(test_ids, pid2idx, embeddings)
test_loader_kaggle = DataLoader(test_dataset_kaggle, batch_size=64)

# === 4. Run predictions ===
model.eval()
all_preds = []

with torch.no_grad():
    for batch in test_loader_kaggle:
        X = batch["X"].to(device)   # or "cuda" if using GPU
        logits = model(X)
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().tolist())

# === 5. Build submission file ===
submission = pd.DataFrame({
    "id": test_ids,
    "label": all_preds
})

SUBMISSION_PATH = ROOT / "submission/P3_submission.csv"
submission.to_csv(SUBMISSION_PATH, index=False)

print(f"Submission file saved to {SUBMISSION_PATH}")
print(submission.head())