# Semantic Taxonomy Discovery (Icecat)

**Interactive Notebook for E-commerce Product Clustering**

This notebook demonstrates unsupervised taxonomy discovery on 489,898 products from the Icecat dataset.

---

## 1. Setup and Imports

In [14]:
import pandas as pd
import numpy as np
import sys

# Import project modules
from src import config, data_loader, features, clustering, evaluation, visualization, tuning, supervised

print(f"Config: MAX_ROWS={config.MAX_ROWS}, LABEL_COL={config.LABEL_COL}")

Config: MAX_ROWS=None, LABEL_COL=Category.Name.Value


## 2. Load Data

Load the Icecat dataset (489,898 products, 1.2GB JSON).

In [15]:
df = data_loader.load_icecat_data()
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns[:10])}...")
df.head(3)

Loading data from /Users/dev/Downloads/icecat_data_train.json...
Data loaded. Shape: (489902, 45)
Dataset shape: (489902, 45)
Columns: ['Brand', 'BrandInfo.BrandLocalName', 'BrandInfo.BrandLogo', 'BrandInfo.BrandName', 'BrandLogo', 'BrandPartCode', 'BulletPoints', 'Category.CategoryID', 'Category.Name.Language', 'Category.Name.Value']...


Unnamed: 0,Brand,BrandInfo.BrandLocalName,BrandInfo.BrandLogo,BrandInfo.BrandName,BrandLogo,BrandPartCode,BulletPoints,Category.CategoryID,Category.Name.Language,Category.Name.Value,...,ProductSeries.Language,ProductSeries.Value,BulletPoints.BulletPointsId,BulletPoints.Language,BulletPoints.Updated,BulletPoints.Values,VirtualCategory,SummaryDescription,pathlist_ids,pathlist_names
1072689,ASUS,,https://images.icecat.biz/img/brand/thumb/161_...,ASUS,https://images.icecat.biz/img/brand/thumb/161_...,K31CD-IT049T,[],153,EN,PCs/Workstations,...,,,,,,,"[{'VirtualCategoryID': 195, 'UNCATID': '431718...",,2833>150>153,Computers & Electronics>Computers>PCs/Workstat...
906402,HP,,https://images.icecat.biz/img/brand/thumb/1_91...,HP,https://images.icecat.biz/img/brand/thumb/1_91...,686915-A41,[],2509,EN,Notebook Spare Parts,...,,,,,,,,,2833>150>8355>2509,Computers & Electronics>Computers>Notebook Par...
411281,C2G,,https://images.icecat.biz/img/brand/thumb/2834...,C2G,https://images.icecat.biz/img/brand/thumb/2834...,37745,[],953,EN,Fibre Optic Cables,...,,,,,,,,,2833>830>953,Computers & Electronics>Computer Cables>Fibre ...


## 3. Feature Engineering

- **HTML Cleaning**: Remove `<b>`, `<br>`, `<div>` tags from descriptions
- **Smart Imputation**: Fill empty descriptions using Title/ProductName/Brand
- **Sentence-BERT Embeddings**: Convert text to 384-dim dense vectors

In [16]:
# Create text features with preprocessing
df = features.create_text_features(df)
print(f"After preprocessing: {len(df)} rows")
print(f"Sample text: {df['cluster_text'].iloc[0][:200]}...")

Creating features from: ['Title', 'ProductName', 'Brand', 'Description', 'LongDesc']
   > Applying HTML Cleaning...



If you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like 'requests' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.



    
  soup = BeautifulSoup(text, "html.parser")


   > Applying Smart Imputation (fallback for empty text)...
Rows after cleaning: 489898 (Dropped 4, 0.0%)
After preprocessing: 489898 rows
Sample text: ASUS K31CD-IT049T PC 6th gen Intel® Core™ i7 i7-6700 16 GB DDR4-SDRAM 1000 GB HDD Black Tower...


In [17]:
# Generate embeddings (cached after first run)
embeddings = features.generate_embeddings(df)
print(f"Embeddings shape: {embeddings.shape}")

Loading embeddings from cache: /Users/dev/Downloads/icecat-taxonomy-clustering/outputs/cache/embeddings_all-MiniLM-L6-v2_489898.npy
Embeddings shape: (489898, 384)


## 4. Dimensionality Reduction

Reduce 384-dim embeddings to 50-dim for faster clustering.

In [18]:
embeddings_low = visualization.reduce_dimensions(embeddings, method='pca', n_components=50)
print(f"Reduced shape: {embeddings_low.shape}")

Reducing dimensions with PCA...
Reduced shape: (489898, 50)


## 5. Clustering Experiments

Run multiple algorithms and compare results.

In [19]:
y_true = df[config.LABEL_COL] if config.LABEL_COL in df.columns else None
print(f"Number of true categories: {y_true.nunique() if y_true is not None else 'N/A'}")

Number of true categories: 370


In [20]:
# BIRCH Clustering (Best Performer)
best_params, _, _ = tuning.tune_hyperparameters(embeddings_low, 'BIRCH', {'threshold': [0.3, 0.5], 'n_clusters': [None]})
labels_birch = clustering.run_birch(embeddings_low, **best_params)

metrics_birch = evaluation.compute_metrics(embeddings_low, labels_birch, y_true)
print(f"\nBIRCH Results:")
print(f"  Purity: {metrics_birch['purity']:.2%}")
print(f"  NMI: {metrics_birch['nmi']:.2%}")


--- Tuning BIRCH (2 combinations) ---
   [2/2] Testing params: {'threshold': 0.5, 'n_clusters': None}...
   > Best Score: 0.1543 with {'threshold': 0.5, 'n_clusters': None}

--- BIRCH (n_clusters=None, threshold=0.5) ---
   > Input data shape: (489898, 50)
   > BIRCH completed in 7.46 seconds.

BIRCH Results:
  Purity: 85.07%
  NMI: 71.92%


In [21]:
# MiniBatchKMeans (Scalable)
labels_mbk = clustering.run_minibatch_kmeans(embeddings_low, n_clusters=150, batch_size=2048)

metrics_mbk = evaluation.compute_metrics(embeddings_low, labels_mbk, y_true)
print(f"\nMiniBatchKMeans Results:")
print(f"  Purity: {metrics_mbk['purity']:.2%}")
print(f"  NMI: {metrics_mbk['nmi']:.2%}")


--- MiniBatchKMeans (k=150, batch=2048) ---
   > Input data shape: (489898, 50)
   > Streaming batches for scalable clustering...
   > MiniBatchKMeans completed in 1.01 seconds.

MiniBatchKMeans Results:
  Purity: 78.88%
  NMI: 69.16%


## 6. Supervised Baseline (Scientific Control)

Train a Logistic Regression classifier to establish the upper bound.

In [10]:
baseline_metrics = supervised.run_baseline(embeddings_low, y_true)
print(f"\nSupervised Baseline:")
print(f"  Accuracy: {baseline_metrics['purity']:.2%}")


--- Supervised Baseline (Logistic Regression) ---
   > Splitting data (Train=80%, Test=20%)...
   > Training Classifier (max_iter=1000)...




   > Predicting on Test Set...
   > Supervised Baseline Completed in 247.30s.
   > Test Accuracy: 93.87%

Supervised Baseline:
  Accuracy: 93.87%


## 7. Visualization

In [11]:
# Sample for visualization (UMAP is slow on large data)
sample_size = 10000
idx = np.random.choice(len(embeddings_low), sample_size, replace=False)

emb_viz = embeddings_low[idx]
true_viz = y_true.iloc[idx] if y_true is not None else None
labels_viz = {'BIRCH': labels_birch[idx], 'MiniBatchKMeans': labels_mbk[idx]}

embeddings_2d = visualization.reduce_dimensions(emb_viz, method='umap', n_components=2)
print(f"2D projection shape: {embeddings_2d.shape}")

Reducing dimensions with UMAP...


  warn(


2D projection shape: (10000, 2)


In [12]:
# Generate comparison panel
visualization.plot_comparison_panel(embeddings_2d, labels_viz, true_labels=true_viz)

Saved Comparison Panel to outputs/clustering_comparison_panel.png


## 8. Results Summary

In [13]:
results = {
    'BIRCH': metrics_birch,
    'MiniBatchKMeans': metrics_mbk,
    'Supervised Baseline': baseline_metrics
}

df_results = pd.DataFrame(results).T
df_results[['purity', 'nmi', 'n_clusters']]

Unnamed: 0,purity,nmi,n_clusters
BIRCH,0.850742,0.719219,1412.0
MiniBatchKMeans,0.788758,0.691612,150.0
Supervised Baseline,0.938743,0.918582,370.0


---

**Key Finding**: BIRCH achieves ~91% of supervised performance without using any labels.