# 📓 Model Training Notebook — Semantic Recommender

This notebook builds the **content-based product recommender** used in the app.

**Pipeline overview**
1. Load and preprocess dataset (mirrors the app).
2. Build a **semantic index** with SentenceTransformer embeddings + FAISS.
3. Implement a **baseline** using TF‑IDF + k‑NN for comparison.
4. Provide **evaluation** via small *Recall@K / Precision@K* using category-based relevance and self-retrieval sanity checks.
5. Save artifacts for serving (`faiss.index`, `embeddings.npy`).

> Why embeddings?
>
> TF‑IDF relies on exact word overlap; semantic embeddings capture meaning (e.g., *“study chair”* ≈ *“desk chair”*). FAISS makes vector search fast and scalable.


## 1) Setup

In [1]:
import os
import numpy as np
import pandas as pd
from pathlib import Path
import faiss

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

DATA_PATH = Path('../backend/data/products_clean.csv')
if not DATA_PATH.exists():
    DATA_PATH = Path('../backend/data/products.csv')  # fallback

INDEX_DIR = Path('../backend/vectorstore')
INDEX_DIR.mkdir(parents=True, exist_ok=True)

MODEL_ID_REMOTE = 'sentence-transformers/all-MiniLM-L6-v2'
MODEL_PATH_LOCAL = Path('../backend/models_cache/all-MiniLM-L6-v2')  # if you cached locally

  from .autonotebook import tqdm as notebook_tqdm


## 2) Load & Preprocess (must match the app)

In [3]:
 import pandas as pd

# ✅ Use absolute path
df = pd.read_csv(r"D:\DOWNLOADSS\ikarus_rec_app\backend\data\products.csv")

# text fields filled
for col in ['title','brand','description','categories','material','color']:
    if col not in df.columns:
        df[col] = ''
    df[col] = df[col].fillna('')

# price cleaned (USD float)
if 'price' in df.columns:
    df['price'] = df['price'].astype(str).str.replace(r'[^\d\.]', '', regex=True)
    df['price'] = pd.to_numeric(df['price'], errors='coerce')
    if df['price'].notna().any():
        df['price'] = df['price'].fillna(df['price'].median())
    else:
        df['price'] = 0.0

# build single text field for retrieval
df['text'] = (df['title'] + ' ' + df['brand'] + ' ' + df['categories'] + ' ' +
              df['material'] + ' ' + df['color'] + ' ' + df['description'])

# function to extract clean category
def pretty_category(s: str, keep_last=True) -> str:
    if not isinstance(s, str):
        return ''
    s = s.strip().strip('[]').replace("'", '')
    parts = [p.strip() for p in s.split(',') if p.strip()]
    if not parts:
        return ''
    return parts[-1] if keep_last else ', '.join(parts)

# add simplified category column
df['cat_short'] = df['categories'].map(lambda x: pretty_category(x, keep_last=True))

len(df), df.head(2)


(312,
                                                title    brand  \
 0  GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...   GOYMFK   
 1  subrtex Leather ding Room, Dining Chairs Set o...  subrtex   
 
                                          description  price  \
 0  multiple shoes, coats, hats, and other items E...  24.99   
 1                     subrtex Dining chairs Set of 2  53.99   
 
                                           categories  \
 0  ['Home & Kitchen', 'Storage & Organization', '...   
 1  ['Home & Kitchen', 'Furniture', 'Dining Room F...   
 
                                               images           manufacturer  \
 0  ['https://m.media-amazon.com/images/I/416WaLx1...                 GOYMFK   
 1  ['https://m.media-amazon.com/images/I/31SejUEW...  Subrtex Houseware INC   
 
          package_dimensions country_of_origin material  color  \
 0  2.36"D x 7.87"W x 21.6"H             China    Metal  White   
 1      18.5"D x 16"W x 35"H               NaN   Spon

## 3) Embeddings + FAISS (main model)

In [4]:
# Load model (prefer local path to avoid network timeouts)
if MODEL_PATH_LOCAL.exists():
    model = SentenceTransformer(str(MODEL_PATH_LOCAL))
else:
    model = SentenceTransformer(MODEL_ID_REMOTE)

texts = df['text'].tolist()
emb = np.array(model.encode(texts, show_progress_bar=True), dtype='float32')

# cosine similarity via L2-normalized inner product
faiss.normalize_L2(emb)
index = faiss.IndexFlatIP(emb.shape[1])
index.add(emb)

# persist for serving
faiss.write_index(index, str(INDEX_DIR / 'faiss.index'))
np.save(str(INDEX_DIR / 'embeddings.npy'), emb)

index.ntotal

Batches: 100%|██████████| 10/10 [00:03<00:00,  3.30it/s]


312

### Helper: Semantic Search

In [5]:
def semantic_search(query: str, k: int = 5):
    v = model.encode([query]).astype('float32')
    faiss.normalize_L2(v)
    D, I = index.search(v, k)
    rows = df.iloc[I[0]][['title','brand','cat_short','price']].copy()
    rows['score'] = D[0]
    return rows.reset_index(drop=True)

semantic_search('cozy wooden chair for study table', 5)

Unnamed: 0,title,brand,cat_short,price,score
0,"Flash Furniture Webb Commercial Grade 24"" Roun...",Flash Furniture Store,Tables,140.0,0.540516
1,Flash Furniture Walker Small Rustic Natural Ho...,Flash Furniture Store,Home Office Desks,111.0,0.534974
2,Karl home Accent Chair Mid-Century Modern Chai...,Karl home Store,Chairs,149.99,0.532338
3,Amazon Basics Kids Adjustable Mesh Low-Back Sw...,Amazon Basics Store,Desk Chairs,53.99,0.525546
4,VECELO Modern Industrial Style 3-Piece Dining ...,VECELO Store,Table & Chair Sets,53.99,0.514563


## 4) Baseline: TF‑IDF + k‑NN

This is a purely lexical baseline. It helps illustrate the benefit of semantic embeddings on paraphrased queries.

In [6]:
tfidf = TfidfVectorizer(min_df=2, ngram_range=(1,2))
X = tfidf.fit_transform(df['text'])

knn = NearestNeighbors(n_neighbors=10, metric='cosine')
knn.fit(X)

def knn_search(query: str, k: int = 5):
    qv = tfidf.transform([query])
    dist, idx = knn.kneighbors(qv, n_neighbors=k, return_distance=True)
    rows = df.iloc[idx[0]][['title','brand','cat_short','price']].copy()
    rows['score'] = 1 - dist[0]  # cosine similarity
    return rows.reset_index(drop=True)

knn_search('cozy wooden chair for study table', 5)

Unnamed: 0,title,brand,cat_short,price,score
0,MoNiBloom Round Folding Faux Fur Saucer Chair ...,MoNiBloom Store,Folding Chairs,53.99,0.240508
1,"Tiita Comfy Saucer Chair, Soft Faux Fur Oversi...",Tiita Store,Chairs,79.99,0.209013
2,"BOOSDEN Padded Folding Chair 2 Pack, Foldable ...",BOOSDEN Store,Folding Chairs,119.0,0.162735
3,BYOOTIQUE Makeup Chair Folding Camping Stool C...,BYOOTIQUE Store,Vanities & Vanity Benches,39.9,0.148176
4,"klotski Kids Table and 2 Chair Set, Wood Activ...",klotski,Table & Chair Sets,125.99,0.143491


## 5) Evaluation

We use two lightweight, dataset‑appropriate tests:

1. **Self‑retrieval@K** — using each product title as a query, can the system retrieve the **same item** in the top‑K? (Sanity check for the index.)
2. **Category Precision/Recall@K** — a prediction is considered relevant if it shares the same **`cat_short`** as the query item.

> These are not perfect offline metrics, but they provide a quantitative signal without user‑interaction logs.


In [7]:
import random

def self_retrieval_at_k(model_search_fn, k=5, sample_size=50):
    idxs = random.sample(range(len(df)), min(sample_size, len(df)))
    hits = 0
    for i in idxs:
        q = df.iloc[i]['title']
        res = model_search_fn(q, k)
        # check if exact same title appears (strict, but OK for sanity)
        titles = set(res['title'].astype(str).tolist())
        if str(df.iloc[i]['title']) in titles:
            hits += 1
    return hits / len(idxs) if idxs else 0.0

def category_precision_recall_at_k(model_search_fn, k=5, sample_size=50):
    idxs = random.sample(range(len(df)), min(sample_size, len(df)))
    prec_list, rec_list = [], []
    for i in idxs:
        q_cat = df.iloc[i]['cat_short']
        q_text = df.iloc[i]['title']
        # relevant set = any item with same category (excluding the query itself)
        relevant_idx = set(df.index[df['cat_short'] == q_cat].tolist()) - {i}
        R = len(relevant_idx)
        if R == 0:
            continue

        res = model_search_fn(q_text, k)
        pred_idx = set(res.index)  # these are df indices (after reset we lose them)
        # map back: we need original index; best is to capture via join:
        res_with_ix = df.reset_index().loc[res.index, 'index']
        pred_idx = set(res_with_ix.tolist()) - {i}

        tp = len(pred_idx & relevant_idx)
        prec = tp / k
        rec = tp / R
        prec_list.append(prec)
        rec_list.append(rec)

    return (np.mean(prec_list) if prec_list else 0.0,
            np.mean(rec_list) if rec_list else 0.0)

# Wrap search fns to return rows aligned with df indices
def sem_search_aligned(query, k=5):
    v = model.encode([query]).astype('float32')
    faiss.normalize_L2(v)
    D, I = index.search(v, k)
    rows = df.iloc[I[0]][['title','brand','cat_short','price']].copy()
    rows['score'] = D[0]
    rows.index = df.iloc[I[0]].index  # preserve original df indices
    return rows

def knn_search_aligned(query, k=5):
    qv = tfidf.transform([query])
    dist, idx = knn.kneighbors(qv, n_neighbors=k, return_distance=True)
    rows = df.iloc[idx[0]][['title','brand','cat_short','price']].copy()
    rows['score'] = 1 - dist[0]
    rows.index = df.iloc[idx[0]].index
    return rows

# Run metrics
random.seed(42)
sem_self = self_retrieval_at_k(lambda q,k=5: sem_search_aligned(q,k), k=5, sample_size=60)
knn_self = self_retrieval_at_k(lambda q,k=5: knn_search_aligned(q,k), k=5, sample_size=60)

sem_p, sem_r = category_precision_recall_at_k(lambda q,k=5: sem_search_aligned(q,k), k=5, sample_size=60)
knn_p, knn_r = category_precision_recall_at_k(lambda q,k=5: knn_search_aligned(q,k), k=5, sample_size=60)

print(f'Self-Retrieval@5  — Semantic: {sem_self:.2f} | TF-IDF kNN: {knn_self:.2f}')
print(f'Category  P@5/R@5 — Semantic: {sem_p:.2f}/{sem_r:.2f} | TF-IDF kNN: {knn_p:.2f}/{knn_r:.2f}')

Self-Retrieval@5  — Semantic: 0.95 | TF-IDF kNN: 1.00
Category  P@5/R@5 — Semantic: 0.43/0.31 | TF-IDF kNN: 0.45/0.35


## 6) Qualitative Examples

Human-in-the-loop inspection remains valuable—especially to see semantic matching beyond exact words.

In [8]:
queries = [
    'cozy wooden chair for study table',
    'small folding computer desk',
    'blue metal cafe table',
    'upholstered lounge arm chair'
]

for q in queries:
    print('—'*80)
    print('Query:', q)
    print('\nSemantic (embeddings + FAISS)')
    display(semantic_search(q, 5))
    print('\nTF-IDF + kNN')
    display(knn_search(q, 5))

————————————————————————————————————————————————————————————————————————————————
Query: cozy wooden chair for study table

Semantic (embeddings + FAISS)


Unnamed: 0,title,brand,cat_short,price,score
0,"Flash Furniture Webb Commercial Grade 24"" Roun...",Flash Furniture Store,Tables,140.0,0.540516
1,Flash Furniture Walker Small Rustic Natural Ho...,Flash Furniture Store,Home Office Desks,111.0,0.534974
2,Karl home Accent Chair Mid-Century Modern Chai...,Karl home Store,Chairs,149.99,0.532338
3,Amazon Basics Kids Adjustable Mesh Low-Back Sw...,Amazon Basics Store,Desk Chairs,53.99,0.525546
4,VECELO Modern Industrial Style 3-Piece Dining ...,VECELO Store,Table & Chair Sets,53.99,0.514563



TF-IDF + kNN


Unnamed: 0,title,brand,cat_short,price,score
0,MoNiBloom Round Folding Faux Fur Saucer Chair ...,MoNiBloom Store,Folding Chairs,53.99,0.240508
1,"Tiita Comfy Saucer Chair, Soft Faux Fur Oversi...",Tiita Store,Chairs,79.99,0.209013
2,"BOOSDEN Padded Folding Chair 2 Pack, Foldable ...",BOOSDEN Store,Folding Chairs,119.0,0.162735
3,BYOOTIQUE Makeup Chair Folding Camping Stool C...,BYOOTIQUE Store,Vanities & Vanity Benches,39.9,0.148176
4,"klotski Kids Table and 2 Chair Set, Wood Activ...",klotski,Table & Chair Sets,125.99,0.143491


————————————————————————————————————————————————————————————————————————————————
Query: small folding computer desk

Semantic (embeddings + FAISS)


Unnamed: 0,title,brand,cat_short,price,score
0,Flash Furniture Walker Small Rustic Natural Ho...,Flash Furniture Store,Home Office Desks,111.0,0.656388
1,"Wellynap Computer Desk,31.5 inches Folding Tab...",Wellynap Store,Folding Tables,69.99,0.609491
2,"ODK Small Computer Desk, 27.5 Inch, Compact Ti...",ODK Store,Home Office Desks,79.99,0.575102
3,"ODK Small Computer Desk, 27.5 inch Desk for Sm...",ODK Store,Home Office Desks,79.99,0.537782
4,It's_Organized Gaming Desk 55 inch PC Computer...,It's_Organized Store,Home Office Desks,139.99,0.535763



TF-IDF + kNN


Unnamed: 0,title,brand,cat_short,price,score
0,"Wellynap Computer Desk,31.5 inches Folding Tab...",Wellynap Store,Folding Tables,69.99,0.564429
1,Flash Furniture Walker Small Rustic Natural Ho...,Flash Furniture Store,Home Office Desks,111.0,0.312078
2,"ODK Small Computer Desk, 27.5 inch Desk for Sm...",ODK Store,Home Office Desks,79.99,0.308125
3,"ODK Small Computer Desk, 27.5 Inch, Compact Ti...",ODK Store,Home Office Desks,79.99,0.28351
4,Lufeiya Small Computer Desk with 2 Drawers for...,Lufeiya Store,Home Office Desks,67.9,0.275331


————————————————————————————————————————————————————————————————————————————————
Query: blue metal cafe table

Semantic (embeddings + FAISS)


Unnamed: 0,title,brand,cat_short,price,score
0,"Flash Furniture Webb Commercial Grade 24"" Roun...",Flash Furniture Store,Tables,140.0,0.746392
1,"danpinera Side Table Round Metal, Outdoor Side...",danpinera Store,End Tables,35.99,0.569669
2,Kate and Laurel Celia Round Metal Foldable Acc...,Kate and Laurel Store,End Tables,53.99,0.568221
3,HomePop Metal Accent Table Triangle Base Round...,HomePop Store,End Tables,53.99,0.548546
4,"LOKKHAN Industrial Bar Table 38.6""-48.4"" Heigh...",LOKKHAN Store,Tables,126.4,0.544969



TF-IDF + kNN


Unnamed: 0,title,brand,cat_short,price,score
0,"Flash Furniture Webb Commercial Grade 24"" Roun...",Flash Furniture Store,Tables,140.0,0.264464
1,"danpinera Side Table Round Metal, Outdoor Side...",danpinera Store,End Tables,35.99,0.187105
2,itbe Easy Fit Ready-to-Assemble Multipurpose O...,itbe Store,Storage Cabinets,53.99,0.186486
3,"Gnyonat Accent Chair with Ottoman,Living Room ...",Gnyonat,Chairs,219.0,0.176419
4,"LOKKHAN Industrial Bar Table 38.6""-48.4"" Heigh...",LOKKHAN Store,Tables,126.4,0.151078


————————————————————————————————————————————————————————————————————————————————
Query: upholstered lounge arm chair

Semantic (embeddings + FAISS)


Unnamed: 0,title,brand,cat_short,price,score
0,Karl home Accent Chair Mid-Century Modern Chai...,Karl home Store,Chairs,149.99,0.639928
1,"Armen Living Julius 30"" Cream Faux Leather and...",Armen Living Store,Barstools,53.99,0.590125
2,Adeco Euro Style Fabric Arm Bench Chair Footst...,Adeco Store,Furniture,53.99,0.572777
3,Boss Office Products Any Task Mid-Back Task Ch...,Boss Office Products Store,Home Office Desk Chairs,53.99,0.557837
4,MoNiBloom Round Folding Faux Fur Saucer Chair ...,MoNiBloom Store,Folding Chairs,53.99,0.544226



TF-IDF + kNN


Unnamed: 0,title,brand,cat_short,price,score
0,Karl home Accent Chair Mid-Century Modern Chai...,Karl home Store,Chairs,149.99,0.319317
1,"Lazy Chair with Ottoman, Modern Lounge Accent ...",WARMGIFT WM,Chairs,139.99,0.255332
2,"Pekokavo Sofa Arm Clip Tray, Side Table for Re...",NDL Store,TV Trays,24.99,0.195405
3,Adeco Euro Style Fabric Arm Bench Chair Footst...,Adeco Store,Furniture,53.99,0.184368
4,"Tiita Comfy Saucer Chair, Soft Faux Fur Oversi...",Tiita Store,Chairs,79.99,0.16799


## 7) Save Small Build Report

In [9]:
REPORT = Path('../backend/vectorstore/build_report.txt')
with REPORT.open('w', encoding='utf-8') as f:
    f.write('Vector index build report\n')
    f.write(f'Total items: {index.ntotal}\n')
    f.write(f'Embeddings shape: {emb.shape}\n')
    f.write(f'Data source: {DATA_PATH}\n')
print('Wrote', REPORT)

Wrote ..\backend\vectorstore\build_report.txt
