# H&M Personalized Fashion Recommendations — Plan to Medal

Objective: Achieve MAP@12 ≥ 0.02394 (bronze+) within 24h by building a strong, fast, GPU-accelerated recommender with robust temporal CV.

Key constraints and approach:
- Use GPU (A10) for any heavy modeling (XGBoost/CatBoost, FAISS).
- Lock temporal CV mirroring test week. Avoid leakage.
- Fast, proven candidate generation (co-visitation + popularity + simple content) → train a re-ranker.
- Iterate on features with ablations; cache intermediates.

Data expectations (Kaggle H&M dataset):
- transactions_train.csv, customers.csv, articles.csv, sample_submission.csv, images/.
- In this repo, images/ exist; locate CSVs (root or subdirs). If not present, locate prepared parquet/csv artifacts.

Validation protocol:
- Time-based split: last week(s) as validation target. Standard practice:
  - Train range: up to 2020-09-15
  - Valid target: 2020-09-16 to 2020-09-22 (one week) predicting next-week purchases per customer.
- Reproduce LB-like CV by creating validation queries for customers with activity up to cutoff and labels in next week.
- Metrics: MAP@12; use exact submission formatting for validation scoring.

System checks (immediately):
- Verify GPU availability.
- Inspect directory structure; confirm presence/format of data artifacts.
- If large CSVs: convert to feather/parquet once; load with dtypes to save RAM.

Baseline (fast) deliverable (target ≥ 0.020 CV):
1) Popularity per week (last-N days) and overall popularity fallback.
2) Co-visitation matrix (item-item) using last 3-4 weeks window; generate top-100 candidates/customer.
3) Blend candidates: co-visitation (weights by recency) + global recent popularity.
4) Rank by blended scores; produce 12 per customer.

Reranker (to push to medal range):
- Model: XGBoost (GPU) or CatBoost (GPU).
- Training data: (customer, candidate article, label if bought in target week).
- Features:
  Customer:
  - Age, FN, Active, club_member_status, fashion_news_frequency, avg basket size, recency (days since last purchase), frequency counts, price spend stats.
  Article:
  - product_type_no, product_group_name, graphical_appearance_no, color_group, perceived_colour, section, garment_group, price, season/year, text length if description available.
  Interactions:
  - Days since customer last bought this article/product_type/product_group.
  - Customer x article co-vis scores (from co-vis), session co-counts, recency-weighted counts.
  - Similarity via simple TF-IDF of product_type/colour or target-encoding means.
  - Price relativity: article price vs customer's median price.
- Negative sampling: for each customer, sample from candidate pool not purchased in target window.
- Loss: binary logistic; predict probability; rank top-12.

Candidate generation details:
- Build multiple co-vis matrices:
  1) Standard co-vis (pairs within customer sessions/window).
  2) Time-decayed co-vis (exponential decay).
  3) Buy2buy only (pairs where both are purchases).
- Optionally add simple content candidates:
  - Top items per product_type within last week.
  - FAISS on image embeddings (later, if time permits) to fill long-tail.
- Cap candidates per source (e.g., 30 + 30 + 30 + popularity 50 → unique top ~150).

Efficiency & caching:
- Persist co-vis matrices and candidate lists to parquet.
- Persist features matrices (train/valid/test) and predictions for quick iteration.
- Log timings per step; subsample to debug.

Milestones and Expert reviews:
M1) Repo scan + GPU check + Data load plan (request expert review).
M2) Temporal CV + baseline popularity and co-vis candidates; local CV MAP@12 (expert review).
M3) Reranker v1 features + XGB GPU; OOF MAP@12 and speed (expert review).
M4) Feature ablations (recency, price-rel, TE), tune early stopping; ensure stability with seeds (expert review).
M5) Final train on full window, generate submission.csv, sanity-check top-12 formatting.

Risk controls:
- Prevent leakage: candidates/feats only from train window.
- Handle cold-start customers: fallback to recent popularity by age/club segment, else global top-12.
- Memory: cast dtypes (int32/int16, category), chunk reads.

Next steps (in order):
1) System/GPU check and list files.
2) Load metadata heads; infer dtypes; plan parquet caching.
3) Implement CV splitter and local MAP@12 evaluator.
4) Build fast baseline candidates (co-vis + popularity) and produce a working submission.
5) Add reranker features and train GPU model.
6) Iterate features; finalize and submit.

In [1]:
import os, sys, subprocess, json, time, shutil, glob
from pathlib import Path
import pandas as pd

print('=== System & GPU Check ===')
start=time.time()
def run(cmd):
    try:
        out = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT, text=True)
        print(out)
    except subprocess.CalledProcessError as e:
        print(e.output)

print('Python:', sys.version)
print('CWD:', os.getcwd())
print('GPU env vars CUDA_VISIBLE_DEVICES=', os.environ.get('CUDA_VISIBLE_DEVICES'))
print('\n[nvidia-smi]')
run('nvidia-smi || true')

print('\n[Check GPU via PyTorch import if available]')
try:
    import torch
    print('Torch version:', torch.__version__)
    print('GPU Available:', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('GPU Count:', torch.cuda.device_count())
        print('GPU Name:', torch.cuda.get_device_name(0))
        props = torch.cuda.get_device_properties(0)
        print(f'GPU Memory: {props.total_memory/1024**3:.1f} GB')
except Exception as e:
    print('Torch not available or failed to import:', e)

print('\n=== Scan for data artifacts (csv/parquet/feather) ===')
root = Path('.')
patterns = ['**/*.csv','**/*.parquet','**/*.feather']
found = []
for pat in patterns:
    # limit to first 500 hits for performance
    for p in root.glob(pat):
        # skip very deep image trees
        if 'images' in p.parts:
            continue
        try:
            size = p.stat().st_size
        except Exception:
            size = -1
        found.append((str(p), size))
found.sort(key=lambda x: (-x[1], x[0]))
for path,size in found[:50]:
    print(f'{size/1024/1024:8.1f} MB  {path}')
print(f'Total data-like files found: {len(found)}')

print('\n=== List top-level entries ===')
for p in sorted(root.iterdir()):
    try:
        info = f'{p.name}/' if p.is_dir() else p.name
        size = p.stat().st_size if p.is_file() else sum(f.stat().st_size for f in p.rglob('*') if f.is_file())
        print(f'{size/1024/1024:8.1f} MB  {info}')
    except Exception as e:
        print(p, e)

print(f'Completed in {time.time()-start:.1f}s')

=== System & GPU Check ===
Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
CWD: /var/lib/simon/agent_run_states/h-and-m-personalized-fashion-recommendations-20250921-042821
GPU env vars CUDA_VISIBLE_DEVICES= None

[nvidia-smi]
Sun Sep 21 04:48:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      De

  3268.5 MB  transactions_train.csv
   257.8 MB  sample_submission.csv
   257.8 MB  submission.csv
   197.5 MB  customers.csv
    34.5 MB  articles.csv
Total data-like files found: 5

=== List top-level entries ===
     0.0 MB  00_eda_and_planning.ipynb
     0.1 MB  agent_metadata/
    34.5 MB  articles.csv
   197.5 MB  customers.csv
     0.0 MB  description.md
     0.0 MB  docker_run.log


 29141.5 MB  images/
     0.0 MB  requirements.txt
   257.8 MB  sample_submission.csv
   257.8 MB  submission.csv
     0.0 MB  task.txt
  3268.5 MB  transactions_train.csv
Completed in 1.7s


In [3]:
# Install deps and convert CSV -> Parquet with optimized dtypes
import sys, subprocess, time, os
t0=time.time()
def pip_install(pkgs):
    print('Installing:', ' '.join(pkgs), flush=True)
    cmd=[sys.executable, '-m', 'pip', 'install', '-q'] + pkgs
    res = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    print(res.stdout)

need = ['polars', 'xgboost', 'catboost']
pip_install(need)

import polars as pl
from pathlib import Path

data_dir = Path('.')
out_dir = Path('parquet'); out_dir.mkdir(exist_ok=True)

print('Converting CSV to Parquet with downcast dtypes...', flush=True)

# Transactions: keep core columns; cast compact dtypes. Keep all rows; filter price>0 later in features/labels.
tx_csv = data_dir / 'transactions_train.csv'
tx_parq = out_dir / 'transactions_train.parquet'
if not tx_parq.exists():
    t=time.time()
    dtypes = {
        't_dat': pl.Utf8,
        'customer_id': pl.Utf8,
        'article_id': pl.Int64,
        'price': pl.Float64,
        'sales_channel_id': pl.Int64,
    }
    print('Reading transactions lazily...', flush=True)
    ldf = pl.scan_csv(str(tx_csv), schema_overrides=dtypes).select([
        pl.col('t_dat'),
        pl.col('customer_id'),
        pl.col('article_id').cast(pl.Int32),
        pl.col('price').cast(pl.Float32),
        pl.col('sales_channel_id').cast(pl.Int8),
    ]).with_columns([
        pl.col('t_dat').str.strptime(pl.Date, format='%Y-%m-%d', strict=False).alias('t_dat'),
        # customer_id stays as string; we can later map to int codes per pipeline step
    ])
    print('Writing transactions to Parquet...', flush=True)
    ldf.collect(streaming=True).write_parquet(str(tx_parq))
    print(f'transactions -> {tx_parq} done in {time.time()-t:.1f}s')
else:
    print('Skipping transactions; Parquet exists')

# Customers
cust_csv = data_dir / 'customers.csv'
cust_parq = out_dir / 'customers.parquet'
if not cust_parq.exists():
    t=time.time()
    print('Reading customers...', flush=True)
    ldf = pl.scan_csv(str(cust_csv)).with_columns([
        pl.col('FN').cast(pl.Int8, strict=False),
        pl.col('Active').cast(pl.Int8, strict=False),
        pl.col('age').cast(pl.Int16, strict=False),
    ])
    print('Writing customers to Parquet...', flush=True)
    ldf.collect(streaming=True).write_parquet(str(cust_parq))
    print(f'customers -> {cust_parq} done in {time.time()-t:.1f}s')
else:
    print('Skipping customers; Parquet exists')

# Articles
art_csv = data_dir / 'articles.csv'
art_parq = out_dir / 'articles.parquet'
if not art_parq.exists():
    t=time.time()
    print('Reading articles...', flush=True)
    ldf = pl.scan_csv(str(art_csv))
    print('Writing articles to Parquet...', flush=True)
    ldf.collect(streaming=True).write_parquet(str(art_parq))
    print(f'articles -> {art_parq} done in {time.time()-t:.1f}s')
else:
    print('Skipping articles; Parquet exists')

print(f'All conversions completed in {time.time()-t0:.1f}s', flush=True)

Installing: polars xgboost catboost



Converting CSV to Parquet with downcast dtypes...


Reading transactions lazily...


Writing transactions to Parquet...


  ldf.collect(streaming=True).write_parquet(str(tx_parq))


transactions -> parquet/transactions_train.parquet done in 7.9s
Reading customers...


Writing customers to Parquet...


  ldf.collect(streaming=True).write_parquet(str(cust_parq))


customers -> parquet/customers.parquet done in 1.3s
Reading articles...


Writing articles to Parquet...


articles -> parquet/articles.parquet done in 0.1s
All conversions completed in 32.3s


  ldf.collect(streaming=True).write_parquet(str(art_parq))


In [None]:
# CV dates, MAP@12 scorer, and helpers (no execution yet)
from datetime import date, timedelta
import numpy as np
import pandas as pd
from pathlib import Path

# Fixed CV window matching LB-like split
TRAIN_END = date(2020, 9, 15)
VAL_START = date(2020, 9, 16)
VAL_END   = date(2020, 9, 22)

def map12_score(y_true_dict, y_pred_dict):
    # y_true_dict: {customer_id: set(article_ids)}
    # y_pred_dict: {customer_id: [article_ids ordered length<=12]}
    K = 12
    aps = []
    for cid, preds in y_pred_dict.items():
        labels = y_true_dict.get(cid, set())
        if not isinstance(labels, set):
            labels = set(labels)
        hit = 0
        prec_sum = 0.0
        for i, aid in enumerate(preds[:K]):
            if aid in labels:
                hit += 1
                prec_sum += hit / (i + 1)
        if len(labels) == 0:
            aps.append(0.0)  # customers with no purchases contribute 0 as per competition
        else:
            aps.append(prec_sum / min(len(labels), K))
    return float(np.mean(aps)) if aps else 0.0

def ensure_12(preds_list, filler_list):
    out = []
    seen = set()
    for a in preds_list:
        if a not in seen:
            out.append(a); seen.add(a)
        if len(out) == 12:
            return out
    for a in filler_list:
        if a not in seen:
            out.append(a); seen.add(a)
        if len(out) == 12:
            break
    if len(out) < 12:
        out += [filler_list[0]] * (12 - len(out))
    return out

def load_sample_customers(sample_path='sample_submission.csv') -> pd.Series:
    ss = pd.read_csv(sample_path, usecols=['customer_id'])
    return ss['customer_id']

def read_transactions_parquet(path='parquet/transactions_train.parquet', cols=None):
    import polars as pl
    cols = cols or ['t_dat','customer_id','article_id','price','sales_channel_id']
    return pl.read_parquet(path, columns=cols)

def compute_global_popularity(pl_df, start_date, end_date, topk=20000):
    # pl_df: Polars DataFrame with t_dat (Date), article_id (int), price (float32)
    import polars as pl
    df = pl_df.filter((pl.col('t_dat') >= pl.lit(start_date)) & (pl.col('t_dat') <= pl.lit(end_date)) & (pl.col('price') > 0))
    pop = (df.group_by('article_id').len().sort('len', descending=True).select('article_id').head(topk))
    return pop['article_id'].to_list()

def to_submission_format(pred_dict, out_path='submission.csv'):
    # pred_dict: {customer_id: [aid,...12]} aids are ints
    rows = []
    for cid, aids in pred_dict.items():
        preds_str = ' '.join([str(a).zfill(10) for a in aids[:12]])
        rows.append((cid, preds_str))
    sub = pd.DataFrame(rows, columns=['customer_id','prediction'])
    sub.to_csv(out_path, index=False)
    return out_path

print('Helpers ready: MAP@12 scorer, ensure_12, global popularity, submission writer.')

In [4]:
# Fast baseline: recent 7-day popularity -> submission.csv
import time
t0=time.time()
import polars as pl

parq_tx = 'parquet/transactions_train.parquet'
print('Loading transactions parquet (columns: t_dat, article_id, price)...', flush=True)
tx = pl.read_parquet(parq_tx, columns=['t_dat','article_id','price'])
max_day = tx.select(pl.col('t_dat').max()).item()
start_day = max_day - pl.duration(days=6)
print(f'Max date in data: {max_day}; Popularity window: {start_day}..{max_day}', flush=True)

tx7 = tx.filter((pl.col('t_dat') >= start_day) & (pl.col('t_dat') <= max_day) & (pl.col('price') > 0))
pop7 = (tx7.group_by('article_id').len().sort('len', descending=True).select('article_id'))
topN = pop7['article_id'].to_list()
if not topN:
    # fallback if window empty (shouldn't happen)
    pop_all = tx.filter(pl.col('price') > 0).group_by('article_id').len().sort('len', descending=True).select('article_id')
    topN = pop_all['article_id'].to_list()
top12 = [int(a) for a in topN[:12]]
print('Top12 articles:', top12, flush=True)

import pandas as pd
ss = pd.read_csv('sample_submission.csv')
def ensure_12_fill(aids, filler):
    out=[]; seen=set()
    for a in aids:
        if a not in seen:
            out.append(a); seen.add(a)
        if len(out)==12: break
    i=0
    while len(out)<12:
        a=filler[i % len(filler)]
        if a not in seen:
            out.append(a); seen.add(a)
        i+=1
    return out

pred_str = ' '.join([str(a).zfill(10) for a in ensure_12_fill(top12, topN[:1000])])
ss['prediction'] = pred_str  # same string for all customers as a smoke baseline
out_path = 'submission.csv'
ss.to_csv(out_path, index=False)
print(f'Wrote baseline submission to {out_path} in {time.time()-t0:.1f}s; size={Path(out_path).stat().st_size/1024/1024:.2f} MB', flush=True)

Loading transactions parquet (columns: t_dat, article_id, price)...


Max date in data: 2020-09-14; Popularity window: [(2020-09-14) - (6d.alias("duration"))]..2020-09-14


Top12 articles: [909370001, 865799006, 918522001, 448509014, 751471001, 924243001, 918292001, 762846027, 863646001, 809238001, 715624001, 673677002]


Wrote baseline submission to submission.csv in 4.5s; size=257.76 MB


In [6]:
# Temporal CV: popularity-only baseline MAP@12
import polars as pl
import pandas as pd
from datetime import date, timedelta
import numpy as np
import time

def map12_score(y_true_dict, y_pred_dict):
    K = 12
    aps = []
    for cid, preds in y_pred_dict.items():
        labels = y_true_dict.get(cid, set())
        if not isinstance(labels, set):
            labels = set(labels)
        hit = 0
        prec_sum = 0.0
        for i, aid in enumerate(preds[:K]):
            if aid in labels:
                hit += 1
                prec_sum += hit / (i + 1)
        if len(labels) == 0:
            aps.append(0.0)
        else:
            aps.append(prec_sum / min(len(labels), K))
    return float(np.mean(aps)) if aps else 0.0

t0=time.time()
tx = pl.read_parquet('parquet/transactions_train.parquet', columns=['t_dat','customer_id','article_id','price'])
max_day = tx.select(pl.col('t_dat').max()).item()  # python date
VAL_END = max_day
VAL_START = VAL_END - timedelta(days=6)
TRAIN_END = VAL_START - timedelta(days=1)
print(f'CV windows -> TRAIN_END: {TRAIN_END}, VAL: {VAL_START}..{VAL_END}', flush=True)

# Build labels for validation week (price>0), dedup customer-article in week
val_df = tx.filter((pl.col('t_dat') >= pl.lit(VAL_START)) & (pl.col('t_dat') <= pl.lit(VAL_END)) & (pl.col('price') > 0))
val_pairs = val_df.select(['customer_id','article_id']).unique(maintain_order=True)
labels = {}
for cid, grp in val_pairs.to_pandas().groupby('customer_id'):
    labels[cid] = set(grp['article_id'].astype(int).tolist())
print('Validation customers with purchases:', len(labels))

# Popularity computed on last 7 days of training window
hist_start = TRAIN_END - timedelta(days=6)
hist_end = TRAIN_END
print(f'Popularity window for training: {hist_start}..{hist_end}', flush=True)
hist = tx.filter((pl.col('t_dat') >= pl.lit(hist_start)) & (pl.col('t_dat') <= pl.lit(hist_end)) & (pl.col('price') > 0))
pop = hist.group_by('article_id').len().sort('len', descending=True).select('article_id')
pop_list = [int(a) for a in pop['article_id'].to_list()]
fallback = pop_list[:1000] if len(pop_list)>=12 else (tx.filter(pl.col('price')>0).group_by('article_id').len().sort('len', descending=True).select('article_id')['article_id'].to_list())
fallback = [int(a) for a in fallback][:1000]
top12 = fallback[:12]

# Predictions for all customers in sample_submission
ss_cust = pd.read_csv('sample_submission.csv', usecols=['customer_id'])['customer_id']
pred_dict = {}
for cid in ss_cust.values:
    pred_dict[cid] = top12  # same list for baseline

# Score MAP@12
score = map12_score(labels, pred_dict)
print(f'CV MAP@12 (popularity-only): {score:.5f}  | elapsed {time.time()-t0:.1f}s')

CV windows -> TRAIN_END: 2020-09-07, VAL: 2020-09-08..2020-09-14


Validation customers with purchases: 74575
Popularity window for training: 2020-09-01..2020-09-07


CV MAP@12 (popularity-only): 0.00037  | elapsed 6.0s


In [7]:
# Temporal CV: Repurchase + popularity baseline
import polars as pl
import pandas as pd
from datetime import timedelta
import numpy as np
import time

def map12_score(y_true_dict, y_pred_dict):
    K = 12
    aps = []
    for cid, preds in y_pred_dict.items():
        labels = y_true_dict.get(cid, set())
        if not isinstance(labels, set):
            labels = set(labels)
        hit = 0
        prec_sum = 0.0
        for i, aid in enumerate(preds[:K]):
            if aid in labels:
                hit += 1
                prec_sum += hit / (i + 1)
        if len(labels) == 0:
            aps.append(0.0)
        else:
            aps.append(prec_sum / min(len(labels), K))
    return float(np.mean(aps)) if aps else 0.0

t0=time.time()
tx = pl.read_parquet('parquet/transactions_train.parquet', columns=['t_dat','customer_id','article_id','price'])
max_day = tx.select(pl.col('t_dat').max()).item()
VAL_END = max_day
VAL_START = VAL_END - timedelta(days=6)
TRAIN_END = VAL_START - timedelta(days=1)
print(f'Windows -> TRAIN_END: {TRAIN_END}, VAL: {VAL_START}..{VAL_END}', flush=True)

# Labels for validation (price>0), unique per customer-article within the week
val_df = tx.filter((pl.col('t_dat') >= pl.lit(VAL_START)) & (pl.col('t_dat') <= pl.lit(VAL_END)) & (pl.col('price') > 0))
val_pairs = val_df.select(['customer_id','article_id']).unique(maintain_order=True).to_pandas()
labels = {cid: set(g.article_id.astype(int).tolist()) for cid, g in val_pairs.groupby('customer_id')}
print('Validation customers with purchases:', len(labels))

# Training slice (last 12 weeks) for repurchase + popularity
train_start = TRAIN_END - timedelta(days=7*12)
hist_start = TRAIN_END - timedelta(days=6)
hist_end = TRAIN_END
print(f'Train slice: {train_start}..{TRAIN_END} | Pop window: {hist_start}..{hist_end}', flush=True)
train_tx = tx.filter((pl.col('t_dat') >= pl.lit(train_start)) & (pl.col('t_dat') <= pl.lit(TRAIN_END)) & (pl.col('price') > 0))

# Global popularity (fallback)
pop = train_tx.filter((pl.col('t_dat') >= pl.lit(hist_start)) & (pl.col('t_dat') <= pl.lit(hist_end))).group_by('article_id').len().sort('len', descending=True).select('article_id')
pop_list = [int(a) for a in pop['article_id'].to_list()]
if len(pop_list) < 12:
    pop_all = train_tx.group_by('article_id').len().sort('len', descending=True).select('article_id')
    pop_list = [int(a) for a in pop_all['article_id'].to_list()]
fallback = pop_list[:5000]

# Repurchase candidates: last 20 distinct items per customer ordered by recency
train_tx_pd = train_tx.select(['customer_id','article_id','t_dat']).to_pandas()
train_tx_pd['article_id'] = train_tx_pd['article_id'].astype(int)
train_tx_pd = train_tx_pd.sort_values(['customer_id','t_dat'], ascending=[True, False])
repurchase = {}
for cid, g in train_tx_pd.groupby('customer_id', sort=False):
    seen=set(); lst=[]
    for aid in g['article_id'].values:
        if aid not in seen:
            lst.append(aid); seen.add(aid)
            if len(lst) >= 20:
                break
    repurchase[cid]=lst
print('Repurchase dict size:', len(repurchase))

# Build predictions for all customers in sample_submission
ss_cust = pd.read_csv('sample_submission.csv', usecols=['customer_id'])['customer_id']
pred = {}
for cid in ss_cust.values:
    base = repurchase.get(cid, [])
    # fill to 12 with popularity without duplicates
    seen=set(base)
    for a in fallback:
        if len(base) >= 12: break
        if a not in seen:
            base.append(a); seen.add(a)
    if len(base) < 12 and fallback:
        base += [fallback[0]] * (12-len(base))
    pred[cid] = base[:12]

cv = map12_score(labels, pred)
print(f'CV MAP@12 (repurchase + pop): {cv:.5f} | elapsed {time.time()-t0:.1f}s', flush=True)

Windows -> TRAIN_END: 2020-09-07, VAL: 2020-09-08..2020-09-14


Validation customers with purchases: 74575
Train slice: 2020-06-15..2020-09-07 | Pop window: 2020-09-01..2020-09-07


Repurchase dict size: 523693


CV MAP@12 (repurchase + pop): 0.00131 | elapsed 26.7s


In [9]:
# Co-visitation (buy2buy + decayed) candidates + CV MAP@12
import polars as pl
import pandas as pd
from datetime import timedelta
import numpy as np
import time

def map12_score(y_true_dict, y_pred_dict):
    K = 12
    aps = []; 
    for cid, preds in y_pred_dict.items():
        labels = y_true_dict.get(cid, set())
        if not isinstance(labels, set):
            labels = set(labels)
        hit = 0; prec_sum = 0.0
        for i, aid in enumerate(preds[:K]):
            if aid in labels:
                hit += 1; prec_sum += hit / (i + 1)
        aps.append(0.0 if len(labels)==0 else (prec_sum / min(len(labels), K)))
    return float(np.mean(aps)) if aps else 0.0

t0=time.time()
tx = pl.read_parquet('parquet/transactions_train.parquet', columns=['t_dat','customer_id','article_id','price'])
max_day = tx.select(pl.col('t_dat').max()).item()
VAL_END = max_day
VAL_START = VAL_END - timedelta(days=6)
TRAIN_END = VAL_START - timedelta(days=1)
print(f'Windows -> TRAIN_END: {TRAIN_END}, VAL: {VAL_START}..{VAL_END}', flush=True)

# Labels in validation week
val_df = tx.filter((pl.col('t_dat') >= pl.lit(VAL_START)) & (pl.col('t_dat') <= pl.lit(VAL_END)) & (pl.col('price') > 0))
val_pairs = val_df.select(['customer_id','article_id']).unique(maintain_order=True).to_pandas()
labels = {cid: set(g.article_id.astype(int).tolist()) for cid, g in val_pairs.groupby('customer_id')}
print('Validation customers with purchases:', len(labels))

# Training slice for co-vis (last 10 weeks), price>0 only
train_start = TRAIN_END - timedelta(days=7*10)
train_tx = tx.filter((pl.col('t_dat') >= pl.lit(train_start)) & (pl.col('t_dat') <= pl.lit(TRAIN_END)) & (pl.col('price') > 0))
print(f'Train slice for covis: {train_start}..{TRAIN_END}  | rows={train_tx.height}', flush=True)

# Global popularity fallback (last 7 days of train window)
hist_start = TRAIN_END - timedelta(days=6)
hist_end = TRAIN_END
pop = train_tx.filter((pl.col('t_dat') >= pl.lit(hist_start)) & (pl.col('t_dat') <= pl.lit(hist_end))).group_by('article_id').len().sort('len', descending=True).select('article_id')
pop_list = [int(a) for a in pop['article_id'].to_list()]
if len(pop_list) < 12:
    pop_all = train_tx.group_by('article_id').len().sort('len', descending=True).select('article_id')
    pop_list = [int(a) for a in pop_all['article_id'].to_list()]
fallback_pop = pop_list[:20000]

# Build customer-day baskets (cap 30 items/day), compute recency weight
print('Building co-vis pairs...', flush=True)
cov_t0 = time.time()
baskets = (
    train_tx.select([
        pl.col('customer_id'),
        pl.col('t_dat'),
        pl.col('article_id').alias('aid')
    ])
    .group_by(['customer_id','t_dat'])
    .agg(pl.col('aid').head(30).alias('items'))
)
print('Baskets:', baskets.height, 'elapsed', f'{time.time()-cov_t0:.1f}s', flush=True)

# Prepare for self-join
cov_t1 = time.time()
b1 = baskets.explode('items').rename({'items':'aid_i'})
b2 = baskets.explode('items').rename({'items':'aid_j'})
# Join on same customer_id & t_dat to get unordered pairs, drop self-pairs
pairs = (
    b1.join(b2, on=['customer_id','t_dat'])
      .filter(pl.col('aid_i') != pl.col('aid_j'))
)
# Weight by recency: w = 0.9^(days_ago/7)
days_ago = (pl.lit(TRAIN_END) - pl.col('t_dat')).dt.total_days().cast(pl.Float32)
pairs = pairs.with_columns((0.9 ** (days_ago / 7.0)).alias('w'))
print('Pairs rows:', pairs.height, 'elapsed', f'{time.time()-cov_t1:.1f}s', flush=True)

# Aggregate scores for aid_i -> aid_j
cov_t2 = time.time()
covis = pairs.group_by(['aid_i','aid_j']).agg(pl.col('w').sum().alias('score'))
print('Covis pairs unique:', covis.height, 'elapsed', f'{time.time()-cov_t2:.1f}s', flush=True)

# Keep top 100 neighbors per aid_i
cov_t3 = time.time()
covis_pd = covis.sort(['aid_i','score'], descending=[False, True]).to_pandas()
top_neighbors = {}; 
for (aid_i), grp in covis_pd.groupby('aid_i', sort=False):
    g = grp.nlargest(100, columns='score')
    top_neighbors[int(aid_i)] = list(zip(g['aid_j'].astype(int).tolist(), g['score'].astype(float).tolist()))
print('Neighbors built for aids:', len(top_neighbors), 'elapsed', f'{time.time()-cov_t3:.1f}s', flush=True)

# Build customer recent history (last 20 distinct aids up to TRAIN_END)
hist_t0 = time.time()
cust_hist_pd = (
    train_tx.select(['customer_id','article_id','t_dat'])
            .sort(['customer_id','t_dat'], descending=[False, True])
            .to_pandas()
)
cust_hist_pd['article_id'] = cust_hist_pd['article_id'].astype(int)
cust_recent = {}
for cid, g in cust_hist_pd.groupby('customer_id', sort=False):
    seen=set(); lst=[]
    for aid in g['article_id'].values:
        if aid not in seen:
            lst.append(aid); seen.add(aid)
            if len(lst) >= 20: break
    cust_recent[cid] = lst
print('Customer histories:', len(cust_recent), 'elapsed', f'{time.time()-hist_t0:.1f}s', flush=True)

# Generate candidates per customer by expanding neighbors of recent items
ss_cust = pd.read_csv('sample_submission.csv', usecols=['customer_id'])['customer_id']
pred = {}
gen_t0 = time.time()
for idx, cid in enumerate(ss_cust.values):
    base = []; seen = set()
    # repurchases first (strong recency bias)
    for a in cust_recent.get(cid, [])[:20]:
        if a not in seen:
            base.append(a); seen.add(a)
    # expand via neighbors
    for a in cust_recent.get(cid, [])[:10]:
        neigh = top_neighbors.get(a, [])
        for nb, sc in neigh[:30]:
            if nb not in seen:
                base.append(nb); seen.add(nb)
            if len(base) >= 150:
                break
        if len(base) >= 150:
            break
    # fill with popularity
    for a in fallback_pop:
        if len(base) >= 12: break
        if a not in seen:
            base.append(a); seen.add(a)
    if len(base) < 12 and fallback_pop:
        base += [fallback_pop[0]] * (12-len(base))
    pred[cid] = base[:12]
    if idx % 200000 == 0:
        print(f'Built preds for {idx} customers...', flush=True)
print('Candidate generation + preds done in', f'{time.time()-gen_t0:.1f}s', flush=True)

# Score MAP@12
cv = map12_score(labels, pred)
print(f'CV MAP@12 (covis buy2buy decayed + repurchase + pop): {cv:.5f} | total elapsed {time.time()-t0:.1f}s', flush=True)

Windows -> TRAIN_END: 2020-09-07, VAL: 2020-09-08..2020-09-14


Validation customers with purchases: 74575
Train slice for covis: 2020-06-29..2020-09-07  | rows=2969342


Building co-vis pairs...


Baskets: 902963 elapsed 0.2s


Pairs rows: 14812350 elapsed 0.6s


Covis pairs unique: 8308274 elapsed 0.2s


Neighbors built for aids: 37954 elapsed 20.2s


Customer histories: 450514 elapsed 12.7s


Built preds for 0 customers...


Built preds for 200000 customers...


Built preds for 400000 customers...


Built preds for 600000 customers...


Built preds for 800000 customers...


Built preds for 1000000 customers...


Built preds for 1200000 customers...


Candidate generation + preds done in 15.6s


CV MAP@12 (covis buy2buy decayed + repurchase + pop): 0.00137 | total elapsed 56.1s


In [None]:
# Co-vis scored blend (short-decayed + long-nondecayed) with ACTIVE/ALL MAP and recall diagnostics (DAILY RAW, no norm, no sym)
import polars as pl
import pandas as pd
from datetime import timedelta
import numpy as np
import time, os, pickle

def map12_all_and_active(y_true_dict, y_pred_dict):
    K=12
    # ALL
    aps_all=[]
    for cid, preds in y_pred_dict.items():
        labels = y_true_dict.get(cid, set())
        if not isinstance(labels, set): labels=set(labels)
        hit=0; prec=0.0
        for i,a in enumerate(preds[:K]):
            if a in labels:
                hit+=1; prec+=hit/(i+1)
        aps_all.append(0.0 if len(labels)==0 else (prec/min(len(labels),K)))
    all_map = float(np.mean(aps_all)) if aps_all else 0.0
    # ACTIVE
    aps_act=[]
    for cid, labels in y_true_dict.items():
        preds = y_pred_dict.get(cid, [])
        hit=0; prec=0.0
        for i,a in enumerate(preds[:K]):
            if a in labels:
                hit+=1; prec+=hit/(i+1)
        aps_act.append(prec/min(len(labels),K))
    act_map = float(np.mean(aps_act)) if aps_act else 0.0
    return all_map, act_map

def recall_at_k_active(y_true_dict, y_pred_dict, k=100):
    hits=0; total=0
    for cid, labels in y_true_dict.items():
        total += len(labels)
        if total==0: continue
        preds = set(y_pred_dict.get(cid, [])[:k])
        hits += sum(1 for a in labels if a in preds)
    return hits/total if total>0 else 0.0

t0=time.time()
tx = pl.read_parquet('parquet/transactions_train.parquet', columns=['t_dat','customer_id','article_id','price'])
max_day = tx.select(pl.col('t_dat').max()).item()
VAL_END = max_day
VAL_START = VAL_END - timedelta(days=6)
TRAIN_END = VAL_START - timedelta(days=1)
print(f'Windows -> TRAIN_END: {TRAIN_END}, VAL: {VAL_START}..{VAL_END}', flush=True)

# Labels
val_df = tx.filter((pl.col('t_dat') >= pl.lit(VAL_START)) & (pl.col('t_dat') <= pl.lit(VAL_END)) & (pl.col('price') > 0))
val_pairs = val_df.select(['customer_id','article_id']).unique(maintain_order=True).to_pandas()
labels = {cid: set(g.article_id.astype(int).tolist()) for cid, g in val_pairs.groupby('customer_id')}
print('Active customers in VAL:', len(labels))

# Windows
short_start = TRAIN_END - timedelta(days=7*6)
long_start  = TRAIN_END - timedelta(days=7*12)
ultra_start = TRAIN_END - timedelta(days=7*2)
print(f'Co-vis ultra: {ultra_start}..{TRAIN_END} | short: {short_start}..{TRAIN_END} | long: {long_start}..{TRAIN_END}', flush=True)

# Popularity fallback (last 7 days of train window)
hist_start = TRAIN_END - timedelta(days=6)
hist_end = TRAIN_END
train_tx_all = tx.filter((pl.col('t_dat') >= pl.lit(long_start)) & (pl.col('t_dat') <= pl.lit(TRAIN_END)) & (pl.col('price') > 0))
pop = train_tx_all.filter((pl.col('t_dat') >= pl.lit(hist_start)) & (pl.col('t_dat') <= pl.lit(hist_end))).group_by('article_id').len().sort('len', descending=True).select('article_id')
fallback_pop = [int(a) for a in pop['article_id'].to_list()][:20000]
if len(fallback_pop) < 12:
    pop_all = train_tx_all.group_by('article_id').len().sort('len', descending=True).select('article_id')
    fallback_pop = [int(a) for a in pop_all['article_id'].to_list()][:20000]

def build_covis_daily_raw(start_date, end_date, decay=None, topk=200):
    t=time.time()
    df = tx.filter((pl.col('t_dat') >= pl.lit(start_date)) & (pl.col('t_dat') <= pl.lit(end_date)) & (pl.col('price') > 0))
    # daily baskets: unique items per customer-day, cap 30 (preserve order for stability)
    baskets = (df.select([pl.col('customer_id'), pl.col('t_dat'), pl.col('article_id').alias('aid')])
                 .group_by(['customer_id','t_dat'])
                 .agg(pl.col('aid').unique(maintain_order=True).head(30).alias('items')))
    b1 = baskets.explode('items').rename({'items':'aid_i'})
    b2 = baskets.explode('items').rename({'items':'aid_j'})
    pairs = (b1.join(b2, on=['customer_id','t_dat']).filter(pl.col('aid_i') != pl.col('aid_j')))
    # recency weight on day
    if decay is None:
        pairs = pairs.with_columns(pl.lit(1.0).alias('w'))
    else:
        days_ago = (pl.lit(end_date) - pl.col('t_dat')).dt.total_days().cast(pl.Float32)
        pairs = pairs.with_columns((decay ** (days_ago / 7.0)).alias('w'))
    covis = pairs.group_by(['aid_i','aid_j']).agg(pl.col('w').sum().alias('score'))
    covis = covis.filter(pl.col('score') > 1e-9)
    covis_pd = covis.sort(['aid_i','score'], descending=[False, True]).to_pandas()
    top = {}
    for aid_i, grp in covis_pd.groupby('aid_i', sort=False):
        g = grp.nlargest(topk, columns='score')
        top[int(aid_i)] = dict(zip(g['aid_j'].astype(int).values, g['score'].astype(float).values))
    print(f'Built DAILY RAW covis for {len(top)} aids in {time.time()-t:.1f}s  [{start_date}..{end_date}] topk={topk}')
    return top

cache_dir='cache'; os.makedirs(cache_dir, exist_ok=True)
topk_items = 200
short_cache = os.path.join(cache_dir, f'covis_short_daily_raw_{short_start}_{TRAIN_END}_top{topk_items}.pkl')
long_cache  = os.path.join(cache_dir, f'covis_long_daily_raw_{long_start}_{TRAIN_END}_top{topk_items}.pkl')
ultra_cache = os.path.join(cache_dir, f'covis_ultra_daily_raw_{ultra_start}_{TRAIN_END}_top{topk_items}.pkl')

if os.path.exists(ultra_cache):
    with open(ultra_cache,'rb') as f: covis_ultra = pickle.load(f); print('Loaded covis_ultra from cache')
else:
    covis_ultra = build_covis_daily_raw(ultra_start, TRAIN_END, decay=0.9, topk=topk_items)
    with open(ultra_cache,'wb') as f: pickle.dump(covis_ultra, f)
if os.path.exists(short_cache):
    with open(short_cache,'rb') as f: covis_short = pickle.load(f); print('Loaded covis_short from cache')
else:
    covis_short = build_covis_daily_raw(short_start, TRAIN_END, decay=0.9, topk=topk_items)
    with open(short_cache,'wb') as f: pickle.dump(covis_short, f)
if os.path.exists(long_cache):
    with open(long_cache,'rb') as f: covis_long = pickle.load(f); print('Loaded covis_long from cache')
else:
    covis_long  = build_covis_daily_raw(long_start, TRAIN_END, decay=None, topk=topk_items)
    with open(long_cache,'wb') as f: pickle.dump(covis_long, f)

# Customer recent distinct history (last 20) from long window
t_hist=time.time()
hist_pd = (train_tx_all.select(['customer_id','article_id','t_dat'])
           .sort(['customer_id','t_dat'], descending=[False, True])
           .to_pandas())
hist_pd['article_id']=hist_pd['article_id'].astype(int)
cust_recent={}
for cid, g in hist_pd.groupby('customer_id', sort=False):
    seen=set(); lst=[]
    for aid in g['article_id'].values:
        if aid not in seen:
            lst.append(aid); seen.add(aid)
            if len(lst)>=20: break
    cust_recent[cid]=lst
print('Customer histories:', len(cust_recent), 'built in', f'{time.time()-t_hist:.1f}s', flush=True)

# Scored aggregation per customer (blend ultra/short/long + strong repeat)
w_ultra = 0.20
w_short = 0.70
w_long  = 0.10
repeat_self_base = 8.0  # multiplied by 0.8^p
repeat_mult_if_in_hist = 1.6  # multiplicative
per_seed_neigh_cap = 100
candidate_pool_cap = 400

ss_cust = pd.read_csv('sample_submission.csv', usecols=['customer_id'])['customer_id']
pred12 = {}
pred100 = {}  # for recall@100 diagnostics
t_pred=time.time()
for idx, cid in enumerate(ss_cust.values):
    hist = cust_recent.get(cid, [])
    cand_score = {}
    hist_set = set(hist)
    # seed direct repurchases with strong prior, decayed by position
    for p, h in enumerate(hist[:20]):
        w_hist = (0.8 ** p)
        cand_score[h] = cand_score.get(h, 0.0) + repeat_self_base * w_hist
    # neighbor expansion, per-seed cap
    for p, h in enumerate(hist[:20]):
        w_hist = (0.8 ** p)
        nu = covis_ultra.get(h, {})
        ns = covis_short.get(h, {})
        nl = covis_long.get(h, {})
        if nu or ns or nl:
            cnt=0
            keys = set(nu.keys()) | set(ns.keys()) | set(nl.keys())
            # precompute blend to sort efficiently
            def blend(x):
                return w_ultra*nu.get(x,0.0) + w_short*ns.get(x,0.0) + w_long*nl.get(x,0.0)
            for nb in sorted(keys, key=lambda x: -blend(x)):
                sc = blend(nb)
                if sc <= 0: continue
                cand_score[nb] = cand_score.get(nb, 0.0) + w_hist * sc
                cnt+=1
                if cnt>=per_seed_neigh_cap: break
    # multiplicative boost for items in recent history
    for nb in list(cand_score.keys()):
        if nb in hist_set:
            cand_score[nb] *= repeat_mult_if_in_hist
    # rank and cap candidate pools
    ranked = sorted(cand_score.items(), key=lambda x: -x[1])
    base_candidates = [nb for nb,_ in ranked[:candidate_pool_cap]]
    # fill with popularity to ensure at least 12/100 unique
    seen=set(base_candidates)
    filled = list(base_candidates)
    for a in fallback_pop:
        if len(filled) >= 100: break
        if a not in seen:
            filled.append(a); seen.add(a)
    if len(filled) < 100 and fallback_pop:
        filled += [fallback_pop[0]] * (100-len(filled))
    pred100[cid] = filled[:100]
    pred12[cid] = filled[:12]
    if idx % 200000 == 0:
        print(f'Preds built for {idx} customers...', flush=True)
print('Predictions built in', f'{time.time()-t_pred:.1f}s', flush=True)

all_map, act_map = map12_all_and_active(labels, pred12)
recall100 = recall_at_k_active(labels, pred100, k=100)
recall12 = recall_at_k_active(labels, pred12, k=12)
print(f'MAP@12 ALL: {all_map:.5f} | ACTIVE: {act_map:.5f} | ACTIVE recall@100: {recall100:.3f} | ACTIVE recall@12: {recall12:.3f} | total elapsed {time.time()-t0:.1f}s', flush=True)

Windows -> TRAIN_END: 2020-09-07, VAL: 2020-09-08..2020-09-14


Active customers in VAL: 74575
Co-vis ultra: 2020-08-24..2020-09-07 | short: 2020-07-27..2020-09-07 | long: 2020-06-15..2020-09-07
