# Production plan: Time-aware, group-purged CV refit

Objective: Medal in RAOP by closing OOF-to-LB gap via robust time-aware training and simple, diverse ensembling.

Validation protocol:
- Forward-chaining, time-ordered, group-purged CV (group=requester_username).
- 4 chains by time after sorting; add a 3–5 day time gap before each validation window; no label stratification.
  - C1: train 0–60% → val 60–75%
  - C2: train 0–75% → val 75–87%
  - C3: train 0–87% → val 87–94%
  - C4: train 0–94% → val 94–100%
- Enforce no requester overlap between train and val per chain; ensure ≥50–70 positives per val window.
- Save fold indices and reuse across legs; deterministic seeds.

Data and features (strict leakage discipline):
- Text legs:
  1) Sentence-transformer embeddings (all-MiniLM-L6-v2, multi-qa-mpnet-base-dot-v1, intfloat/e5-base-v2) → XGBoost binary:logistic.
  2) TF-IDF (word+char) on title+body (+ optional subreddit TF-IDF) → LogisticRegression with C tuned on CV. Fit vectorizer within each train fold only; up-weight title if beneficial.
- Meta features (fold-safe only; no global ranks/relative_position; no future info):
  • lengths: title/body chars/words, unique word ratio; punctuation/!?/ALLCAPS rates; digit/currency/url flags; has_url, has_edit; title_to_body_len_ratio
  • calendar: month, weekday, quarter per-row; optional hour
  • user-safe at request time (computed per fold using train-only history):
    - days_since_account_creation = req_ts − account_creation_ts
    - raop_comment_ratio = requester_comments_in_raop_at_request / (requester_comments_at_request + 1)
    - raop_post_ratio = requester_posts_in_raop_at_request / (requester_posts_at_request + 1)
    - user_has_flair (binary), flair_len_chars
  • If using “days since start,” compute relative to the fold’s train min timestamp.

Modeling details:
- XGBoost (embeddings legs): tree_method=gpu_hist, objective=binary:logistic, eval_metric=auc,
  max_depth=5 (4–6 ok), eta=0.05 (0.05–0.08), subsample=0.8, colsample_bytree=0.8–0.9,
  min_child_weight=3–5, reg_lambda=2–4 (reg_alpha 0–0.5 optional), n_estimators=2000 with early_stopping_rounds=50–100;
  optionally set scale_pos_weight=neg/pos per chain.
- Logistic Regression (TF-IDF): solver=saga, penalty=L2, C∈[0.5,1,2,4], try class_weight='balanced'.
- Cache per-leg OOF and test preds to .npy; cache feature matrices/embeddings once and slice per fold.

Blending (robust, shift-aware):
- Rank space blending. Learn weights separately on last 2–3 chains; L2-normalize each weight vector and average them.
- Apply 15–20% shrink toward uniform after averaging; greedy prune legs with ~0 or harmful weights.
- Fallbacks: uniform rank-average of retained legs; backup = average of top-2 legs by forward-chain AUC.
- Optional light calibration: preds = 0.9*preds + 0.1*rank(preds); clip final preds to [0.01, 0.99].

Diagnostics:
- AV only for analysis; report AV AUC with/without time; if extreme shift persists, increase time gap to 5–7 days and increase blend shrink.
- Log per-chain AUC, elapsed time, and ensure zero requester overlap; print date ranges per split.
- Inspect XGB importances and LR coefficients for sanity.

Execution plan (milestones):
1) Env check (GPU, versions) and torch stack guard (no installation conflicts).
2) Load data; parse time; build groups; sort by time; quick sanity checks.
3) Implement fold builder: purged forward-chaining (4 chains) with 3–5 day gap; persist folds and validate positive counts.
4) Leg A: TF-IDF + LR (title+body+subs) under forward-chaining; cache OOF/test.
5) Legs B–D: ST embeddings (MiniLM, MPNet, E5) + XGB with GPU; 3-seed bag per leg; cache OOF/test.
6) Optional (time permitting): SVD+meta XGB for diversity or swap MPNet→bge-small-en-v1.5.
7) Blend via multi-chain weight learning with shrink and pruning; produce primary and fallback submissions.
8) Sanity-check submission distribution; clip and save submission.csv.

Expert review checkpoints:
- After folds implementation + smoke metrics.
- After first leg OOF/test cached.
- After all legs cached, before blending.
- After blend results, before submission.

Notes:
- Refit every leg under this exact forward-chaining, group-purged CV; do not reuse non-time-aware OOF.
- Always print progress per chain and flush; keep notebook small; cache artifacts.

In [35]:
# One-time setup: Create isolated .venv with cu121 torch and register Jupyter kernel
import sys, subprocess, os

def sh(cmd):
    print('$', cmd, flush=True)
    subprocess.run(cmd, shell=True, check=True, executable='/bin/bash')

# Create venv
sh(f"{sys.executable} -m venv .venv")
act = "source .venv/bin/activate &&"

# Upgrade basics + ipykernel
sh(f"{act} python -m pip install --upgrade pip wheel setuptools ipykernel")

# Torch cu121 stack
sh(f"{act} python -m pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1")

# NLP deps
sh(f"{act} python -m pip install --no-cache-dir transformers==4.44.2 sentence-transformers==3.0.1 accelerate==0.34.2 sentencepiece")

# Register kernel
sh(f"{act} python -m ipykernel install --user --name raop-venv-cu121 --display-name 'RAOP Venv (cu121)'")

print(">>> Switch kernel to 'RAOP Venv (cu121)', restart, then verify:", flush=True)
print("import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())", flush=True)
print("from sentence_transformers import SentenceTransformer; print('ST OK')", flush=True)

$ /usr/bin/python3.11 -m venv .venv


$ source .venv/bin/activate && python -m pip install --upgrade pip wheel setuptools ipykernel


Collecting pip
  Downloading pip-25.2-py3-none-any.whl (1.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 52.9 MB/s eta 0:00:00
Collecting wheel


  Downloading wheel-0.45.1-py3-none-any.whl (72 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.5/72.5 KB 399.4 MB/s eta 0:00:00
Collecting setuptools
  Downloading setuptools-80.9.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 264.4 MB/s eta 0:00:00


Collecting ipykernel
  Downloading ipykernel-6.30.1-py3-none-any.whl (117 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.5/117.5 KB 485.9 MB/s eta 0:00:00
Collecting debugpy>=1.6.5
  Downloading debugpy-1.8.17-cp311-cp311-manylinux_2_34_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 243.4 MB/s eta 0:00:00


Collecting comm>=0.1.1
  Downloading comm-0.2.3-py3-none-any.whl (7.3 kB)
Collecting jupyter-client>=8.0.0
  Downloading jupyter_client-8.6.3-py3-none-any.whl (106 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 106.1/106.1 KB 466.1 MB/s eta 0:00:00


Collecting pyzmq>=25
  Downloading pyzmq-27.1.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (857 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 857.0/857.0 KB 389.5 MB/s eta 0:00:00
Collecting psutil>=5.7
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 507.2 MB/s eta 0:00:00
Collecting matplotlib-inline>=0.1
  Downloading matplotlib_inline-0.1.7-py3-none-any.whl (9.9 kB)
Collecting tornado>=6.2
  Downloading tornado-6.5.2-cp39-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (443 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 443.9/443.9 KB 487.5 MB/s eta 0:00:00


Collecting packaging>=22
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 409.7 MB/s eta 0:00:00
Collecting traitlets>=5.4.0
  Downloading traitlets-5.14.3-py3-none-any.whl (85 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85.4/85.4 KB 456.6 MB/s eta 0:00:00
Collecting jupyter-core!=5.0.*,>=4.12
  Downloading jupyter_core-5.8.1-py3-none-any.whl (28 kB)
Collecting nest-asyncio>=1.4
  Downloading nest_asyncio-1.6.0-py3-none-any.whl (5.2 kB)
Collecting ipython>=7.23.1
  Downloading ipython-9.5.0-py3-none-any.whl (612 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 612.4/612.4 KB 297.0 MB/s eta 0:00:00
Collecting typing_extensions>=4.6
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 328.5 MB/s eta 0:00:00


Collecting decorator
  Downloading decorator-5.2.1-py3-none-any.whl (9.2 kB)
Collecting jedi>=0.16
  Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 532.5 MB/s eta 0:00:00
Collecting prompt_toolkit<3.1.0,>=3.0.41
  Downloading prompt_toolkit-3.0.52-py3-none-any.whl (391 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 391.4/391.4 KB 524.0 MB/s eta 0:00:00
Collecting pexpect>4.3
  Downloading pexpect-4.9.0-py2.py3-none-any.whl (63 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 KB 443.2 MB/s eta 0:00:00
Collecting ipython-pygments-lexers
  Downloading ipython_pygments_lexers-1.1.1-py3-none-any.whl (8.1 kB)
Collecting stack_data
  Downloading stack_data-0.6.3-py3-none-any.whl (24 kB)
Collecting pygments>=2.4.0
  Downloading pygments-2.19.2-py3-none-any.whl (1.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 477.1 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 KB 482.7 MB/s eta 0:00:00
Collecting platformdirs>=2.5
  Downloading platformdirs-4.4.0-py3-none-any.whl (18 kB)
Collecting parso<0.9.0,>=0.8.4
  Downloading parso-0.8.5-py2.py3-none-any.whl (106 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 106.7/106.7 KB 445.1 MB/s eta 0:00:00
Collecting ptyprocess>=0.5
  Downloading ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Collecting wcwidth
  Downloading wcwidth-0.2.14-py2.py3-none-any.whl (37 kB)


Collecting six>=1.5
  Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)
Collecting asttokens>=2.1.0
  Downloading asttokens-3.0.0-py3-none-any.whl (26 kB)
Collecting pure-eval
  Downloading pure_eval-0.2.3-py3-none-any.whl (11 kB)
Collecting executing>=1.2.0
  Downloading executing-2.2.1-py2.py3-none-any.whl (28 kB)


Installing collected packages: pure-eval, ptyprocess, wheel, wcwidth, typing_extensions, traitlets, tornado, six, setuptools, pyzmq, pygments, psutil, platformdirs, pip, pexpect, parso, packaging, nest-asyncio, executing, decorator, debugpy, comm, asttokens, stack_data, python-dateutil, prompt_toolkit, matplotlib-inline, jupyter-core, jedi, ipython-pygments-lexers, jupyter-client, ipython, ipykernel


  Attempting uninstall: setuptools
    Found existing installation: setuptools 59.6.0
    Uninstalling setuptools-59.6.0:
      Successfully uninstalled setuptools-59.6.0


  Attempting uninstall: pip
    Found existing installation: pip 22.0.2


    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2


Successfully installed asttokens-3.0.0 comm-0.2.3 debugpy-1.8.17 decorator-5.2.1 executing-2.2.1 ipykernel-6.30.1 ipython-9.5.0 ipython-pygments-lexers-1.1.1 jedi-0.19.2 jupyter-client-8.6.3 jupyter-core-5.8.1 matplotlib-inline-0.1.7 nest-asyncio-1.6.0 packaging-25.0 parso-0.8.5 pexpect-4.9.0 pip-25.2 platformdirs-4.4.0 prompt_toolkit-3.0.52 psutil-7.1.0 ptyprocess-0.7.0 pure-eval-0.2.3 pygments-2.19.2 python-dateutil-2.9.0.post0 pyzmq-27.1.0 setuptools-80.9.0 six-1.17.0 stack_data-0.6.3 tornado-6.5.2 traitlets-5.14.3 typing_extensions-4.15.0 wcwidth-0.2.14 wheel-0.45.1
$ source .venv/bin/activate && python -m pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1


Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/799.0 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/799.0 MB[0m [31m251.6 MB/s[0m eta [36m0:00:03[0m

[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/799.0 MB[0m [31m256.8 MB/s[0m eta [36m0:00:03[0m

[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/799.0 MB[0m [31m257.8 MB/s[0m eta [36m0:00:03[0m

[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.5/799.0 MB[0m [31m255.4 MB/s[0m eta [36m0:00:03[0m

[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.6/799.0 MB[0m [31m256.1 MB/s[0m eta [36m0:00:03[0m

[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.3/799.0 MB[0m [31m258.1 MB/s[0m eta [36m0:00:02[0m

[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m364.1/799.0 MB[0m [31m260.3 MB/s[0m eta [36m0:00:02[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m418.1/799.0 MB[0m [31m261.6 MB/s[0m eta [36m0:00:02[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m469.0/799.0 MB[0m [31m262.4 MB/s[0m eta [36m0:00:02[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m522.7/799.0 MB[0m [31m264.9 MB/s[0m eta [36m0:00:02[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━[0m [32m575.1/799.0 MB[0m [31m263.8 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m623.9/799.0 MB[0m [31m258.5 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m672.1/799.0 MB[0m [31m253.8 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m723.3/799.0 MB[0m [31m254.2 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m773.3/799.0 MB[0m [31m249.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m799.0/799.0 MB[0m [31m249.2 MB/s[0m  [33m0:00:03[0m
[?25h

Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m179.9 MB/s[0m  [33m0:00:00[0m
[?25h

Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m316.8 MB/s[0m  [33m0:00:00[0m
[?25h

Collecting filelock (from torch==2.4.1)
  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting sympy (from torch==2.4.1)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)


Collecting networkx (from torch==2.4.1)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch==2.4.1)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)


Collecting fsspec (from torch==2.4.1)
  Downloading fsspec-2025.9.0-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)


Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.1)


  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.4.1)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)


Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.4.1)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.4.1)
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.4.1)


  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)


Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.4.1)
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.4.1)
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.4.1)
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)


Collecting triton==3.0.0 (from torch==2.4.1)
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)


Collecting numpy (from torchvision==0.19.1)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)


Collecting pillow!=8.3.*,>=5.3.0 (from torchvision==0.19.1)
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (9.0 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.4.1)
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)


Collecting MarkupSafe>=2.0 (from jinja2->torch==2.4.1)
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.7 kB)


Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.4.1)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/410.6 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.6/410.6 MB[0m [31m174.8 MB/s[0m eta [36m0:00:03[0m

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/410.6 MB[0m [31m214.4 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.7/410.6 MB[0m [31m231.4 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m191.4/410.6 MB[0m [31m238.0 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m230.7/410.6 MB[0m [31m231.1 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m274.7/410.6 MB[0m [31m223.6 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m312.0/410.6 MB[0m [31m228.0 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m353.4/410.6 MB[0m [31m223.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m391.1/410.6 MB[0m [31m210.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m198.9 MB/s[0m  [33m0:00:01[0m
[?25hDownloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/14.1 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m149.5 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/23.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m301.5 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/823.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m735.0 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/664.8 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/664.8 MB[0m [31m309.9 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/664.8 MB[0m [31m304.2 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/664.8 MB[0m [31m294.7 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/664.8 MB[0m [31m280.8 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.1/664.8 MB[0m [31m265.6 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m303.8/664.8 MB[0m [31m244.9 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m362.8/664.8 MB[0m [31m242.7 MB/s[0m eta [36m0:00:02[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m423.6/664.8 MB[0m [31m245.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m477.1/664.8 MB[0m [31m258.6 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━[0m [32m538.2/664.8 MB[0m [31m291.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m598.2/664.8 MB[0m [31m294.0 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m638.1/664.8 MB[0m [31m273.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m267.7 MB/s[0m  [33m0:00:02[0m
[?25hDownloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/121.6 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/121.6 MB[0m [31m289.1 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m115.3/121.6 MB[0m [31m288.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m260.9 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.5 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m52.4/56.5 MB[0m [31m262.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m262.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/124.2 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m63.2/124.2 MB[0m [31m315.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.2/124.2 MB[0m [31m323.3 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/196.0 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.2/196.0 MB[0m [31m185.8 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m98.6/196.0 MB[0m [31m253.8 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m157.8/196.0 MB[0m [31m261.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.0/196.0 MB[0m [31m258.6 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/176.2 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/176.2 MB[0m [31m268.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m102.0/176.2 MB[0m [31m253.9 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m166.7/176.2 MB[0m [31m276.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.2/176.2 MB[0m [31m276.0 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.4 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/209.4 MB[0m [31m289.3 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m117.4/209.4 MB[0m [31m292.9 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m173.5/209.4 MB[0m [31m287.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m287.6 MB/s[0m  [33m0:00:00[0m
[?25hDownloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m277.8 MB/s[0m  [33m0:00:00[0m
[?25hDownloading filelock-3.19.1-py3-none-any.whl (15 kB)
Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m226.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/39.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m211.4 MB/s[0m  [33m0:00:00[0m
[?25hDownloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.3 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m290.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.2/536.2 kB[0m [31m925.5 MB/s[0m  [33m0:00:00[0m
[?25h

Installing collected packages: mpmath, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio
[?25l[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/25[0m [mpmath]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy]

[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/25[0m [sympy][2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 2/25[0m [pillow]

[2K   [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/25[0m [nvidia-nvjitlink-cu12][2K   [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/25[0m [nvidia-nvjitlink-cu12]

[2K   [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/25[0m [nvidia-nvjitlink-cu12][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 5/25[0m [nvidia-nccl-cu12]

[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 6/25[0m [nvidia-curand-cu12][2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 6/25[0m [nvidia-curand-cu12]

[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12][2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12]

[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12][2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12]

[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12][2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 7/25[0m [nvidia-cufft-cu12]

[2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 9/25[0m [nvidia-cuda-nvrtc-cu12][2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 9/25[0m [nvidia-cuda-nvrtc-cu12]

[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10/25[0m [nvidia-cuda-cupti-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12][2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m11/25[0m [nvidia-cublas-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy][2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy]

[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy][2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy]

[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy][2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy]

[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m12/25[0m [numpy][2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m13/25[0m [networkx]

[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m13/25[0m [networkx][2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m13/25[0m [networkx]

[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m13/25[0m [networkx][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m15/25[0m [fsspec]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m17/25[0m [triton]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m18/25[0m [nvidia-cusparse-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m19/25[0m [nvidia-cudnn-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m21/25[0m [nvidia-cusolver-cu12][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m22/25[0m [torch]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m23/25[0m [torchvision][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m24/25[0m [torchaudio]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25/25[0m [torchaudio]
[?25h[1A[2KSuccessfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0


$ source .venv/bin/activate && python -m pip install --no-cache-dir transformers==4.44.2 sentence-transformers==3.0.1 accelerate==0.34.2 sentencepiece


Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)


Collecting huggingface-hub<1.0,>=0.23.2 (from transformers==4.44.2)
  Downloading huggingface_hub-0.35.1-py3-none-any.whl.metadata (14 kB)
Collecting pyyaml>=5.1 (from transformers==4.44.2)
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.4 kB)


Collecting regex!=2019.12.17 (from transformers==4.44.2)
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Collecting requests (from transformers==4.44.2)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting safetensors>=0.4.1 (from transformers==4.44.2)
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)


Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting tqdm>=4.27 (from transformers==4.44.2)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting scikit-learn (from sentence-transformers==3.0.1)
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers==3.0.1)
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)


Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.23.2->transformers==4.44.2)
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Collecting charset_normalizer<4,>=2 (from requests->transformers==4.44.2)
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (36 kB)
Collecting idna<4,>=2.5 (from requests->transformers==4.44.2)
  Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests->transformers==4.44.2)
  Downloading urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)


Collecting certifi>=2017.4.17 (from requests->transformers==4.44.2)
  Downloading certifi-2025.8.3-py3-none-any.whl.metadata (2.4 kB)
Collecting joblib>=1.2.0 (from scikit-learn->sentence-transformers==3.0.1)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->sentence-transformers==3.0.1)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/9.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m125.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/563.3 kB[0m [31m?[0m eta [36m-:--:

Downloading requests-2.32.5-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
Downloading idna-3.10-py3-none-any.whl (70 kB)
Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/9.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m162.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/35.9 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m34.1/35.9 MB[0m [31m169.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.9/35.9 MB[0m [31m173.8 MB/s[0m  [33m0:00:00[0m
[?25hDownloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)


Installing collected packages: urllib3, tqdm, threadpoolctl, sentencepiece, scipy, safetensors, regex, pyyaml, joblib, idna, hf-xet, charset_normalizer, certifi, scikit-learn, requests, huggingface-hub, tokenizers, accelerate, transformers, sentence-transformers
[?25l[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3]

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 0/20[0m [urllib3][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm]

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 1/20[0m [tqdm][2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 2/20[0m [threadpoolctl]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy]

[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 4/20[0m [scipy][2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 6/20[0m [regex]

[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m 8/20[0m [joblib][2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m11/20[0m [charset_normalizer]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m13/20[0m [scikit-learn][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m15/20[0m [huggingface-hub]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m16/20[0m [tokenizers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m18/20[0m [transformers][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20/20[0m [sentence-transformers]
[?25h[1A[2KSuccessfully installed accelerate-0.34.2 certifi-2025.8.3 charset_normalizer-3.4.3 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 joblib-1.5.2 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 scikit-learn-1.7.2 scipy-1.16.2 sentence-transformers-3.0.1 sentencepiece-0.2.1 threadpoolctl-3.6.0 tokenizers-0.19.1 tqdm-4.67.1 transformers-4.44.2 urllib3-2.5.0


$ source .venv/bin/activate && python -m ipykernel install --user --name raop-venv-cu121 --display-name 'RAOP Venv (cu121)'


Installed kernelspec raop-venv-cu121 in /app/.local/share/jupyter/kernels/raop-venv-cu121
>>> Switch kernel to 'RAOP Venv (cu121)', restart, then verify:


import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())


from sentence_transformers import SentenceTransformer; print('ST OK')


In [46]:
# Helper: Generate and cache E5 embeddings using the venv interpreter (works even if current kernel is base)
import os, sys, subprocess, textwrap
from pathlib import Path

script_path = Path('e5_gen.py')
script_code = textwrap.dedent('''
import os, sys
from pathlib import Path
import numpy as np, pandas as pd
import torch
from sentence_transformers import SentenceTransformer

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            import json
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text_cols(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates)
    bcol = first_col(body_col_candidates)
    t = df[tcol].fillna('') if tcol else pd.Series(['' for _ in range(len(df))])
    b = df[bcol].fillna('') if bcol else pd.Series(['' for _ in range(len(df))])
    return t.astype(str), b.astype(str)

def main():
    cache_dir = Path('emb_cache'); cache_dir.mkdir(exist_ok=True)
    tr_path = cache_dir / 'emb_e5_train.npy'
    te_path = cache_dir / 'emb_e5_test.npy'
    if tr_path.exists() and te_path.exists():
        arr_tr = np.load(tr_path); arr_te = np.load(te_path)
        print('Embeddings already exist:', arr_tr.shape, arr_te.shape)
        return
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    t_title, t_body = build_text_cols(tr)
    te_title, te_body = build_text_cols(te)
    tr_texts = ('passage: ' + (t_title + ' \\n ' + t_body)).tolist()
    te_texts = ('passage: ' + (te_title + ' \\n ' + te_body)).tolist()
    model_name = 'intfloat/e5-base-v2'
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print('Loading model:', model_name, 'on', device)
    model = SentenceTransformer(model_name, device=device)
    model.max_seq_length = 512
    def embed(texts, batch_size=128):
        return np.asarray(model.encode(texts, batch_size=batch_size, normalize_embeddings=True, show_progress_bar=True), dtype=np.float32)
    print('Encoding train ...'); emb_tr = embed(tr_texts)
    print('Encoding test ...'); emb_te = embed(te_texts)
    np.save(tr_path, emb_tr); np.save(te_path, emb_te)
    print('Saved:', tr_path, te_path, '| shapes:', emb_tr.shape, emb_te.shape)

if __name__ == '__main__':
    main()
''')
script_path.write_text(script_code)
print('Wrote helper script:', script_path)

# Execute with venv python to ensure proper torch/ST stack
venv_py = Path('.venv/bin/python')
assert venv_py.exists(), 'Venv python not found; run the venv setup cell first'

# Ensure pandas/numpy are installed in the venv (required by e5_gen.py)
print('Ensuring pandas/numpy in venv ...')
subprocess.run([str(venv_py), '-m', 'pip', 'install', '--upgrade', 'pip', 'wheel', 'setuptools'], check=True)
subprocess.run([str(venv_py), '-m', 'pip', 'install', 'pandas', 'numpy'], check=True)

print('Running embeddings generation via', venv_py)
proc = subprocess.run([str(venv_py), str(script_path)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(proc.stdout)
if proc.returncode != 0:
    raise RuntimeError('Embedding generation failed; check logs above')
print('E5 embeddings generation complete. You can now run Cell 10 to train XGB on cached embeddings.')

Wrote helper script: e5_gen.py
Ensuring pandas/numpy in venv ...




Collecting pandas
  Downloading pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m252.1 MB/s[0m  [33m0:00:00[0m
[?25hDownloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)


Installing collected packages: pytz, tzdata, pandas
[?25l[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas]

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2/3[0m [pandas][2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]
[?25h[1A[2KSuccessfully installed pandas-2.3.2 pytz-2025.2 tzdata-2025.2
Running embeddings generation via .venv/bin/python


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loading model: intfloat/e5-base-v2 on cuda
Encoding train ...

Batches:   0%|          | 0/23 [00:00<?, ?it/s]
Batches:   4%|▍         | 1/23 [00:01<00:31,  1.43s/it]
Batches:   9%|▊         | 2/23 [00:02<00:22,  1.09s/it]
Batches:  13%|█▎        | 3/23 [00:02<00:18,  1.10it/s]
Batches:  17%|█▋        | 4/23 [00:03<00:15,  1.23it/s]
Batches:  22%|██▏       | 5/23 [00:04<00:13,  1.36it/s]
Batches:  26%|██▌       | 6/23 [00:04<00:10,  1.58it/s]
Batches:  30%|███       | 7/23 [00:05<00:09,  1.77it/s]
Batches:  35%|███▍      | 8/23 [00:05<00:07,  1.98it/s]
Batches:  39%|███▉      | 9/23 [00:05<00:06,  2.19it/s]
Batches:  43%|████▎     | 10/23 [00:06<00:05,  2.42it/s]
Batches:  48%|████▊     | 11/23 [00:06<00:04,  2.69it/s]


In [25]:
# Torch repair v2: install cu121 stack into local vendor dir (read-only system fallback), guard sys.path
import os, sys, site, shutil, subprocess, glob
from pathlib import Path

VENDOR = Path('vendor_pkgs'); VENDOR.mkdir(exist_ok=True)

def sh(args):
    print('$', ' '.join(args), flush=True)
    subprocess.run(args, check=True)

# 0) Disable user-site/shadowing for this process
for k in ['PIP_TARGET','PYTHONPATH','PYTHONUSERBASE','PIP_USER']:
    os.environ.pop(k, None)
os.environ['PYTHONNOUSERSITE'] = '1'

# 1) Remove pip-target dirs and *.pth that inject them
for d in [os.path.expanduser('~/.pip-target'), '/app/.pip-target']:
    if os.path.exists(d):
        print('Removing', d); shutil.rmtree(d, ignore_errors=True)
for sp in set(site.getsitepackages() + [site.getusersitepackages()]):
    if isinstance(sp, str) and os.path.isdir(sp):
        for pth in glob.glob(os.path.join(sp, '*.pth')):
            try:
                txt = open(pth, 'r', encoding='utf-8', errors='ignore').read()
                if 'pip-target' in txt: print('Removing pth injector:', pth); os.remove(pth)
            except Exception: pass

# 2) Purge any torch/transformer remnants inside vendor (clean slate)
for pat in ['torch*','torchvision*','torchaudio*','sentence_transformers*','sentence-transformers*','transformers*','accelerate*','tokenizers*','safetensors*','nvidia_*','triton*']:
    for p in VENDOR.glob(pat):
        if p.is_dir(): print('Removing dir', p); shutil.rmtree(p, ignore_errors=True)
        elif p.is_file(): print('Removing file', p); p.unlink(missing_ok=True)

# 3) Install exact CUDA 12.1 torch stack into vendor (writeable) and then NLP deps with constraints
constraints = Path('constraints.txt'); constraints.write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')
try:
    sh([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--target', str(VENDOR),
        '--index-url', 'https://download.pytorch.org/whl/cu121', '--extra-index-url', 'https://pypi.org/simple',
        'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1'])
except subprocess.CalledProcessError as e:
    print('Torch stack install failed:', e)
sh([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--target', str(VENDOR), '-c', str(constraints),
    'sentence-transformers==3.0.1', 'transformers==4.44.2', 'accelerate==0.34.2', 'sentencepiece'])

# 4) Runtime path guard: remove any pip-target paths; prepend vendor to sys.path
sys.path = [p for p in sys.path if 'pip-target' not in p]
if str(VENDOR) not in sys.path:
    sys.path.insert(0, str(VENDOR))

# 5) Sanity tests
try:
    import torch
    print('torch:', getattr(torch, '__version__', None), '| cuda:', getattr(torch.version, 'cuda', None), '| cuda_available:', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('gpu:', torch.cuda.get_device_name(0))
    from sentence_transformers import SentenceTransformer
    print('sentence-transformers import OK')
except Exception as e:
    print('Sanity check failed:', repr(e))

print('Vendor path used:', VENDOR.resolve())
print('>>> If torch.cuda.is_available() is False, still proceed with embedding CPU fallback temporarily, but prefer GPU. <<<')

Removing /app/.pip-target
Removing /app/.pip-target
$ /usr/bin/python3.11 -m pip install --no-cache-dir --target vendor_pkgs --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1


Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 565.3 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 524.2 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 513.2 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 126.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 217.3 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 124.4 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 173.5 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 165.2 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 272.5 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 432.2 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 120.3 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 194.0 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 123.5 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 168.2 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 388.1 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 340.4 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 438.5 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 467.9 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 283.3 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 486.3 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 163.6 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 101.8 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 298.0 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 379.4 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


$ /usr/bin/python3.11 -m pip install --no-cache-dir --target vendor_pkgs -c constraints.txt sentence-transformers==3.0.1 transformers==4.44.2 accelerate==0.34.2 sentencepiece


Collecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.1/227.1 KB 8.7 MB/s eta 0:00:00
Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 69.9 MB/s eta 0:00:00
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 443.9 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 368.2 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 220.5 MB/s eta 0:00:00


Collecting torch>=1.11.0
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 106.9 MB/s eta 0:00:00


Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 247.1 MB/s eta 0:00:00
Collecting huggingface-hub>=0.15.1
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 510.7 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 210.3 MB/s eta 0:00:00


Collecting Pillow
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 173.3 MB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 419.4 MB/s eta 0:00:00


Collecting safetensors>=0.4.1
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 523.2 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 504.3 MB/s eta 0:00:00


Collecting regex!=2019.12.17
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 521.4 MB/s eta 0:00:00


Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 193.9 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 373.9 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 423.7 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 486.3 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 500.8 MB/s eta 0:00:00
Collecting fsspec>=2023.5.0
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 507.0 MB/s eta 0:00:00


Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 396.6 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 198.9 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 196.9 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 443.8 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 190.1 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 192.9 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 202.7 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 261.3 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 376.7 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105


  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 203.5 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 174.0 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 202.7 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 567.4 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 397.2 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 216.0 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 500.2 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 263.5 MB/s eta 0:00:00


Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 493.1 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 478.3 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 458.1 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 413.9 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 490.2 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 502.4 MB/s eta 0:00:00


Installing collected packages: mpmath, urllib3, typing-extensions, tqdm, threadpoolctl, sympy, sentencepiece, safetensors, regex, pyyaml, psutil, Pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, joblib, idna, hf-xet, fsspec, filelock, charset_normalizer, certifi, triton, scipy, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, scikit-learn, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers, accelerate, sentence-transformers


Successfully installed MarkupSafe-3.0.3 Pillow-11.3.0 accelerate-0.34.2 certifi-2025.8.3 charset_normalizer-3.4.3 filelock-3.19.1 fsspec-2025.9.0 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 jinja2-3.1.6 joblib-1.5.2 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 packaging-25.0 psutil-7.1.0 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 scikit-learn-1.7.2 scipy-1.16.2 sentence-transformers-3.0.1 sentencepiece-0.2.1 sympy-1.14.0 threadpoolctl-3.6.0 tokenizers-0.19.1 torch-2.4.1 tqdm-4.67.1 transformers-4.44.2 triton-3.0.0 typing-extensions-4.15.0 urllib3-2.5.0








Sanity check failed: AttributeError("module 'torch' has no attribute 'version'")
Vendor path used: /var/lib/simon/agent_run_states/random-acts-of-pizza-20250929-001459/vendor_pkgs
>>> If torch.cuda.is_available() is False, still proceed with embedding CPU fallback temporarily, but prefer GPU. <<<


In [26]:
# Torch diagnostics: identify import path and attributes
import sys
print('First 5 sys.path entries:')
for i,p in enumerate(sys.path[:5]):
    print(i, p)
try:
    import torch
    print('torch module:', torch)
    print('torch __file__:', getattr(torch, '__file__', None))
    print('has torch.__version__:', hasattr(torch, '__version__'))
    print('has torch.version:', hasattr(torch, 'version'))
    if hasattr(torch, 'version'):
        print('torch.version:', torch.version)
    print('has torch.cuda:', hasattr(torch, 'cuda'))
    if hasattr(torch, 'cuda'):
        print('torch.cuda.is_available():', getattr(torch.cuda, 'is_available', lambda: 'n/a')())
except Exception as e:
    print('Import torch failed:', repr(e))
try:
    from sentence_transformers import SentenceTransformer
    import sentence_transformers as st
    print('sentence_transformers __file__:', getattr(st, '__file__', None))
except Exception as e:
    print('Import sentence_transformers failed:', repr(e))

First 5 sys.path entries:
0 vendor_pkgs
1 /app
2 /app/src
3 /usr/lib/python311.zip
4 /usr/lib/python3.11
torch module: <module 'torch' (<_frozen_importlib_external.NamespaceLoader object at 0x7dbe59d048d0>)>
torch __file__: None
has torch.__version__: False
has torch.version: False
has torch.cuda: False
Import sentence_transformers failed: ImportError("cannot import name 'Tensor' from 'torch' (unknown location)")


In [27]:
# Torch fix v3: remove shadow namespace dirs, force-reinstall into vendor, verify CUDA
import os, sys, shutil, subprocess, glob, site
from pathlib import Path

VENDOR = Path('vendor_pkgs'); VENDOR.mkdir(exist_ok=True)

def sh(args):
    print('$', ' '.join(args), flush=True)
    subprocess.run(args, check=True)

# 1) Ensure vendor is first; drop any pip-target entries
sys.path = [p for p in sys.path if 'pip-target' not in p]
if str(VENDOR) not in sys.path:
    sys.path.insert(0, str(VENDOR))
print('sys.path[0:5]=', sys.path[:5])

# 2) Locate and remove shadow 'torch' namespace dirs not under vendor (common cause of NamespaceLoader import) 
cwd = os.getcwd()
removed = []
for p in list(sys.path):
    # Only consider writable, local paths (avoid system dirs)
    if not isinstance(p, str):
        continue
    if not (p.startswith(cwd) or p.startswith('/app')):
        continue
    tdir = os.path.join(p, 'torch')
    if os.path.isdir(tdir) and str(VENDOR) not in tdir:
        # Heuristic: remove if missing __init__.py (namespace dir) or obviously not a proper torch package
        has_init = os.path.exists(os.path.join(tdir, '__init__.py'))
        if not has_init:
            try:
                print('Removing shadow torch dir:', tdir)
                shutil.rmtree(tdir, ignore_errors=True)
                removed.append(tdir)
            except Exception as e:
                print('Failed to remove', tdir, e)
print('Removed shadow dirs:', removed)

# 3) Force reinstall GPU torch stack into vendor
constraints = Path('constraints.txt'); constraints.write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')
try:
    sh([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--target', str(VENDOR),
        '--upgrade', '--force-reinstall',
        '--index-url', 'https://download.pytorch.org/whl/cu121', '--extra-index-url', 'https://pypi.org/simple',
        'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1'])
except subprocess.CalledProcessError as e:
    print('Torch stack reinstall error:', e)
try:
    sh([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--target', str(VENDOR), '-c', str(constraints),
        '--upgrade', '--force-reinstall',
        'sentence-transformers==3.0.1', 'transformers==4.44.2', 'accelerate==0.34.2', 'sentencepiece'])
except subprocess.CalledProcessError as e:
    print('NLP deps reinstall error:', e)

# 4) Final guard: vendor first on sys.path
sys.path = [p for p in sys.path if 'pip-target' not in p]
if sys.path[0] != str(VENDOR):
    sys.path.insert(0, str(VENDOR))
print('sys.path[0]=', sys.path[0])

# 5) Sanity import and report source files
try:
    import importlib, types
    torch = importlib.import_module('torch')
    print('torch module file:', getattr(torch, '__file__', None))
    print('torch has __version__?', hasattr(torch, '__version__'))
    print('torch has version attr?', hasattr(torch, 'version'))
    if hasattr(torch, '__version__'):
        print('torch.__version__ =', torch.__version__)
    if hasattr(torch, 'version') and hasattr(torch.version, 'cuda'):
        print('torch.version.cuda =', torch.version.cuda)
    if hasattr(torch, 'cuda') and hasattr(torch.cuda, 'is_available'):
        print('torch.cuda.is_available() =', torch.cuda.is_available())
    st = importlib.import_module('sentence_transformers')
    print('sentence_transformers file:', getattr(st, '__file__', None))
except Exception as e:
    print('Sanity import failed:', repr(e))

sys.path[0:5]= ['vendor_pkgs', '/app', '/app/src', '/usr/lib/python311.zip', '/usr/lib/python3.11']
Removed shadow dirs: []
$ /usr/bin/python3.11 -m pip install --no-cache-dir --target vendor_pkgs --upgrade --force-reinstall --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1


Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 243.0 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 513.1 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 260.0 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105


  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 222.9 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 236.6 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 207.3 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 194.3 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 552.1 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 231.9 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 364.5 MB/s eta 0:00:00
Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 469.1 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 240.0 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 468.2 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 244.9 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 505.4 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 225.1 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 488.2 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 210.4 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 239.5 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 256.3 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 397.8 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 177.1 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 131.2 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 502.4 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


$ /usr/bin/python3.11 -m pip install --no-cache-dir --target vendor_pkgs -c constraints.txt --upgrade --force-reinstall sentence-transformers==3.0.1 transformers==4.44.2 accelerate==0.34.2 sentencepiece


Collecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.1/227.1 KB 10.2 MB/s eta 0:00:00
Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 164.9 MB/s eta 0:00:00
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 483.2 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 369.6 MB/s eta 0:00:00


Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 406.8 MB/s eta 0:00:00


Collecting Pillow
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 194.1 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 232.6 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 210.7 MB/s eta 0:00:00
Collecting torch>=1.11.0
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 152.8 MB/s eta 0:00:00


Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 334.1 MB/s eta 0:00:00
Collecting huggingface-hub>=0.15.1
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 496.7 MB/s eta 0:00:00


Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 187.1 MB/s eta 0:00:00


Collecting safetensors>=0.4.1
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 329.4 MB/s eta 0:00:00


Collecting regex!=2019.12.17
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 255.3 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 332.8 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 506.3 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 374.8 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 484.8 MB/s eta 0:00:00
Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 325.2 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 302.7 MB/s eta 0:00:00


Collecting fsspec>=2023.5.0
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 459.1 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 174.3 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 181.0 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 214.2 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 226.8 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 369.5 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 239.4 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 530.2 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 180.3 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 451.1 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 229.6 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 220.6 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 459.8 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 161.3 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 176.3 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 243.1 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 177.7 MB/s eta 0:00:00


Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 482.1 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 487.1 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 437.3 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 448.8 MB/s eta 0:00:00
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 70.9 MB/s eta 0:00:00


Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 519.0 MB/s eta 0:00:00


Installing collected packages: mpmath, urllib3, typing-extensions, tqdm, threadpoolctl, sympy, sentencepiece, safetensors, regex, pyyaml, psutil, Pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, joblib, idna, hf-xet, fsspec, filelock, charset_normalizer, certifi, triton, scipy, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, scikit-learn, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers, accelerate, sentence-transformers


Successfully installed MarkupSafe-3.0.3 Pillow-11.3.0 accelerate-0.34.2 certifi-2025.8.3 charset_normalizer-3.4.3 filelock-3.19.1 fsspec-2025.9.0 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 jinja2-3.1.6 joblib-1.5.2 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 packaging-25.0 psutil-7.1.0 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 scikit-learn-1.7.2 scipy-1.16.2 sentence-transformers-3.0.1 sentencepiece-0.2.1 sympy-1.14.0 threadpoolctl-3.6.0 tokenizers-0.19.1 torch-2.4.1 tqdm-4.67.1 transformers-4.44.2 triton-3.0.0 typing-extensions-4.15.0 urllib3-2.5.0


sys.path[0]= vendor_pkgs
torch module file: None
torch has __version__? False
torch has version attr? False
Sanity import failed: ImportError("cannot import name 'Tensor' from 'torch' (unknown location)")


In [28]:
# Runtime hardening: drop usersite/pip-target from sys.path, remove shadow torch, verify vendor torch import
import os, sys, site, shutil, glob
from pathlib import Path

VENDOR = Path('vendor_pkgs')
print('Pre sys.path[0:8]:', sys.path[:8])

# 1) Disable usersite for this process and strip any usersite/pip-target paths from sys.path
os.environ['PYTHONNOUSERSITE'] = '1'
user_site = site.getusersitepackages() if hasattr(site, 'getusersitepackages') else None
bad_substrings = ['.pip-target', str(user_site) if user_site else '']
sys.path = [p for p in sys.path if not any(bs and bs in str(p) for bs in bad_substrings)]
print('Post strip sys.path[0:8]:', sys.path[:8])
print('User site path:', user_site)

# 2) Physically remove ~/.pip-target to prevent accidental shadowing
pt = os.path.expanduser('~/.pip-target')
if os.path.exists(pt):
    print('Removing ~/.pip-target recursively')
    shutil.rmtree(pt, ignore_errors=True)

# 3) Ensure vendor_pkgs is first on sys.path
if str(VENDOR) not in sys.path:
    sys.path.insert(0, str(VENDOR))
else:
    # Move to front if not already
    sys.path.remove(str(VENDOR)); sys.path.insert(0, str(VENDOR))
print('Final sys.path[0:5]:', sys.path[:5])

# 4) Inspect vendor torch contents quickly
torch_dir = VENDOR / 'torch'
print('vendor torch exists:', torch_dir.exists(), '| has __init__:', (torch_dir / '__init__.py').exists())
if torch_dir.exists():
    sub = list(torch_dir.iterdir())[:10]
    print('torch top entries:', [p.name for p in sub])

# 5) Import torch and print details
try:
    import importlib
    torch = importlib.import_module('torch')
    print('torch __file__:', getattr(torch, '__file__', None))
    print('torch __version__:', getattr(torch, '__version__', None))
    has_cuda = hasattr(torch, 'cuda') and hasattr(torch.cuda, 'is_available') and torch.cuda.is_available()
    print('CUDA available:', has_cuda)
    if has_cuda:
        try:
            print('GPU:', torch.cuda.get_device_name(0))
        except Exception as e:
            print('GPU name fetch error:', repr(e))
except Exception as e:
    print('Import torch failed:', repr(e))

# 6) Try sentence_transformers import (should resolve from vendor if installed there) 
try:
    import sentence_transformers as st
    print('sentence_transformers __file__:', getattr(st, '__file__', None))
except Exception as e:
    print('Import sentence_transformers failed:', repr(e))

Pre sys.path[0:8]: ['vendor_pkgs', '/app', '/app/src', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '', '/usr/local/lib/python3.11/dist-packages']
Post strip sys.path[0:8]: ['vendor_pkgs', '/app', '/app/src', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '', '/usr/local/lib/python3.11/dist-packages']
User site path: /app/.pip-user/lib/python3.11/site-packages
Removing ~/.pip-target recursively
Final sys.path[0:5]: ['vendor_pkgs', '/app', '/app/src', '/usr/lib/python311.zip', '/usr/lib/python3.11']
vendor torch exists: True | has __init__: True
torch top entries: ['random.py', 'signal', 'distributed', 'utils', 'onnx', '_decomp', 'testing', '_C.cpython-311-x86_64-linux-gnu.so', '_dynamo', '__init__.py']
torch __file__: None
torch __version__: None
CUDA available: False
Import sentence_transformers failed: ImportError("cannot import name 'Tensor' from 'torch' (unknown location)")


In [117]:
# Env check: GPU, versions, basics
import os, sys, subprocess, time, json, platform
import numpy as np, pandas as pd
print("Python:", sys.version.split()[0], "| Platform:", platform.platform())
print("Working dir:", os.getcwd())

def run(cmd):
    print("$", " ".join(cmd), flush=True)
    return subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

res = run(['bash','-lc','nvidia-smi || true'])
print(res.stdout)
gpu_ok = ('NVIDIA-SMI' in res.stdout) and ('Driver Version' in res.stdout)
print("GPU available:", gpu_ok)

import sklearn, xgboost
print("sklearn:", sklearn.__version__)
print("xgboost:", xgboost.__version__)

# Helper: timer context
from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time()
    print(f"[T0] {msg} ...", flush=True)
    try:
        yield
    finally:
        dt = time.time() - t0
        print(f"[T+] {msg} done in {dt:.2f}s", flush=True)

# Seeds and constants
SEEDS = [42, 1337, 2025]
# Revised chains to improve C3 positives; reduce purge gap to 3 days
CHAIN_SPLITS = [(0.0,0.60,0.80), (0.0,0.75,0.90), (0.0,0.80,1.00)]
PURGE_GAP_DAYS = 3
print("SEEDS:", SEEDS)
print("Chains:", CHAIN_SPLITS, "| Purge gap days:", PURGE_GAP_DAYS)

assert gpu_ok, "GPU not available. Per competition_best_practices.md, exit immediately to avoid wasting time."

Python: 3.11.0rc1 | Platform: Linux-6.8.0-1031-azure-x86_64-with-glibc2.35
Working dir: /var/lib/simon/agent_run_states/random-acts-of-pizza-20250929-001459
$ bash -lc nvidia-smi || true


Mon Sep 29 09:05:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     711MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [118]:
# Build purged forward-chaining, group-purged folds and persist indices
import os, json, numpy as np, pandas as pd
from pathlib import Path

with timer("Load train.json and basic parsing"):
    # Robust JSON loader: try JSON Lines first, then standard JSON array/object
    df = None
    try:
        df = pd.read_json("train.json", lines=True)
    except ValueError:
        try:
            df = pd.read_json("train.json", lines=False)
        except ValueError:
            # Final fallback: read whole file via json.load and normalize
            with open("train.json", "r", encoding="utf-8") as f:
                data = json.load(f)
            if isinstance(data, dict) and "data" in data:
                data = data["data"]
            df = pd.json_normalize(data)
    assert df is not None and len(df) > 0, "Failed to load train.json into a DataFrame"
    # Identify key columns with fallbacks
    time_col_candidates = [
        'unix_timestamp_of_request_utc', 'request_timestamp', 'created_utc', 'timestamp', 'time'
    ]
    user_col_candidates = ['requester_username', 'username', 'user']
    label_col_candidates = ['requester_received_pizza', 'label', 'target', 'y']

    def first_col(cols):
        for c in cols:
            if c in df.columns:
                return c
        raise KeyError(f"Missing expected columns; have: {list(df.columns)[:20]} ...")

    TIME_COL = first_col(time_col_candidates)
    GROUP_COL = first_col(user_col_candidates)
    LABEL_COL = first_col(label_col_candidates)
    print("Cols:", dict(time=TIME_COL, group=GROUP_COL, label=LABEL_COL))

    # Ensure numeric unix time (seconds) and datetime for readability
    ts = pd.to_numeric(df[TIME_COL], errors='coerce').astype('Int64')
    if ts.isna().any():
        # If stored as datetime string, coerce
        try:
            ts2 = pd.to_datetime(df[TIME_COL], utc=True, errors='coerce')
            ts = (ts2.view('int64') // 1_000_000_000).astype('Int64')
        except Exception:
            pass
    ts = ts.fillna(ts.dropna().median()).astype(np.int64)
    if ts.max() < 10_000_000_000:
        unix_s = ts.values
    else:
        unix_s = (ts.values // 1_000_000_000).astype(np.int64)
    df['_unix_s'] = unix_s
    df['_dt'] = pd.to_datetime(df['_unix_s'], unit='s', utc=True)
    # Clean label to 0/1
    y = df[LABEL_COL]
    if y.dtype == bool:
        y = y.astype(np.int8)
    elif y.dtype.name == 'object':
        y = y.map({True:1, False:0, 'True':1, 'False':0, 'yes':1, 'no':0}).fillna(pd.to_numeric(y, errors='coerce')).fillna(0).astype(int)
    else:
        y = pd.to_numeric(y, errors='coerce').fillna(0).astype(int)
    df['_y'] = y.clip(0,1).astype(np.int8)

with timer("Sort by time and compute chain boundaries"):
    df = df.sort_values('_unix_s', kind='mergesort').reset_index(drop=True)
    n = len(df)
    print("Rows:", n, "| Positives:", int(df['_y'].sum()))
    print("Time range:", df['_dt'].min(), "->", df['_dt'].max())
    # Pre-compute position indices for split fractions
    def frac_to_idx(f):
        return int(round(f * n))
    chains = []
    for (tr_s, tr_e, va_e) in CHAIN_SPLITS:
        tr_s_i, tr_e_i, va_e_i = frac_to_idx(tr_s), frac_to_idx(tr_e), frac_to_idx(va_e)
        va_s_i = tr_e_i  # initial val start at train end fraction
        chains.append((tr_s_i, tr_e_i, va_s_i, va_e_i))
    print("Chain idx (train_start, train_end, val_start, val_end):", chains)

with timer("Construct purged, group-purged folds"):
    purge_gap_sec = int(PURGE_GAP_DAYS * 86400)
    fold_dir = Path("folds"); fold_dir.mkdir(exist_ok=True)
    manifest = {"time_col": TIME_COL, "group_col": GROUP_COL, "label_col": LABEL_COL,
                "purge_gap_days": PURGE_GAP_DAYS, "chains": []}
    for ci, (tr_s_i, tr_e_i, va_s_i, va_e_i) in enumerate(chains, start=1):
        # Base masks by position window
        base_train_idx = np.arange(tr_s_i, tr_e_i)
        base_val_idx = np.arange(va_s_i, va_e_i)
        # Enforce purge gap: shift val start by time
        train_end_ts = int(df.iloc[tr_e_i - 1]['_unix_s']) if tr_e_i > tr_s_i else int(df.iloc[0]['_unix_s'])
        min_val_ts = train_end_ts + purge_gap_sec
        # Find first index in base_val_idx with ts >= min_val_ts
        val_ts = df.iloc[base_val_idx]['_unix_s'].values if len(base_val_idx) else np.array([])
        if len(val_ts):
            valid_mask_time = val_ts >= min_val_ts
            base_val_idx = base_val_idx[valid_mask_time]
        # Group purge: drop val rows whose group appears in train
        tr_groups = set(df.iloc[base_train_idx][GROUP_COL].astype(str).values.tolist())
        if len(base_val_idx):
            val_groups = df.iloc[base_val_idx][GROUP_COL].astype(str).values
            keep_mask = np.array([g not in tr_groups for g in val_groups], dtype=bool)
            val_idx = base_val_idx[keep_mask]
        else:
            val_idx = base_val_idx
        train_idx = base_train_idx
        # Safety: ensure non-empty and enough positives in val
        val_pos = int(df.iloc[val_idx]['_y'].sum()) if len(val_idx) else 0
        train_pos = int(df.iloc[train_idx]['_y'].sum()) if len(train_idx) else 0
        tr_start_dt = df.iloc[train_idx]['_dt'].min() if len(train_idx) else None
        tr_end_dt = df.iloc[train_idx]['_dt'].max() if len(train_idx) else None
        va_start_dt = df.iloc[val_idx]['_dt'].min() if len(val_idx) else None
        va_end_dt = df.iloc[val_idx]['_dt'].max() if len(val_idx) else None
        print(f"Chain {ci}: train {len(train_idx)} (pos {train_pos}) [{tr_start_dt} -> {tr_end_dt}] | val {len(val_idx)} (pos {val_pos}) [{va_start_dt} -> {va_end_dt}]")
        if len(val_idx) == 0 or val_pos < 50:
            print(f"Warning: Chain {ci} has low/zero positives in val ({val_pos}). Consider widening window or reducing purge gap.")
        # Persist
        np.save(fold_dir / f"fc_chain{ci}_train_idx.npy", train_idx)
        np.save(fold_dir / f"fc_chain{ci}_val_idx.npy", val_idx)
        chain_info = {
            "chain": ci,
            "train_count": int(len(train_idx)),
            "val_count": int(len(val_idx)),
            "train_pos": train_pos,
            "val_pos": val_pos,
            "train_end_dt": pd.to_datetime(train_end_ts, unit='s', utc=True).isoformat(),
            "min_val_dt": pd.to_datetime(min_val_ts, unit='s', utc=True).isoformat(),
            "val_start_dt": va_start_dt.isoformat() if va_start_dt is not None else None,
            "val_end_dt": va_end_dt.isoformat() if va_end_dt is not None else None
        }
        manifest["chains"].append(chain_info)

with timer("Save fold manifest"):
    (fold_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
    print(json.dumps(manifest, indent=2)[:800] + ("..." if len(json.dumps(manifest))>800 else ""))

print("Folds saved under ./folds/. Reuse these indices across all legs.")

[T0] Load train.json and basic parsing ...


Cols: {'time': 'unix_timestamp_of_request_utc', 'group': 'requester_username', 'label': 'requester_received_pizza'}
[T+] Load train.json and basic parsing done in 0.11s


[T0] Sort by time and compute chain boundaries ...


Rows: 2878 | Positives: 715
Time range: 2011-05-23 20:29:10+00:00 -> 2013-10-09 18:51:12+00:00
Chain idx (train_start, train_end, val_start, val_end): [(0, 1727, 1727, 2302), (0, 2158, 2158, 2590), (0, 2302, 2302, 2878)]
[T+] Sort by time and compute chain boundaries done in 0.00s


[T0] Construct purged, group-purged folds ...


Chain 1: train 1727 (pos 470) [2011-05-23 20:29:10+00:00 -> 2012-10-17 20:30:50+00:00] | val 565 (pos 131) [2012-10-20 22:09:47+00:00 -> 2013-05-08 23:52:39+00:00]
Chain 2: train 2158 (pos 564) [2011-05-23 20:29:10+00:00 -> 2013-03-27 05:29:00+00:00] | val 427 (pos 103) [2013-03-30 10:03:22+00:00 -> 2013-08-05 04:54:13+00:00]
Chain 3: train 2302 (pos 603) [2011-05-23 20:29:10+00:00 -> 2013-05-08 23:52:39+00:00] | val 567 (pos 109) [2013-05-12 22:20:35+00:00 -> 2013-10-09 18:51:12+00:00]
[T+] Construct purged, group-purged folds done in 0.01s


[T0] Save fold manifest ...


{
  "time_col": "unix_timestamp_of_request_utc",
  "group_col": "requester_username",
  "label_col": "requester_received_pizza",
  "purge_gap_days": 3,
  "chains": [
    {
      "chain": 1,
      "train_count": 1727,
      "val_count": 565,
      "train_pos": 470,
      "val_pos": 131,
      "train_end_dt": "2012-10-17T20:30:50+00:00",
      "min_val_dt": "2012-10-20T20:30:50+00:00",
      "val_start_dt": "2012-10-20T22:09:47+00:00",
      "val_end_dt": "2013-05-08T23:52:39+00:00"
    },
    {
      "chain": 2,
      "train_count": 2158,
      "val_count": 427,
      "train_pos": 564,
      "val_pos": 103,
      "train_end_dt": "2013-03-27T05:29:00+00:00",
      "min_val_dt": "2013-03-30T05:29:00+00:00",
      "val_start_dt": "2013-03-30T10:03:22+00:00",
      "val_end_dt": "2013-08-05T04:...
[T+] Save fold manifest done in 0.00s


Folds saved under ./folds/. Reuse these indices across all legs.


In [36]:
# Leg A: TF-IDF (title x3 + body) + Logistic Regression under forward-chaining folds
import json, gc, time, glob
from pathlib import Path
import numpy as np, pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text(df):
    title_col_candidates = ['request_title', 'title']
    # Leak-safe: never use edit-aware body text
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns:
                return c
        return None
    tcol = first_col(title_col_candidates)
    bcol = first_col(body_col_candidates)
    t = df[tcol].fillna("") if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("") if bcol else pd.Series(["" for _ in range(len(df))])
    # Up-weight title by repeating x3
    return (t + ' ' + t + ' ' + t + ' ' + b).astype(str)

with timer("Load train/test and prepare text"):
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    # Columns discovered earlier:
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    group_col = 'requester_username' if 'requester_username' in tr.columns else tr.columns[1]
    label_col = 'requester_received_pizza' if 'requester_received_pizza' in tr.columns else 'label'
    # Align order as in folds (sorted by time) to match saved indices
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    X_text_tr = build_text(tr)
    X_text_te = build_text(te)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values

fold_dir = Path('folds')
oof = np.zeros(len(tr), dtype=float)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_seeds = []
params = dict(C_grid=[0.5,1,2,4], class_weight='balanced', max_features_word=200000, max_features_char=300000)
print("Params:", params)

def fit_predict_fold(train_idx, val_idx, seed):
    # Word and char TF-IDF separate, then hstack
    word_vec = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=2, max_df=0.98,
                               max_features=params['max_features_word'],
                               dtype=np.float32)
    char_vec = TfidfVectorizer(ngram_range=(3,6), analyzer='char_wb', min_df=2,
                               max_features=params['max_features_char'],
                               dtype=np.float32)
    Xtr_word = word_vec.fit_transform(X_text_tr.iloc[train_idx])
    Xtr_char = char_vec.fit_transform(X_text_tr.iloc[train_idx])
    Xtr = sparse.hstack([Xtr_word, Xtr_char]).tocsr()
    Xva_word = word_vec.transform(X_text_tr.iloc[val_idx])
    Xva_char = char_vec.transform(X_text_tr.iloc[val_idx])
    Xva = sparse.hstack([Xva_word, Xva_char]).tocsr()
    Xte_word = word_vec.transform(X_text_te)
    Xte_char = char_vec.transform(X_text_te)
    Xte = sparse.hstack([Xte_word, Xte_char]).tocsr()
    best_auc, best_pred_va, best_pred_te = -1.0, None, None
    for C in params['C_grid']:
        clf = LogisticRegression(solver='saga', penalty='l2', C=C,
                                 class_weight=params['class_weight'],
                                 random_state=seed, max_iter=2000, n_jobs=-1, verbose=0)
        t0 = time.time()
        clf.fit(Xtr, y[train_idx])
        pva = clf.predict_proba(Xva)[:,1]
        auc = roc_auc_score(y[val_idx], pva)
        print(f"  C={C} | AUC={auc:.5f} | fit+pred {time.time()-t0:.1f}s", flush=True)
        if auc > best_auc:
            best_auc, best_pred_va = auc, pva
            best_pred_te = clf.predict_proba(Xte)[:,1]
    # Free memory
    del Xtr_word, Xtr_char, Xtr, Xva_word, Xva_char, Xva, Xte_word, Xte_char
    gc.collect()
    return best_pred_va, best_pred_te, best_auc

with timer("Train across forward-chaining folds"):
    # Use manifest.json to avoid stale chain4 files
    manifest_path = fold_dir / 'manifest.json'
    if manifest_path.exists():
        mf = json.loads(manifest_path.read_text())
        chain_ids = [c['chain'] for c in mf.get('chains', [])]
    else:
        val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
        chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
    print("Detected chains (manifest):", chain_ids)
    for ci in chain_ids:
        tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy")
        va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
        if len(va_idx) == 0:
            print(f"Chain {ci}: empty val; skipping")
            continue
        print(f"Chain {ci}: train {len(tr_idx)} | val {len(va_idx)}", flush=True)
        seed = SEEDS[0]
        pva, pte, auc = fit_predict_fold(tr_idx, va_idx, seed)
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds_seeds.append(pte)
        print(f"Chain {ci}: AUC={auc:.5f}", flush=True)

with timer("Evaluate OOF and save artifacts"):
    if val_mask.any():
        oof_auc_val = roc_auc_score(y[val_mask], oof[val_mask])
        print("OOF AUC (on validation rows only):", round(oof_auc_val, 6))
    else:
        print("Warning: no validation rows in mask; OOF AUC not computed.")
    np.save('oof_lr_tfidf_fc.npy', oof)
    # Average test predictions across chains (vectorizers differ per chain; this is a smoke check only)
    if len(test_preds_seeds):
        test_pred = np.mean(np.vstack(test_preds_seeds), axis=0)
        np.save('test_lr_tfidf_fc.npy', test_pred)
        # Also write a submission for quick smoke check
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_lr_tfidf_fc.csv', index=False)
        print('Saved: oof_lr_tfidf_fc.npy, test_lr_tfidf_fc.npy, submission_lr_tfidf_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load train/test and prepare text ...


[T+] Load train/test and prepare text done in 0.13s


Params: {'C_grid': [0.5, 1, 2, 4], 'class_weight': 'balanced', 'max_features_word': 200000, 'max_features_char': 300000}
[T0] Train across forward-chaining folds ...


Detected chains (manifest): [1, 2, 3]
Chain 1: train 1727 | val 562


  C=0.5 | AUC=0.59809 | fit+pred 1.3s


  C=1 | AUC=0.58855 | fit+pred 2.3s


  C=2 | AUC=0.57875 | fit+pred 3.6s


  C=4 | AUC=0.57216 | fit+pred 4.6s


Chain 1: AUC=0.59809


Chain 2: train 2302 | val 278


  C=0.5 | AUC=0.66337 | fit+pred 3.7s


  C=1 | AUC=0.66329 | fit+pred 4.5s


  C=2 | AUC=0.66042 | fit+pred 5.7s


  C=4 | AUC=0.65725 | fit+pred 6.9s


Chain 2: AUC=0.66337


Chain 3: train 2590 | val 268


  C=0.5 | AUC=0.55753 | fit+pred 3.4s


  C=1 | AUC=0.55519 | fit+pred 4.5s


  C=2 | AUC=0.55317 | fit+pred 5.9s


  C=4 | AUC=0.55317 | fit+pred 7.4s


Chain 3: AUC=0.55753


[T+] Train across forward-chaining folds done in 59.00s


[T0] Evaluate OOF and save artifacts ...


OOF AUC (on validation rows only): 0.603979
Saved: oof_lr_tfidf_fc.npy, test_lr_tfidf_fc.npy, submission_lr_tfidf_fc.csv
[T+] Evaluate OOF and save artifacts done in 0.01s


In [120]:
# Leg B: E5-base-v2 embeddings + XGBoost (GPU, xgb.train+ES), 3-seed bag, tiny grid, with fold-safe per-row meta features + robust kNN-rate (multi-k + softmax + cross-model + recency-decay + bayesian smoothing, per-chain std) [REVERT: drop label variance & gap, tighten XGB]
import os, sys, json, gc, math, shutil, subprocess, time, importlib, site
from pathlib import Path
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
import xgboost as xgb

os.environ['PYTHONNOUSERSITE'] = '1'
sys.path = [p for p in sys.path if ('vendor_pkgs' not in str(p)) and ('.pip-target' not in str(p))]

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text_cols(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns:
                return c
        return None
    tcol = first_col(title_col_candidates)
    bcol = first_col(body_col_candidates)
    t = df[tcol].fillna("") if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("") if bcol else pd.Series(["" for _ in range(len(df))])
    return t.astype(str), b.astype(str)

def compute_or_load_e5_embeddings(tr, te, cache_dir=Path("emb_cache")):
    cache_dir.mkdir(exist_ok=True)
    tr_path = cache_dir / "emb_e5_train.npy"
    te_path = cache_dir / "emb_e5_test.npy"
    if tr_path.exists() and te_path.exists():
        print("Loading cached E5 embeddings ...")
        emb_tr = np.load(tr_path)
        emb_te = np.load(te_path)
        return emb_tr, emb_te
    from sentence_transformers import SentenceTransformer
    import torch
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model_name = "intfloat/e5-base-v2"
    print("Loading model:", model_name, "on", device)
    model = SentenceTransformer(model_name, device=device)
    model.max_seq_length = 512
    t_title, t_body = build_text_cols(tr)
    te_title, te_body = build_text_cols(te)
    tr_texts = ("passage: " + (t_title.fillna("") + " \n " + t_body.fillna("")).astype(str)).tolist()
    te_texts = ("passage: " + (te_title.fillna("") + " \n " + te_body.fillna("")).astype(str)).tolist()
    def embed(texts, batch_size=128):
        out = model.encode(texts, batch_size=batch_size, normalize_embeddings=True, show_progress_bar=True)
        return np.asarray(out, dtype=np.float32)
    emb_tr = embed(tr_texts); emb_te = embed(te_texts)
    np.save(tr_path, emb_tr); np.save(te_path, emb_te)
    print("Saved embeddings:", tr_path, te_path)
    return emb_tr, emb_te

def per_row_meta(df):
    # Leak-safe base meta only
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    time_col_candidates = ['unix_timestamp_of_request_utc', 'request_timestamp', 'created_utc', 'timestamp', 'time']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates); bcol = first_col(body_col_candidates); time_col = first_col(time_col_candidates)
    t = df[tcol].fillna("").astype(str) if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("").astype(str) if bcol else pd.Series(["" for _ in range(len(df))])
    def wc(s): return s.str.split().apply(len).astype(np.int32)
    def cc(s): return s.str.len().astype(np.int32)
    title_wc = wc(t); body_wc = wc(b); title_cc = cc(t); body_cc = cc(b)
    ratio_tb = (title_cc / (body_cc + 1)).astype(np.float32)
    uniq_ratio = (b.str.lower().str.split().apply(lambda x: len(set(x)) / (len(x)+1e-6))).astype(np.float32)
    exclam = b.str.count('!').astype(np.int32); quest = b.str.count('\?').astype(np.int32)
    allcaps = b.apply(lambda s: sum(1 for w in s.split() if len(w)>=3 and w.isupper())).astype(np.int32)
    has_url = (b.str.contains('http', case=False, na=False) | b.str.contains('www\.', case=False, na=False)).astype(np.int8)
    has_dollar = b.str.contains('\u0024|\$', case=False, na=False).astype(np.int8)
    has_digit = b.str.contains('[0-9]', regex=True, na=False).astype(np.int8)
    ts = pd.to_numeric(df[time_col], errors='coerce').fillna(0).astype(np.int64) if time_col else pd.Series(np.zeros(len(df), dtype=np.int64))
    if ts.max() > 10_000_000_000: ts = (ts // 1_000_000_000).astype(np.int64)
    dt = pd.to_datetime(ts, unit='s', utc=True)
    month = dt.dt.month.astype(np.int16); wday = dt.dt.weekday.astype(np.int16); hour = dt.dt.hour.astype(np.int16)
    feats = np.vstack([
        title_wc, body_wc, title_cc, body_cc, ratio_tb, uniq_ratio,
        exclam, quest, allcaps, has_url, has_dollar, has_digit,
        month, wday, hour
    ]).T.astype(np.float32)
    return feats

def _topk_idx(sims, k):
    if k < sims.shape[1]:
        return np.argpartition(-sims, kth=k-1, axis=1)[:, :k]
    else:
        return np.argsort(-sims, axis=1)

def knn_rate_mean(pool_emb, pool_y, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return pool_y[topk].mean(axis=1).astype(np.float32)

def knn_rate_softmax(pool_emb, pool_y, query_emb, k=50, tau=0.12, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    row_idx = np.arange(topk.shape[0])[:, None]
    s_top = sims[row_idx, topk]
    w = np.exp(s_top / max(1e-6, tau))
    w /= (w.sum(axis=1, keepdims=True) + 1e-9)
    y_top = pool_y[topk].astype(np.float32)
    return (w * y_top).sum(axis=1).astype(np.float32)

def knn_topk(pool_emb, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return topk, sims

def knn_rate_bayes(pool_emb, pool_y, query_emb, k=50, alpha=20.0, p_train=0.5, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    y_top = pool_y[topk].astype(np.float32)
    s = y_top.sum(axis=1)
    denom = (k + alpha)
    return ((s + alpha * p_train) / denom).astype(np.float32)

def knn_rate_recency_decay(pool_emb, pool_y, query_emb, pool_ts_days, query_ts_days, k=50, lam_days=75.0, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    row_idx = np.arange(topk.shape[0])[:, None]
    qd = query_ts_days[row_idx[:,0]][:, None]
    pdays = pool_ts_days[topk]
    gaps = np.maximum(qd - pdays, 0.0)
    w = np.exp(-gaps / max(lam_days, 1e-6))
    y_top = pool_y[topk].astype(np.float32)
    w_sum = w.sum(axis=1, keepdims=True) + 1e-9
    return (w * y_top).sum(axis=1) / w_sum[:,0]

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer("Load train/test and prepare inputs"):
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    label_col = 'requester_received_pizza' if 'requester_received_pizza' in tr.columns else 'label'
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    # timestamps -> days (float)
    ts_tr = pd.to_numeric(tr[time_col], errors='coerce').fillna(0).values.astype(np.int64)
    ts_te = pd.to_numeric(te[time_col], errors='coerce').fillna(0).values.astype(np.int64) if time_col in te.columns else np.zeros(len(te), dtype=np.int64)
    if ts_tr.max() > 10_000_000_000: ts_tr = ts_tr // 1_000_000_000
    if ts_te.max() > 10_000_000_000: ts_te = ts_te // 1_000_000_000
    ts_tr_days = (ts_tr / 86400.0).astype(np.float32)
    ts_te_days = (ts_te / 86400.0).astype(np.float32)
    emb_tr, emb_te = compute_or_load_e5_embeddings(tr, te)
    # Cross-model pool (BGE) for cross kNN-rate
    bge_tr = np.load('emb_cache/emb_bge_train.npy').astype(np.float32)
    bge_te = np.load('emb_cache/emb_bge_test.npy').astype(np.float32)
    meta_tr = per_row_meta(tr)
    meta_te = per_row_meta(te)
    print("Shapes | emb:", emb_tr.shape, emb_te.shape, "meta:", meta_tr.shape, meta_te.shape, "| bge:", bge_tr.shape, bge_te.shape)

fold_dir = Path('folds')
manifest_path = fold_dir / 'manifest.json'
if manifest_path.exists():
    mf = json.loads(manifest_path.read_text())
    chain_ids = [c['chain'] for c in mf.get('chains', [])]
else:
    val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
    chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
print("Chains detected (from manifest if available):", chain_ids)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_bag = []

# Tighter regularization per expert guidance
param_grid = [
    dict(max_depth=4, eta=0.08, min_child_weight=5),
    dict(max_depth=4, eta=0.05, min_child_weight=5),
    dict(max_depth=4, eta=0.08, min_child_weight=8),
]

def predict_with_best(bst, dmat):
    bi = getattr(bst, 'best_iteration', None)
    if bi is not None:
        return bst.predict(dmat, iteration_range=(0, int(bi)+1))
    return bst.predict(dmat)

def standardize_knn_feats(tr_mat, va_mat, te_mat):
    mu = tr_mat.mean(axis=0); sd = tr_mat.std(axis=0); sd = np.where(sd < 1e-6, 1.0, sd)
    tr_s = (tr_mat - mu) / sd
    va_s = (va_mat - mu) / sd
    te_s = (te_mat - mu) / sd
    return tr_s.astype(np.float32), va_s.astype(np.float32), te_s.astype(np.float32)

def train_one_chain_seed(ci, seed):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f"Chain {ci} seed {seed}: empty val; skip"); return None, None, None
    Xtr_emb, ytr = emb_tr[tr_idx], y[tr_idx]
    Xva_emb, yva = emb_tr[va_idx], y[va_idx]
    # Same-model kNN rates: k=20,50,100 mean; softmax-weighted (k=50, tau=0.12)
    kn_tr_k20 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=20, self_exclude=True).reshape(-1,1)
    kn_va_k20 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=20, self_exclude=False).reshape(-1,1)
    kn_tr_k50 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=50, self_exclude=True).reshape(-1,1)
    kn_va_k50 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=50, self_exclude=False).reshape(-1,1)
    kn_tr_k100 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=100, self_exclude=True).reshape(-1,1)
    kn_va_k100 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=100, self_exclude=False).reshape(-1,1)
    kn_tr_soft = knn_rate_softmax(Xtr_emb, ytr, Xtr_emb, k=50, tau=0.12, self_exclude=True).reshape(-1,1)
    kn_va_soft = knn_rate_softmax(Xtr_emb, ytr, Xva_emb, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    # Cross-model rate using BGE embeddings (aligned by rows)
    kn_tr_cross = knn_rate_mean(bge_tr[tr_idx], ytr, bge_tr[tr_idx], k=50, self_exclude=True).reshape(-1,1)
    kn_va_cross = knn_rate_mean(bge_tr[tr_idx], ytr, bge_tr[va_idx], k=50, self_exclude=False).reshape(-1,1)
    # Recency-decayed k=50 (lambda ~ 75 days)
    tr_days = ts_tr_days[tr_idx]; va_days = ts_tr_days[va_idx]
    kn_tr_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xtr_emb, tr_days, tr_days, k=50, lam_days=75.0, self_exclude=True).reshape(-1,1)
    kn_va_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xva_emb, tr_days, va_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    # Bayesian-smoothed k=50
    p_train = float(ytr.mean())
    kn_tr_bayes = knn_rate_bayes(Xtr_emb, ytr, Xtr_emb, k=50, alpha=20.0, p_train=p_train, self_exclude=True).reshape(-1,1)
    kn_va_bayes = knn_rate_bayes(Xtr_emb, ytr, Xva_emb, k=50, alpha=20.0, p_train=p_train, self_exclude=False).reshape(-1,1)
    # Test features against full train pool
    kn_te_k20  = knn_rate_mean(emb_tr, y, emb_te, k=20,  self_exclude=False).reshape(-1,1)
    kn_te_k50  = knn_rate_mean(emb_tr, y, emb_te, k=50,  self_exclude=False).reshape(-1,1)
    kn_te_k100 = knn_rate_mean(emb_tr, y, emb_te, k=100, self_exclude=False).reshape(-1,1)
    kn_te_soft = knn_rate_softmax(emb_tr, y, emb_te, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    kn_te_cross = knn_rate_mean(bge_tr, y, bge_te, k=50, self_exclude=False).reshape(-1,1)
    te_days = ts_te_days
    kn_te_dec = knn_rate_recency_decay(emb_tr, y, emb_te, ts_tr_days, te_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    p_full = float(y.mean())
    kn_te_bayes = knn_rate_bayes(emb_tr, y, emb_te, k=50, alpha=20.0, p_train=p_full, self_exclude=False).reshape(-1,1)
    # Assemble KNN features (REVERT: exclude label variance and density gap)
    K_tr = np.hstack([kn_tr_k20, kn_tr_k50, kn_tr_k100, kn_tr_soft, kn_tr_cross, kn_tr_dec, kn_tr_bayes])
    K_va = np.hstack([kn_va_k20, kn_va_k50, kn_va_k100, kn_va_soft, kn_va_cross, kn_va_dec, kn_va_bayes])
    K_te = np.hstack([kn_te_k20, kn_te_k50, kn_te_k100, kn_te_soft, kn_te_cross, kn_te_dec, kn_te_bayes])
    K_tr_s, K_va_s, K_te_s = standardize_knn_feats(K_tr, K_va, K_te)
    # Final features
    Xtr = np.hstack([Xtr_emb, meta_tr[tr_idx], K_tr_s]).astype(np.float32)
    Xva = np.hstack([Xva_emb, meta_tr[va_idx], K_va_s]).astype(np.float32)
    Xte = np.hstack([emb_te,   meta_te,        K_te_s]).astype(np.float32)
    dtr = xgb.DMatrix(Xtr, label=ytr); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(Xte)
    pos = int((ytr==1).sum()); neg = int((ytr==0).sum()); spw = float(neg)/max(1.0, float(pos))
    base = dict(tree_method='gpu_hist', objective='binary:logistic', eval_metric='auc',
                subsample=0.8, colsample_bytree=0.8, reg_lambda=5.0, reg_alpha=0.2,
                scale_pos_weight=spw, seed=seed)
    best_auc = -1.0; best_pva = None; best_pte = None; best_desc = None
    for g in param_grid:
        params = base.copy(); params.update(g)
        t0 = time.time()
        bst = xgb.train(params, dtr, num_boost_round=4000, evals=[(dva,'val')], early_stopping_rounds=120, verbose_eval=False)
        pva = predict_with_best(bst, dva)
        auc = roc_auc_score(yva, pva)
        if auc > best_auc:
            best_auc, best_pva = auc, pva
            best_pte = predict_with_best(bst, dte)
            best_desc = g.copy(); best_desc['best_it'] = getattr(bst, 'best_iteration', None); best_desc['secs'] = time.time()-t0
    print(f"Chain {ci} seed {seed}: best={best_desc} AUC={best_auc:.5f}", flush=True)
    return va_idx, best_pva, best_pte

with timer("Train E5+XGB+robust kNN across chains and seeds"):
    SEEDS = [42, 1337, 2025]
    for seed in SEEDS:
        test_preds_per_chain = []
        for ci in chain_ids:
            res = train_one_chain_seed(ci, seed)
            if res is None: continue
            va_idx, pva, pte = res
            if seed == SEEDS[0]:
                oof[va_idx] = pva
            else:
                oof[va_idx] += pva
            val_mask[va_idx] = True
            test_preds_per_chain.append(pte)
        if len(test_preds_per_chain):
            test_preds_bag.append(np.mean(np.vstack(test_preds_per_chain), axis=0))

with timer("Evaluate and save E5+XGB artifacts (reverted extras)"):
    if val_mask.any():
        oof_avg = oof.copy(); oof_avg[val_mask] = oof_avg[val_mask] / 3.0
        oof_auc_val = roc_auc_score(y[val_mask], oof_avg[val_mask])
        print("E5+XGB OOF AUC (val rows only, seed-bag):", round(oof_auc_val, 6))
        np.save('oof_e5_xgb_fc.npy', oof_avg)
    else:
        print("Warning: no validation rows; OOF not computed.")
    if len(test_preds_bag):
        test_pred = np.mean(np.vstack(test_preds_bag), axis=0)
        np.save('test_e5_xgb_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_e5_xgb_fc.csv', index=False)
        print('Saved: oof_e5_xgb_fc.npy, test_e5_xgb_fc.npy, submission_e5_xgb_fc.csv')
    else:
        print('Warning: no test preds collected from seeds.')

[T0] Load train/test and prepare inputs ...


Loading cached E5 embeddings ...


Shapes | emb: (2878, 768) (1162, 768) meta: (2878, 15) (1162, 15) | bge: (2878, 384) (1162, 384)
[T+] Load train/test and prepare inputs done in 0.23s


Chains detected (from manifest if available): [1, 2, 3]
[T0] Train E5+XGB+robust kNN across chains and seeds ...


Chain 1 seed 42: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 305, 'secs': 1.4255635738372803} AUC=0.63259


Chain 2 seed 42: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 102, 'secs': 0.7586708068847656} AUC=0.67676


Chain 3 seed 42: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 97, 'secs': 0.7526278495788574} AUC=0.65312


Chain 1 seed 1337: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 149, 'secs': 0.9070234298706055} AUC=0.61906


Chain 2 seed 1337: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 60, 'secs': 0.6201951503753662} AUC=0.66900


Chain 3 seed 1337: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 96, 'secs': 0.7465951442718506} AUC=0.64619


Chain 1 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 287, 'secs': 1.2900474071502686} AUC=0.63995


Chain 2 seed 2025: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 40, 'secs': 0.563866376876831} AUC=0.68393


Chain 3 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 20, 'secs': 0.4767467975616455} AUC=0.66005


[T+] Train E5+XGB+robust kNN across chains and seeds done in 24.92s


[T0] Evaluate and save E5+XGB artifacts (reverted extras) ...


E5+XGB OOF AUC (val rows only, seed-bag): 0.610506
Saved: oof_e5_xgb_fc.npy, test_e5_xgb_fc.npy, submission_e5_xgb_fc.csv
[T+] Evaluate and save E5+XGB artifacts (reverted extras) done in 0.00s


In [55]:
# Leg C: TF-IDF -> TruncatedSVD + XGBoost (GPU) under forward-chaining folds (migrated to xgb.train + ES)
import json, gc, time
from pathlib import Path
import numpy as np, pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
import xgboost as xgb

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text_cols(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates)
    bcol = first_col(body_col_candidates)
    t = df[tcol].fillna("") if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("") if bcol else pd.Series(["" for _ in range(len(df))])
    return t.astype(str), b.astype(str)

def per_row_meta(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    time_col_candidates = ['unix_timestamp_of_request_utc', 'request_timestamp', 'created_utc', 'timestamp', 'time']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates); bcol = first_col(body_col_candidates)
    time_col = first_col(time_col_candidates)
    t = df[tcol].fillna("").astype(str) if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("").astype(str) if bcol else pd.Series(["" for _ in range(len(df))])
    def wc(s): return s.str.split().apply(len).astype(np.int32)
    def cc(s): return s.str.len().astype(np.int32)
    title_wc = wc(t); body_wc = wc(b); title_cc = cc(t); body_cc = cc(b)
    ratio_tb = (title_cc / (body_cc + 1)).astype(np.float32)
    uniq_ratio = (b.str.lower().str.split().apply(lambda x: len(set(x)) / (len(x)+1e-6))).astype(np.float32)
    exclam = b.str.count('!').astype(np.int32); quest = b.str.count('\?').astype(np.int32)
    allcaps = b.apply(lambda s: sum(1 for w in s.split() if len(w)>=3 and w.isupper())).astype(np.int32)
    has_url = (b.str.contains('http', case=False, na=False) | b.str.contains('www\.', case=False, na=False)).astype(np.int8)
    has_dollar = b.str.contains('\u0024|\$', case=False, na=False).astype(np.int8)
    has_digit = b.str.contains('[0-9]', regex=True, na=False).astype(np.int8)
    ts = pd.to_numeric(df[time_col], errors='coerce').fillna(0).astype(np.int64) if time_col else pd.Series(np.zeros(len(df), dtype=np.int64))
    if ts.max() > 10_000_000_000: ts = (ts // 1_000_000_000).astype(np.int64)
    dt = pd.to_datetime(ts, unit='s', utc=True)
    month = dt.dt.month.astype(np.int16); wday = dt.dt.weekday.astype(np.int16); hour = dt.dt.hour.astype(np.int16)
    feats = np.vstack([
        title_wc, body_wc, title_cc, body_cc, ratio_tb, uniq_ratio,
        exclam, quest, allcaps, has_url, has_dollar, has_digit,
        month, wday, hour
    ]).T.astype(np.float32)
    return feats

with timer("Load data and build text/meta"):
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    label_col = 'requester_received_pizza' if 'requester_received_pizza' in tr.columns else 'label'
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    t_title, t_body = build_text_cols(tr)
    te_title, te_body = build_text_cols(te)
    text_tr = (t_title + ' ' + t_title + ' ' + t_title + ' ' + t_body).astype(str)
    text_te = (te_title + ' ' + te_body).astype(str)
    meta_tr = per_row_meta(tr)
    meta_te = per_row_meta(te)

params = {
    'svd_dim': 250,
    'max_features_word': 200000,
    'max_features_char': 200000
}
print("Params:", params)

fold_dir = Path('folds')
manifest_path = fold_dir / 'manifest.json'
if manifest_path.exists():
    mf = json.loads(manifest_path.read_text())
    chain_ids = [c['chain'] for c in mf.get('chains', [])]
else:
    val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
    chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
print("Chains detected:", chain_ids)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_per_chain = []

param_grid = [
    dict(max_depth=4, eta=0.05, min_child_weight=3),
    dict(max_depth=4, eta=0.08, min_child_weight=3),
    dict(max_depth=5, eta=0.05, min_child_weight=3),
    dict(max_depth=5, eta=0.05, min_child_weight=5),
]

def predict_with_best(bst, dmat):
    bi = getattr(bst, 'best_iteration', None)
    if bi is not None:
        return bst.predict(dmat, iteration_range=(0, int(bi)+1))
    return bst.predict(dmat)

def fit_predict_chain(ci, include_meta=False):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f"Chain {ci}: empty val; skip"); return None
    assert len(set(tr_idx.tolist()).intersection(set(va_idx.tolist()))) == 0, f"Index overlap in chain {ci}"
    print(f"Chain {ci}: vectorizing ...", flush=True)
    word_vec = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=2, max_df=0.98, max_features=params['max_features_word'], dtype=np.float32)
    char_vec = TfidfVectorizer(ngram_range=(3,5), analyzer='char_wb', min_df=2, max_features=params['max_features_char'], dtype=np.float32)
    Xtr_w = word_vec.fit_transform(text_tr.iloc[tr_idx]); Xva_w = word_vec.transform(text_tr.iloc[va_idx]); Xte_w = word_vec.transform(text_te)
    Xtr_c = char_vec.fit_transform(text_tr.iloc[tr_idx]); Xva_c = char_vec.transform(text_tr.iloc[va_idx]); Xte_c = char_vec.transform(text_te)
    Xtr_tf = sparse.hstack([Xtr_w, Xtr_c]).tocsr()
    Xva_tf = sparse.hstack([Xva_w, Xva_c]).tocsr()
    Xte_tf = sparse.hstack([Xte_w, Xte_c]).tocsr()
    print(f"Chain {ci}: SVD ...", flush=True)
    svd = TruncatedSVD(n_components=params['svd_dim'], random_state=42)
    scaler = StandardScaler(with_mean=False)
    pipe = make_pipeline(svd, scaler)
    Xtr_s = pipe.fit_transform(Xtr_tf).astype(np.float32)
    Xva_s = pipe.transform(Xva_tf).astype(np.float32)
    Xte_s = pipe.transform(Xte_tf).astype(np.float32)
    if include_meta:
        Xtr = np.hstack([Xtr_s, meta_tr[tr_idx]]).astype(np.float32)
        Xva = np.hstack([Xva_s, meta_tr[va_idx]]).astype(np.float32)
        Xte = np.hstack([Xte_s, meta_te]).astype(np.float32)
    else:
        Xtr, Xva, Xte = Xtr_s, Xva_s, Xte_s
    ytr, yva = y[tr_idx], y[va_idx]
    pos = int((ytr==1).sum()); neg = int((ytr==0).sum()); spw = float(neg)/max(1.0, float(pos))
    dtr = xgb.DMatrix(Xtr, label=ytr); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(Xte)
    base = dict(tree_method='gpu_hist', objective='binary:logistic', eval_metric='auc',
                subsample=0.85, colsample_bytree=0.9, reg_lambda=3.0, reg_alpha=0.1,
                scale_pos_weight=spw, seed=42)
    best_auc, best_pva, best_pte, best_desc = -1.0, None, None, None
    for g in param_grid:
        params_xgb = base.copy(); params_xgb.update(g)
        t0 = time.time()
        bst = xgb.train(params_xgb, dtr, num_boost_round=3000, evals=[(dva,'val')], early_stopping_rounds=75, verbose_eval=False)
        pva = predict_with_best(bst, dva)
        auc = roc_auc_score(yva, pva)
        if auc > best_auc:
            best_auc, best_pva = auc, pva
            best_pte = predict_with_best(bst, dte)
            best_desc = g.copy(); best_desc['best_it'] = getattr(bst,'best_iteration',None); best_desc['secs'] = time.time()-t0
    print(f"Chain {ci}: best={best_desc} AUC={best_auc:.5f}", flush=True)
    return va_idx, best_pva, best_pte

with timer("Train SVD+XGB across chains"):
    for ci in chain_ids:
        res = fit_predict_chain(ci, include_meta=False)
        if res is None: continue
        va_idx, pva, pte = res
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds_per_chain.append(pte)

with timer("Evaluate and save SVD+XGB artifacts"):
    if val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print("SVD+XGB OOF AUC (val rows only):", round(oof_auc, 6))
        np.save('oof_tfidf_svd_xgb_fc.npy', oof)
    else:
        print('Warning: no validation rows; OOF not computed.')
    if len(test_preds_per_chain):
        test_pred = np.mean(np.vstack(test_preds_per_chain), axis=0)
        np.save('test_tfidf_svd_xgb_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_tfidf_svd_xgb_fc.csv', index=False)
        print('Saved: oof_tfidf_svd_xgb_fc.npy, test_tfidf_svd_xgb_fc.npy, submission_tfidf_svd_xgb_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load data and build text/meta ...


[T+] Load data and build text/meta done in 0.23s


Params: {'svd_dim': 250, 'max_features_word': 200000, 'max_features_char': 200000}
Chains detected: [1, 2, 3]
[T0] Train SVD+XGB across chains ...


Chain 1: vectorizing ...


Chain 1: SVD ...


Chain 2: SVD ...


Chain 2: best={'max_depth': 5, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 0, 'secs': 0.23083972930908203} AUC=0.69185


Chain 3: vectorizing ...


Chain 3: SVD ...


Chain 3: best={'max_depth': 5, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 209, 'secs': 0.8223862648010254} AUC=0.51725


[T+] Train SVD+XGB across chains done in 17.52s


[T0] Evaluate and save SVD+XGB artifacts ...


SVD+XGB OOF AUC (val rows only): 0.591481
Saved: oof_tfidf_svd_xgb_fc.npy, test_tfidf_svd_xgb_fc.npy, submission_tfidf_svd_xgb_fc.csv
[T+] Evaluate and save SVD+XGB artifacts done in 0.00s


In [153]:
# Blending v6: prune weak legs, heavier shrink/cap, auto-pick best by OOF AUC (rank space)
import json, numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def load_oof_test(prefix):
    oof = np.load(f"oof_{prefix}.npy").astype(np.float64)
    test = np.load(f"test_{prefix}.npy").astype(np.float64)
    return oof, test

def rank_array(x):
    order = np.argsort(x)
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x))
    r = ranks / max(1.0, (len(x) - 1))
    return np.clip(r, 0.01, 0.99)

def simplex_grid(L, step=0.05):
    alphas = np.arange(0.0, 1.0 + 1e-9, step)
    if L == 1:
        yield np.array([1.0])
    elif L == 2:
        for a in alphas:
            yield np.array([a, 1.0 - a])
    elif L == 3:
        for a in alphas:
            for b in np.arange(0.0, 1.0 - a + 1e-9, step):
                c = 1.0 - a - b
                if c < -1e-9: continue
                yield np.array([a, b, max(0.0, c)])
    elif L == 4:
        for a in alphas:
            for b in np.arange(0.0, 1.0 - a + 1e-9, step):
                for c in np.arange(0.0, 1.0 - a - b + 1e-9, step):
                    d = 1.0 - a - b - c
                    if d < -1e-9: continue
                    yield np.array([a, b, c, max(0.0, d)])
    elif L == 5:
        for a in alphas:
            for b in np.arange(0.0, 1.0 - a + 1e-9, step):
                for c in np.arange(0.0, 1.0 - a - b + 1e-9, step):
                    for d in np.arange(0.0, 1.0 - a - b - c + 1e-9, step):
                        e = 1.0 - a - b - c - d
                        if e < -1e-9: continue
                        yield np.array([a, b, c, d, max(0.0, e)])
    else:
        yield np.ones(L) / L

def learn_simplex_weights(ranks_val, y_val, step=0.05):
    L = ranks_val.shape[0]
    best_auc, best_w = -1.0, None
    for w in simplex_grid(L, step=step):
        auc = roc_auc_score(y_val, np.dot(w, ranks_val))
        if auc > best_auc:
            best_auc, best_w = auc, w.copy()
    return best_w, best_auc

def learn_lr_on_ranks(ranks_val, y_val, C_grid=(0.5,1,2,5)):
    best_auc, best_w = -1.0, None
    X = ranks_val.T
    for C in C_grid:
        clf = LogisticRegression(penalty='l2', C=C, fit_intercept=False, solver='lbfgs', max_iter=1000)
        clf.fit(X, y_val)
        w = clf.coef_.ravel().astype(float)
        w = np.maximum(0.0, w)
        if w.sum() == 0: continue
        w = w / w.sum()
        auc = roc_auc_score(y_val, np.dot(w, ranks_val))
        if auc > best_auc:
            best_auc, best_w = auc, w.copy()
    return best_w, best_auc

def shrink_weights(w, shrink, L):
    uniform = np.ones(L) / L
    wf = (1.0 - shrink) * w + shrink * uniform
    s = wf.sum()
    return wf / (s if s > 0 else 1.0)

def cap_and_norm(w, cap=0.15):
    w = np.minimum(w, cap)
    s = w.sum()
    if s <= 0:
        return np.ones_like(w) / len(w)
    return w / s

fold_dir = Path('folds')
mf = json.loads((fold_dir / 'manifest.json').read_text())
chains = [c['chain'] for c in mf['chains']]
print('Chains (manifest):', chains)

n = len(np.load('oof_lr_tfidf_fc.npy'))
val_mask_all = np.zeros(n, dtype=bool)
val_mask_c2 = np.zeros(n, dtype=bool)
val_mask_c23 = np.zeros(n, dtype=bool)
val_mask_c3 = np.zeros(n, dtype=bool)
for ci in chains:
    va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    val_mask_all[va_idx] = True
    if ci == 2: val_mask_c2[va_idx] = True
    if ci in (2,3): val_mask_c23[va_idx] = True
    if ci == 3: val_mask_c3[va_idx] = True

# Pruned legs: drop mpnet_xgb_fc and tfidf_svd_xgb_fc per expert advice
# Include DeBERTa FT, RoBERTa FT, meta, embeddings, TF-IDF legs; include wordlr_fc and nbsvm_fc
all_leg_names = ['meta_xgb_fc', 'e5_xgb_fc', 'bge_xgb_fc', 'lr_tfidf_fc', 'wordlr_fc', 'charlr_fc', 'roberta_ft_fc', 'deberta_ft_fc', 'nbsvm_fc']
legs = []
for name in all_leg_names:
    try:
        oof, test = load_oof_test(name)
        assert len(oof) == n
        legs.append((name, oof, test))
        print('Loaded leg:', name)
    except Exception as e:
        print('Skip leg:', name, '|', repr(e))
assert len(legs) >= 3

# Labels
tr = load_json_df('train.json')
time_col = mf.get('time_col', 'unix_timestamp_of_request_utc')
label_col = mf.get('label_col', 'requester_received_pizza')
tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values

def prepare_rank_mats(cur_legs):
    ranks_c2 = []; ranks_c23 = []; ranks_c3 = []; ranks_all = []; ranks_test = []; probs_test = []
    for _, oof, test in cur_legs:
        ranks_c2.append(rank_array(oof[val_mask_c2]))
        ranks_c23.append(rank_array(oof[val_mask_c23]))
        ranks_c3.append(rank_array(oof[val_mask_c3]))
        r_all = np.zeros(n, dtype=np.float64); r_all[val_mask_all] = rank_array(oof[val_mask_all])
        ranks_all.append(r_all)
        ranks_test.append(rank_array(test))
        probs_test.append(test.astype(np.float64))
    return dict(
        ranks_c2=np.vstack(ranks_c2),
        ranks_c23=np.vstack(ranks_c23),
        ranks_c3=np.vstack(ranks_c3),
        ranks_all=np.vstack(ranks_all),
        ranks_test=np.vstack(ranks_test),
        probs_test=np.vstack(probs_test)
    )

legs_pruned = [(n,o,t) for (n,o,t) in legs]
legs_transformers = [(n,o,t) for (n,o,t) in legs_pruned if n in ('e5_xgb_fc','bge_xgb_fc','roberta_ft_fc','deberta_ft_fc')]

def optimize_simplex_on(mask_key, cur_legs, grid_step=0.05):
    mats = prepare_rank_mats(cur_legs)
    key_map = {'c23':'ranks_c23','c2':'ranks_c2','c3':'ranks_c3'}
    R = mats[key_map[mask_key]]
    w,_ = learn_simplex_weights(R, y[val_mask_c23 if mask_key=='c23' else (val_mask_c2 if mask_key=='c2' else val_mask_c3)], step=grid_step)
    return w, mats

def write_submission(tag, w, mats, cur_legs, shrink=0.38, cap=0.15):
    L = len(cur_legs)
    w = shrink_weights(w, shrink, L)
    w = cap_and_norm(w, cap=cap)
    ranks_all = mats['ranks_all']; ranks_test = mats['ranks_test']
    oof_blend = np.dot(w, ranks_all)
    oof_auc = roc_auc_score(y[val_mask_all], oof_blend[val_mask_all])
    test_rank = np.dot(w, ranks_test)
    test_final = test_rank
    np.save(f'oof_blend_{tag}.npy', oof_blend)
    np.save(f'test_blend_{tag}.npy', test_rank)
    sub = pd.read_csv('sampleSubmission.csv')
    sub['requester_received_pizza'] = test_final
    fn = f'submission_blend_fc_{tag}.csv'
    sub.to_csv(fn, index=False)
    print(f'{tag}: OOF AUC={oof_auc:.6f} -> wrote {fn}')
    return fn, oof_auc, w

cands = []  # (tag, auc, fn)

# A) Simplex on C2+3 (pruned legs), heavy shrink/cap
mats_base = prepare_rank_mats(legs_pruned)
wA,_ = learn_simplex_weights(mats_base['ranks_c23'], y[val_mask_c23], step=(0.02 if len(legs_pruned)<=4 else 0.05))
if wA is not None:
    fnA, aucA, _ = write_submission('A_simplex_c23_s38_c15', wA, mats_base, legs_pruned, shrink=0.38, cap=0.15)
    cands.append(('A_simplex_c23_s38_c15', aucA, fnA))

# B) LR on ranks C2+3 (pruned), heavy shrink/cap
wB,_ = learn_lr_on_ranks(mats_base['ranks_c23'], y[val_mask_c23])
if wB is not None:
    fnB, aucB, _ = write_submission('B_lr_c23_s38_c15', wB, mats_base, legs_pruned, shrink=0.38, cap=0.15)
    cands.append(('B_lr_c23_s38_c15', aucB, fnB))

# C) Time-decayed 0.7*C3 + 0.3*C2 (pruned), heavy shrink/cap
w2_td,_ = learn_simplex_weights(mats_base['ranks_c2'], y[val_mask_c2], step=(0.02 if len(legs_pruned)<=4 else 0.05))
w3_td,_ = learn_simplex_weights(mats_base['ranks_c3'], y[val_mask_c3], step=(0.02 if len(legs_pruned)<=4 else 0.05))
if (w2_td is not None) and (w3_td is not None):
    def l2(v): return v / (np.linalg.norm(v) + 1e-12)
    wC = 0.7*l2(w3_td) + 0.3*l2(w2_td)
    wC = np.maximum(0.0, wC); wC = wC / wC.sum()
    fnC, aucC, _ = write_submission('C_time_decay_70_30_s38_c15', wC, mats_base, legs_pruned, shrink=0.38, cap=0.15)
    cands.append(('C_time_decay_70_30_s38_c15', aucC, fnC))

# D) Transformers-only (E5, BGE, RoBERTa, DeBERTa) as a backup
if len(legs_transformers) >= 2:
    mats_tx = prepare_rank_mats(legs_transformers)
    wD,_ = learn_simplex_weights(mats_tx['ranks_c23'], y[val_mask_c23], step=(0.02 if len(legs_transformers)<=4 else 0.05))
    if wD is not None:
        fnD, aucD, _ = write_submission('D_tx_only_c23_s38_c15', wD, mats_tx, legs_transformers, shrink=0.38, cap=0.15)
        cands.append(('D_tx_only_c23_s38_c15', aucD, fnD))

cands.sort(key=lambda x: x[1], reverse=True)
print('Blend candidates (sorted by OOF AUC on all val rows after shrink/cap):')
for name, auc, fn in cands:
    print(' ', name, '| AUC=', round(auc,6), '| file=', fn)

# Auto-pick best by OOF AUC
primary = cands[0][2] if len(cands) else None
if primary is not None:
    pd.read_csv(primary).to_csv('submission_blend_fc.csv', index=False)
    print('Primary submission_blend_fc.csv <-', primary)
else:
    print('No candidates produced; keeping previous submission_blend_fc.csv if exists')

Chains (manifest): [1, 2, 3]
Loaded leg: meta_xgb_fc
Loaded leg: e5_xgb_fc
Loaded leg: bge_xgb_fc
Loaded leg: lr_tfidf_fc
Loaded leg: wordlr_fc
Loaded leg: charlr_fc
Loaded leg: roberta_ft_fc
Loaded leg: deberta_ft_fc
Loaded leg: nbsvm_fc
A_simplex_c23_s38_c15: OOF AUC=0.641767 -> wrote submission_blend_fc_A_simplex_c23_s38_c15.csv


B_lr_c23_s38_c15: OOF AUC=0.640458 -> wrote submission_blend_fc_B_lr_c23_s38_c15.csv
C_time_decay_70_30_s38_c15: OOF AUC=0.641767 -> wrote submission_blend_fc_C_time_decay_70_30_s38_c15.csv


D_tx_only_c23_s38_c15: OOF AUC=0.631174 -> wrote submission_blend_fc_D_tx_only_c23_s38_c15.csv
Blend candidates (sorted by OOF AUC on all val rows after shrink/cap):
  A_simplex_c23_s38_c15 | AUC= 0.641767 | file= submission_blend_fc_A_simplex_c23_s38_c15.csv
  C_time_decay_70_30_s38_c15 | AUC= 0.641767 | file= submission_blend_fc_C_time_decay_70_30_s38_c15.csv
  B_lr_c23_s38_c15 | AUC= 0.640458 | file= submission_blend_fc_B_lr_c23_s38_c15.csv
  D_tx_only_c23_s38_c15 | AUC= 0.631174 | file= submission_blend_fc_D_tx_only_c23_s38_c15.csv
Primary submission_blend_fc.csv <- submission_blend_fc_A_simplex_c23_s38_c15.csv


In [145]:
# Create submission.csv from the current primary blend selected by Cell 12
import pandas as pd, os
src = 'submission_blend_fc.csv'
assert os.path.exists(src), f"Primary blend file {src} not found; run Cell 12 first"
dst = 'submission.csv'
pd.read_csv(src).to_csv(dst, index=False)
print('Wrote submission.csv from', src)
print(pd.read_csv(dst).head())

Wrote submission.csv from submission_blend_fc.csv
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.336286
1   t3_roiuw                  0.573397
2   t3_mjnbq                  0.594069
3   t3_t8wd1                  0.546081
4  t3_1m4zxu                  0.556294


In [108]:
# Leg M: Meta-only XGBoost (GPU) with strict leak bans + minimal safe features + sentinels + fold-safe user history
import json, time, re
from pathlib import Path
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
import xgboost as xgb

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def drop_banned_columns(df: pd.DataFrame) -> pd.DataFrame:
    banned_tokens = [
        'requester_received_pizza','received_pizza','target','label','y',
        'request_text_edit_aware','edit','edited',
        'retrieval',
        'upvote','downvote','votes','karma','score',
        'giver_username_if_known','number_of_recipients',
        'account_age','days_since_first_post_on_raop','number_of_posts','number_of_comments','posts_on_raop','comments_in_raop',
        'success'
    ]
    low = [c.lower() for c in df.columns]
    keep = []
    for c, cl in zip(df.columns, low):
        if any(tok in cl for tok in banned_tokens):
            continue
        keep.append(c)
    df2 = df[keep].copy()
    assert 'request_text_edit_aware' not in df2.columns, 'edit-aware text present'
    assert all('edit' not in c.lower() for c in df2.columns), 'any *edit* column present'
    return df2

def meta_features_minimal(df: pd.DataFrame) -> np.ndarray:
    # Only from request_title and request_text + calendar + safe lexical flags. No requester_* fields.
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(['request_title','title'])
    bcol = first_col(['request_text'])  # hard-ban body/text aliases to avoid edit-aware traps
    # Guards
    if bcol is None:
        b = pd.Series(['']*len(df), index=df.index)
    else:
        assert 'edit' not in bcol.lower(), f'Body column is edit-aware: {bcol}'
        b = df[bcol].fillna('').astype(str)
    t = df[tcol].fillna('').astype(str) if tcol else pd.Series(['']*len(df), index=df.index)
    # Base lengths and simple punctuation
    def wc(s): return s.str.split().apply(len).astype(np.int32)
    def cc(s): return s.str.len().astype(np.int32)
    title_wc = wc(t); body_wc = wc(b); title_cc = cc(t); body_cc = cc(b)
    ratio_tb = (title_cc / (body_cc + 1)).astype(np.float32)
    uniq_ratio = (b.str.lower().str.split().apply(lambda x: len(set(x)) / (len(x)+1e-6))).astype(np.float32)
    exclam = b.str.count('!').astype(np.int32); quest = b.str.count('\?').astype(np.int32)
    allcaps = b.apply(lambda s: sum(1 for w in s.split() if len(w)>=3 and w.isupper())).astype(np.int32)
    # Calendar from unix_timestamp_of_request_utc (or similar)
    time_col = None
    for c in ['unix_timestamp_of_request_utc','request_timestamp','created_utc','timestamp','time']:
        if c in df.columns: time_col = c; break
    ts = pd.to_numeric(df[time_col], errors='coerce').fillna(0).astype(np.int64) if time_col else pd.Series(np.zeros(len(df), dtype=np.int64))
    if ts.max() > 10_000_000_000: ts = (ts // 1_000_000_000).astype(np.int64)
    dt = pd.to_datetime(ts, unit='s', utc=True)
    month = dt.dt.month.astype(np.int16); wday = dt.dt.weekday.astype(np.int16); hour = dt.dt.hour.astype(np.int16)
    # Safe lexical flags
    lower_b = b.str.lower()
    # URL patterns and counts
    url_pat = r'https?://\S+|www\.[^\s]+'
    url_count = b.str.count(url_pat, flags=re.IGNORECASE).astype(np.int16)
    has_url = (url_count > 0).astype(np.int8)
    has_imgur = lower_b.str.contains('imgur.com', na=False).astype(np.int8)
    tld_com = lower_b.str.contains('\u002ecom|\.com', na=False).astype(np.int8)
    tld_org = lower_b.str.contains('\u002eorg|\.org', na=False).astype(np.int8)
    tld_net = lower_b.str.contains('\u002enet|\.net', na=False).astype(np.int8)
    # Currency and numbers
    currency_pat = r'(\$|usd|dollar|dollars)'
    currency_count = lower_b.str.count(currency_pat, flags=re.IGNORECASE).astype(np.int16)
    has_usd = lower_b.str.contains('usd', na=False).astype(np.int8)
    has_dollar_word = lower_b.str.contains('dollar', na=False).astype(np.int8)
    has_dollar_symbol = b.str.contains('\u0024|\$', na=False).astype(np.int8)
    numbers_count = b.str.count(r'\d').astype(np.int16)
    # Politeness/thanks
    please_thank_pat = r'(please|thank|thanks|grateful|appreciate)'
    please_thank_count = lower_b.str.count(please_thank_pat, flags=re.IGNORECASE).astype(np.int16)
    # First-person pronouns rate
    fp_pat = r'\b(i|me|my|mine|i\'m|i\'ve|i\'d|i\'ll)\b'
    fp_count = lower_b.str.count(fp_pat, flags=re.IGNORECASE).astype(np.int16)
    first_person_rate = (fp_count / (body_wc.replace(0,1))).astype(np.float32)
    # Hardship/urgency cues
    hardship_pat = r'(rent|bill|bills|job|unemploy|hungry|broke|student|finals|paycheck|family|kids|today|tonight|asap|tldr)'
    hardship_count = lower_b.str.count(hardship_pat, flags=re.IGNORECASE).astype(np.int16)
    # Negations
    neg_pat = r"\b(no|not|never|n't)\b"
    negation_count = lower_b.str.count(neg_pat, flags=re.IGNORECASE).astype(np.int16)
    # Title question mark flag
    title_has_qmark = t.str.contains('\?', na=False).astype(np.int8)
    # Emoji count (basic range; broad approximation)
    emoji_pat = r'[\U0001F300-\U0001F6FF\U0001F900-\U0001F9FF\U0001F1E6-\U0001F1FF\u2600-\u26FF\u2700-\u27BF]'
    try:
        emoji_count = b.str.count(emoji_pat).astype(np.int16)
    except re.error:
        # Some engines may not support the full range; fallback to zero
        emoji_count = pd.Series([0]*len(b), index=b.index, dtype='int16')
    # Assemble features
    feats = np.vstack([
        title_wc, body_wc, title_cc, body_cc, ratio_tb, uniq_ratio,
        exclam, quest, allcaps,
        month, wday, hour,
        # lexical flags
        url_count, has_url, has_imgur, tld_com, tld_org, tld_net,
        currency_count, has_usd, has_dollar_word, has_dollar_symbol, numbers_count,
        please_thank_count, first_person_rate, hardship_count, negation_count,
        title_has_qmark, emoji_count
    ]).T.astype(np.float32)
    return feats

def build_user_history_fold_safe(tr_all_sorted: pd.DataFrame, te_df: pd.DataFrame, group_col: str, time_col: str, label_col: str, fold_dir: Path, chains: list, alpha: float = 20.0) -> tuple[np.ndarray, np.ndarray]:
    # Returns history features aligned to train rows (shape (n, F)) and test rows (shape (T, F))
    n = len(tr_all_sorted); T = len(te_df)
    H_tr = np.zeros((n, 4), dtype=np.float32)  # [log1p(prior_count), smoothed_rate, days_since_prev, log1p(days_since_first)]
    # Global prior for smoothing uses train-only per chain; compute per-chain then fill
    for ci in chains:
        tr_idx = np.load(fold_dir / f'fc_chain{ci}_train_idx.npy')
        va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
        if len(tr_idx) == 0: continue
        sub_tr = tr_all_sorted.iloc[tr_idx].copy()
        sub_tr = sub_tr.sort_values(time_col, kind='mergesort')
        gp = sub_tr.groupby(group_col, sort=False, observed=True)
        # Train-row features (per-row cum stats within train window)
        prior_cnt = gp.cumcount().astype(np.int64).values  # count before
        succ = pd.to_numeric(sub_tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1)
        prior_succ = gp[succ.name].cumsum().shift(1).fillna(0).astype(np.int64).values
        prev_ts = gp[time_col].shift(1).fillna(np.nan).values.astype('float64')
        cur_ts = sub_tr[time_col].values.astype('float64')
        days_since_prev = np.where(np.isnan(prev_ts), 0.0, (cur_ts - prev_ts) / 86400.0).astype(np.float32)
        first_ts = gp[time_col].transform('min').values.astype('float64')
        days_since_first = np.maximum((cur_ts - first_ts) / 86400.0, 0.0).astype(np.float32)
        p_global = float(succ.mean()) if len(sub_tr) else 0.5
        rate_sm = ((prior_succ + alpha * p_global) / (prior_cnt + alpha)).astype(np.float32)
        H_tr_trwin = np.vstack([np.log1p(prior_cnt).astype(np.float32), rate_sm, days_since_prev, np.log1p(days_since_first)]).T.astype(np.float32)
        # Map back to H_tr at train indices order
        H_tr[tr_idx] = H_tr_trwin
        # Validation rows get aggregate stats from train window only
        if len(va_idx):
            agg = gp.agg({label_col:'sum', time_col:['min','max','count']})
            agg.columns = ['succ_sum','first_ts','last_ts','cnt']
            # Build maps
            succ_map = agg['succ_sum'].to_dict()
            first_map = agg['first_ts'].to_dict()
            last_map = agg['last_ts'].to_dict()
            cnt_map = agg['cnt'].to_dict()
            users_va = tr_all_sorted.iloc[va_idx][group_col].astype(str).values
            cur_ts_va = tr_all_sorted.iloc[va_idx][time_col].values.astype('float64')
            prior_cnt_va = np.array([cnt_map.get(u, 0) for u in users_va], dtype=np.int64)
            prior_succ_va = np.array([succ_map.get(u, 0) for u in users_va], dtype=np.int64)
            last_ts_va = np.array([last_map.get(u, cur_ts_va[i]) for i,u in enumerate(users_va)], dtype='float64')
            first_ts_va = np.array([first_map.get(u, cur_ts_va[i]) for i,u in enumerate(users_va)], dtype='float64')
            days_prev_va = np.maximum((cur_ts_va - last_ts_va)/86400.0, 0.0).astype(np.float32)
            days_first_va = np.maximum((cur_ts_va - first_ts_va)/86400.0, 0.0).astype(np.float32)
            rate_sm_va = ((prior_succ_va + alpha * p_global) / (prior_cnt_va + alpha)).astype(np.float32)
            H_tr_va = np.vstack([np.log1p(prior_cnt_va).astype(np.float32), rate_sm_va, days_prev_va, np.log1p(days_first_va)]).T.astype(np.float32)
            H_tr[va_idx] = H_tr_va
    # Test features: build aggregates from full training (all rows) and apply to test
    te_users = te_df.get('requester_username', None)
    if te_users is None and 'username' in te_df.columns:
        te_users = te_df['username']
    if te_users is None:
        te_users = pd.Series(['']*len(te_df))
    tr_full = tr_all_sorted.copy()
    gp_full = tr_full.groupby(group_col, sort=False, observed=True)
    succ_full = pd.to_numeric(tr_full[label_col], errors='coerce').fillna(0).astype(int).clip(0,1)
    agg_full = gp_full.agg({label_col:'sum', time_col:['min','max','count']})
    agg_full.columns = ['succ_sum','first_ts','last_ts','cnt']
    p_global_full = float(succ_full.mean()) if len(tr_full) else 0.5
    succ_map_f = agg_full['succ_sum'].to_dict()
    first_map_f = agg_full['first_ts'].to_dict()
    last_map_f = agg_full['last_ts'].to_dict()
    cnt_map_f = agg_full['cnt'].to_dict()
    cur_ts_te = pd.to_numeric(te_df[time_col], errors='coerce').fillna(0).values.astype('float64') if time_col in te_df.columns else np.zeros(len(te_df), dtype='float64')
    users_te = te_users.astype(str).values
    prior_cnt_te = np.array([cnt_map_f.get(u, 0) for u in users_te], dtype=np.int64)
    prior_succ_te = np.array([succ_map_f.get(u, 0) for u in users_te], dtype=np.int64)
    last_ts_te = np.array([last_map_f.get(u, cur_ts_te[i]) for i,u in enumerate(users_te)], dtype='float64')
    first_ts_te = np.array([first_map_f.get(u, cur_ts_te[i]) for i,u in enumerate(users_te)], dtype='float64')
    days_prev_te = np.maximum((cur_ts_te - last_ts_te)/86400.0, 0.0).astype(np.float32)
    days_first_te = np.maximum((cur_ts_te - first_ts_te)/86400.0, 0.0).astype(np.float32)
    rate_sm_te = ((prior_succ_te + alpha * p_global_full) / (prior_cnt_te + alpha)).astype(np.float32)
    H_te = np.vstack([np.log1p(prior_cnt_te).astype(np.float32), rate_sm_te, days_prev_te, np.log1p(days_first_te)]).T.astype(np.float32)
    return H_tr, H_te

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

fold_dir = Path('folds')
manifest_path = fold_dir / 'manifest.json'
mf = json.loads(manifest_path.read_text()) if manifest_path.exists() else None

with timer('Load data, apply hard bans, and build minimal-safe meta + user history (fold-safe)'):
    tr_raw = load_json_df('train.json')
    te_raw = load_json_df('test.json')
    tr = drop_banned_columns(tr_raw)
    te = drop_banned_columns(te_raw)
    time_col = (mf.get('time_col') if mf else ('unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]))
    label_col = (mf.get('label_col') if mf else ('requester_received_pizza' if 'requester_received_pizza' in tr_raw.columns else 'label'))
    group_col = (mf.get('group_col') if mf else ('requester_username' if 'requester_username' in tr_raw.columns else 'username'))
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    tr_raw_sorted = tr_raw.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr_raw_sorted[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    X_meta_tr_base = meta_features_minimal(tr)
    X_meta_te_base = meta_features_minimal(te)
    # Fold-safe user history features
    if mf is not None:
        chain_ids = [c['chain'] for c in mf.get('chains', [])]
    else:
        val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
        chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
    H_tr, H_te = build_user_history_fold_safe(tr_raw_sorted, te_raw, group_col, time_col, label_col, fold_dir, chain_ids, alpha=20.0)
    X_meta_tr = np.hstack([X_meta_tr_base, H_tr]).astype(np.float32)
    X_meta_te = np.hstack([X_meta_te_base, H_te]).astype(np.float32)
    print('Meta+History shapes:', X_meta_tr.shape, X_meta_te.shape)

print('Chains detected for meta leg:', [c['chain'] for c in mf.get('chains', [])] if mf else 'unknown')

# Sanity A: zero-sentinel to ensure pipeline isn't leaking
def sanity_zero_check():
    from sklearn.metrics import roc_auc_score
    oof_zero = np.zeros(len(y), dtype=np.float32)
    val_mask = np.zeros(len(y), dtype=bool)
    if mf is not None:
        for ci in [c['chain'] for c in mf.get('chains', [])]:
            va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
            val_mask[va_idx] = True
    try:
        auc = roc_auc_score(y[val_mask], oof_zero[val_mask]) if val_mask.any() else 0.5
    except Exception:
        auc = 0.5
    print('Sanity A (zeros) AUC on val rows:', round(auc, 6))
    return auc

# Sanity B: single-feature AUC sentinels
def sanity_single_feature_scan(X):
    from sklearn.metrics import roc_auc_score
    val_mask = np.zeros(len(y), dtype=bool)
    if mf is not None:
        for ci in [c['chain'] for c in mf.get('chains', [])]:
            va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
            val_mask[va_idx] = True
    vmax = -1.0; vmax_j = -1
    for j in range(X.shape[1]):
        try:
            auc = roc_auc_score(y[val_mask], X[val_mask, j])
            if auc > vmax:
                vmax, vmax_j = auc, j
        except Exception:
            pass
    print('Sanity B: max single-feature AUC on val rows =', round(vmax,6), 'at feature', vmax_j)
    if vmax >= 0.95:
        raise RuntimeError(f'Feature leakage suspected: single feature AUC {vmax:.6f} >= 0.95')

zero_auc = sanity_zero_check()
sanity_single_feature_scan(X_meta_tr)

oof = np.zeros(len(y), dtype=np.float32)
val_mask = np.zeros(len(y), dtype=bool)
test_preds = []

param_grid = [
    dict(max_depth=3, eta=0.05, min_child_weight=3),
    dict(max_depth=4, eta=0.05, min_child_weight=3),
    dict(max_depth=4, eta=0.08, min_child_weight=3),
    dict(max_depth=5, eta=0.05, min_child_weight=5),
]

def predict_with_best(bst, dmat):
    bi = getattr(bst, 'best_iteration', None)
    if bi is not None:
        return bst.predict(dmat, iteration_range=(0, int(bi)+1))
    return bst.predict(dmat)

def train_chain(ci):
    tr_idx = np.load(fold_dir / f'fc_chain{ci}_train_idx.npy'); va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
    if len(va_idx) == 0:
        print(f'Chain {ci}: empty val; skip'); return None
    Xtr = X_meta_tr[tr_idx].astype(np.float32); Xva = X_meta_tr[va_idx].astype(np.float32); Xte = X_meta_te.astype(np.float32)
    ytr_full, yva = y[tr_idx], y[va_idx]
    dtr = xgb.DMatrix(Xtr, label=ytr_full); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(Xte)
    pos = int((ytr_full==1).sum()); neg = int((ytr_full==0).sum()); spw = float(neg)/max(1.0,float(pos))
    base = dict(tree_method='gpu_hist', objective='binary:logistic', eval_metric='auc',
                subsample=0.85, colsample_bytree=0.9, reg_lambda=3.0, reg_alpha=0.1,
                scale_pos_weight=spw, seed=42)
    best_auc, best_pva, best_pte, best_desc = -1.0, None, None, None
    for g in param_grid:
        params = base.copy(); params.update(g)
        t0 = time.time()
        bst = xgb.train(params, dtr, num_boost_round=3000, evals=[(dva,'val')], early_stopping_rounds=75, verbose_eval=False)
        pva = predict_with_best(bst, dva)
        auc = roc_auc_score(yva, pva)
        if auc > best_auc:
            best_auc, best_pva = auc, pva
            best_pte = predict_with_best(bst, dte)
            best_desc = g.copy(); best_desc['best_it'] = getattr(bst,'best_iteration',None); best_desc['secs'] = time.time()-t0
    print(f'Chain {ci}: best={best_desc} AUC={best_auc:.5f}', flush=True)
    return va_idx, best_pva, best_pte

with timer('Train Meta-XGB across chains (minimal + history + lexical)'):
    if mf is not None:
        chain_ids = [c['chain'] for c in mf.get('chains', [])]
    else:
        val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
        chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
    for ci in chain_ids:
        res = train_chain(ci)
        if res is None: continue
        va_idx, pva, pte = res
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds.append(pte)

with timer('Evaluate and save Meta-XGB artifacts (minimal+history+lexical)'):
    if val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print('Meta-XGB (minimal+history+lexical) OOF AUC (val rows only):', round(oof_auc, 6))
        np.save('oof_meta_xgb_fc.npy', oof)
    else:
        print('Warning: no validation rows; OOF not computed.')
    if len(test_preds):
        test_pred = np.mean(np.vstack(test_preds), axis=0)
        np.save('test_meta_xgb_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_meta_xgb_fc.csv', index=False)
        print('Saved: oof_meta_xgb_fc.npy, test_meta_xgb_fc.npy, submission_meta_xgb_fc.csv')
    else:
        print('Warning: no test preds collected.')

print()

[T0] Load data, apply hard bans, and build minimal-safe meta + user history (fold-safe) ...


Meta+History shapes: (2878, 33) (1162, 33)
[T+] Load data, apply hard bans, and build minimal-safe meta + user history (fold-safe) done in 0.55s


Chains detected for meta leg: [1, 2, 3]
Sanity A (zeros) AUC on val rows: 0.5
Sanity B: max single-feature AUC on val rows = 0.606255 at feature 1
[T0] Train Meta-XGB across chains (minimal + history + lexical) ...


Chain 1: best={'max_depth': 3, 'eta': 0.05, 'min_child_weight': 3, 'best_it': 14, 'secs': 0.15033745765686035} AUC=0.63402


Chain 2: best={'max_depth': 5, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 31, 'secs': 0.19341516494750977} AUC=0.66163


Chain 3: best={'max_depth': 5, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 84, 'secs': 0.28555989265441895} AUC=0.53764


[T+] Train Meta-XGB across chains (minimal + history + lexical) done in 2.41s


[T0] Evaluate and save Meta-XGB artifacts (minimal+history+lexical) ...


Meta-XGB (minimal+history+lexical) OOF AUC (val rows only): 0.619577
Saved: oof_meta_xgb_fc.npy, test_meta_xgb_fc.npy, submission_meta_xgb_fc.csv
[T+] Evaluate and save Meta-XGB artifacts (minimal+history+lexical) done in 0.00s





In [57]:
# Helper: Generate and cache BGE-small embeddings via venv interpreter
import os, sys, subprocess, textwrap
from pathlib import Path

script_path = Path('bge_gen.py')
script_code = textwrap.dedent('''
import os, sys
from pathlib import Path
import numpy as np, pandas as pd
import torch
from sentence_transformers import SentenceTransformer

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            import json
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text_cols(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates)
    bcol = first_col(body_col_candidates)
    t = df[tcol].fillna('') if tcol else pd.Series(['' for _ in range(len(df))])
    b = df[bcol].fillna('') if bcol else pd.Series(['' for _ in range(len(df))])
    return t.astype(str), b.astype(str)

def main():
    cache_dir = Path('emb_cache'); cache_dir.mkdir(exist_ok=True)
    tr_path = cache_dir / 'emb_bge_train.npy'
    te_path = cache_dir / 'emb_bge_test.npy'
    if tr_path.exists() and te_path.exists():
        arr_tr = np.load(tr_path); arr_te = np.load(te_path)
        print('BGE embeddings already exist:', arr_tr.shape, arr_te.shape)
        return
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    t_title, t_body = build_text_cols(tr)
    te_title, te_body = build_text_cols(te)
    tr_texts = (t_title + ' \\n ' + t_body).tolist()
    te_texts = (te_title + ' \\n ' + te_body).tolist()
    model_name = 'BAAI/bge-small-en-v1.5'
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print('Loading model:', model_name, 'on', device)
    model = SentenceTransformer(model_name, device=device)
    model.max_seq_length = 512
    def embed(texts, batch_size=128):
        return np.asarray(model.encode(texts, batch_size=batch_size, normalize_embeddings=True, show_progress_bar=True), dtype=np.float32)
    print('Encoding train ...'); emb_tr = embed(tr_texts)
    print('Encoding test ...'); emb_te = embed(te_texts)
    np.save(tr_path, emb_tr); np.save(te_path, emb_te)
    print('Saved:', tr_path, te_path, '| shapes:', emb_tr.shape, emb_te.shape)

if __name__ == '__main__':
    main()
''')
script_path.write_text(script_code)
print('Wrote helper script:', script_path)

venv_py = Path('.venv/bin/python')
assert venv_py.exists(), 'Venv python not found; run the venv setup cell first'

print('Ensuring pandas/numpy in venv ...')
subprocess.run([str(venv_py), '-m', 'pip', 'install', '--upgrade', 'pip', 'wheel', 'setuptools'], check=True)
subprocess.run([str(venv_py), '-m', 'pip', 'install', 'pandas', 'numpy', 'sentence-transformers'], check=True)

print('Running BGE embeddings generation via', venv_py)
proc = subprocess.run([str(venv_py), str(script_path)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(proc.stdout)
if proc.returncode != 0:
    raise RuntimeError('BGE embedding generation failed; check logs above')
print('BGE embeddings generation complete. Train the BGE+XGB leg next.')

Wrote helper script: bge_gen.py
Ensuring pandas/numpy in venv ...






Running BGE embeddings generation via .venv/bin/python


Loading model: BAAI/bge-small-en-v1.5 on cuda
Encoding train ...

Batches:   0%|          | 0/23 [00:00<?, ?it/s]
Batches:   4%|▍         | 1/23 [00:00<00:13,  1.66it/s]
Batches:   9%|▊         | 2/23 [00:00<00:09,  2.32it/s]
Batches:  13%|█▎        | 3/23 [00:01<00:06,  2.88it/s]
Batches:  17%|█▋        | 4/23 [00:01<00:05,  3.29it/s]
Batches:  22%|██▏       | 5/23 [00:01<00:04,  3.73it/s]
Batches:  26%|██▌       | 6/23 [00:01<00:03,  4.45it/s]
Batches:  30%|███       | 7/23 [00:01<00:03,  5.05it/s]
Batches:  35%|███▍      | 8/23 [00:02<00:02,  5.75it/s]
Batches:  39%|███▉      | 9/23 [00:02<00:02,  6.35it/s]
Batches:  43%|████▎     | 10/23 [00:02<00:01,  7.02it/s]
Batches:  52%|█████▏    | 12/23 [00:02<00:01,  8.63it/s]
Batches:  61%|██████    | 14/23 [00:02<00:00,  9.57it/s]
Batches:  70%|██████▉   | 16/23 [00:02<00:00, 10.51it/s]
Batches:  78%|███████▊  | 18/23 [00:02<00:00, 11.43it/s]
Batches:  87%|████████▋ | 20/23 [00:03<00:00, 12.91it/s]
Batches:  96%|█████████▌| 22/23 [00:03<0

In [121]:
# Leg B2: BGE-small-en-v1.5 embeddings + XGBoost (GPU), 3-seed bag, with robust kNN-rate (multi-k + softmax + cross-model + recency-decay + bayesian), per-chain std [REVERT EXTRAS; TIGHTER XGB]
import os, sys, json, time
from pathlib import Path
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
import xgboost as xgb

os.environ['PYTHONNOUSERSITE'] = '1'
sys.path = [p for p in sys.path if ('vendor_pkgs' not in str(p)) and ('.pip-target' not in str(p))]

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def per_row_meta(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    time_col_candidates = ['unix_timestamp_of_request_utc', 'request_timestamp', 'created_utc', 'timestamp', 'time']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates); bcol = first_col(body_col_candidates)
    time_col = first_col(time_col_candidates)
    t = df[tcol].fillna("").astype(str) if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("").astype(str) if bcol else pd.Series(["" for _ in range(len(df))])
    def wc(s): return s.str.split().apply(len).astype(np.int32)
    def cc(s): return s.str.len().astype(np.int32)
    title_wc = wc(t); body_wc = wc(b); title_cc = cc(t); body_cc = cc(b)
    ratio_tb = (title_cc / (body_cc + 1)).astype(np.float32)
    uniq_ratio = (b.str.lower().str.split().apply(lambda x: len(set(x)) / (len(x)+1e-6))).astype(np.float32)
    exclam = b.str.count('!').astype(np.int32); quest = b.str.count('\?').astype(np.int32)
    allcaps = b.apply(lambda s: sum(1 for w in s.split() if len(w)>=3 and w.isupper())).astype(np.int32)
    has_url = (b.str.contains('http', case=False, na=False) | b.str.contains('www\.', case=False, na=False)).astype(np.int8)
    has_dollar = b.str.contains('\u0024|\$', case=False, na=False).astype(np.int8)
    has_digit = b.str.contains('[0-9]', regex=True, na=False).astype(np.int8)
    ts = pd.to_numeric(df[time_col], errors='coerce').fillna(0).astype(np.int64) if time_col else pd.Series(np.zeros(len(df), dtype=np.int64))
    if ts.max() > 10_000_000_000: ts = (ts // 1_000_000_000).astype(np.int64)
    dt = pd.to_datetime(ts, unit='s', utc=True)
    month = dt.dt.month.astype(np.int16); wday = dt.dt.weekday.astype(np.int16); hour = dt.dt.hour.astype(np.int16)
    feats = np.vstack([
        title_wc, body_wc, title_cc, body_cc, ratio_tb, uniq_ratio,
        exclam, quest, allcaps, has_url, has_dollar, has_digit,
        month, wday, hour
    ]).T.astype(np.float32)
    return feats

def compute_or_load_bge_embeddings(cache_dir=Path("emb_cache")):
    tr_path = cache_dir / "emb_bge_train.npy"
    te_path = cache_dir / "emb_bge_test.npy"
    assert tr_path.exists() and te_path.exists(), "BGE caches missing; run Cell 15 first"
    return np.load(tr_path).astype(np.float32), np.load(te_path).astype(np.float32)

def _topk_idx(sims, k):
    if k < sims.shape[1]:
        return np.argpartition(-sims, kth=k-1, axis=1)[:, :k]
    else:
        return np.argsort(-sims, axis=1)

def knn_rate_mean(pool_emb, pool_y, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return pool_y[topk].mean(axis=1).astype(np.float32)

def knn_rate_softmax(pool_emb, pool_y, query_emb, k=50, tau=0.12, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    row_idx = np.arange(topk.shape[0])[:, None]
    s_top = sims[row_idx, topk]
    w = np.exp(s_top / max(1e-6, tau))
    w /= (w.sum(axis=1, keepdims=True) + 1e-9)
    y_top = pool_y[topk].astype(np.float32)
    return (w * y_top).sum(axis=1).astype(np.float32)

def knn_topk(pool_emb, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return topk, sims

def knn_rate_bayes(pool_emb, pool_y, query_emb, k=50, alpha=22.0, p_train=0.5, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    y_top = pool_y[topk].astype(np.float32)
    s = y_top.sum(axis=1)
    denom = (k + alpha)
    return ((s + alpha * p_train) / denom).astype(np.float32)

def knn_rate_recency_decay(pool_emb, pool_y, query_emb, pool_ts_days, query_ts_days, k=50, lam_days=75.0, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    row_idx = np.arange(topk.shape[0])[:, None]
    qd = query_ts_days[row_idx[:,0]][:, None]
    pdays = pool_ts_days[topk]
    gaps = np.maximum(qd - pdays, 0.0)
    w = np.exp(-gaps / max(lam_days, 1e-6))
    y_top = pool_y[topk].astype(np.float32)
    w_sum = w.sum(axis=1, keepdims=True) + 1e-9
    return ((w * y_top).sum(axis=1) / w_sum[:,0]).astype(np.float32)

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer("Load train/test and prepare inputs (BGE)"):
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    label_col = 'requester_received_pizza' if 'requester_received_pizza' in tr.columns else 'label'
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    # timestamps -> days
    ts_tr = pd.to_numeric(tr[time_col], errors='coerce').fillna(0).values.astype(np.int64)
    ts_te = pd.to_numeric(te[time_col], errors='coerce').fillna(0).values.astype(np.int64) if time_col in te.columns else np.zeros(len(te), dtype=np.int64)
    if ts_tr.max() > 10_000_000_000: ts_tr = ts_tr // 1_000_000_000
    if ts_te.max() > 10_000_000_000: ts_te = ts_te // 1_000_000_000
    ts_tr_days = (ts_tr / 86400.0).astype(np.float32)
    ts_te_days = (ts_te / 86400.0).astype(np.float32)
    bge_tr, bge_te = compute_or_load_bge_embeddings()
    # Cross-model pool for reverse cross kNN-rate (E5 space)
    e5_tr = np.load('emb_cache/emb_e5_train.npy').astype(np.float32)
    e5_te = np.load('emb_cache/emb_e5_test.npy').astype(np.float32)
    meta_tr = per_row_meta(tr)
    meta_te = per_row_meta(te)
    print("Shapes | bge:", bge_tr.shape, bge_te.shape, "meta:", meta_tr.shape, meta_te.shape, "| e5:", e5_tr.shape, e5_te.shape)

fold_dir = Path('folds')
manifest_path = fold_dir / 'manifest.json'
if manifest_path.exists():
    mf = json.loads(manifest_path.read_text())
    chain_ids = [c['chain'] for c in mf.get('chains', [])]
else:
    val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
    chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
print("Chains detected (from manifest if available):", chain_ids)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_bag = []

# Tighter regularization per expert guidance
param_grid = [
    dict(max_depth=4, eta=0.08, min_child_weight=5),
    dict(max_depth=4, eta=0.05, min_child_weight=5),
    dict(max_depth=4, eta=0.08, min_child_weight=8),
]

def predict_with_best(bst, dmat):
    bi = getattr(bst, 'best_iteration', None)
    if bi is not None:
        return bst.predict(dmat, iteration_range=(0, int(bi)+1))
    return bst.predict(dmat)

def standardize_knn_feats(tr_mat, va_mat, te_mat):
    mu = tr_mat.mean(axis=0); sd = tr_mat.std(axis=0); sd = np.where(sd < 1e-6, 1.0, sd)
    tr_s = (tr_mat - mu) / sd
    va_s = (va_mat - mu) / sd
    te_s = (te_mat - mu) / sd
    return tr_s.astype(np.float32), va_s.astype(np.float32), te_s.astype(np.float32)

def train_one_chain_seed(ci, seed):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f"Chain {ci} seed {seed}: empty val; skip"); return None, None, None
    Xtr_emb, ytr = bge_tr[tr_idx], y[tr_idx]
    Xva_emb, yva = bge_tr[va_idx], y[va_idx]
    # Same-model kNN rates in BGE space: multi-k and softmax (tau=0.12)
    kn_tr_k20 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=20, self_exclude=True).reshape(-1,1)
    kn_va_k20 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=20, self_exclude=False).reshape(-1,1)
    kn_tr_k50 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=50, self_exclude=True).reshape(-1,1)
    kn_va_k50 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=50, self_exclude=False).reshape(-1,1)
    kn_tr_k100 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=100, self_exclude=True).reshape(-1,1)
    kn_va_k100 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=100, self_exclude=False).reshape(-1,1)
    kn_tr_soft = knn_rate_softmax(Xtr_emb, ytr, Xtr_emb, k=50, tau=0.12, self_exclude=True).reshape(-1,1)
    kn_va_soft = knn_rate_softmax(Xtr_emb, ytr, Xva_emb, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    # Cross-model (reverse) in E5 space
    kn_tr_cross = knn_rate_mean(e5_tr[tr_idx], ytr, e5_tr[tr_idx], k=50, self_exclude=True).reshape(-1,1)
    kn_va_cross = knn_rate_mean(e5_tr[tr_idx], ytr, e5_tr[va_idx], k=50, self_exclude=False).reshape(-1,1)
    # Recency-decayed k=50 (lambda ~75 days) in BGE space
    tr_days = ts_tr_days[tr_idx]; va_days = ts_tr_days[va_idx]
    kn_tr_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xtr_emb, tr_days, tr_days, k=50, lam_days=75.0, self_exclude=True).reshape(-1,1)
    kn_va_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xva_emb, tr_days, va_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    # Bayesian-smoothed k=50 in BGE space
    p_train = float(ytr.mean())
    kn_tr_bayes = knn_rate_bayes(Xtr_emb, ytr, Xtr_emb, k=50, alpha=22.0, p_train=p_train, self_exclude=True).reshape(-1,1)
    kn_va_bayes = knn_rate_bayes(Xtr_emb, ytr, Xva_emb, k=50, alpha=22.0, p_train=p_train, self_exclude=False).reshape(-1,1)
    # Test features using full train pool
    kn_te_k20  = knn_rate_mean(bge_tr, y, bge_te, k=20,  self_exclude=False).reshape(-1,1)
    kn_te_k50  = knn_rate_mean(bge_tr, y, bge_te, k=50,  self_exclude=False).reshape(-1,1)
    kn_te_k100 = knn_rate_mean(bge_tr, y, bge_te, k=100, self_exclude=False).reshape(-1,1)
    kn_te_soft = knn_rate_softmax(bge_tr, y, bge_te, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    kn_te_cross = knn_rate_mean(e5_tr, y, e5_te, k=50, self_exclude=False).reshape(-1,1)
    te_days = ts_te_days
    kn_te_dec = knn_rate_recency_decay(bge_tr, y, bge_te, ts_tr_days, te_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    p_full = float(y.mean())
    kn_te_bayes = knn_rate_bayes(bge_tr, y, bge_te, k=50, alpha=22.0, p_train=p_full, self_exclude=False).reshape(-1,1)
    # Assemble and standardize kNN features per chain (EXTRAS REMOVED)
    K_tr = np.hstack([kn_tr_k20, kn_tr_k50, kn_tr_k100, kn_tr_soft, kn_tr_cross, kn_tr_dec, kn_tr_bayes])
    K_va = np.hstack([kn_va_k20, kn_va_k50, kn_va_k100, kn_va_soft, kn_va_cross, kn_va_dec, kn_va_bayes])
    K_te = np.hstack([kn_te_k20, kn_te_k50, kn_te_k100, kn_te_soft, kn_te_cross, kn_te_dec, kn_te_bayes])
    K_tr_s, K_va_s, K_te_s = standardize_knn_feats(K_tr, K_va, K_te)
    # Final features
    Xtr = np.hstack([Xtr_emb, meta_tr[tr_idx], K_tr_s]).astype(np.float32)
    Xva = np.hstack([Xva_emb, meta_tr[va_idx], K_va_s]).astype(np.float32)
    Xte = np.hstack([bge_te,   meta_te,        K_te_s]).astype(np.float32)
    dtr = xgb.DMatrix(Xtr, label=ytr); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(Xte)
    pos = int((ytr==1).sum()); neg = int((ytr==0).sum()); spw = float(neg)/max(1.0, float(pos))
    base = dict(tree_method='gpu_hist', objective='binary:logistic', eval_metric='auc',
                subsample=0.8, colsample_bytree=0.8, reg_lambda=5.0, reg_alpha=0.2,
                scale_pos_weight=spw, seed=seed)
    best_auc = -1.0; best_pva = None; best_pte = None; best_desc = None
    for g in param_grid:
        params = base.copy(); params.update(g)
        t0 = time.time()
        bst = xgb.train(params, dtr, num_boost_round=4000, evals=[(dva,'val')], early_stopping_rounds=120, verbose_eval=False)
        pva = predict_with_best(bst, dva)
        auc = roc_auc_score(yva, pva)
        if auc > best_auc:
            best_auc, best_pva = auc, pva
            best_pte = predict_with_best(bst, dte)
            best_desc = g.copy(); best_desc['best_it'] = getattr(bst, 'best_iteration', None); best_desc['secs'] = time.time()-t0
    print(f"Chain {ci} seed {seed}: best={best_desc} AUC={best_auc:.5f}", flush=True)
    return va_idx, best_pva, best_pte

with timer("Train BGE+XGB+robust kNN across chains and seeds"):
    SEEDS = [42, 1337, 2025]
    for seed in SEEDS:
        test_preds_per_chain = []
        for ci in chain_ids:
            res = train_one_chain_seed(ci, seed)
            if res is None: continue
            va_idx, pva, pte = res
            if seed == SEEDS[0]:
                oof[va_idx] = pva
            else:
                oof[va_idx] += pva
            val_mask[va_idx] = True
            test_preds_per_chain.append(pte)
        if len(test_preds_per_chain):
            test_preds_bag.append(np.mean(np.vstack(test_preds_per_chain), axis=0))

with timer("Evaluate and save BGE+XGB artifacts (reverted extras)"):
    if val_mask.any():
        oof_avg = oof.copy(); oof_avg[val_mask] = oof_avg[val_mask] / 3.0
        oof_auc_val = roc_auc_score(y[val_mask], oof_avg[val_mask])
        print("BGE+XGB OOF AUC (val rows only, seed-bag):", round(oof_auc_val, 6))
        np.save('oof_bge_xgb_fc.npy', oof_avg)
    else:
        print("Warning: no validation rows; OOF not computed.")
    if len(test_preds_bag):
        test_pred = np.mean(np.vstack(test_preds_bag), axis=0)
        np.save('test_bge_xgb_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_bge_xgb_fc.csv', index=False)
        print('Saved: oof_bge_xgb_fc.npy, test_bge_xgb_fc.npy, submission_bge_xgb_fc.csv')
    else:
        print('Warning: no test preds collected from seeds.')

[T0] Load train/test and prepare inputs (BGE) ...


Shapes | bge: (2878, 384) (1162, 384) meta: (2878, 15) (1162, 15) | e5: (2878, 768) (1162, 768)
[T+] Load train/test and prepare inputs (BGE) done in 0.23s


Chains detected (from manifest if available): [1, 2, 3]
[T0] Train BGE+XGB+robust kNN across chains and seeds ...


Chain 1 seed 42: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 111, 'secs': 0.5622589588165283} AUC=0.62390


Chain 2 seed 42: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 20, 'secs': 0.34802865982055664} AUC=0.70032


Chain 3 seed 42: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 80, 'secs': 0.48722362518310547} AUC=0.63433


Chain 1 seed 1337: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 12, 'secs': 0.32511377334594727} AUC=0.61522


Chain 2 seed 1337: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 39, 'secs': 0.3844156265258789} AUC=0.68168


Chain 3 seed 1337: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 58, 'secs': 0.45854854583740234} AUC=0.62802


Chain 1 seed 2025: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 19, 'secs': 0.3417055606842041} AUC=0.60752


Chain 2 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 18, 'secs': 0.3515007495880127} AUC=0.67218


Chain 3 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 116, 'secs': 0.57157301902771} AUC=0.61422


[T+] Train BGE+XGB+robust kNN across chains and seeds done in 17.62s


[T0] Evaluate and save BGE+XGB artifacts (reverted extras) ...


BGE+XGB OOF AUC (val rows only, seed-bag): 0.614021
Saved: oof_bge_xgb_fc.npy, test_bge_xgb_fc.npy, submission_bge_xgb_fc.csv
[T+] Evaluate and save BGE+XGB artifacts (reverted extras) done in 0.00s


In [67]:
# kNN neighbor-rate features from E5 and BGE embeddings (per-chain, leak-safe); saves OOF/test features
import json, time
from pathlib import Path
import numpy as np, pandas as pd

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def timer_log(msg):
    print(f"[T0] {msg} ...", flush=True)
    return time.time()

def timer_done(t0, msg):
    print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

fold_dir = Path('folds')
mf = json.loads((fold_dir / 'manifest.json').read_text())
chains = [c['chain'] for c in mf['chains']]
print('Chains (manifest):', chains)

# Labels aligned to sorted-by-time order
tr = load_json_df('train.json')
time_col = mf.get('time_col', 'unix_timestamp_of_request_utc')
label_col = mf.get('label_col', 'requester_received_pizza')
tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values.astype(np.int8)
n = len(tr)

# Load normalized embeddings (already normalized in cache generation)
emb_dir = Path('emb_cache')
e5_tr = np.load(emb_dir / 'emb_e5_train.npy').astype(np.float32)
e5_te = np.load(emb_dir / 'emb_e5_test.npy').astype(np.float32)
bge_tr = np.load(emb_dir / 'emb_bge_train.npy').astype(np.float32)
bge_te = np.load(emb_dir / 'emb_bge_test.npy').astype(np.float32)
assert e5_tr.shape[0] == n and bge_tr.shape[0] == n
print('Emb shapes | e5:', e5_tr.shape, e5_te.shape, '| bge:', bge_tr.shape, bge_te.shape)

def knn_rate_foldsafe(emb_tr, emb_te, y, k=50):
    """Compute per-chain OOF neighbor success-rate and test neighbor success-rate.
    - For each chain c: use emb_tr[train_idx] as neighbor pool; for val_idx rows,
      compute cosine sims to pool, take top-k indices, average y of neighbors.
    - For test: use full training pool emb_tr (all rows) against emb_te.
    Returns:
      oof_rate: shape (n,), filled at validation indices only;
      test_rate: shape (len(test),).
    """
    oof_rate = np.zeros(n, dtype=np.float32)
    val_mask_all = np.zeros(n, dtype=bool)
    # Per-chain OOF
    for ci in chains:
        tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy")
        va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
        if len(va_idx) == 0:
            continue
        val_mask_all[va_idx] = True
        X_pool = emb_tr[tr_idx]  # (M, D)
        X_q = emb_tr[va_idx]     # (V, D)
        # Cosine sim via dot (embeddings are normalized)
        t0 = timer_log(f'Chain {ci} kNN-rate: pool {X_pool.shape}, queries {X_q.shape}, k={k}')
        sims = X_q @ X_pool.T  # (V, M)
        # top-k indices per row
        if k < sims.shape[1]:
            topk_idx = np.argpartition(-sims, kth=k-1, axis=1)[:, :k]
        else:
            topk_idx = np.argsort(-sims, axis=1)
        # Gather neighbor labels and average
        yn = y[tr_idx]
        rates = yn[topk_idx].mean(axis=1).astype(np.float32)
        oof_rate[va_idx] = rates
        timer_done(t0, f'Chain {ci} kNN-rate')
    # Test using full train as pool
    t0 = timer_log('Test kNN-rate using full train pool')
    sims_te = emb_te @ emb_tr.T  # (T, N)
    if k < sims_te.shape[1]:
        topk_te = np.argpartition(-sims_te, kth=k-1, axis=1)[:, :k]
    else:
        topk_te = np.argsort(-sims_te, axis=1)
    test_rate = y[topk_te].mean(axis=1).astype(np.float32)
    timer_done(t0, 'Test kNN-rate')
    return oof_rate, test_rate, val_mask_all

for name, (tr_emb, te_emb) in {
    'e5': (e5_tr, e5_te),
    'bge': (bge_tr, bge_te),
}.items():
    t0 = timer_log(f'Compute kNN-rate for {name}')
    oof_rate, test_rate, valmask = knn_rate_foldsafe(tr_emb, te_emb, y, k=50)
    # Save artifacts
    np.save(f'knnrate_{name}_oof.npy', oof_rate)
    np.save(f'knnrate_{name}_test.npy', test_rate)
    # Quick diagnostics
    try:
        from sklearn.metrics import roc_auc_score
        auc = roc_auc_score(y[valmask], oof_rate[valmask]) if valmask.any() else float('nan')
        print(f'{name} kNN-rate OOF AUC (val rows): {auc:.6f}')
    except Exception as e:
        print('AUC diag failed for', name, '|', repr(e))
    timer_done(t0, f'Compute kNN-rate for {name}')

print('Saved: knnrate_e5_oof.npy, knnrate_e5_test.npy, knnrate_bge_oof.npy, knnrate_bge_test.npy')
print('Next: append these features to E5/BGE/Meta legs and retrain, then reblend.')

Chains (manifest): [1, 2, 3]
Emb shapes | e5: (2878, 768) (1162, 768) | bge: (2878, 384) (1162, 384)
[T0] Compute kNN-rate for e5 ...


[T0] Chain 1 kNN-rate: pool (1727, 768), queries (562, 768), k=50 ...


[T+] Chain 1 kNN-rate done in 0.01s


[T0] Chain 2 kNN-rate: pool (2302, 768), queries (278, 768), k=50 ...


[T+] Chain 2 kNN-rate done in 0.01s


[T0] Chain 3 kNN-rate: pool (2590, 768), queries (268, 768), k=50 ...


[T+] Chain 3 kNN-rate done in 0.01s


[T0] Test kNN-rate using full train pool ...


[T+] Test kNN-rate done in 0.05s


e5 kNN-rate OOF AUC (val rows): 0.568997
[T+] Compute kNN-rate for e5 done in 0.08s


[T0] Compute kNN-rate for bge ...


[T0] Chain 1 kNN-rate: pool (1727, 384), queries (562, 384), k=50 ...


[T+] Chain 1 kNN-rate done in 0.01s


[T0] Chain 2 kNN-rate: pool (2302, 384), queries (278, 384), k=50 ...


[T+] Chain 2 kNN-rate done in 0.01s


[T0] Chain 3 kNN-rate: pool (2590, 384), queries (268, 384), k=50 ...


[T+] Chain 3 kNN-rate done in 0.01s


[T0] Test kNN-rate using full train pool ...


[T+] Test kNN-rate done in 0.03s


bge kNN-rate OOF AUC (val rows): 0.570418
[T+] Compute kNN-rate for bge done in 0.07s


Saved: knnrate_e5_oof.npy, knnrate_e5_test.npy, knnrate_bge_oof.npy, knnrate_bge_test.npy
Next: append these features to E5/BGE/Meta legs and retrain, then reblend.


In [122]:
# Leg B3: MPNet (multi-qa-mpnet-base-dot-v1) embeddings + XGBoost with robust kNN features (multi-k mean, softmax tau=0.12, recency lam=75, bayes, cross-model), per-chain std [REVERT EXTRAS; TIGHTER XGB]
import os, sys, json, time, subprocess, textwrap
from pathlib import Path
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
import xgboost as xgb

os.environ['PYTHONNOUSERSITE'] = '1'
sys.path = [p for p in sys.path if ('vendor_pkgs' not in str(p)) and ('.pip-target' not in str(p))]

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def per_row_meta(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    time_col_candidates = ['unix_timestamp_of_request_utc', 'request_timestamp', 'created_utc', 'timestamp', 'time']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates); bcol = first_col(body_col_candidates)
    time_col = first_col(time_col_candidates)
    t = df[tcol].fillna("").astype(str) if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("").astype(str) if bcol else pd.Series(["" for _ in range(len(df))])
    def wc(s): return s.str.split().apply(len).astype(np.int32)
    def cc(s): return s.str.len().astype(np.int32)
    title_wc = wc(t); body_wc = wc(b); title_cc = cc(t); body_cc = cc(b)
    ratio_tb = (title_cc / (body_cc + 1)).astype(np.float32)
    uniq_ratio = (b.str.lower().str.split().apply(lambda x: len(set(x)) / (len(x)+1e-6))).astype(np.float32)
    exclam = b.str.count('!').astype(np.int32); quest = b.str.count('\?').astype(np.int32)
    allcaps = b.apply(lambda s: sum(1 for w in s.split() if len(w)>=3 and w.isupper())).astype(np.int32)
    has_url = (b.str.contains('http', case=False, na=False) | b.str.contains('www\.', case=False, na=False)).astype(np.int8)
    has_dollar = b.str.contains('\u0024|\$', case=False, na=False).astype(np.int8)
    has_digit = b.str.contains('[0-9]', regex=True, na=False).astype(np.int8)
    ts = pd.to_numeric(df[time_col], errors='coerce').fillna(0).astype(np.int64) if time_col else pd.Series(np.zeros(len(df), dtype=np.int64))
    if ts.max() > 10_000_000_000: ts = (ts // 1_000_000_000).astype(np.int64)
    dt = pd.to_datetime(ts, unit='s', utc=True)
    month = dt.dt.month.astype(np.int16); wday = dt.dt.weekday.astype(np.int16); hour = dt.dt.hour.astype(np.int16)
    feats = np.vstack([
        title_wc, body_wc, title_cc, body_cc, ratio_tb, uniq_ratio,
        exclam, quest, allcaps, has_url, has_dollar, has_digit,
        month, wday, hour
    ]).T.astype(np.float32)
    return feats

def ensure_mpnet_embeddings():
    cache_dir = Path('emb_cache'); cache_dir.mkdir(exist_ok=True)
    tr_path = cache_dir / 'emb_mpnet_train.npy'
    te_path = cache_dir / 'emb_mpnet_test.npy'
    if tr_path.exists() and te_path.exists():
        return
    script_path = Path('mpnet_gen.py')
    code = textwrap.dedent('''
import json, numpy as np, pandas as pd, torch
from pathlib import Path
from sentence_transformers import SentenceTransformer

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def first_col(df, cols):
    for c in cols:
        if c in df.columns: return c
    return None

def main():
    cache_dir = Path('emb_cache'); cache_dir.mkdir(exist_ok=True)
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    tcol = first_col(tr, ['request_title','title'])
    bcol = first_col(tr, ['request_text','body','text'])
    t = tr[tcol].fillna('') if tcol else pd.Series(['']*len(tr))
    b = tr[bcol].fillna('') if bcol else pd.Series(['']*len(tr))
    tt = te[tcol].fillna('') if tcol in te.columns else pd.Series(['']*len(te))
    tb = te[bcol].fillna('') if bcol in te.columns else pd.Series(['']*len(te))
    tr_txt = (t + ' \n ' + b).astype(str).tolist()
    te_txt = (tt + ' \n ' + tb).astype(str).tolist()
    model = SentenceTransformer('multi-qa-mpnet-base-dot-v1', device='cuda' if torch.cuda.is_available() else 'cpu')
    model.max_seq_length = 512
    def enc(texts, bs=128):
        return np.asarray(model.encode(texts, batch_size=bs, normalize_embeddings=True, show_progress_bar=True), dtype=np.float32)
    emb_tr = enc(tr_txt); emb_te = enc(te_txt)
    np.save(cache_dir/'emb_mpnet_train.npy', emb_tr)
    np.save(cache_dir/'emb_mpnet_test.npy', emb_te)
    print('Saved MPNet embeddings:', emb_tr.shape, emb_te.shape)

if __name__ == '__main__':
    main()
''')
    script_path.write_text(code)
    venv_py = Path('.venv/bin/python')
    assert venv_py.exists(), 'Missing .venv python; run setup cell 1 first'
    subprocess.run([str(venv_py), '-m', 'pip', 'install', '--upgrade', 'pip', 'wheel', 'setuptools'], check=True)
    subprocess.run([str(venv_py), '-m', 'pip', 'install', 'pandas', 'numpy', 'sentence-transformers'], check=True)
    print('Generating MPNet embeddings via venv ...', flush=True)
    proc = subprocess.run([str(venv_py), str(script_path)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    print(proc.stdout)
    if proc.returncode != 0:
        raise RuntimeError('MPNet embedding generation failed')

def _topk_idx(sims, k):
    if k < sims.shape[1]:
        return np.argpartition(-sims, kth=k-1, axis=1)[:, :k]
    else:
        return np.argsort(-sims, axis=1)

def knn_rate_mean(pool_emb, pool_y, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return pool_y[topk].mean(axis=1).astype(np.float32)

def knn_rate_softmax(pool_emb, pool_y, query_emb, k=50, tau=0.12, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    row_idx = np.arange(topk.shape[0])[:, None]
    s_top = sims[row_idx, topk]
    w = np.exp(s_top / max(1e-6, tau))
    w /= (w.sum(axis=1, keepdims=True) + 1e-9)
    y_top = pool_y[topk].astype(np.float32)
    return (w * y_top).sum(axis=1).astype(np.float32)

def knn_topk(pool_emb, query_emb, k=50, self_exclude=False):
    sims = query_emb @ pool_emb.T
    if self_exclude and query_emb.shape[0] == pool_emb.shape[0]:
        idx = np.arange(sims.shape[0]); sims[idx, idx] = -1e9
    topk = _topk_idx(sims, k)
    return topk, sims

def knn_rate_bayes(pool_emb, pool_y, query_emb, k=50, alpha=22.0, p_train=0.5, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    y_top = pool_y[topk].astype(np.float32)
    s = y_top.sum(axis=1)
    denom = (k + alpha)
    return ((s + alpha * p_train) / denom).astype(np.float32)

def knn_rate_recency_decay(pool_emb, pool_y, query_emb, pool_ts_days, query_ts_days, k=50, lam_days=75.0, self_exclude=False):
    topk, sims = knn_topk(pool_emb, query_emb, k=k, self_exclude=self_exclude)
    row_idx = np.arange(topk.shape[0])[:, None]
    qd = query_ts_days[row_idx[:,0]][:, None]
    pdays = pool_ts_days[topk]
    gaps = np.maximum(qd - pdays, 0.0)
    w = np.exp(-gaps / max(lam_days, 1e-6))
    y_top = pool_y[topk].astype(np.float32)
    w_sum = w.sum(axis=1, keepdims=True) + 1e-9
    return ((w * y_top).sum(axis=1) / w_sum[:,0]).astype(np.float32)

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer('Load train/test and prepare inputs (MPNet)'):
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = 'unix_timestamp_of_request_utc' if 'unix_timestamp_of_request_utc' in tr.columns else tr.columns[0]
    label_col = 'requester_received_pizza' if 'requester_received_pizza' in tr.columns else 'label'
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    ts_tr = pd.to_numeric(tr[time_col], errors='coerce').fillna(0).values.astype(np.int64)
    ts_te = pd.to_numeric(te[time_col], errors='coerce').fillna(0).values.astype(np.int64) if time_col in te.columns else np.zeros(len(te), dtype=np.int64)
    if ts_tr.max() > 10_000_000_000: ts_tr = ts_tr // 1_000_000_000
    if ts_te.max() > 10_000_000_000: ts_te = ts_te // 1_000_000_000
    ts_tr_days = (ts_tr / 86400.0).astype(np.float32)
    ts_te_days = (ts_te / 86400.0).astype(np.float32)
    ensure_mpnet_embeddings()
    mp_tr = np.load('emb_cache/emb_mpnet_train.npy').astype(np.float32)
    mp_te = np.load('emb_cache/emb_mpnet_test.npy').astype(np.float32)
    e5_tr = np.load('emb_cache/emb_e5_train.npy').astype(np.float32)
    e5_te = np.load('emb_cache/emb_e5_test.npy').astype(np.float32)
    bge_tr = np.load('emb_cache/emb_bge_train.npy').astype(np.float32)
    bge_te = np.load('emb_cache/emb_bge_test.npy').astype(np.float32)
    meta_tr = per_row_meta(tr); meta_te = per_row_meta(te)
    print('Shapes | mpnet:', mp_tr.shape, mp_te.shape, 'meta:', meta_tr.shape, meta_te.shape, '| e5:', e5_tr.shape, '| bge:', bge_tr.shape)

fold_dir = Path('folds')
manifest_path = fold_dir / 'manifest.json'
if manifest_path.exists():
    mf = json.loads(manifest_path.read_text())
    chain_ids = [c['chain'] for c in mf.get('chains', [])]
else:
    val_files = sorted(fold_dir.glob('fc_chain*_val_idx.npy'))
    chain_ids = sorted(int(p.stem.split('chain')[1].split('_')[0]) for p in val_files)
print('Chains detected (from manifest if available):', chain_ids)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_bag = []

# Tighter XGB regularization
param_grid = [
    dict(max_depth=4, eta=0.08, min_child_weight=5),
    dict(max_depth=4, eta=0.05, min_child_weight=5),
    dict(max_depth=4, eta=0.08, min_child_weight=8),
]

def predict_with_best(bst, dmat):
    bi = getattr(bst, 'best_iteration', None)
    if bi is not None:
        return bst.predict(dmat, iteration_range=(0, int(bi)+1))
    return bst.predict(dmat)

def standardize_knn_feats(tr_mat, va_mat, te_mat):
    mu = tr_mat.mean(axis=0); sd = tr_mat.std(axis=0); sd = np.where(sd < 1e-6, 1.0, sd)
    tr_s = (tr_mat - mu) / sd
    va_s = (va_mat - mu) / sd
    te_s = (te_mat - mu) / sd
    return tr_s.astype(np.float32), va_s.astype(np.float32), te_s.astype(np.float32)

def train_one_chain_seed(ci, seed):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f'Chain {ci} seed {seed}: empty val; skip'); return None, None, None
    Xtr_emb, ytr = mp_tr[tr_idx], y[tr_idx]
    Xva_emb, yva = mp_tr[va_idx], y[va_idx]
    # Same-space kNN features (k=20/50/100 mean, softmax tau=0.12, cross-model (E5), recency-decay 75d, Bayes)
    kn_tr_k20 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=20, self_exclude=True).reshape(-1,1)
    kn_va_k20 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=20, self_exclude=False).reshape(-1,1)
    kn_tr_k50 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=50, self_exclude=True).reshape(-1,1)
    kn_va_k50 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=50, self_exclude=False).reshape(-1,1)
    kn_tr_k100 = knn_rate_mean(Xtr_emb, ytr, Xtr_emb, k=100, self_exclude=True).reshape(-1,1)
    kn_va_k100 = knn_rate_mean(Xtr_emb, ytr, Xva_emb, k=100, self_exclude=False).reshape(-1,1)
    kn_tr_soft = knn_rate_softmax(Xtr_emb, ytr, Xtr_emb, k=50, tau=0.12, self_exclude=True).reshape(-1,1)
    kn_va_soft = knn_rate_softmax(Xtr_emb, ytr, Xva_emb, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    # Cross-model (E5 space)
    kn_tr_cross = knn_rate_mean(e5_tr[tr_idx], ytr, e5_tr[tr_idx], k=50, self_exclude=True).reshape(-1,1)
    kn_va_cross = knn_rate_mean(e5_tr[tr_idx], ytr, e5_tr[va_idx], k=50, self_exclude=False).reshape(-1,1)
    # Recency-decayed k=50 (lam_days=75.0)
    tr_days = ts_tr_days[tr_idx]; va_days = ts_tr_days[va_idx]
    kn_tr_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xtr_emb, tr_days, tr_days, k=50, lam_days=75.0, self_exclude=True).reshape(-1,1)
    kn_va_dec = knn_rate_recency_decay(Xtr_emb, ytr, Xva_emb, tr_days, va_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    # Bayesian-smoothed k=50
    p_train = float(ytr.mean())
    kn_tr_bayes = knn_rate_bayes(Xtr_emb, ytr, Xtr_emb, k=50, alpha=22.0, p_train=p_train, self_exclude=True).reshape(-1,1)
    kn_va_bayes = knn_rate_bayes(Xtr_emb, ytr, Xva_emb, k=50, alpha=22.0, p_train=p_train, self_exclude=False).reshape(-1,1)
    # Test features with full train pool
    kn_te_k20  = knn_rate_mean(mp_tr, y, mp_te, k=20,  self_exclude=False).reshape(-1,1)
    kn_te_k50  = knn_rate_mean(mp_tr, y, mp_te, k=50,  self_exclude=False).reshape(-1,1)
    kn_te_k100 = knn_rate_mean(mp_tr, y, mp_te, k=100, self_exclude=False).reshape(-1,1)
    kn_te_soft = knn_rate_softmax(mp_tr, y, mp_te, k=50, tau=0.12, self_exclude=False).reshape(-1,1)
    kn_te_cross = knn_rate_mean(e5_tr, y, e5_te, k=50, self_exclude=False).reshape(-1,1)
    kn_te_dec = knn_rate_recency_decay(mp_tr, y, mp_te, ts_tr_days, ts_te_days, k=50, lam_days=75.0, self_exclude=False).reshape(-1,1)
    kn_te_bayes = knn_rate_bayes(mp_tr, y, mp_te, k=50, alpha=22.0, p_train=float(y.mean()), self_exclude=False).reshape(-1,1)
    # Stack and standardize kNN features (7 features)
    K_tr = np.hstack([kn_tr_k20, kn_tr_k50, kn_tr_k100, kn_tr_soft, kn_tr_cross, kn_tr_dec, kn_tr_bayes])
    K_va = np.hstack([kn_va_k20, kn_va_k50, kn_va_k100, kn_va_soft, kn_va_cross, kn_va_dec, kn_va_bayes])
    K_te = np.hstack([kn_te_k20, kn_te_k50, kn_te_k100, kn_te_soft, kn_te_cross, kn_te_dec, kn_te_bayes])
    K_tr_s, K_va_s, K_te_s = standardize_knn_feats(K_tr, K_va, K_te)
    # Final matrices
    Xtr = np.hstack([Xtr_emb, meta_tr[tr_idx], K_tr_s]).astype(np.float32)
    Xva = np.hstack([Xva_emb, meta_tr[va_idx], K_va_s]).astype(np.float32)
    Xte = np.hstack([mp_te,   meta_te,        K_te_s]).astype(np.float32)
    dtr = xgb.DMatrix(Xtr, label=ytr); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(Xte)
    pos = int((ytr==1).sum()); neg = int((ytr==0).sum()); spw = float(neg)/max(1.0, float(pos))
    base = dict(tree_method='gpu_hist', objective='binary:logistic', eval_metric='auc',
                subsample=0.8, colsample_bytree=0.8, reg_lambda=5.0, reg_alpha=0.2,
                scale_pos_weight=spw, seed=seed)
    best_auc=-1.0; best_pva=None; best_pte=None; best_desc=None
    for g in param_grid:
        params = base.copy(); params.update(g)
        t0=time.time()
        bst = xgb.train(params, dtr, num_boost_round=4000, evals=[(dva,'val')], early_stopping_rounds=120, verbose_eval=False)
        pva = predict_with_best(bst, dva)
        auc = roc_auc_score(yva, pva)
        if auc > best_auc:
            best_auc, best_pva = auc, pva
            best_pte = predict_with_best(bst, dte)
            best_desc = g.copy(); best_desc['best_it']=getattr(bst,'best_iteration',None); best_desc['secs']=time.time()-t0
    print(f'Chain {ci} seed {seed}: best={best_desc} AUC={best_auc:.5f}', flush=True)
    return va_idx, best_pva, best_pte

with timer('Train MPNet+XGB+robust kNN across chains and seeds'):
    SEEDS=[42,1337,2025]
    for seed in SEEDS:
        test_preds_per_chain=[]
        for ci in chain_ids:
            res = train_one_chain_seed(ci, seed)
            if res is None: continue
            va_idx, pva, pte = res
            if seed == SEEDS[0]:
                oof[va_idx] = pva
            else:
                oof[va_idx] += pva
            val_mask[va_idx] = True
            test_preds_per_chain.append(pte)
        if len(test_preds_per_chain):
            test_preds_bag.append(np.mean(np.vstack(test_preds_per_chain), axis=0))

with timer('Evaluate and save MPNet+XGB artifacts (reverted extras)'):
    if val_mask.any():
        oof_avg = oof.copy(); oof_avg[val_mask] = oof_avg[val_mask]/3.0
        print('MPNet+XGB OOF AUC (val rows only, seed-bag):', round(roc_auc_score(y[val_mask], oof_avg[val_mask]),6))
        np.save('oof_mpnet_xgb_fc.npy', oof_avg)
    else:
        print('Warning: no validation rows; OOF not computed.')
    if len(test_preds_bag):
        test_pred = np.mean(np.vstack(test_preds_bag), axis=0)
        np.save('test_mpnet_xgb_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_mpnet_xgb_fc.csv', index=False)
        print('Saved: oof_mpnet_xgb_fc.npy, test_mpnet_xgb_fc.npy, submission_mpnet_xgb_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load train/test and prepare inputs (MPNet) ...


Shapes | mpnet: (2878, 768) (1162, 768) meta: (2878, 15) (1162, 15) | e5: (2878, 768) | bge: (2878, 384)
[T+] Load train/test and prepare inputs (MPNet) done in 0.23s


Chains detected (from manifest if available): [1, 2, 3]
[T0] Train MPNet+XGB+robust kNN across chains and seeds ...


Chain 1 seed 42: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 5, 'secs': 0.42093634605407715} AUC=0.61705


Chain 2 seed 42: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 8, 'secs': 0.4458937644958496} AUC=0.65417


Chain 3 seed 42: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 6, 'secs': 0.4651782512664795} AUC=0.61171


Chain 1 seed 1337: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 11, 'secs': 0.45186328887939453} AUC=0.63065


Chain 2 seed 1337: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 8, 'best_it': 6, 'secs': 0.4221508502960205} AUC=0.63841


Chain 3 seed 1337: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 19, 'secs': 0.5003876686096191} AUC=0.59252


Chain 1 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 14, 'secs': 0.46810150146484375} AUC=0.60003


Chain 2 seed 2025: best={'max_depth': 4, 'eta': 0.05, 'min_child_weight': 5, 'best_it': 16, 'secs': 0.4728114604949951} AUC=0.66742


Chain 3 seed 2025: best={'max_depth': 4, 'eta': 0.08, 'min_child_weight': 5, 'best_it': 31, 'secs': 0.5465712547302246} AUC=0.62047


[T+] Train MPNet+XGB+robust kNN across chains and seeds done in 20.28s


[T0] Evaluate and save MPNet+XGB artifacts (reverted extras) ...


MPNet+XGB OOF AUC (val rows only, seed-bag): 0.589321
Saved: oof_mpnet_xgb_fc.npy, test_mpnet_xgb_fc.npy, submission_mpnet_xgb_fc.csv
[T+] Evaluate and save MPNet+XGB artifacts (reverted extras) done in 0.00s


In [131]:
# New Leg: RoBERTa-base fine-tuning (title+body pairs) per forward chain via venv Trainer; caches OOF/test
import os, json, textwrap, subprocess
from pathlib import Path

# Set SMOKE=0 for full run (all chains, 3 epochs, max_len=384, seeds [42,1337]); set to '1' only for quick smoke.
os.environ['SMOKE'] = os.environ.get('SMOKE', '0')  # force default to full run unless explicitly set

script = Path('roberta_ft_fc.py')
code = textwrap.dedent('''
import os, json, numpy as np, pandas as pd, torch, math, random, glob
from pathlib import Path
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments,
                          DataCollatorWithPadding, EarlyStoppingCallback)
from sklearn.metrics import roc_auc_score

def set_seed(seed: int):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed);
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

def load_json_df(path):
    try: return pd.read_json(path, lines=True)
    except ValueError:
        try: return pd.read_json(path, lines=False)
        except ValueError:
            with open(path,'r',encoding='utf-8') as f: data=json.load(f)
            if isinstance(data, dict) and 'data' in data: data=data['data']
            return pd.json_normalize(data)

def first_col(df, cols):
    for c in cols:
        if c in df.columns: return c
    return None

def extract_title_body(df):
    tcol = first_col(df, ['request_title','title'])
    bcol = first_col(df, ['request_text','body','text'])
    t = df[tcol].fillna('') if tcol else pd.Series(['']*len(df))
    b = df[bcol].fillna('') if bcol else pd.Series(['']*len(df))
    return t.astype(str).tolist(), b.astype(str).tolist()

class PairDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.enc = encodings
        self.labels = None if labels is None else np.array(labels, dtype=np.int64)
    def __len__(self):
        return len(self.enc['input_ids'])
    def __getitem__(self, i):
        item = {k: torch.tensor(self.enc[k][i]) for k in self.enc.keys()}
        if self.labels is not None:
            item['labels'] = torch.tensor(int(self.labels[i]))
        return item

def rank01(x: np.ndarray) -> np.ndarray:
    order = np.argsort(x)
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x))
    r = ranks / max(1.0, (len(x) - 1))
    return np.clip(r, 0.0, 1.0).astype(np.float32)

def softmax_np(logits: np.ndarray) -> np.ndarray:
    if logits.ndim == 1:
        z = logits - np.max(logits)
        e = np.exp(z)
        return e / (e.sum() + 1e-12)
    z = logits - logits.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / (e.sum(axis=1, keepdims=True) + 1e-12)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    if logits.ndim == 2 and logits.shape[1] == 2:
        probs1 = softmax_np(logits)[:,1]
    else:
        # binary single-logit case
        probs1 = 1.0/(1.0+np.exp(-logits.squeeze()))
    try: auc = roc_auc_score(labels, probs1)
    except Exception: auc = 0.5
    return {'auc': float(auc)}

def latest_checkpoint_dir(outdir: Path):
    cks = sorted([p for p in outdir.glob('checkpoint-*') if p.is_dir()], key=lambda p: int(p.name.split('-')[-1]))
    return cks[-1] if len(cks) else None

def main():
    fold_dir = Path('folds')
    mf = json.loads((fold_dir/'manifest.json').read_text())
    chains_all = [c['chain'] for c in mf['chains']]
    tr = load_json_df('train.json')
    te = load_json_df('test.json')
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    tr_titles, tr_bodies = extract_title_body(tr)
    te_titles, te_bodies = extract_title_body(te)

    smoke = os.environ.get('SMOKE','0') == '1'
    pred_only = os.environ.get('PRED_ONLY','0') == '1'
    if smoke:
        chains = [max(chains_all)]  # C3 only
        max_length = 256
        num_epochs = 1
        seeds = [42]
    else:
        chains = chains_all
        max_length = 384
        num_epochs = 3
        seeds = [42, 1337]

    model_name = 'roberta-base'
    tok = AutoTokenizer.from_pretrained(model_name)
    collate = DataCollatorWithPadding(tokenizer=tok)

    oof = np.zeros(len(tr), dtype=np.float32)
    val_mask = np.zeros(len(tr), dtype=bool)
    test_pred_seeds = []  # each entry: rank-avg across chains for a seed

    for seed in seeds:
        set_seed(seed)
        test_preds_per_chain = []
        for ci in chains:
            tr_idx = np.load(fold_dir / f'fc_chain{ci}_train_idx.npy')
            va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
            if len(va_idx) == 0:
                continue
            val_mask[va_idx] = True
            enc_tr = tok([tr_titles[i] for i in tr_idx], [tr_bodies[i] for i in tr_idx], truncation=True, padding=False, max_length=max_length)
            enc_va = tok([tr_titles[i] for i in va_idx], [tr_bodies[i] for i in va_idx], truncation=True, padding=False, max_length=max_length)
            enc_te = tok(te_titles, te_bodies, truncation=True, padding=False, max_length=max_length)
            ds_tr = PairDataset(enc_tr, y[tr_idx])
            ds_va = PairDataset(enc_va, y[va_idx])
            ds_te = PairDataset(enc_te, None)
            out_dir = Path(f'./roberta_fc_c{ci}_s{seed}')
            out_dir.mkdir(exist_ok=True, parents=True)
            ckpt = latest_checkpoint_dir(out_dir)
            model = None
            if pred_only and (ckpt is not None):
                model = AutoModelForSequenceClassification.from_pretrained(str(ckpt))
            else:
                model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
                if hasattr(model, 'gradient_checkpointing_enable'):
                    model.gradient_checkpointing_enable()
                eff_bs = 16 * 2
                eval_steps = max(100, len(tr_idx) // max(1, eff_bs*2))
                args = TrainingArguments(
                    output_dir=str(out_dir),
                    learning_rate=2e-5,
                    weight_decay=0.01,
                    lr_scheduler_type='cosine',
                    warmup_ratio=0.06,
                    per_device_train_batch_size=16,
                    gradient_accumulation_steps=2,
                    per_device_eval_batch_size=64,
                    num_train_epochs=num_epochs,
                    evaluation_strategy='steps',
                    eval_steps=eval_steps,
                    save_strategy='steps',
                    save_steps=eval_steps,
                    save_total_limit=1,
                    logging_steps=max(50, eval_steps//2),
                    load_best_model_at_end=True,
                    metric_for_best_model='auc',
                    greater_is_better=True,
                    fp16=torch.cuda.is_available(),
                    disable_tqdm=True,
                    seed=seed
                )
                trainer = Trainer(
                    model=model,
                    args=args,
                    train_dataset=ds_tr,
                    eval_dataset=ds_va,
                    tokenizer=tok,
                    data_collator=collate,
                    compute_metrics=compute_metrics,
                    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
                )
                trainer.train()
                ckpt = latest_checkpoint_dir(out_dir) or out_dir
                model.save_pretrained(str(ckpt))
            # Predict val/test with best/last checkpoint
            model.eval()
            trainer_pred = Trainer(model=model, tokenizer=tok, data_collator=collate)
            logits_va = trainer_pred.predict(ds_va).predictions
            if logits_va.ndim == 2 and logits_va.shape[1] == 2:
                pva = softmax_np(logits_va)[:,1].astype(np.float32)
            else:
                pva = (1.0/(1.0+np.exp(-logits_va.squeeze()))).astype(np.float32)
            oof[va_idx] = (oof[va_idx] + pva) if (seed != seeds[0]) else pva
            try:
                auc = roc_auc_score(y[va_idx], pva)
                print(f'Chain {ci} seed {seed}: val AUC={auc:.6f}')
            except Exception:
                pass
            logits_te = trainer_pred.predict(ds_te).predictions
            if logits_te.ndim == 2 and logits_te.shape[1] == 2:
                pte = softmax_np(logits_te)[:,1].astype(np.float32)
            else:
                pte = (1.0/(1.0+np.exp(-logits_te.squeeze()))).astype(np.float32)
            test_preds_per_chain.append(rank01(pte))
        if len(test_preds_per_chain):
            test_rank_seed = np.mean(np.vstack(test_preds_per_chain), axis=0)
            test_pred_seeds.append(test_rank_seed.astype(np.float32))

    if len(seeds) > 1:
        oof[val_mask] = oof[val_mask] / float(len(seeds))

    if val_mask.any():
        try:
            auc_oof = roc_auc_score(y[val_mask], oof[val_mask])
            print('RoBERTa FT OOF AUC (val rows):', round(float(auc_oof),6))
        except Exception:
            pass
    np.save('oof_roberta_ft_fc.npy', oof.astype(np.float32))
    if len(test_pred_seeds):
        test_rank_final = np.mean(np.vstack(test_pred_seeds), axis=0)
        np.save('test_roberta_ft_fc.npy', test_rank_final.astype(np.float32))
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_rank_final, 0.01, 0.99)
        sub.to_csv('submission_roberta_ft_fc.csv', index=False)
        print('Saved: oof_roberta_ft_fc.npy, test_roberta_ft_fc.npy, submission_roberta_ft_fc.csv')

if __name__ == '__main__':
    main()
''')
script.write_text(code)
print('Wrote roberta_ft_fc.py')

# Run via venv (has working torch/cu121 and HF stack)
venv_py = Path('.venv/bin/python')
assert venv_py.exists(), 'Missing .venv python; run setup cell 1 first'
env = os.environ.copy()
env['SMOKE'] = '0'  # force full run in subprocess regardless of parent env
env['PRED_ONLY'] = env.get('PRED_ONLY', '1')  # default to predict-only to quickly fix probability extraction
print('SMOKE mode =', env.get('SMOKE','0'), '| PRED_ONLY =', env.get('PRED_ONLY','0'))
print('Launching fine-tune/predict with', venv_py)
proc = subprocess.run([str(venv_py), str(script)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, env=env)
print(proc.stdout)
if proc.returncode != 0:
    raise RuntimeError('RoBERTa FT/predict failed')
print('RoBERTa FT leg updated (softmax probs). Re-run Cell 12 to reblend.')

Wrote roberta_ft_fc.py
SMOKE mode = 0 | PRED_ONLY = 1
Launching fine-tune/predict with .venv/bin/python



  0%|          | 0/71 [00:00<?, ?it/s]
  7%|▋         | 5/71 [00:00<00:01, 39.16it/s]
 13%|█▎        | 9/71 [00:00<00:01, 33.02it/s]
 18%|█▊        | 13/71 [00:00<00:02, 27.47it/s]
 23%|██▎       | 16/71 [00:00<00:02, 27.17it/s]
 28%|██▊       | 20/71 [00:00<00:01, 29.24it/s]
 34%|███▍      | 24/71 [00:00<00:01, 30.71it/s]
 39%|███▉      | 28/71 [00:00<00:01, 31.13it/s]
 45%|████▌     | 32/71 [00:01<00:01, 32.60it/s]
 51%|█████     | 36/71 [00:01<00:01, 28.68it/s]
 55%|█████▍    | 39/71 [00:01<00:01, 28.02it/s]
 59%|█████▉    | 42/71 [00:01<00:01, 27.98it/s]
 63%|██████▎   | 45/71 [00:01<00:00, 26.68it/s]
 69%|██████▉   | 49/71 [00:01<00:00, 26.15it/s]
 75%|███████▍  | 53/71 [00:01<00:00, 26.43it/s]
 79%|███████▉  | 56/71 [00:01<00:00, 25.08it/s]
 83%|████████▎ | 59/71 [00:02<00:00, 25.20it/s]
 87%|████████▋ | 62/71 [00:02<00:00, 25.29it/s]
 92%|█████████▏| 65/71 [00:02<00:00, 24.01it/s]
 96%|█████████▌| 68/71 [00:02<00:00, 22.85it/s]
100%|██████████| 71/71 [00:02<00:00, 27.13it/s]
Ch

In [135]:
# New Leg: DeBERTa-v3-base fine-tuning (title+body pairs), forward-chaining, 2 seeds; caches OOF/test
import os, json, textwrap, subprocess
from pathlib import Path

script = Path('deberta_ft_fc.py')
code = textwrap.dedent('''
import os, json, numpy as np, pandas as pd, torch, random
from pathlib import Path
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments,
                          DataCollatorWithPadding, EarlyStoppingCallback)
from sklearn.metrics import roc_auc_score

def set_seed(seed: int):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

def load_json_df(path):
    try: return pd.read_json(path, lines=True)
    except ValueError:
        try: return pd.read_json(path, lines=False)
        except ValueError:
            with open(path,'r',encoding='utf-8') as f: data=json.load(f)
            if isinstance(data, dict) and 'data' in data: data=data['data']
            return pd.json_normalize(data)

def first_col(df, cols):
    for c in cols:
        if c in df.columns: return c
    return None

def extract_title_body(df):
    tcol = first_col(df, ['request_title','title'])
    bcol = first_col(df, ['request_text','body','text'])
    t = df[tcol].fillna('') if tcol else pd.Series(['']*len(df))
    b = df[bcol].fillna('') if bcol else pd.Series(['']*len(df))
    return t.astype(str).tolist(), b.astype(str).tolist()

class PairDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.enc = encodings
        self.labels = None if labels is None else np.array(labels, dtype=np.int64)
    def __len__(self):
        return len(self.enc['input_ids'])
    def __getitem__(self, i):
        item = {k: torch.tensor(self.enc[k][i]) for k in self.enc.keys()}
        if self.labels is not None:
            item['labels'] = torch.tensor(int(self.labels[i]))
        return item

def softmax_np(logits: np.ndarray) -> np.ndarray:
    if logits.ndim == 1:
        z = logits - np.max(logits); e = np.exp(z); return e / (e.sum() + 1e-12)
    z = logits - logits.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / (e.sum(axis=1, keepdims=True) + 1e-12)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    p1 = softmax_np(logits)[:,1] if (logits.ndim==2 and logits.shape[1]==2) else 1.0/(1.0+np.exp(-logits.squeeze()))
    try: auc = roc_auc_score(labels, p1)
    except Exception: auc = 0.5
    return {'auc': float(auc)}

def latest_checkpoint_dir(outdir: Path):
    cks = sorted([p for p in outdir.glob('checkpoint-*') if p.is_dir()], key=lambda p: int(p.name.split('-')[-1]))
    return cks[-1] if len(cks) else None

def main():
    fold_dir = Path('folds')
    mf = json.loads((fold_dir/'manifest.json').read_text())
    chains_all = [c['chain'] for c in mf['chains']]
    tr = load_json_df('train.json'); te = load_json_df('test.json')
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    tr_titles, tr_bodies = extract_title_body(tr); te_titles, te_bodies = extract_title_body(te)

    max_length = int(os.environ.get('MAX_LEN','384'))
    seeds = [int(s) for s in os.environ.get('SEEDS','42,1337').split(',')]

    model_name = 'microsoft/deberta-v3-base'
    tok = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    collate = DataCollatorWithPadding(tokenizer=tok)

    oof = np.zeros(len(tr), dtype=np.float32); val_mask = np.zeros(len(tr), dtype=bool)
    test_pred_seeds = []

    for seed in seeds:
        set_seed(seed)
        test_preds_per_chain = []
        for ci in chains_all:
            tr_idx = np.load(fold_dir / f'fc_chain{ci}_train_idx.npy')
            va_idx = np.load(fold_dir / f'fc_chain{ci}_val_idx.npy')
            if len(va_idx)==0: continue
            val_mask[va_idx] = True
            enc_tr = tok([tr_titles[i] for i in tr_idx], [tr_bodies[i] for i in tr_idx], truncation=True, padding=False, max_length=max_length)
            enc_va = tok([tr_titles[i] for i in va_idx], [tr_bodies[i] for i in va_idx], truncation=True, padding=False, max_length=max_length)
            enc_te = tok(te_titles, te_bodies, truncation=True, padding=False, max_length=max_length)
            ds_tr = PairDataset(enc_tr, y[tr_idx]); ds_va = PairDataset(enc_va, y[va_idx]); ds_te = PairDataset(enc_te, None)
            out_dir = Path(f'./deberta_fc_c{ci}_s{seed}'); out_dir.mkdir(exist_ok=True, parents=True)
            model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
            if hasattr(model, 'gradient_checkpointing_enable'): model.gradient_checkpointing_enable()
            eff_bs = 16*2
            eval_steps = max(100, len(tr_idx)//max(1, eff_bs*2))
            args = TrainingArguments(
                output_dir=str(out_dir),
                learning_rate=float(os.environ.get('LR','1.5e-5')),
                weight_decay=0.01,
                lr_scheduler_type='cosine',
                warmup_ratio=float(os.environ.get('WARMUP','0.10')),
                per_device_train_batch_size=16,
                gradient_accumulation_steps=2,
                per_device_eval_batch_size=64,
                num_train_epochs=int(os.environ.get('EPOCHS','3')),
                evaluation_strategy='steps',
                eval_steps=eval_steps,
                save_strategy='steps',
                save_steps=eval_steps,
                save_total_limit=1,
                logging_steps=max(50, eval_steps//2),
                load_best_model_at_end=True,
                metric_for_best_model='auc',
                greater_is_better=True,
                fp16=torch.cuda.is_available(),
                disable_tqdm=True,
                seed=seed
            )
            trainer = Trainer(model=model, args=args, train_dataset=ds_tr, eval_dataset=ds_va, tokenizer=tok,
                              data_collator=collate, compute_metrics=compute_metrics,
                              callbacks=[EarlyStoppingCallback(early_stopping_patience=1)])
            trainer.train()
            # Predict best
            model.eval(); pred_trainer = Trainer(model=model, tokenizer=tok, data_collator=collate)
            logits_va = pred_trainer.predict(ds_va).predictions
            pva = softmax_np(logits_va)[:,1].astype(np.float32) if (logits_va.ndim==2 and logits_va.shape[1]==2) else (1.0/(1.0+np.exp(-logits_va.squeeze()))).astype(np.float32)
            oof[va_idx] = (oof[va_idx]+pva) if (seed!=seeds[0]) else pva
            try: print(f'Chain {ci} seed {seed}: val AUC={roc_auc_score(y[va_idx], pva):.6f}')
            except Exception: pass
            logits_te = pred_trainer.predict(ds_te).predictions
            pte = softmax_np(logits_te)[:,1].astype(np.float32) if (logits_te.ndim==2 and logits_te.shape[1]==2) else (1.0/(1.0+np.exp(-logits_te.squeeze()))).astype(np.float32)
            # rank per chain then average
            order = np.argsort(pte); ranks = np.empty_like(order, dtype=np.float64); ranks[order] = np.arange(len(pte)); ranks = ranks/ max(1,(len(pte)-1));
            test_preds_per_chain.append(ranks.astype(np.float32))
        if len(test_preds_per_chain):
            test_pred_seeds.append(np.mean(np.vstack(test_preds_per_chain), axis=0).astype(np.float32))

    if len(seeds)>1: oof[val_mask] = oof[val_mask]/float(len(seeds))
    if val_mask.any():
        try: print('DeBERTa FT OOF AUC (val rows):', round(float(roc_auc_score(y[val_mask], oof[val_mask])),6))
        except Exception: pass
    np.save('oof_deberta_ft_fc.npy', oof.astype(np.float32))
    if len(test_pred_seeds):
        test_rank_final = np.mean(np.vstack(test_pred_seeds), axis=0).astype(np.float32)
        np.save('test_deberta_ft_fc.npy', test_rank_final)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_rank_final, 0.01, 0.99)
        sub.to_csv('submission_deberta_ft_fc.csv', index=False)
        print('Saved: oof_deberta_ft_fc.npy, test_deberta_ft_fc.npy, submission_deberta_ft_fc.csv')

if __name__ == '__main__':
    main()
''')
script.write_text(code)
print('Wrote deberta_ft_fc.py')

# Launch training via venv
venv_py = Path('.venv/bin/python')
assert venv_py.exists(), 'Missing .venv python; run the venv setup cell 1 first'
env = os.environ.copy()
env.setdefault('EPOCHS','3'); env.setdefault('LR','1.5e-5'); env.setdefault('WARMUP','0.10'); env.setdefault('MAX_LEN','384'); env.setdefault('SEEDS','42,1337')
print('Launching DeBERTa FT with', venv_py, '| MAX_LEN=', env['MAX_LEN'], '| SEEDS=', env['SEEDS'])
proc = subprocess.run([str(venv_py), str(script)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, env=env)
print(proc.stdout)
if proc.returncode != 0:
    raise RuntimeError('DeBERTa FT failed')
print('DeBERTa FT leg complete. Next: modify Cell 12 to include deberta_ft_fc and reblend.')

Wrote deberta_ft_fc.py
Launching DeBERTa FT with .venv/bin/python | MAX_LEN= 384 | SEEDS= 42,1337


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [141]:
# Leg: NB-SVM (word+char TF-IDF, title x3 + body), forward-chain, per-chain vectorizers, C in {0.5,1,2,4}
import json, re, time, gc
from pathlib import Path
import numpy as np, pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def load_json_df(path):
    try:
        return pd.read_json(path, lines=True)
    except ValueError:
        try:
            return pd.read_json(path, lines=False)
        except ValueError:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            if isinstance(data, dict) and 'data' in data:
                data = data['data']
            return pd.json_normalize(data)

def build_text(df):
    title_col_candidates = ['request_title', 'title']
    body_col_candidates = ['request_text', 'body', 'text']
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(title_col_candidates); bcol = first_col(body_col_candidates)
    t = df[tcol].fillna("") if tcol else pd.Series(["" for _ in range(len(df))])
    b = df[bcol].fillna("") if bcol else pd.Series(["" for _ in range(len(df))])
    # lowercase; normalize URLs and digits
    t = t.astype(str).str.lower(); b = b.astype(str).str.lower()
    url_pat = r'https?://\S+|www\.[^\s]+'
    t = t.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    b = b.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    # Up-weight title by repeating x3
    return (t + ' ' + t + ' ' + t + ' ' + b)

def nb_log_ratio_from_matrix(Xtr, ytr, alpha=1.0):
    # Compute log P(w|pos) - log P(w|neg) using per-class summed TF-IDF weights with additive smoothing.
    ytr = np.asarray(ytr).astype(np.int8)
    pos_idx = np.where(ytr == 1)[0]
    neg_idx = np.where(ytr == 0)[0]
    Xc = Xtr.tocsc()
    pos_counts = np.array(Xc[pos_idx, :].sum(axis=0)).ravel().astype(np.float64)
    neg_counts = np.array(Xc[neg_idx, :].sum(axis=0)).ravel().astype(np.float64)
    pos_counts += alpha; neg_counts += alpha
    pos_probs = pos_counts / (pos_counts.sum() + 1e-12)
    neg_probs = neg_counts / (neg_counts.sum() + 1e-12)
    r = np.log(pos_probs + 1e-12) - np.log(neg_probs + 1e-12)
    return r.astype(np.float32)

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer('Load data and align by time'):
    tr = load_json_df('train.json'); te = load_json_df('test.json')
    mf = json.loads(Path('folds/manifest.json').read_text())
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    X_text_tr = build_text(tr)
    X_text_te = build_text(te)

fold_dir = Path('folds')
mf = json.loads((fold_dir / 'manifest.json').read_text())
chains = [c['chain'] for c in mf['chains']]
print('Chains:', chains)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_accum = []

params = dict(
    word_max_features=220000,
    char_max_features=280000,
    C_grid=[0.5, 1.0, 2.0, 4.0]
)
print('Params:', params)

def fit_chain(ci, seed=42):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f'Chain {ci}: empty val; skip'); return None
    print(f'Chain {ci}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
    # TF-IDF vectorizers (per-chain, fit on train only); strip_accents unicode
    word_vec = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=2, max_df=0.98,
                               max_features=params['word_max_features'], lowercase=False,
                               strip_accents='unicode', dtype=np.float32)
    char_vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2,
                               max_features=params['char_max_features'], lowercase=False,
                               strip_accents='unicode', dtype=np.float32)
    with timer(f'Chain {ci}: vectorize'):
        Xtr_w = word_vec.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_w = word_vec.transform(X_text_tr.iloc[va_idx])
        Xte_w = word_vec.transform(X_text_te)
        Xtr_c = char_vec.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_c = char_vec.transform(X_text_tr.iloc[va_idx])
        Xte_c = char_vec.transform(X_text_te)
        Xtr = sparse.hstack([Xtr_w, Xtr_c]).tocsr()
        Xva = sparse.hstack([Xva_w, Xva_c]).tocsr()
        Xte = sparse.hstack([Xte_w, Xte_c]).tocsr()
    ytr, yva = y[tr_idx], y[va_idx]
    # NB-SVM transform via manual log ratio on TF-IDF matrix
    with timer(f'Chain {ci}: NB log-ratio transform'):
        r = nb_log_ratio_from_matrix(Xtr, ytr, alpha=1.0)
        Xtr_nb = Xtr.multiply(r)
        Xva_nb = Xva.multiply(r)
        Xte_nb = Xte.multiply(r)
    best_auc, best_pva, best_pte, bestC = -1.0, None, None, None
    for C in params['C_grid']:
        clf = LogisticRegression(solver='saga', penalty='l2', C=C,
                                 class_weight='balanced',
                                 random_state=seed, max_iter=3000, n_jobs=-1, verbose=0)
        t0 = time.time(); clf.fit(Xtr_nb, ytr); pva = clf.predict_proba(Xva_nb)[:,1]
        auc = roc_auc_score(yva, pva)
        print(f'  C={C} | AUC={auc:.6f} | secs={time.time()-t0:.2f}', flush=True)
        if auc > best_auc:
            best_auc, best_pva, best_pte, bestC = auc, pva, clf.predict_proba(Xte_nb)[:,1], C
    print(f'Chain {ci}: best C={bestC} | AUC={best_auc:.6f}', flush=True)
    del Xtr_w, Xva_w, Xte_w, Xtr_c, Xva_c, Xte_c, Xtr, Xva, Xte, Xtr_nb, Xva_nb, Xte_nb; gc.collect()
    return va_idx, best_pva.astype(np.float32), best_pte.astype(np.float32)

with timer('Train NB-SVM across forward chains'):
    for ci in chains:
        res = fit_chain(ci, seed=42)
        if res is None: continue
        va_idx, pva, pte = res
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds_accum.append(pte)

with timer('Evaluate and save NB-SVM artifacts'):
    if val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print('NB-SVM OOF AUC (val rows only):', round(oof_auc, 6))
    else:
        print('Warning: no validation rows; OOF not computed.')
    np.save('oof_nbsvm_fc.npy', oof.astype(np.float32))
    if len(test_preds_accum):
        test_pred = np.mean(np.vstack(test_preds_accum), axis=0).astype(np.float32)
        np.save('test_nbsvm_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_nbsvm_fc.csv', index=False)
        print('Saved: oof_nbsvm_fc.npy, test_nbsvm_fc.npy, submission_nbsvm_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load data and align by time ...


[T+] Load data and align by time done in 0.16s


Chains: [1, 2, 3]
Params: {'word_max_features': 220000, 'char_max_features': 280000, 'C_grid': [0.5, 1.0, 2.0, 4.0]}
[T0] Train NB-SVM across forward chains ...


Chain 1: train 1727 | val 565


[T0] Chain 1: vectorize ...


[T+] Chain 1: vectorize done in 1.42s


[T0] Chain 1: NB log-ratio transform ...


[T+] Chain 1: NB log-ratio transform done in 0.04s


  C=0.5 | AUC=0.604830 | secs=0.46


  C=1.0 | AUC=0.599043 | secs=0.47


  C=2.0 | AUC=0.590301 | secs=0.46


  C=4.0 | AUC=0.579766 | secs=0.66


Chain 1: best C=0.5 | AUC=0.604830


Chain 2: train 2158 | val 427


[T0] Chain 2: vectorize ...


[T+] Chain 2: vectorize done in 1.57s


  C=0.5 | AUC=0.610632 | secs=0.55


  C=1.0 | AUC=0.600593 | secs=0.60


  C=2.0 | AUC=0.589716 | secs=0.65


  C=4.0 | AUC=0.580007 | secs=1.73


Chain 2: best C=0.5 | AUC=0.610632


Chain 3: train 2302 | val 567


[T0] Chain 3: vectorize ...


[T+] Chain 3: vectorize done in 1.70s


[T0] Chain 3: NB log-ratio transform ...


[T+] Chain 3: NB log-ratio transform done in 0.05s


  C=0.5 | AUC=0.591042 | secs=0.58


  C=1.0 | AUC=0.583530 | secs=0.61


  C=2.0 | AUC=0.573955 | secs=0.76


  C=4.0 | AUC=0.563759 | secs=1.81


Chain 3: best C=0.5 | AUC=0.591042


[T+] Train NB-SVM across forward chains done in 14.54s


[T0] Evaluate and save NB-SVM artifacts ...


NB-SVM OOF AUC (val rows only): 0.590052
Saved: oof_nbsvm_fc.npy, test_nbsvm_fc.npy, submission_nbsvm_fc.csv
[T+] Evaluate and save NB-SVM artifacts done in 0.00s


In [143]:
# NB-SVM via venv script: counts->MNB r, apply r to TF-IDF (same vocab), per-chain fit
import os, textwrap, subprocess, json
from pathlib import Path

script = Path('nbsvm_fc.py')
code = textwrap.dedent('''
import os, json, time, gc, numpy as np, pandas as pd
from pathlib import Path
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def load_json_df(path):
    try: return pd.read_json(path, lines=True)
    except ValueError:
        try: return pd.read_json(path, lines=False)
        except ValueError:
            with open(path,'r',encoding='utf-8') as f: data=json.load(f)
            if isinstance(data, dict) and 'data' in data: data=data['data']
            return pd.json_normalize(data)

def build_text(df):
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(['request_title','title'])
    bcol = first_col(['request_text','body','text'])
    t = df[tcol].fillna('') if tcol else pd.Series(['']*len(df))
    b = df[bcol].fillna('') if bcol else pd.Series(['']*len(df))
    t = t.astype(str).str.lower()
    b = b.astype(str).str.lower()
    url_pat = r'https?://\S+|www\.[^\s]+'
    t = t.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    b = b.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    return (t + ' ' + t + ' ' + t + ' ' + b)

def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True);
    return t0
def done(t0, msg):
    print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

def main():
    fold_dir = Path('folds')
    mf = json.loads((fold_dir/'manifest.json').read_text())
    chains = [c['chain'] for c in mf['chains']]
    tr = load_json_df('train.json'); te = load_json_df('test.json')
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    X_text_tr = build_text(tr); X_text_te = build_text(te)
    print('Chains:', chains)

    oof = np.zeros(len(tr), dtype=np.float32); val_mask = np.zeros(len(tr), dtype=bool)
    test_preds = []
    params = dict(word_max_features=240000, char_max_features=300000, C_grid=[0.25,0.5,1.0,2.0,4.0])
    print('Params:', params)

    for ci in chains:
        tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
        if len(va_idx) == 0:
            print(f'Chain {ci}: empty val; skip'); continue
        print(f'Chain {ci}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
        # 1) Count vectorizers (fit on train only) with binary=True for stability
        word_cv = CountVectorizer(analyzer='word', ngram_range=(1,2), min_df=2, max_df=0.98,
                                  max_features=params['word_max_features'], lowercase=False, strip_accents='unicode', binary=True)
        char_cv = CountVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2,
                                  max_features=params['char_max_features'], lowercase=False, strip_accents='unicode', binary=True)
        t0 = timer(f'Chain {ci}: fit counts')
        Xtr_w_cnt = word_cv.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_w_cnt = word_cv.transform(X_text_tr.iloc[va_idx])
        Xte_w_cnt = word_cv.transform(X_text_te)
        Xtr_c_cnt = char_cv.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_c_cnt = char_cv.transform(X_text_tr.iloc[va_idx])
        Xte_c_cnt = char_cv.transform(X_text_te)
        Xtr_cnt = sparse.hstack([Xtr_w_cnt, Xtr_c_cnt]).tocsr()
        Xva_cnt = sparse.hstack([Xva_w_cnt, Xva_c_cnt]).tocsr()
        Xte_cnt = sparse.hstack([Xte_w_cnt, Xte_c_cnt]).tocsr()
        done(t0, f'Chain {ci}: fit counts')
        # 2) TF-IDF with same vocabularies (align columns), using sublinear_tf=True
        word_tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=2, max_df=0.98,
                                  vocabulary=word_cv.vocabulary_, lowercase=False, strip_accents='unicode', dtype=np.float32, sublinear_tf=True)
        char_tf = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2,
                                  vocabulary=char_cv.vocabulary_, lowercase=False, strip_accents='unicode', dtype=np.float32, sublinear_tf=True)
        t0 = timer(f'Chain {ci}: fit/transform TF-IDF (vocab-aligned)')
        Xtr_w_tf = word_tf.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_w_tf = word_tf.transform(X_text_tr.iloc[va_idx])
        Xte_w_tf = word_tf.transform(X_text_te)
        Xtr_c_tf = char_tf.fit_transform(X_text_tr.iloc[tr_idx])
        Xva_c_tf = char_tf.transform(X_text_tr.iloc[va_idx])
        Xte_c_tf = char_tf.transform(X_text_te)
        Xtr_tf = sparse.hstack([Xtr_w_tf, Xtr_c_tf]).tocsr()
        Xva_tf = sparse.hstack([Xva_w_tf, Xva_c_tf]).tocsr()
        Xte_tf = sparse.hstack([Xte_w_tf, Xte_c_tf]).tocsr()
        done(t0, f'Chain {ci}: TF-IDF')
        # 3) NB step on counts
        t0 = timer(f'Chain {ci}: NB fit (counts) and r compute')
        mnb = MultinomialNB(alpha=1.0); mnb.fit(Xtr_cnt, y[tr_idx])
        r = (mnb.feature_log_prob_[1] - mnb.feature_log_prob_[0]).astype(np.float32)
        r = np.clip(r, -8.0, 8.0)
        done(t0, f'Chain {ci}: NB fit')
        # 4) Apply r to TF-IDF features
        Xtr_nb = Xtr_tf.multiply(r); Xva_nb = Xva_tf.multiply(r); Xte_nb = Xte_tf.multiply(r)
        # 5) LR over NB-weighted features; grid over C
        best_auc, best_pva, best_pte, bestC = -1.0, None, None, None
        for C in params['C_grid']:
            clf = LogisticRegression(solver='saga', penalty='l2', C=C, class_weight=None,
                                     random_state=42, max_iter=4000, n_jobs=-1, verbose=0)
            tfit = time.time(); clf.fit(Xtr_nb, y[tr_idx]); pva = clf.predict_proba(Xva_nb)[:,1]
            auc = roc_auc_score(y[va_idx], pva)
            print(f'  C={C} | AUC={auc:.6f} | secs={time.time()-tfit:.2f}', flush=True)
            if auc > best_auc:
                best_auc, best_pva, best_pte, bestC = auc, pva, clf.predict_proba(Xte_nb)[:,1], C
        print(f'Chain {ci}: best C={bestC} | AUC={best_auc:.6f}', flush=True)
        oof = globals().setdefault('oof', np.zeros(len(tr), dtype=np.float32))
        val_mask = globals().setdefault('val_mask', np.zeros(len(tr), dtype=bool))
        test_preds = globals().setdefault('test_preds', [])
        oof[va_idx] = best_pva.astype(np.float32); val_mask[va_idx] = True; test_preds.append(best_pte.astype(np.float32))
        del Xtr_w_cnt, Xva_w_cnt, Xte_w_cnt, Xtr_c_cnt, Xva_c_cnt, Xte_c_cnt, Xtr_cnt, Xva_cnt, Xte_cnt
        del Xtr_w_tf, Xva_w_tf, Xte_w_tf, Xtr_c_tf, Xva_c_tf, Xte_c_tf, Xtr_tf, Xva_tf, Xte_tf, Xtr_nb, Xva_nb, Xte_nb
        gc.collect()

    oof = globals().get('oof')
    val_mask = globals().get('val_mask')
    test_preds = globals().get('test_preds', [])
    if val_mask is not None and val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print('NB-SVM (venv, counts->MNB r, tfidf LR, binary counts, sublinear TF) OOF AUC:', round(oof_auc, 6))
    np.save('oof_nbsvm_fc.npy', oof.astype(np.float32))
    if len(test_preds):
        test_pred = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
        np.save('test_nbsvm_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_nbsvm_fc.csv', index=False)
        print('Saved: oof_nbsvm_fc.npy, test_nbsvm_fc.npy, submission_nbsvm_fc.csv')

if __name__ == '__main__':
    main()
''')
script.write_text(code)
print('Wrote nbsvm_fc.py')

# Run with venv python to avoid base-kernel sklearn issues
venv_py = Path('.venv/bin/python'); assert venv_py.exists(), 'Missing .venv python; run setup first'
print('Launching NB-SVM via', venv_py)
proc = subprocess.run([str(venv_py), str(script)], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(proc.stdout)
if proc.returncode != 0:
    raise RuntimeError('nbsvm_fc.py failed; check logs above')
print('NB-SVM venv run complete. Re-run Cell 12 to reblend.')

Wrote nbsvm_fc.py
Launching NB-SVM via .venv/bin/python


Chains: [1, 2, 3]
Params: {'word_max_features': 240000, 'char_max_features': 300000, 'C_grid': [0.25, 0.5, 1.0, 2.0, 4.0]}
Chain 1: train 1727 | val 565
[T0] Chain 1: fit counts ...
[T+] Chain 1: fit counts done in 1.42s
[T0] Chain 1: fit/transform TF-IDF (vocab-aligned) ...
[T+] Chain 1: TF-IDF done in 1.34s
[T0] Chain 1: NB fit (counts) and r compute ...
[T+] Chain 1: NB fit done in 0.01s
  C=0.25 | AUC=0.578622 | secs=0.46
  C=0.5 | AUC=0.579256 | secs=0.76
  C=1.0 | AUC=0.578605 | secs=1.28
  C=2.0 | AUC=0.576336 | secs=1.98
  C=4.0 | AUC=0.571728 | secs=2.52
Chain 1: best C=0.5 | AUC=0.579256
Chain 2: train 2158 | val 427
[T0] Chain 2: fit counts ...
[T+] Chain 2: fit counts done in 1.54s
[T0] Chain 2: fit/transform TF-IDF (vocab-aligned) ...
[T+] Chain 2: TF-IDF done in 1.46s
[T0] Chain 2: NB fit (counts) and r compute ...
[T+] Chain 2: NB fit done in 0.01s
  C=0.25 | AUC=0.589716 | secs=1.09
  C=0.5 | AUC=0.590555 | secs=1.43
  C=1.0 | AUC=0.590165 | secs=2.18
  C=2.0 | AUC=0.58

In [146]:
# New Leg: Char-only TF-IDF + LR (char_wb 3-6), forward-chain per-chain fit
import json, time, gc
from pathlib import Path
import numpy as np, pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def load_json_df(path):
    try: return pd.read_json(path, lines=True)
    except ValueError:
        try: return pd.read_json(path, lines=False)
        except ValueError:
            import json as _json
            with open(path, 'r', encoding='utf-8') as f: data=_json.load(f)
            if isinstance(data, dict) and 'data' in data: data=data['data']
            return pd.json_normalize(data)

def build_text(df):
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(['request_title','title'])
    bcol = first_col(['request_text','body','text'])
    t = df[tcol].fillna('') if tcol else pd.Series(['']*len(df))
    b = df[bcol].fillna('') if bcol else pd.Series(['']*len(df))
    t = t.astype(str).str.lower()
    b = b.astype(str).str.lower()
    url_pat = r'https?://\S+|www\.[^\s]+'
    t = t.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    b = b.str.replace(url_pat, ' URL ', regex=True).str.replace(r'\d', '0', regex=True)
    return (t + ' ' + t + ' ' + t + ' ' + b)

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer('Load data and align by time'):
    tr = load_json_df('train.json'); te = load_json_df('test.json')
    mf = json.loads(Path('folds/manifest.json').read_text())
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    X_text_tr = build_text(tr)
    X_text_te = build_text(te)

fold_dir = Path('folds')
mf = json.loads((fold_dir / 'manifest.json').read_text())
chains = [c['chain'] for c in mf['chains']]
print('Chains:', chains)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_accum = []

params = dict(
    char_max_features=300000,
    C_grid=[1.0, 2.0, 4.0]
)
print('Params:', params)

def fit_chain(ci, seed=42):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f'Chain {ci}: empty val; skip'); return None
    print(f'Chain {ci}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
    char_vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2,
                               max_features=params['char_max_features'], lowercase=False,
                               strip_accents='unicode', dtype=np.float32)
    with timer(f'Chain {ci}: vectorize char TF-IDF'):
        Xtr = char_vec.fit_transform(X_text_tr.iloc[tr_idx]).tocsr()
        Xva = char_vec.transform(X_text_tr.iloc[va_idx]).tocsr()
        Xte = char_vec.transform(X_text_te).tocsr()
    ytr, yva = y[tr_idx], y[va_idx]
    best_auc, best_pva, best_pte, bestC = -1.0, None, None, None
    for C in params['C_grid']:
        clf = LogisticRegression(solver='saga', penalty='l2', C=C,
                                 class_weight=None,
                                 random_state=seed, max_iter=3000, n_jobs=-1, verbose=0)
        t0 = time.time(); clf.fit(Xtr, ytr); pva = clf.predict_proba(Xva)[:,1]
        auc = roc_auc_score(yva, pva)
        print(f'  C={C} | AUC={auc:.6f} | secs={time.time()-t0:.2f}', flush=True)
        if auc > best_auc:
            best_auc, best_pva, best_pte, bestC = auc, pva, clf.predict_proba(Xte)[:,1], C
    print(f'Chain {ci}: best C={bestC} | AUC={best_auc:.6f}', flush=True)
    del Xtr, Xva, Xte; gc.collect()
    return va_idx, best_pva.astype(np.float32), best_pte.astype(np.float32)

with timer('Train Char-LR across forward chains'):
    for ci in chains:
        res = fit_chain(ci, seed=42)
        if res is None: continue
        va_idx, pva, pte = res
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds_accum.append(pte)

with timer('Evaluate and save Char-LR artifacts'):
    if val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print('Char-LR OOF AUC (val rows only):', round(oof_auc, 6))
    else:
        print('Warning: no validation rows; OOF not computed.')
    np.save('oof_charlr_fc.npy', oof.astype(np.float32))
    if len(test_preds_accum):
        test_pred = np.mean(np.vstack(test_preds_accum), axis=0).astype(np.float32)
        np.save('test_charlr_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_charlr_fc.csv', index=False)
        print('Saved: oof_charlr_fc.npy, test_charlr_fc.npy, submission_charlr_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load data and align by time ...


[T+] Load data and align by time done in 0.17s


Chains: [1, 2, 3]
Params: {'char_max_features': 300000, 'C_grid': [1.0, 2.0, 4.0]}
[T0] Train Char-LR across forward chains ...


Chain 1: train 1727 | val 565


[T0] Chain 1: vectorize char TF-IDF ...


[T+] Chain 1: vectorize char TF-IDF done in 1.14s


  C=1.0 | AUC=0.594136 | secs=3.12


  C=2.0 | AUC=0.581507 | secs=3.63


  C=4.0 | AUC=0.569916 | secs=4.17


Chain 1: best C=1.0 | AUC=0.594136


Chain 2: train 2158 | val 427


[T0] Chain 2: vectorize char TF-IDF ...


[T+] Chain 2: vectorize char TF-IDF done in 1.26s


  C=1.0 | AUC=0.633255 | secs=4.29


  C=2.0 | AUC=0.624206 | secs=4.67


  C=4.0 | AUC=0.615007 | secs=5.20


Chain 2: best C=1.0 | AUC=0.633255


Chain 3: train 2302 | val 567


[T0] Chain 3: vectorize char TF-IDF ...


[T+] Chain 3: vectorize char TF-IDF done in 1.37s


  C=1.0 | AUC=0.621910 | secs=4.41


  C=2.0 | AUC=0.616682 | secs=4.97


  C=4.0 | AUC=0.609711 | secs=5.40


Chain 3: best C=1.0 | AUC=0.621910


[T+] Train Char-LR across forward chains done in 44.03s


[T0] Evaluate and save Char-LR artifacts ...


Char-LR OOF AUC (val rows only): 0.60305
Saved: oof_charlr_fc.npy, test_charlr_fc.npy, submission_charlr_fc.csv
[T+] Evaluate and save Char-LR artifacts done in 0.01s


In [150]:
# Leg: Word-only TF-IDF + LR (word 1-2, lowercase, sublinear), forward-chain per-chain fit
import json, time, gc
from pathlib import Path
import numpy as np, pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def load_json_df(path):
    try: return pd.read_json(path, lines=True)
    except ValueError:
        try: return pd.read_json(path, lines=False)
        except ValueError:
            import json as _json
            with open(path, 'r', encoding='utf-8') as f: data=_json.load(f)
            if isinstance(data, dict) and 'data' in data: data=data['data']
            return pd.json_normalize(data)

def build_text(df):
    def first_col(cols):
        for c in cols:
            if c in df.columns: return c
        return None
    tcol = first_col(['request_title','title'])
    bcol = first_col(['request_text','body','text'])
    t = df[tcol].fillna('') if tcol else pd.Series(['']*len(df))
    b = df[bcol].fillna('') if bcol else pd.Series(['']*len(df))
    t = t.astype(str).str.lower()
    b = b.astype(str).str.lower()
    url_pat = r'https?://\S+|www\.[^\s]+'
    t = t.str.replace(url_pat, ' url ', regex=True).str.replace(r'\d', '0', regex=True)
    b = b.str.replace(url_pat, ' url ', regex=True).str.replace(r'\d', '0', regex=True)
    return (t + ' ' + t + ' ' + t + ' ' + b)

from contextlib import contextmanager
@contextmanager
def timer(msg):
    t0 = time.time(); print(f"[T0] {msg} ...", flush=True)
    try: yield
    finally: print(f"[T+] {msg} done in {time.time()-t0:.2f}s", flush=True)

with timer('Load data and align by time'):
    tr = load_json_df('train.json'); te = load_json_df('test.json')
    mf = json.loads(Path('folds/manifest.json').read_text())
    time_col = mf.get('time_col','unix_timestamp_of_request_utc')
    label_col = mf.get('label_col','requester_received_pizza')
    tr = tr.sort_values(time_col, kind='mergesort').reset_index(drop=True)
    y = pd.to_numeric(tr[label_col], errors='coerce').fillna(0).astype(int).clip(0,1).values
    X_text_tr = build_text(tr)
    X_text_te = build_text(te)

fold_dir = Path('folds')
mf = json.loads((fold_dir / 'manifest.json').read_text())
chains = [c['chain'] for c in mf['chains']]
print('Chains:', chains)

oof = np.zeros(len(tr), dtype=np.float32)
val_mask = np.zeros(len(tr), dtype=bool)
test_preds_accum = []

params = dict(
    word_max_features=250000,
    min_df=2,
    max_df=0.995,
    C_grid=[1.0, 2.0, 4.0, 8.0, 12.0]
)
print('Params:', params)

def fit_chain(ci, seed=42):
    tr_idx = np.load(fold_dir / f"fc_chain{ci}_train_idx.npy"); va_idx = np.load(fold_dir / f"fc_chain{ci}_val_idx.npy")
    if len(va_idx) == 0:
        print(f'Chain {ci}: empty val; skip'); return None
    print(f'Chain {ci}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
    word_vec = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                               min_df=params['min_df'], max_df=params['max_df'],
                               max_features=params['word_max_features'],
                               lowercase=True, strip_accents='unicode',
                               dtype=np.float32, sublinear_tf=True)
    with timer(f'Chain {ci}: vectorize word TF-IDF (word-only)'):
        Xtr = word_vec.fit_transform(X_text_tr.iloc[tr_idx]).tocsr()
        Xva = word_vec.transform(X_text_tr.iloc[va_idx]).tocsr()
        Xte = word_vec.transform(X_text_te).tocsr()
    ytr, yva = y[tr_idx], y[va_idx]
    best_auc, best_pva, best_pte, bestC = -1.0, None, None, None
    for C in params['C_grid']:
        clf = LogisticRegression(solver='saga', penalty='l2', C=C,
                                 class_weight=None,
                                 random_state=seed, max_iter=3000, n_jobs=-1, verbose=0)
        t0 = time.time(); clf.fit(Xtr, ytr); pva = clf.predict_proba(Xva)[:,1]
        auc = roc_auc_score(yva, pva)
        print(f'  C={C} | AUC={auc:.6f} | secs={time.time()-t0:.2f}', flush=True)
        if auc > best_auc:
            best_auc, best_pva, best_pte, bestC = auc, pva, clf.predict_proba(Xte)[:,1], C
    print(f'Chain {ci}: best C={bestC} | AUC={best_auc:.6f}', flush=True)
    del Xtr, Xva, Xte; gc.collect()
    return va_idx, best_pva.astype(np.float32), best_pte.astype(np.float32)

with timer('Train Word-only TF-IDF LR across forward chains'):
    for ci in chains:
        res = fit_chain(ci, seed=42)
        if res is None: continue
        va_idx, pva, pte = res
        oof[va_idx] = pva
        val_mask[va_idx] = True
        test_preds_accum.append(pte)

with timer('Evaluate and save Word-only TF-IDF LR artifacts'):
    if val_mask.any():
        oof_auc = roc_auc_score(y[val_mask], oof[val_mask])
        print('Word-only TF-IDF LR OOF AUC (val rows only):', round(oof_auc, 6))
    else:
        print('Warning: no validation rows; OOF not computed.')
    np.save('oof_wordlr_fc.npy', oof.astype(np.float32))
    if len(test_preds_accum):
        test_pred = np.mean(np.vstack(test_preds_accum), axis=0).astype(np.float32)
        np.save('test_wordlr_fc.npy', test_pred)
        sub = pd.read_csv('sampleSubmission.csv')
        sub['requester_received_pizza'] = np.clip(test_pred, 0.01, 0.99)
        sub.to_csv('submission_wordlr_fc.csv', index=False)
        print('Saved: oof_wordlr_fc.npy, test_wordlr_fc.npy, submission_wordlr_fc.csv')
    else:
        print('Warning: no test preds collected.')

[T0] Load data and align by time ...


[T+] Load data and align by time done in 0.16s


Chains: [1, 2, 3]
Params: {'word_max_features': 250000, 'min_df': 2, 'max_df': 0.995, 'C_grid': [1.0, 2.0, 4.0, 8.0, 12.0]}
[T0] Train Word-only TF-IDF LR across forward chains ...


Chain 1: train 1727 | val 565


[T0] Chain 1: vectorize word TF-IDF (word-only) ...


[T+] Chain 1: vectorize word TF-IDF (word-only) done in 0.27s


  C=1.0 | AUC=0.592940 | secs=0.47


  C=2.0 | AUC=0.588754 | secs=0.60


  C=4.0 | AUC=0.584040 | secs=0.74


  C=8.0 | AUC=0.580452 | secs=0.88


  C=12.0 | AUC=0.578271 | secs=1.01


Chain 1: best C=1.0 | AUC=0.592940


Chain 2: train 2158 | val 427


[T0] Chain 2: vectorize word TF-IDF (word-only) ...


[T+] Chain 2: vectorize word TF-IDF (word-only) done in 0.30s


  C=1.0 | AUC=0.635173 | secs=0.77


  C=2.0 | AUC=0.632686 | secs=0.89


  C=4.0 | AUC=0.631218 | secs=1.08


  C=8.0 | AUC=0.628371 | secs=1.23


  C=12.0 | AUC=0.628191 | secs=1.33


Chain 2: best C=1.0 | AUC=0.635173


Chain 3: train 2302 | val 567


[T0] Chain 3: vectorize word TF-IDF (word-only) ...


[T+] Chain 3: vectorize word TF-IDF (word-only) done in 0.33s


  C=1.0 | AUC=0.620548 | secs=0.77


  C=2.0 | AUC=0.617583 | secs=0.91


  C=4.0 | AUC=0.615380 | secs=1.12


  C=8.0 | AUC=0.613757 | secs=1.30


  C=12.0 | AUC=0.612365 | secs=1.44


Chain 3: best C=1.0 | AUC=0.620548


[T+] Train Word-only TF-IDF LR across forward chains done in 15.82s


[T0] Evaluate and save Word-only TF-IDF LR artifacts ...


Word-only TF-IDF LR OOF AUC (val rows only): 0.598862
Saved: oof_wordlr_fc.npy, test_wordlr_fc.npy, submission_wordlr_fc.csv
[T+] Evaluate and save Word-only TF-IDF LR artifacts done in 0.01s
