# Facebook Recruiting III - Keyword Extraction (MLE-Benchmark)

Objective: Predict multi-label tags from question Title + Body. Metric: micro-F1. Goal: WIN A MEDAL (Bronze → Silver → Gold).

## Strategy Overview (Gold-Trajectory, Concrete)
- Text construction and cleaning:
  - Parse HTML with BeautifulSoup (lxml parser). Extract:
    - Title (plain text)
    - Body text without code (strip tags, keep meaningful punctuation)
    - Code text from <pre><code> blocks (joined, as separate feature space)
  - Normalize: lowercase, collapse whitespace, replace URLs with token URL, emails with EMAIL, numbers with 0, hex with HEX.
  - Preserve punctuation for char-ngrams. Remove remaining HTML.
- Features (sparse, float32):
  - Title TF-IDF: word ngrams 1–3, max_df=0.95, min_df∈{2–5}, max_features 200k–400k, sublinear_tf=True.
  - Body TF-IDF: word ngrams 1–3, same params, max_features 400k–800k.
  - Char TF-IDF (body+title combined): analyzer='char_wb', ngrams 3–6, min_df∈{2–5}, max_features 300k–800k.
  - Code TF-IDF: word ngrams 1–3, token_pattern to include symbols (e.g., r'(?u)\b\w[\w_\+\-\#\.]*\b'), min_df 2–5, max_features 100k–200k.
  - Combine via hstack with Title boosting α ∈ {2,3,4} (Title features multiplied by α).
  - Meta features (scaled and hstacked via sparse csr): length (title/body/code char and token counts), num_urls, num_code_blocks, punctuation density, avg token length. Use StandardScaler(with_mean=False).
- Label space management:
  - Primary training retains all tags; for fast iterations, prune to tags with freq ≥ 3 or ≥ 5. Report coverage.
  - Compute label coverage % after pruning and ensure final model trains on full label set.
- Models:
  - Primary: OneVsRest LogisticRegression(solver='saga', penalty='l2', C ∈ {0.5, 1.0, 1.5, 2.0}, max_iter 2000, n_jobs=-1, class_weight=None).
  - Diversity: LinearSVC wrapped in CalibratedClassifierCV(method='sigmoid', cv=3) for probabilities.
  - Optional: SGDClassifier(loss='log_loss', early_stopping=True) for quick checks.
- Thresholding (critical):
  - Optimize global threshold t ∈ [0.05, 0.6].
  - Then per-tag thresholds t_k for tags with support ≥ 50 (else fall back to global). Optimize on OOF to maximize micro-F1.
  - Safety rule: if no tag exceeds threshold for a sample, emit top-1 highest score (optionally min_k tuned in OOF).
- Ensembling & dependencies:
  - Blend probabilities from LR(word+char+code+meta) and calibrated SVC; optionally add char-only LR with 0.2–0.4 weight. Tune weights on OOF.
  - Optional small co-occurrence nudge (validated): add +0.01–0.02 to scores of frequent tag pairs before thresholding.
- Contingency (if CV stalls < gold):
  - Classifier Chains with LR on frequent labels subset.
  - Lightweight transformer (DistilBERT/RoBERTa-base) multi-label head (2–3 epochs) and ensemble with linear stack.

## Formal CV & OOF Protocol (Leak-proof)
- Random seeds: global_seed=1337; fold seeds fixed [1337, 2025, 4242, 7777, 8888]. All PRNGs seeded.
- Folds: 5-fold iterative stratification (iterative-stratification library). For speed iterations, 3 folds.
- Within each fold:
  - Fit vectorizers (Title/Body/Char/Code) on training split only.
  - Fit scaler for meta-features on training split only.
  - Train model on training split; predict probabilities on validation split to produce OOF.
- OOF predictions:
  - Store per-model OOF probs for every label; use exclusively for threshold optimization, ensembling weight tuning, and error analysis.
  - Report OOF micro-F1 using optimized thresholds (global and per-tag).
- Final model for inference:
  - Refit vectorizers and models on full training data using best hyperparams and thresholds determined on OOF.

## Milestones with Acceptance Criteria
1) Data load + schema + dummy submission [Checkpoint]
   - Validate reading train/test; inspect columns; ensure Tags is space-separated.
   - Create a tiny dummy submission with correct format (Id, Tags) and pass local schema checks.
   - Acceptance: dummy submission created; basic EDA table produced; no memory issues.
2) Baseline OVR-LR with Title/Body concat (word+char) [Checkpoint]
   - 3-fold iterative CV; global threshold sweep; no code/meta features yet.
   - Acceptance: OOF micro-F1 ≥ 0.70, train time ≤ 60 min, feature_dim ≤ 1.2M.
3) Full feature stack (Title boost, Char WB, Code, Meta) + per-tag thresholds [Checkpoint]
   - 5-fold iterative CV; per-tag thresholds for support ≥ 50.
   - Acceptance: OOF micro-F1 ≥ 0.75, train time ≤ 120 min, label_coverage ≥ 98%.
4) Model diversity + blending [Checkpoint]
   - Add calibrated LinearSVC; blend; re-tune thresholds on blended OOF.
   - Acceptance: OOF micro-F1 ≥ 0.77, stable across folds (std ≤ 0.005).
5) Error analysis + targeted fixes [Checkpoint]
   - Analyze by tag frequency buckets, top false negatives/positives, tag co-occurrence errors; adjust features (e.g., code tokenizer, title alpha, min_df) accordingly.
   - Acceptance: Documented improvements with before/after OOF; no regression in ≥80% of folds.
6) Final training + test inference + submission.csv [Final]
   - Generate predictions with safety rule (min 1 tag); save submission.csv.
   - Acceptance: passes local validator; sizes sane; backed by notebook log.

## Error Analysis Loop (after first baseline)
- Report per-tag precision/recall, support; confusion via co-occurrence matrix deltas.
- Inspect 50 worst FNs/FPs; categorize by: code-heavy, very short title, rare tags, OOV tokens.
- Actions: adjust title weighting, char n-gram range, code token pattern, min_df, and thresholds.

## Experiment Log (Reproducibility-Grade)
| ID | Date/Time | random_seed | fold_scheme | label_pruning_rule | vec_title(word ngram, min_df, max_df, max_feat, sublinear) | vec_body | vec_char | vec_code | meta_feats | model(hypers) | thresholding_method | feature_dim | label_coverage_% | OOF micro-F1 | Holdout micro-F1 | train_time | peak_memory | notes |
|----|-----------|-------------|-------------|--------------------|--------------------------------------------------------------|----------|----------|----------|------------|----------------|---------------------|-------------|------------------|--------------|------------------|------------|-------------|-------|
| 001 | | 1337 | 3-fold iter | freq≥5 | title w(1-2) min_df=3 max_df=0.95 max_feat=200k sub=True | body w(1-2) ... | char wb(3-6) | code w(1-2) | len,url,code, punct | OVR-LR C=1.0 saga | global sweep |  |  |  |  |  |  | baseline |
| 002 | | 1337 | 5-fold iter | none | title w(1-3) ... | body w(1-3) ... | char wb(3-6) | code w(1-3) | +scaled | OVR-LR C=1.5 saga | per-tag + min1 |  |  |  |  |  |  |  |
| 003 | | 1337 | 5-fold iter | none | ... | ... | ... | ... | ... | Calibrated LinearSVC blend 30% | per-tag |  |  |  |  |  |  |  |

## Notebook Hygiene & Repro Policy
- Every milestone: backup notebook programmatically, then DELETE stale code after logging results here.
- Use CSR sparse float32 matrices; cap max_features to avoid OOM; prefer HashingVectorizer+TfidfTransformer if memory-bound.
- All vectorizers/scalers fit within folds only; refit on full train for final model.
- Submit for audit at each checkpoint.

## Risks / Mitigations
- Memory/time: limit feature dims; parallelize OVR; possibly train top-N frequent labels first to validate pipeline.
- Long-tail labels: per-tag thresholds; do not drop rare labels in final training.
- Format issues: early dummy submission to validate schema.

Proceed to: data loading, schema check, dummy submission, and EDA.

In [None]:
# Milestone 1: Data loading, schema validation, dummy submission, initial EDA, and early memory optimization (clean)
import os, sys, json, shutil, gc, re, time, math, random
from datetime import datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# Reproducibility
GLOBAL_SEED = 1337
random.seed(GLOBAL_SEED)
np.random.seed(GLOBAL_SEED)

# Utility: backup notebook
def backup_notebook():
    nb_path = 'agent_notebook.ipynb'
    if os.path.exists(nb_path):
        ts = datetime.now().strftime('%Y%m%d_%H%M%S')
        dst = f'agent_notebook_backup_{ts}.ipynb'
        shutil.copy2(nb_path, dst)
        print(f'[Backup] Notebook copied to {dst}')
    else:
        print('[Backup] Notebook file not found; skipping backup.')
backup_notebook()

# HTML parsing and basic cleaning utilities
URL_RE = re.compile(r'https?://\S+|www\.\S+', flags=re.IGNORECASE)
EMAIL_RE = re.compile(r'\b[\w\.-]+@[\w\.-]+\.[A-Za-z]{2,}\b')
HEX_RE = re.compile(r'\b0x[0-9A-Fa-f]+\b')
NUM_RE = re.compile(r'\b\d+\b')

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ''
    s = s.lower()
    s = URL_RE.sub(' URL ', s)
    s = EMAIL_RE.sub(' EMAIL ', s)
    s = HEX_RE.sub(' HEX ', s)
    s = NUM_RE.sub(' 0 ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def extract_text_and_code_pre_only(html: str):
    """
    Extracts block-level code within <pre>...</pre> (including nested <code>) as code_text,
    keeps inline <code> in body text. Returns:
    (text_wo_block_code, block_code_text, num_block_code_blocks, url_count, punct_density)
    """
    if not isinstance(html, str):
        return '', '', 0, 0, 0.0
    soup = BeautifulSoup(html, 'lxml')
    pre_blocks = soup.find_all('pre')
    code_texts = []
    for pre in pre_blocks:
        code_texts.append(pre.get_text(' ', strip=True))
        pre.extract()
    text = soup.get_text(' ', strip=True)
    text_norm = normalize_text(text)
    code_norm = normalize_text(' \n '.join(code_texts))
    url_count = len(URL_RE.findall(text.lower()))
    puncts = re.findall(r'[\!\?\.,;:\-\(\)\[\]\{\}\#\+\*/\\\|\<\>\=\_\~\^\`\"]', text_norm)
    punct_density = (len(puncts) / max(1, len(text_norm)))
    return text_norm, code_norm, len(pre_blocks), url_count, punct_density

# Paths
TRAIN_PATH = 'train.csv'
TEST_PATH = 'test.csv'
SAMPLE_SUB_PATH = 'sample_submission.csv'

print('[Info] Listing CWD files:')
print('\n'.join(sorted(os.listdir('.'))))

# Load train/test
t0 = time.time()
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)
print(f'[Load] train shape={train.shape}, test shape={test.shape}, time={time.time()-t0:.2f}s')
print('[Schema] train columns:', train.columns.tolist())
print('[Schema] test columns:', test.columns.tolist())

# Early memory optimization (downcast + category) and schema validation
def mem_usage_mb(df: pd.DataFrame) -> float:
    return round(df.memory_usage(deep=True).sum() / (1024**2), 2)
print('[Memory] BEFORE optimization - train:', mem_usage_mb(train), 'MB; test:', mem_usage_mb(test), 'MB')
for df_name, df in [('train', train), ('test', test)]:
    if 'Id' in df.columns and pd.api.types.is_integer_dtype(df['Id']):
        df['Id'] = pd.to_numeric(df['Id'], downcast='integer')
        print(f'[Memory] {df_name}.Id dtype ->', df['Id'].dtype)
if 'Tags' in train.columns and train['Tags'].dtype == object:
    train['Tags'] = train['Tags'].astype('category')
    print('[Memory] train.Tags dtype ->', train['Tags'].dtype)
print('[Memory] AFTER optimization  - train:', mem_usage_mb(train), 'MB; test:', mem_usage_mb(test), 'MB')
# Minor schema validation for Tags delimiters
if 'Tags' in train.columns:
    tags_str = train['Tags'].astype(str)
    bad_delim_mask = tags_str.str.contains(r'[;,\|]', regex=True)
    n_bad = int(bad_delim_mask.sum())
    print(f"[Schema Check] Rows with unexpected delimiters in Tags (commas/semicolons/pipes): {n_bad}")

# Basic checks
required_train_cols = {'Id','Title','Body','Tags'}
required_test_cols = {'Id','Title','Body'}
missing_train = required_train_cols - set(train.columns)
missing_test = required_test_cols - set(test.columns)
if missing_train:
    print('[Warning] Missing expected train columns:', missing_train)
if missing_test:
    print('[Warning] Missing expected test columns:', missing_test)

print('\n[Head] train:')
display(train.head(3))
print('\n[Head] test:')
display(test.head(3))

# Create and validate a tiny dummy submission to ensure schema is right
def make_dummy_submission(output_path='submission_dummy.csv', n_rows=5):
    if os.path.exists(SAMPLE_SUB_PATH):
        ss = pd.read_csv(SAMPLE_SUB_PATH)
        if 'Id' in ss.columns and 'Id' in test.columns:
            ss = ss[ss['Id'].isin(test['Id'])]
        ss_small = ss.head(n_rows).copy()
        ss_small.to_csv(output_path, index=False)
        print(f'[Dummy] Wrote {output_path} with {len(ss_small)} rows from sample_submission.csv')
    else:
        sub = pd.DataFrame({'Id': test['Id'], 'Tags': ['tag'] * len(test)})
        sub.head(n_rows).to_csv(output_path, index=False)
        print(f'[Dummy] Wrote {output_path} with {n_rows} rows constructed from test Ids')
make_dummy_submission()

# Initial EDA: label stats
def split_tags(s):
    if isinstance(s, str):
        return s.strip().split()
    return []
tag_lists = train['Tags'].astype(str).apply(split_tags)
train['__n_tags__'] = tag_lists.apply(len)
avg_tags = train['__n_tags__'].mean()
print(f'[EDA] Avg tags per sample: {avg_tags:.3f}')
print(f'[EDA] Min/Max tags per sample: {train["__n_tags__"].min()} / {train["__n_tags__"].max()}')
top_tags = pd.Series([t for lst in tag_lists for t in lst]).value_counts().head(20)
print('[EDA] Top-20 tags by frequency:')
display(top_tags)

# Quick parsing demo on a few rows to validate HTML/code extraction (using <pre>-only)
demo_rows = train.sample(min(3, len(train)), random_state=GLOBAL_SEED)
demo_out = []
for _, r in demo_rows.iterrows():
    body_txt, code_txt, n_code, n_url, pden = extract_text_and_code_pre_only(r.get('Body', ''))
    title_txt = normalize_text(r.get('Title',''))
    demo_out.append({
        'Id': r.get('Id'),
        'title_len': len(title_txt),
        'body_len': len(body_txt),
        'code_len': len(code_txt),
        'num_block_code': n_code,
        'num_urls': n_url,
        'punct_density': round(pden, 4),
        'tags': str(r.get('Tags'))[:80] if not pd.isna(r.get('Tags')) else ''
    })
print('[Parse Demo] Title/Body/Code lengths and meta for a few samples:')
display(pd.DataFrame(demo_out))

# Memory footprint info (post-optimization)
print('[Memory] train dtypes:')
print(train.dtypes)
print('[Memory] train memory usage (MB):', round(train.memory_usage(deep=True).sum() / (1024**2), 2))
print('[Memory] test memory usage (MB):', round(test.memory_usage(deep=True).sum() / (1024**2), 2))

print('[Milestone 1] Completed data load, schema check, dummy submission, EDA, and early memory optimization (clean).')


In [None]:
# Install missing dependencies for HTML parsing
%pip install --quiet beautifulsoup4 lxml
import importlib
assert importlib.util.find_spec('bs4') is not None, 'bs4 not installed'
assert importlib.util.find_spec('lxml') is not None, 'lxml not installed'
print('Dependencies installed: beautifulsoup4, lxml')

In [None]:
# Milestone 1 fixes (headless-safe, no matplotlib): parsing correction + expanded EDA outputs to files
import os, re, math, random, gc
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

random.seed(GLOBAL_SEED if 'GLOBAL_SEED' in globals() else 1337)
np.random.seed(GLOBAL_SEED if 'GLOBAL_SEED' in globals() else 1337)

# Ensure train/test exist from previous cells
assert 'train' in globals() and 'test' in globals(), 'train/test DataFrames not found in environment.'

# Define regex and normalizer if missing
if 'URL_RE' not in globals():
    URL_RE = re.compile(r'https?://\S+|www\.\S+', flags=re.IGNORECASE)
if 'EMAIL_RE' not in globals():
    EMAIL_RE = re.compile(r'\b[\w\.-]+@[\w\.-]+\.[A-Za-z]{2,}\b')
if 'HEX_RE' not in globals():
    HEX_RE = re.compile(r'\b0x[0-9A-Fa-f]+\b')
if 'NUM_RE' not in globals():
    NUM_RE = re.compile(r'\b\d+\b')
if 'normalize_text' not in globals():
    def normalize_text(s: str) -> str:
        if not isinstance(s, str):
            return ''
        s = s.lower()
        s = URL_RE.sub(' URL ', s)
        s = EMAIL_RE.sub(' EMAIL ', s)
        s = HEX_RE.sub(' HEX ', s)
        s = NUM_RE.sub(' 0 ', s)
        s = re.sub(r'\s+', ' ', s).strip()
        return s

# Correct parsing: only block-level code from <pre> (keeping inline <code> in body)
def extract_text_and_code_pre_only(html: str):
    if not isinstance(html, str):
        return '', '', 0, 0, 0.0
    soup = BeautifulSoup(html, 'lxml')
    pre_blocks = soup.find_all('pre')
    code_texts = []
    for pre in pre_blocks:
        code_texts.append(pre.get_text(' ', strip=True))
        pre.extract()
    text = soup.get_text(' ', strip=True)
    text_norm = normalize_text(text)
    code_norm = normalize_text(' \n '.join(code_texts))
    url_count = len(URL_RE.findall(text.lower()))
    puncts = re.findall(r'[\!\?\.,;:\-\(\)\[\]\{\}\#\+\*/\\\|\<\>\=\_\~\^\`\"]', text_norm)
    punct_density = (len(puncts) / max(1, len(text_norm)))
    return text_norm, code_norm, len(pre_blocks), url_count, punct_density

out_dir = '.'

# Label space analysis and frequency CSV (top 5000)
tag_lists = train['Tags'].apply(lambda s: s.split() if isinstance(s, str) else [])
all_tags = pd.Series([t for lst in tag_lists for t in lst])
freq = all_tags.value_counts()
unique_tags = int(freq.shape[0])
freq_head = freq.head(5000).rename('count').to_frame()
freq_head.index.name = 'tag'
freq_head.reset_index().to_csv(os.path.join(out_dir, 'eda_tag_freq_top5000.csv'), index=False)
with open(os.path.join(out_dir, 'eda_tag_freq_summary.txt'), 'w') as f:
    f.write(f'Total unique tags: {unique_tags}\n')
    f.write(f'min={int(freq.min())}, median={int(freq.median())}, mean={freq.mean():.4f}, max={int(freq.max())}\n')
print('[EDA] Saved tag frequency CSV and summary stats.')

# Co-occurrence top-50 on sample, save as CSV
K = 50
topK = list(freq.head(K).index)
idx_map = {t:i for i,t in enumerate(topK)}
sample_size = min(200_000, len(train))
sample_idx = train.sample(sample_size, random_state=(GLOBAL_SEED if 'GLOBAL_SEED' in globals() else 1337)).index
co_mat = np.zeros((K, K), dtype=np.int32)
for tags in tag_lists.loc[sample_idx]:
    present = [idx_map[t] for t in tags if t in idx_map]
    for i in range(len(present)):
        a = present[i]
        for j in range(i, len(present)):
            b = present[j]
            co_mat[a, b] += 1
            if a != b:
                co_mat[b, a] += 1
co_df = pd.DataFrame(co_mat, index=topK, columns=topK)
co_df.to_csv(os.path.join(out_dir, 'eda_top50_cooccurrence.csv'))
print('[EDA] Saved co-occurrence matrix CSV for top-50 tags.')

# Length distributions (quantiles + hist counts) for Title/Body/Code on sample
sample_size_len = min(100_000, len(train))
sample_df = train.sample(sample_size_len, random_state=(GLOBAL_SEED if 'GLOBAL_SEED' in globals() else 1337))
title_lens = sample_df['Title'].fillna('').apply(lambda s: len(normalize_text(s)))
body_len_list = []
code_len_list = []
for html in sample_df['Body'].fillna(''):
    body_txt, code_txt, _, _, _ = extract_text_and_code_pre_only(html)
    body_len_list.append(len(body_txt))
    code_len_list.append(len(code_txt))
lengths = {
    'title': np.array(title_lens, dtype=np.int32),
    'body': np.array(body_len_list, dtype=np.int32),
    'code': np.array(code_len_list, dtype=np.int32)
}
quantiles = [0.5, 0.75, 0.9, 0.95, 0.99]
rows = []
for name, arr in lengths.items():
    qs = np.quantile(arr, quantiles)
    rows.append({'channel': name, 'mean': float(arr.mean()), 'std': float(arr.std()), 'min': int(arr.min()), 'max': int(arr.max()), **{f'q{int(q*100)}': float(val) for q, val in zip(quantiles, qs)}})
pd.DataFrame(rows).to_csv(os.path.join(out_dir, 'eda_length_stats.csv'), index=False)

# Also save histogram bin counts (no plotting)
hist_rows = []
for name, arr in lengths.items():
    counts, edges = np.histogram(arr, bins=50)
    hist_rows.append(pd.DataFrame({'channel': name, 'bin_left': edges[:-1], 'bin_right': edges[1:], 'count': counts}))
pd.concat(hist_rows, ignore_index=True).to_csv(os.path.join(out_dir, 'eda_length_hist_counts.csv'), index=False)
print('[EDA] Saved length stats and histogram counts CSVs.')

# Zero-tag analysis
n_tags_col_exists = '__n_tags__' in train.columns
zero_tag_mask = (train['__n_tags__'] == 0) if n_tags_col_exists else (tag_lists.apply(len) == 0)
n_zero = int(zero_tag_mask.sum())
pct_zero = 100.0 * n_zero / len(train)
print(f"[EDA] Samples with zero tags: {n_zero} ({pct_zero:.3f}%)")
train.loc[zero_tag_mask, ['Id', 'Title']].head(20).to_csv(os.path.join(out_dir, 'eda_zero_tag_samples_head.csv'), index=False)
print('[EDA] Saved head(20) of zero-tag samples to CSV.')

print('[Milestone 1 Fixes - Headless] Completed parsing correction and expanded EDA (CSV outputs).')


In [None]:
# Dedicated memory optimization + schema validation + notebook hygiene fixes
import pandas as pd
import numpy as np
import re

assert 'train' in globals() and 'test' in globals(), 'train/test DataFrames not found.'

def mem_usage_mb(df: pd.DataFrame) -> float:
    return round(df.memory_usage(deep=True).sum() / (1024**2), 2)

print('[Memory] BEFORE optimization - train:', mem_usage_mb(train), 'MB; test:', mem_usage_mb(test), 'MB')

# Downcast Id columns to reduce footprint
for df_name, df in [('train', train), ('test', test)]:
    if 'Id' in df.columns and pd.api.types.is_integer_dtype(df['Id']):
        df['Id'] = pd.to_numeric(df['Id'], downcast='integer')
        print(f'[Memory] {df_name}.Id dtype ->', df['Id'].dtype)

# Convert Tags to category (kept as strings later for ML when needed)
if 'Tags' in train.columns and train['Tags'].dtype == object:
    train['Tags'] = train['Tags'].astype('category')
    print('[Memory] train.Tags dtype ->', train['Tags'].dtype)

print('[Memory] AFTER optimization  - train:', mem_usage_mb(train), 'MB; test:', mem_usage_mb(test), 'MB')

# Minor EDA validation: ensure Tags schema is space-separated without other delimiters
if 'Tags' in train.columns:
    # Convert to string temporarily for regex checks (category -> str)
    tags_str = train['Tags'].astype(str)
    bad_delim_mask = tags_str.str.contains(r'[;,\|]', regex=True)
    n_bad = int(bad_delim_mask.sum())
    print(f"[Schema Check] Rows with unexpected delimiters in Tags (commas/semicolons/pipes): {n_bad}")
    if n_bad > 0:
        display(train.loc[bad_delim_mask, ['Id','Title','Tags']].head(5))

# Notebook hygiene: remove legacy incorrect parsing function from namespace, ensure only <pre>-aware version stays
removed_legacy = False
if 'extract_text_and_code' in globals():
    try:
        del globals()['extract_text_and_code']
        removed_legacy = True
    except Exception as e:
        print('[Hygiene] Could not delete legacy function:', e)
print('[Hygiene] Legacy extract_text_and_code removed from namespace:', removed_legacy)

# Optional: expose the correct function under a common name for downstream code (without redefining)
if 'extract_text_and_code_pre_only' in globals():
    extract_text_and_code = extract_text_and_code_pre_only  # alias to the correct implementation
    print('[Hygiene] Alias set: extract_text_and_code -> extract_text_and_code_pre_only')

print('[Milestone 1] Memory optimization and schema validation complete. Ready for audit.')

In [None]:
# Environment hygiene: suppress matplotlib_inline post-run callback errors (non-blocking noise)
from IPython import get_ipython
print('[Env] Activating matplotlib_inline suppression...')
try:
    import matplotlib as _mpl
    # Monkey-patch missing attribute to satisfy backend_inline access
    if not hasattr(_mpl, 'backend_bases'):
        _mpl.backend_bases = type('backend_bases', (), {'_Backend': type('_Backend', (), {})})
    # Prefer headless backend
    try:
        import matplotlib.pyplot as plt
        try:
            plt.switch_backend('Agg')
        except Exception:
            pass
    except Exception:
        pass
    ip = get_ipython()
    if ip is not None and hasattr(ip, 'events'):
        # Remove any matplotlib_inline configure_once callback from post_run_cell
        cbs = ip.events.callbacks.get('post_run_cell', [])
        new_cbs = []
        for cb in cbs:
            name = getattr(cb, '__name__', '')
            mod = getattr(cb, '__module__', '')
            qual = getattr(cb, '__qualname__', '')
            # Heuristic: drop callbacks coming from matplotlib_inline backend_inline
            if ('matplotlib_inline' in mod) or ('backend_inline' in mod) or (name == 'configure_once') or ('configure_once' in qual):
                continue
            new_cbs.append(cb)
        ip.events.callbacks['post_run_cell'] = new_cbs
    print('[Env] matplotlib_inline suppression active.')
except Exception as e:
    print('[Env] matplotlib_inline suppression failed:', repr(e))

In [None]:
# Milestone 2 Setup (Clean, Linear): Cache parsed fields to Pickle for fast, leak-proof CV
# - Consolidates required M1 logic (HTML parsing, normalization) into a single executable cell
# - Processes train/test in chunks, extracts title_norm, body_wo_pre, code_pre, and meta
# - Saves compact pickle caches to disk (avoids pyarrow/fastparquet issues)

import os, re, gc, time, math, shutil
from datetime import datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

def backup_notebook():
    nb_path = 'agent_notebook.ipynb'
    if os.path.exists(nb_path):
        ts = datetime.now().strftime('%Y%m%d_%H%M%S')
        dst = f'agent_notebook_backup_{ts}.ipynb'
        shutil.copy2(nb_path, dst)
        print(f'[Backup] Notebook copied to {dst}')
backup_notebook()

# Robust normalization and HTML parsing (block-level code only)
URL_RE   = re.compile(r'https?://\S+|www\.\S+', flags=re.IGNORECASE)
EMAIL_RE = re.compile(r'\b[\w\.-]+@[\w\.-]+\.[A-Za-z]{2,}\b')
HEX_RE   = re.compile(r'\b0x[0-9A-Fa-f]+\b')
NUM_RE   = re.compile(r'\b\d+\b')

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ''
    s = s.lower()
    s = URL_RE.sub(' URL ', s)
    s = EMAIL_RE.sub(' EMAIL ', s)
    s = HEX_RE.sub(' HEX ', s)
    s = NUM_RE.sub(' 0 ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def extract_text_and_code_pre_only(html: str):
    if not isinstance(html, str):
        return '', '', 0, 0, 0.0
    soup = BeautifulSoup(html, 'lxml')
    pre_blocks = soup.find_all('pre')
    code_texts = []
    for pre in pre_blocks:
        code_texts.append(pre.get_text(' ', strip=True))
        pre.extract()
    text = soup.get_text(' ', strip=True)
    text_norm = normalize_text(text)
    code_norm = normalize_text(' \n '.join(code_texts))
    url_count = len(URL_RE.findall(text.lower()))
    puncts = re.findall(r'[\!\?\.,;:\-\(\)\[\]\{\}\#\+\*/\\\|\<\>\=\_\~\^\`\"]', text_norm)
    punct_density = (len(puncts) / max(1, len(text_norm)))
    return text_norm, code_norm, len(pre_blocks), url_count, punct_density

# Chunked cache builder -> returns a single DataFrame (accumulated) to be written once to pickle
def build_cache(csv_path: str, is_train: bool = True, chunksize: int = 100_000, subsample_frac: float | None = None) -> pd.DataFrame:
    t0 = time.time()
    all_parts = []
    usecols = ['Id', 'Title', 'Body'] + (['Tags'] if is_train else [])
    reader = pd.read_csv(csv_path, usecols=usecols, chunksize=chunksize)
    total_rows = 0
    for i, chunk in enumerate(reader):
        if subsample_frac is not None and 0 < subsample_frac < 1:
            chunk = chunk.sample(frac=subsample_frac, random_state=GLOBAL_SEED)
        if 'Id' in chunk.columns:
            chunk['Id'] = pd.to_numeric(chunk['Id'], downcast='integer')
        out_records = {
            'Id': [], 'title_norm': [], 'body_text': [], 'code_text': [],
            'title_len': [], 'body_len': [], 'code_len': [], 'num_block_code': [], 'num_urls': [], 'punct_density': []
        }
        if is_train:
            out_records['Tags'] = []
        titles = chunk['Title'].fillna('').astype(str).tolist()
        titles_norm = [normalize_text(t) for t in titles]
        bodies = chunk['Body'].fillna('').astype(str).tolist()
        for idx in range(len(chunk)):
            body_txt, code_txt, n_code, n_url, pden = extract_text_and_code_pre_only(bodies[idx])
            out_records['Id'].append(int(chunk.iloc[idx]['Id']))
            out_records['title_norm'].append(titles_norm[idx])
            out_records['body_text'].append(body_txt)
            out_records['code_text'].append(code_txt)
            out_records['title_len'].append(len(titles_norm[idx]))
            out_records['body_len'].append(len(body_txt))
            out_records['code_len'].append(len(code_txt))
            out_records['num_block_code'].append(int(n_code))
            out_records['num_urls'].append(int(n_url))
            out_records['punct_density'].append(float(pden))
            if is_train:
                out_records['Tags'].append(chunk.iloc[idx]['Tags'])
        out_df = pd.DataFrame(out_records)
        all_parts.append(out_df)
        total_rows += len(out_df)
        if (i + 1) % 5 == 0:
            print(f'[Cache] Processed ~{total_rows} rows so far for {os.path.basename(csv_path)}')
        del chunk, out_df, out_records, titles, titles_norm, bodies
        gc.collect()
    if len(all_parts) == 0:
        result = pd.DataFrame(columns=['Id','title_norm','body_text','code_text','title_len','body_len','code_len','num_block_code','num_urls','punct_density'] + (['Tags'] if is_train else []))
    else:
        result = pd.concat(all_parts, ignore_index=True)
    print(f'[Cache] Built DataFrame with {len(result)} rows in {time.time()-t0:.1f}s from {os.path.basename(csv_path)}')
    return result

# Paths and configuration
TRAIN_CSV = 'train.csv'
TEST_CSV  = 'test.csv'
PARSED_TRAIN_PKL = 'parsed_train.pkl'
PARSED_TEST_PKL  = 'parsed_test.pkl'

# For speed during first baseline, use 5% of train; keep full test to enable full inference later
SUBSAMPLE_FRAC_TRAIN = 0.05  # quick baseline cache on 5% of train; set to None for full cache later
SUBSAMPLE_FRAC_TEST  = None  # do not subsample test

if not os.path.exists(PARSED_TRAIN_PKL):
    print('[Cache] Creating train cache (pickle) ...')
    df_train_cache = build_cache(TRAIN_CSV, is_train=True, chunksize=50_000, subsample_frac=SUBSAMPLE_FRAC_TRAIN)
    df_train_cache.to_pickle(PARSED_TRAIN_PKL)
    print(f'[Cache] Wrote {PARSED_TRAIN_PKL} ({len(df_train_cache)} rows)')
    del df_train_cache; gc.collect()
else:
    print('[Cache] Found existing', PARSED_TRAIN_PKL)

if not os.path.exists(PARSED_TEST_PKL):
    print('[Cache] Creating test cache (pickle) ...')
    df_test_cache = build_cache(TEST_CSV, is_train=False, chunksize=50_000, subsample_frac=SUBSAMPLE_FRAC_TEST)
    df_test_cache.to_pickle(PARSED_TEST_PKL)
    print(f'[Cache] Wrote {PARSED_TEST_PKL} ({len(df_test_cache)} rows)')
    del df_test_cache; gc.collect()
else:
    print('[Cache] Found existing', PARSED_TEST_PKL)

print('[Milestone 2 Setup] Parsed caches ready (pickle). Next: 3-fold leak-proof baseline with TF-IDF + OVR-LR.')

In [None]:
# Install and configure Parquet engine to avoid pyarrow issues
%pip install --quiet fastparquet
import importlib, pandas as pd
assert importlib.util.find_spec('fastparquet') is not None, 'fastparquet not installed'
pd.set_option('io.parquet.engine', 'fastparquet')
print('[Parquet] Using engine:', pd.get_option('io.parquet.engine'))

In [None]:
# Milestone 2: Fast 3-fold leak-proof baseline (word + char) with OVR Logistic (SGD) and global threshold sweep
# - Uses cached parsed_train.pkl (5% subsample) and parsed_test.pkl (full)
# - Speed-tuned: subsample train to 120k, prune labels by freq >= 100, smaller TF-IDF caps, early-stopping SGD
# - Iterative stratification (multilabel) for CV (scikit-multilearn)
# - Vectorizers fit within folds only (no leakage)
# - Outputs: OOF micro-F1, best threshold, submission.csv (baseline)

%pip install --quiet scikit-multilearn
import os, gc, time, math, sys
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.model_selection import IterativeStratification

GLOBAL_SEED = 1337
rng = np.random.RandomState(GLOBAL_SEED)
np.random.seed(GLOBAL_SEED)

t0_all = time.time()
TRAIN_PKL = 'parsed_train.pkl'
TEST_PKL  = 'parsed_test.pkl'
assert os.path.exists(TRAIN_PKL) and os.path.exists(TEST_PKL), 'Missing parsed caches. Run cache cell first.'

df_tr = pd.read_pickle(TRAIN_PKL)
df_te = pd.read_pickle(TEST_PKL)
print('[Data] Loaded caches:', df_tr.shape, df_te.shape)

# Build texts
def build_text(df: pd.DataFrame):
    # Baseline: concat title + body (ignore code for baseline)
    return (df['title_norm'].fillna('') + ' ' + df['body_text'].fillna('')).astype(str)
X_text_full = build_text(df_tr)
X_text_test = build_text(df_te)

# Labels: split Tags to list
y_tags_full = df_tr['Tags'].astype(str).apply(lambda s: s.split())

# Label pruning for speed
MIN_FREQ = 100
freq = pd.Series([t for lst in y_tags_full for t in lst]).value_counts()
kept_labels = set(freq[freq >= MIN_FREQ].index.tolist())
y_pruned_full = y_tags_full.apply(lambda lst: [t for t in lst if t in kept_labels])
coverage = 100.0 * (y_pruned_full.apply(len) > 0).mean()
print(f'[Labels] Kept labels >= {MIN_FREQ}: {len(kept_labels)}; samples with >=1 kept label: {coverage:.2f}%')

# Filter out samples with zero kept labels to keep CV stable
mask_keep = (y_pruned_full.apply(len) > 0).values
X_text_f = X_text_full.loc[mask_keep].reset_index(drop=True)
y_pruned_f = y_pruned_full.loc[mask_keep].reset_index(drop=True)
print('[Data] After filtering:', X_text_f.shape[0], 'samples')

# Subsample for speed (still leak-proof inside subset)
SUBSAMPLE_N = 120_000
if X_text_f.shape[0] > SUBSAMPLE_N:
    idx_sub = rng.choice(X_text_f.shape[0], size=SUBSAMPLE_N, replace=False)
    idx_sub.sort()
    X_text = X_text_f.iloc[idx_sub].reset_index(drop=True)
    y_pruned = y_pruned_f.iloc[idx_sub].reset_index(drop=True)
    print(f'[Data] Subsampled to {SUBSAMPLE_N} for fast baseline')
else:
    X_text = X_text_f
    y_pruned = y_pruned_f

mlb = MultiLabelBinarizer(sparse_output=True)
Y = mlb.fit_transform(y_pruned)
labels_list = mlb.classes_.tolist()
print('[Labels] Final label dimension:', Y.shape[1])

# CV setup using scikit-multilearn (supports sparse Y)
n_splits = 3
mskf = IterativeStratification(n_splits=n_splits, order=1)

# Model and vectorizers configs (baseline, fast)
word_vec_params = dict(
    analyzer='word', ngram_range=(1,2), min_df=3, max_df=0.95, max_features=100_000,
    sublinear_tf=True, dtype=np.float32
)
char_vec_params = dict(
    analyzer='char_wb', ngram_range=(3,5), min_df=3, max_features=150_000,
    dtype=np.float32
)

def vectorize_fit_transform(X_train_text, X_valid_text):
    wv = TfidfVectorizer(**word_vec_params)
    cv = TfidfVectorizer(**char_vec_params)
    Xw_tr = wv.fit_transform(X_train_text)
    Xc_tr = cv.fit_transform(X_train_text)
    Xw_va = wv.transform(X_valid_text)
    Xc_va = cv.transform(X_valid_text)
    X_tr = sparse.hstack([Xw_tr, Xc_tr]).tocsr().astype(np.float32)
    X_va = sparse.hstack([Xw_va, Xc_va]).tocsr().astype(np.float32)
    return (X_tr, X_va, wv, cv)

def train_ovr_sgd(X_tr, Y_tr):
    base = SGDClassifier(loss='log_loss', penalty='l2', alpha=2e-4, max_iter=10, tol=1e-3,
                         n_jobs=-1, random_state=GLOBAL_SEED, early_stopping=True, n_iter_no_change=3)
    clf = OneVsRestClassifier(base, n_jobs=-1)
    clf.fit(X_tr, Y_tr)
    return clf

def predict_proba_ovr(clf, X):
    # SGDClassifier decision_function -> convert via sigmoid
    from scipy.special import expit
    scores = clf.decision_function(X)
    return expit(scores)

# OOF loop with streaming F1 aggregation to avoid dense arrays
ths = np.linspace(0.1, 0.5, 8)
tp_tot = np.zeros_like(ths, dtype=np.int64)
fp_tot = np.zeros_like(ths, dtype=np.int64)
fn_tot = np.zeros_like(ths, dtype=np.int64)
fold_times = []
n_samples = X_text.shape[0]
X_dummy = np.zeros((n_samples, 1))
for fold, (tr_idx, va_idx) in enumerate(mskf.split(X_dummy, Y)):
    t0 = time.time()
    X_tr_text, X_va_text = X_text.iloc[tr_idx], X_text.iloc[va_idx]
    Y_tr, Y_va = Y[tr_idx], Y[va_idx]
    X_tr, X_va, wv, cv = vectorize_fit_transform(X_tr_text, X_va_text)
    print(f'[Fold {fold}] Shapes train={X_tr.shape}, valid={X_va.shape}, labels={Y_tr.shape[1]}')
    clf = train_ovr_sgd(X_tr, Y_tr)
    va_probs = predict_proba_ovr(clf, X_va)
    # Aggregate TP/FP/FN per threshold
    for ti, thr in enumerate(ths):
        tp = 0; fp = 0; fn = 0
        bs = 4096
        for s in range(0, va_probs.shape[0], bs):
            e = min(va_probs.shape[0], s + bs)
            batch_probs = va_probs[s:e]
            Y_batch = Y_va[s:e]
            for i in range(e - s):
                true_idx = Y_batch[i].indices
                pred_idx = np.where(batch_probs[i] >= thr)[0]
                if true_idx.size == 0 and pred_idx.size == 0:
                    continue
                if pred_idx.size == 0 or true_idx.size == 0:
                    tp_i = 0
                else:
                    a = true_idx; b = pred_idx; ia = ib = 0; tp_i = 0
                    while ia < a.size and ib < b.size:
                        if a[ia] == b[ib]:
                            tp_i += 1; ia += 1; ib += 1
                        elif a[ia] < b[ib]:
                            ia += 1
                        else:
                            ib += 1
                tp += tp_i
                fp += int(pred_idx.size - tp_i)
                fn += int(true_idx.size - tp_i)
        tp_tot[ti] += tp; fp_tot[ti] += fp; fn_tot[ti] += fn
    dt = time.time() - t0
    fold_times.append(dt)
    print(f'[Fold {fold}] Done in {dt/60:.1f} min')
    del X_tr, X_va, X_tr_text, X_va_text, Y_tr, Y_va, wv, cv, clf, va_probs
    gc.collect()

# Compute micro-F1 per threshold
f1s = []
for ti in range(len(ths)):
    tp, fp, fn = tp_tot[ti], fp_tot[ti], fn_tot[ti]
    denom = (2*tp + fp + fn)
    f1s.append(0.0 if denom == 0 else (2.0 * tp) / denom)
best_idx = int(np.argmax(f1s))
best_thr = float(ths[best_idx])
best_f1 = float(f1s[best_idx])
for t, f in zip(ths, f1s):
    print(f'[OOF] thr={t:.3f} micro-F1={f:.5f}')
print(f'[OOF] Best micro-F1={best_f1:.5f} at thr={best_thr:.3f}')
print(f'[Timing] Avg fold time: {np.mean(fold_times)/60:.1f} min, total {np.sum(fold_times)/60:.1f} min')

# Refit on all filtered (subsampled) data and predict test for a baseline submission
print('[Final Fit] Vectorizing full filtered train and test...')
wv_full = TfidfVectorizer(**word_vec_params)
cv_full = TfidfVectorizer(**char_vec_params)
Xw_full = wv_full.fit_transform(X_text)
Xc_full = cv_full.fit_transform(X_text)
X_full = sparse.hstack([Xw_full, Xc_full]).tocsr().astype(np.float32)
Xw_test = wv_full.transform(X_text_test)
Xc_test = cv_full.transform(X_text_test)
X_test_mat = sparse.hstack([Xw_test, Xc_test]).tocsr().astype(np.float32)

clf_full = train_ovr_sgd(X_full, Y)
test_probs = predict_proba_ovr(clf_full, X_test_mat)
pred_bin = (test_probs >= best_thr).astype(np.int32)

# Safety rule: ensure at least 1 tag per sample
rowsums = pred_bin.sum(axis=1)
if rowsums.ndim == 2:
    rowsums = rowsums.ravel()
for i in range(pred_bin.shape[0]):
    if rowsums[i] == 0:
        j = int(np.argmax(test_probs[i]))
        pred_bin[i, j] = 1

# Build submission
id_test = df_te['Id'].astype(int).values
pred_tags = []
for i in range(pred_bin.shape[0]):
    inds = np.where(pred_bin[i] == 1)[0]
    tags = [labels_list[j] for j in inds]
    pred_tags.append(' '.join(tags) if len(tags) > 0 else labels_list[int(np.argmax(test_probs[i]))])
sub = pd.DataFrame({'Id': id_test, 'Tags': pred_tags})
sub.to_csv('submission.csv', index=False)
print('[Submission] Wrote submission.csv with shape:', sub.shape)
print('[Milestone 2] Baseline complete. OOF micro-F1 (global):', f'{best_f1:.5f}', 'thr=', f'{best_thr:.3f}')
print('[Total time] {:.1f} min'.format((time.time()-t0_all)/60.0))


In [None]:
# Milestone 2 (Reset Caching): Build FULL training cache (no subsample) for leak-proof CV
# - Writes parsed_train_full.pkl to avoid clobbering the 5% cache
# - Reuses the approved HTML parsing: extract block-level <pre> code only

import os, re, gc, time, shutil
from datetime import datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

def backup_notebook():
    nb_path = 'agent_notebook.ipynb'
    if os.path.exists(nb_path):
        ts = datetime.now().strftime('%Y%m%d_%H%M%S')
        dst = f'agent_notebook_backup_{ts}.ipynb'
        shutil.copy2(nb_path, dst)
        print(f'[Backup] Notebook copied to {dst}')
backup_notebook()

# Normalization and parsing (as approved in M1)
URL_RE   = re.compile(r'https?://\S+|www\.\S+', flags=re.IGNORECASE)
EMAIL_RE = re.compile(r'\b[\w\.-]+@[\w\.-]+\.[A-Za-z]{2,}\b')
HEX_RE   = re.compile(r'\b0x[0-9A-Fa-f]+\b')
NUM_RE   = re.compile(r'\b\d+\b')

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ''
    s = s.lower()
    s = URL_RE.sub(' URL ', s)
    s = EMAIL_RE.sub(' EMAIL ', s)
    s = HEX_RE.sub(' HEX ', s)
    s = NUM_RE.sub(' 0 ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def extract_text_and_code_pre_only(html: str):
    if not isinstance(html, str):
        return '', '', 0, 0, 0.0
    soup = BeautifulSoup(html, 'lxml')
    pre_blocks = soup.find_all('pre')
    code_texts = []
    for pre in pre_blocks:
        code_texts.append(pre.get_text(' ', strip=True))
        pre.extract()
    text = soup.get_text(' ', strip=True)
    text_norm = normalize_text(text)
    code_norm = normalize_text(' \n '.join(code_texts))
    url_count = len(URL_RE.findall(text.lower()))
    puncts = re.findall(r'[\!\?\.,;:\-\(\)\[\]\{\}\#\+\*/\\\|\<\>\=\_\~\^\`\"]', text_norm)
    punct_density = (len(puncts) / max(1, len(text_norm)))
    return text_norm, code_norm, len(pre_blocks), url_count, punct_density

def build_cache(csv_path: str, is_train: bool = True, chunksize: int = 100_000) -> pd.DataFrame:
    t0 = time.time()
    all_parts = []
    usecols = ['Id', 'Title', 'Body'] + (['Tags'] if is_train else [])
    reader = pd.read_csv(csv_path, usecols=usecols, chunksize=chunksize)
    total_rows = 0
    for i, chunk in enumerate(reader):
        # Downcast Id
        if 'Id' in chunk.columns:
            chunk['Id'] = pd.to_numeric(chunk['Id'], downcast='integer')
        out_records = {
            'Id': [], 'title_norm': [], 'body_text': [], 'code_text': [],
            'title_len': [], 'body_len': [], 'code_len': [], 'num_block_code': [], 'num_urls': [], 'punct_density': []
        }
        if is_train:
            out_records['Tags'] = []
        titles = chunk['Title'].fillna('').astype(str).tolist()
        titles_norm = [normalize_text(t) for t in titles]
        bodies = chunk['Body'].fillna('').astype(str).tolist()
        for idx in range(len(chunk)):
            body_txt, code_txt, n_code, n_url, pden = extract_text_and_code_pre_only(bodies[idx])
            out_records['Id'].append(int(chunk.iloc[idx]['Id']))
            out_records['title_norm'].append(titles_norm[idx])
            out_records['body_text'].append(body_txt)
            out_records['code_text'].append(code_txt)
            out_records['title_len'].append(len(titles_norm[idx]))
            out_records['body_len'].append(len(body_txt))
            out_records['code_len'].append(len(code_txt))
            out_records['num_block_code'].append(int(n_code))
            out_records['num_urls'].append(int(n_url))
            out_records['punct_density'].append(float(pden))
            if is_train:
                out_records['Tags'].append(chunk.iloc[idx]['Tags'])
        out_df = pd.DataFrame(out_records)
        all_parts.append(out_df)
        total_rows += len(out_df)
        if (i + 1) % 5 == 0:
            print(f'[Cache] Processed ~{total_rows} rows so far for {os.path.basename(csv_path)}')
        del chunk, out_df, out_records, titles, titles_norm, bodies
        gc.collect()
    result = pd.concat(all_parts, ignore_index=True) if len(all_parts) else pd.DataFrame()
    print(f'[Cache] Built DataFrame with {len(result)} rows in {time.time()-t0:.1f}s from {os.path.basename(csv_path)}')
    return result

TRAIN_CSV = 'train.csv'
PARSED_TRAIN_FULL_PKL = 'parsed_train_full.pkl'

if not os.path.exists(PARSED_TRAIN_FULL_PKL):
    print('[Cache] Building FULL train cache (no subsample)...')
    df_train_full = build_cache(TRAIN_CSV, is_train=True, chunksize=75_000)
    df_train_full.to_pickle(PARSED_TRAIN_FULL_PKL)
    print(f'[Cache] Wrote {PARSED_TRAIN_FULL_PKL} with {len(df_train_full)} rows')
    del df_train_full; gc.collect()
else:
    print('[Cache] Found existing', PARSED_TRAIN_FULL_PKL)

print('[M2 Cache Reset] Full training cache ready. Next: implement leak-proof per-fold label pruning + multi-channel features.')


In [None]:
# Process Remediation: Create a pristine, single-cell notebook per audit mandate
# - Archives current notebook as a devlog
# - Writes a new notebook (agent_notebook_pristine.ipynb) with ONE idempotent setup cell
#   that imports dependencies, defines helpers, suppresses matplotlib_inline noise,
#   and builds/loads full caches into memory (df_train_full, df_test_cache).
# - Does NOT run the setup; it's authored for a clean Run-All experience.

import os, shutil, time, re, gc
from datetime import datetime
import nbformat as nbf

ts = datetime.now().strftime('%Y%m%d_%H%M%S')
src_nb = 'agent_notebook.ipynb'
devlog_nb = f'agent_notebook_devlog_{ts}.ipynb'
pristine_nb = 'agent_notebook_pristine.ipynb'

# 1) Archive current notebook
if os.path.exists(src_nb):
    shutil.copy2(src_nb, devlog_nb)
    print(f'[Archive] Current notebook archived to {devlog_nb}')
else:
    print('[Archive] Source notebook not found; skipping archive.')

# 2) Compose the single, consolidated setup cell (idempotent)
setup_source = r'''\
# Facebook Recruiting III - Keyword Extraction (Pristine Setup)
# Single-cell, idempotent setup: imports, constants, helpers, env hygiene, and full-data caching.

import os, re, gc, time, shutil
from datetime import datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# --- Environment hygiene: suppress matplotlib_inline post-run callback noise ---
from IPython import get_ipython
print('[Env] Activating matplotlib_inline suppression...')
try:
    import matplotlib as _mpl
    if not hasattr(_mpl, 'backend_bases'):
        _mpl.backend_bases = type('backend_bases', (), {'_Backend': type('_Backend', (), {})})
    try:
        import matplotlib.pyplot as plt
        try:
            plt.switch_backend('Agg')
        except Exception:
            pass
    except Exception:
        pass
    ip = get_ipython()
    if ip is not None and hasattr(ip, 'events'):
        cbs = ip.events.callbacks.get('post_run_cell', [])
        new_cbs = []
        for cb in cbs:
            name = getattr(cb, '__name__', '')
            mod = getattr(cb, '__module__', '')
            qual = getattr(cb, '__qualname__', '')
            if ('matplotlib_inline' in mod) or ('backend_inline' in mod) or (name == 'configure_once') or ('configure_once' in qual):
                continue
            new_cbs.append(cb)
        ip.events.callbacks['post_run_cell'] = new_cbs
    print('[Env] matplotlib_inline suppression active.')
except Exception as e:
    print('[Env] matplotlib_inline suppression failed:', repr(e))

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

def backup_notebook():
    nb_path = 'agent_notebook_pristine.ipynb'
    if os.path.exists(nb_path):
        ts = datetime.now().strftime('%Y%m%d_%H%M%S')
        dst = f'agent_notebook_pristine_backup_{ts}.ipynb'
        shutil.copy2(nb_path, dst)
        print(f'[Backup] Pristine notebook copied to {dst}')
backup_notebook()

# Normalization and parsing (approved) - RAW regex strings to preserve \b, \d etc.
URL_RE   = re.compile(r'https?://\S+|www\.\S+', flags=re.IGNORECASE)
EMAIL_RE = re.compile(r'(?i)\b[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}\b')
HEX_RE   = re.compile(r'\b0x[0-9a-f]+\b', flags=re.IGNORECASE)
NUM_RE   = re.compile(r'\b\d+\b')

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ''
    s = s.lower()
    s = URL_RE.sub(' URL ', s)
    s = EMAIL_RE.sub(' EMAIL ', s)
    s = HEX_RE.sub(' HEX ', s)
    s = NUM_RE.sub(' 0 ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def extract_text_and_code_pre_only(html: str):
    if not isinstance(html, str):
        return '', '', 0, 0, 0.0
    soup = BeautifulSoup(html, 'lxml')
    pre_blocks = soup.find_all('pre')
    code_texts = []
    for pre in pre_blocks:
        code_texts.append(pre.get_text(' ', strip=True))
        pre.extract()
    text = soup.get_text(' ', strip=True)
    text_norm = normalize_text(text)
    # Use a simple space to join code blocks to avoid escape issues in code generation
    code_norm = normalize_text(' '.join(code_texts))
    url_count = len(URL_RE.findall(text.lower()))
    puncts = re.findall(r'[\!\?\.,;:\-\(\)\[\]\{\}\#\+\*/\\\|\<\>\=\_\~\^\`\"]', text_norm)
    punct_density = (len(puncts) / max(1, len(text_norm)))
    return text_norm, code_norm, len(pre_blocks), url_count, punct_density

def build_cache(csv_path: str, is_train: bool = True, chunksize: int = 100_000) -> pd.DataFrame:
    t0 = time.time()
    all_parts = []
    usecols = ['Id', 'Title', 'Body'] + (['Tags'] if is_train else [])
    reader = pd.read_csv(csv_path, usecols=usecols, chunksize=chunksize)
    total_rows = 0
    for i, chunk in enumerate(reader):
        if 'Id' in chunk.columns:
            chunk['Id'] = pd.to_numeric(chunk['Id'], downcast='integer')
        out_records = {
            'Id': [], 'title_norm': [], 'body_text': [], 'code_text': [],
            'title_len': [], 'body_len': [], 'code_len': [], 'num_block_code': [], 'num_urls': [], 'punct_density': []
        }
        if is_train:
            out_records['Tags'] = []
        titles = chunk['Title'].fillna('').astype(str).tolist()
        titles_norm = [normalize_text(t) for t in titles]
        bodies = chunk['Body'].fillna('').astype(str).tolist()
        for idx in range(len(chunk)):
            body_txt, code_txt, n_code, n_url, pden = extract_text_and_code_pre_only(bodies[idx])
            out_records['Id'].append(int(chunk.iloc[idx]['Id']))
            out_records['title_norm'].append(titles_norm[idx])
            out_records['body_text'].append(body_txt)
            out_records['code_text'].append(code_txt)
            out_records['title_len'].append(len(titles_norm[idx]))
            out_records['body_len'].append(len(body_txt))
            out_records['code_len'].append(len(code_txt))
            out_records['num_block_code'].append(int(n_code))
            out_records['num_urls'].append(int(n_url))
            out_records['punct_density'].append(float(pden))
            if is_train:
                out_records['Tags'].append(chunk.iloc[idx]['Tags'])
        out_df = pd.DataFrame(out_records)
        all_parts.append(out_df)
        total_rows += len(out_df)
        if (i + 1) % 5 == 0:
            print(f'[Cache] Processed ~{total_rows} rows so far for {os.path.basename(csv_path)}')
        del chunk, out_df, out_records, titles, titles_norm, bodies
        gc.collect()
    result = pd.concat(all_parts, ignore_index=True) if len(all_parts) else pd.DataFrame()
    print(f'[Cache] Built DataFrame with {len(result)} rows in {time.time()-t0:.1f}s from {os.path.basename(csv_path)}')
    return result

TRAIN_CSV = 'train.csv'
TEST_CSV  = 'test.csv'
PARSED_TRAIN_FULL_PKL = 'parsed_train_full.pkl'
PARSED_TEST_PKL  = 'parsed_test.pkl'

print('[Setup] Starting pristine setup...')
built_any = False

# Load-or-build TRAIN cache, and ensure df_train_full in memory
if os.path.exists(PARSED_TRAIN_FULL_PKL):
    df_train_full = pd.read_pickle(PARSED_TRAIN_FULL_PKL)
    print(f"[Cache] Loaded {PARSED_TRAIN_FULL_PKL} with shape {df_train_full.shape}")
else:
    print('[Cache] Building FULL train cache (no subsample)...')
    df_train_full = build_cache(TRAIN_CSV, is_train=True, chunksize=75_000)
    df_train_full.to_pickle(PARSED_TRAIN_FULL_PKL)
    print(f'[Cache] Wrote {PARSED_TRAIN_FULL_PKL} with {len(df_train_full)} rows')
    built_any = True

# Load-or-build TEST cache, and ensure df_test_cache in memory
if os.path.exists(PARSED_TEST_PKL):
    df_test_cache = pd.read_pickle(PARSED_TEST_PKL)
    print(f"[Cache] Loaded {PARSED_TEST_PKL} with shape {df_test_cache.shape}")
else:
    print('[Cache] Building test cache (pickle)...')
    df_test_cache = build_cache(TEST_CSV, is_train=False, chunksize=75_000)
    df_test_cache.to_pickle(PARSED_TEST_PKL)
    print(f'[Cache] Wrote {PARSED_TEST_PKL} with {len(df_test_cache)} rows')
    built_any = True

print('[Setup] Completed. Built any:', built_any)
print('[Setup] df_train_full shape:', df_train_full.shape if 'df_train_full' in globals() else None)
print('[Setup] df_test_cache shape:', df_test_cache.shape if 'df_test_cache' in globals() else None)
'''

# 3) Create the pristine notebook object with a single code cell
nb = nbf.v4.new_notebook()
nb.cells = [nbf.v4.new_code_cell(setup_source)]
nb.metadata["kernelspec"] = {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
}
nb.metadata["language_info"] = {
    "name": "python",
    "version": "3.11"
}

# 4) Write pristine notebook to disk
with open(pristine_nb, 'w', encoding='utf-8') as f:
    nbf.write(nb, f)
print(f'[Pristine] Wrote {pristine_nb} with a single idempotent setup cell.')
print('[Instruction] Open agent_notebook_pristine.ipynb and Run-All. It will load/build caches, suppress matplotlib_inline noise, and expose df_train_full/df_test_cache in memory.')


In [None]:
# Install nbformat to programmatically create a pristine notebook per audit requirements
%pip install --quiet nbformat
import importlib
assert importlib.util.find_spec('nbformat') is not None, 'nbformat failed to install'
print('nbformat installed and importable.')


In [None]:
# Verification step (mandatory): validate pristine setup regex normalization and parsing
# - Reads agent_notebook_pristine.ipynb
# - Executes its single setup cell in an isolated namespace
# - Asserts normalize_text and extract_text_and_code_pre_only behave correctly

import os
import nbformat as nbf

pristine_nb = 'agent_notebook_pristine.ipynb'
assert os.path.exists(pristine_nb), '[Verify] Pristine notebook not found.'

# Load pristine notebook and extract the single code cell
nb = nbf.read(pristine_nb, as_version=4)
cells = [c for c in nb.cells if c.cell_type == 'code']
assert len(cells) == 1, f'[Verify] Expected 1 code cell in pristine notebook, found {len(cells)}'
setup_code = cells[0].source
print('[Verify] Loaded setup cell code length:', len(setup_code))

# Execute the setup cell in an isolated global namespace
g = {}
exec(setup_code, g, g)

# Pull required symbols
normalize_text = g.get('normalize_text')
extract_text_and_code_pre_only = g.get('extract_text_and_code_pre_only')
URL_RE = g.get('URL_RE'); EMAIL_RE = g.get('EMAIL_RE'); HEX_RE = g.get('HEX_RE'); NUM_RE = g.get('NUM_RE')
assert callable(normalize_text), '[Verify] normalize_text is not defined/callable.'
assert callable(extract_text_and_code_pre_only), '[Verify] extract_text_and_code_pre_only is not defined/callable.'
assert URL_RE is not None and EMAIL_RE is not None and HEX_RE is not None and NUM_RE is not None, '[Verify] One or more regex patterns missing.'
print('[Verify] Patterns:', 'URL=', URL_RE.pattern, 'EMAIL=', EMAIL_RE.pattern, 'HEX=', HEX_RE.pattern, 'NUM=', NUM_RE.pattern)

# Tests for normalize_text
s_in = 'Email: Foo.Bar-123@example.co.uk visit: https://example.com/path?q=1&a=2 hex=0xDEADbeef and numbers 42, 007.'
s_out = normalize_text(s_in)
print('[Verify] normalize_text output:', s_out[:160], '...')
# Validate replacements
assert ' URL ' in s_out, '[Verify] URL replacement failed'
assert ' EMAIL ' in s_out, '[Verify] EMAIL replacement failed'
assert '@' not in s_out, '[Verify] EMAIL address not removed'
assert ' HEX ' in s_out, '[Verify] HEX replacement failed'
assert '0x' not in s_out, '[Verify] HEX literal not removed'
assert ' 0 ' in s_out, '[Verify] number normalization failed'


# Tests for extract_text_and_code_pre_only
html = '''<p>Body has URL https://x.y and inline <code>print(123)</code>.</p>\n<pre><code>int main(){return 0;}</code></pre>\n<p>mail me: a@b.cc</p>'''
body_txt, code_txt, n_code, n_url, pden = extract_text_and_code_pre_only(html)
print('[Verify] body_txt:', body_txt)
print('[Verify] code_txt:', code_txt)
print('[Verify] n_code:', n_code, 'n_url:', n_url, 'punct_density:', pden)
# Inline <code> should remain in body text (numbers normalized to 0). Check for 'print' token presence.
assert 'print' in body_txt, '[Verify] Inline <code> content missing from body text.'
assert 'int main' in code_txt, '[Verify] <pre> code should be extracted to code_txt.'
assert n_code == 1, '[Verify] num_block_code should be 1.'
assert n_url >= 1, '[Verify] URL count should reflect URL presence.'

print('[Verify] PASSED: Pristine setup normalization and parsing behaviors are correct.')


In [None]:
# Append REFACTORED Phase 2 modeling cell to agent_notebook_pristine.ipynb (v3: streaming/online learning)
import nbformat as nbf, os

pristine_nb = 'agent_notebook_pristine.ipynb'
assert os.path.exists(pristine_nb), 'Pristine notebook not found.'
nb = nbf.read(pristine_nb, as_version=4)

# Keep only the first setup cell; drop any previously appended modeling cells for cleanliness
code_cells = [i for i,c in enumerate(nb.cells) if c.cell_type=='code']
if len(code_cells) > 1:
    nb.cells = [nb.cells[code_cells[0]]]

modeling_source = r'''\
# Phase 2 (Refactored v3): Scalable 5-fold CV with streaming features + online OVR via SGD (partial_fit)
# - Avoids materializing full X_tr/X_va: transforms batches on the fly per channel and hstack per-batch only
# - Per-fold label pruning + dynamic label sharding (4GB coef_ budget)
# - Streaming threshold optimization with correct FN accounting for excluded labels
# - Hygiene: no nested parallelism, sparse-safe ops, per-fold vectorizer fit, artifact persistence, safety rule
# - Micro-pilot mode (conditional): deterministic subsample and run only the first fold to validate pipeline

import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.linear_model import SGDClassifier

# Dependency guard: iterative-stratification
try:
    from skmultilearn.model_selection import IterativeStratification
except Exception:
    import sys
    from subprocess import run
    run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
    from skmultilearn.model_selection import IterativeStratification

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

# Expect df_train_full and df_test_cache in memory from setup cell
assert 'df_train_full' in globals() and 'df_test_cache' in globals(), 'Run setup cell first to load caches.'

# Prepare texts and labels
def to_list_tags(s):
    return s.split() if isinstance(s, str) else []
y_lists = df_train_full['Tags'].astype(str).apply(to_list_tags)
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[Labels] #samples:', n_samples, '#labels:', n_labels)

# Channels
title_text = df_train_full['title_norm'].fillna('').astype(str)
body_text  = df_train_full['body_text'].fillna('').astype(str)
code_text  = df_train_full['code_text'].fillna('').astype(str)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_train_full[meta_cols].astype(np.float32).values

# Micro-pilot configuration (auditor-approved): subsample deterministically and run only first fold
MICRO_PILOT = True
PILOT_N = 50_000  # target rows for micro-pilot (reduced for faster turnaround)
if MICRO_PILOT:
    n_keep = int(min(PILOT_N, n_samples))
    rng = np.random.RandomState(GLOBAL_SEED)
    idx_keep = rng.choice(n_samples, size=n_keep, replace=False)
    idx_keep.sort()
    # Subset all channels and labels consistently
    title_text = title_text.iloc[idx_keep].reset_index(drop=True)
    body_text  = body_text.iloc[idx_keep].reset_index(drop=True)
    code_text  = code_text.iloc[idx_keep].reset_index(drop=True)
    meta_all   = meta_all[idx_keep]
    Y_all      = Y_all[idx_keep]
    n_samples  = Y_all.shape[0]
    print(f'[Pilot] Subsampled to {n_samples} rows for micro-pilot.')

# CV setup
n_splits = 5
mskf = IterativeStratification(n_splits=n_splits, order=1)

# Vectorizer configs (fit within fold on train split)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=200_000, sublinear_tf=True, dtype=np.float32)
# Use streaming hashing with built-in l2 normalization (no IDF) to allow partial/batch processing
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**19,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**18,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg   = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                      token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=100_000,
                      sublinear_tf=True, dtype=np.float32)
title_alpha = 3.0
MIN_LABEL_FREQ_TRAIN = 50  # higher for micro-pilot to reduce labels per shard and speed up
THS = np.linspace(0.05, 0.6, 12)
COEF_BUDGET_BYTES = 4_000_000_000  # 4GB
BATCH_SIZE = 8192  # larger batch for better throughput in micro-pilot

# Global label support for per-tag thresholds
global_support = np.asarray(Y_all.sum(axis=0)).ravel()
hi_support_mask = (global_support >= 50)
hi_label_idx = np.where(hi_support_mask)[0]
print('[Labels] High-support labels (>=50):', hi_label_idx.size)

# Streaming accumulators for global micro-F1
tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)

# Per-tag (high-support only) streaming counts per threshold
tp_hi = np.zeros((hi_label_idx.size, THS.size), dtype=np.int64)
fp_hi = np.zeros((hi_label_idx.size, THS.size), dtype=np.int64)
fn_hi = np.zeros((hi_label_idx.size, THS.size), dtype=np.int64)
hi_pos = {lab: pos for pos, lab in enumerate(hi_label_idx)}

def update_counts_batch(Y_true_batch_csr, probs_batch, label_idx_batch, ths, tp, fp, fn,
                        tp_hi, fp_hi, fn_hi, hi_pos_map):
    bs, Lb = probs_batch.shape
    for ti, thr in enumerate(ths):
        pred_bin = (probs_batch >= thr).astype(np.uint8)
        pred_csr = sparse.csr_matrix(pred_bin, dtype=np.uint8)
        tp_mat = pred_csr.multiply(Y_true_batch_csr)
        tp_count = int(tp_mat.sum())
        pred_pos = int(pred_bin.sum())
        true_pos = int(Y_true_batch_csr.sum())
        tp[ti] += tp_count
        fp[ti] += (pred_pos - tp_count)
        fn[ti] += (true_pos - tp_count)
        if hi_pos_map:
            for j_local in range(Lb):
                g_lab = int(label_idx_batch[j_local])
                pos = hi_pos_map.get(g_lab, None)
                if pos is None:
                    continue
                col_true = Y_true_batch_csr[:, j_local]
                col_pred = pred_csr[:, j_local]
                tp_j = int(col_true.multiply(col_pred).sum())
                p_j = int(col_pred.sum())
                t_j = int(col_true.sum())
                tp_hi[pos, ti] += tp_j
                fp_hi[pos, ti] += (p_j - tp_j)
                fn_hi[pos, ti] += (t_j - tp_j)

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

# Helper: yield batches of indices
def batch_indices(idxs, batch_size):
    for s in range(0, idxs.size, batch_size):
        yield idxs[s:min(idxs.size, s+batch_size)]

fold_times = []
X_dummy = np.zeros((n_samples, 1))
fold_id = 0
t_all = time.time()
for tr_idx, va_idx in mskf.split(X_dummy, Y_all):
    t0 = time.time()
    Y_tr_full = Y_all[tr_idx]
    Y_va_full = Y_all[va_idx]

    # Per-fold label pruning (train split only)
    sup_tr = np.asarray(Y_tr_full.sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    print(f'[Fold {fold_id}] Train={len(tr_idx)}, Valid={len(va_idx)}, kept_labels={kept_labels.size}, excluded={excluded_labels.size}')
    if kept_labels.size == 0:
        print(f'[Fold {fold_id}] No labels meet freq >= {MIN_LABEL_FREQ_TRAIN}; skipping fold.')
        continue

    # Fit vectorizers/scaler on training split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    meta_scaler = StandardScaler(with_mean=False)

    # Fit title/code on full train split (fit only, no transform to avoid materializing large matrices)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])

    # Fit meta scaler in batches
    for b_idx in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b_idx])

    # Compute feature dimension D approximately by transforming one small batch
    probe_idx = tr_idx[:min(BATCH_SIZE, tr_idx.size)]
    Xt_probe = title_vec.transform(title_text.iloc[probe_idx]).astype(np.float32)
    Xt_probe = Xt_probe.multiply(title_alpha)
    Xb_probe = body_hash.transform(body_text.iloc[probe_idx]).astype(np.float32)
    Xc_probe = char_hash.transform((title_text.iloc[probe_idx] + ' ' + body_text.iloc[probe_idx])).astype(np.float32)
    Xcode_probe = code_vec.transform(code_text.iloc[probe_idx]).astype(np.float32)
    Xmeta_probe = sparse.csr_matrix(meta_scaler.transform(meta_all[probe_idx]), dtype=np.float32)
    D = sparse.hstack([Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe], format='csr', dtype=np.float32).shape[1]
    del Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe; gc.collect()
    # SGDClassifier stores coef_ as float64 by default -> 8 bytes per weight
    shard_cap_by_budget = max(1, int(COEF_BUDGET_BYTES // (8 * D)))
    dyn_shard_size = max(1, min(2000, shard_cap_by_budget))
    print(f'[Fold {fold_id}] Approx feature dim D={D:,}. Dynamic SHARD_SIZE={dyn_shard_size} (budget {COEF_BUDGET_BYTES/1e9:.1f}GB)')

    # Shard labels
    shards = [kept_labels[i:i+dyn_shard_size] for i in range(0, kept_labels.size, dyn_shard_size)]
    print(f'[Fold {fold_id}] #shards: {len(shards)} (size {dyn_shard_size})')

    # Train + validate per shard using online binary classifiers
    for si, shard in enumerate(shards):
        Lb = len(shard)
        if Lb == 0:
            continue
        print(f'[Fold {fold_id}] Shard {si+1}/{len(shards)} with {Lb} labels')

        # Create per-label SGD models
        models = []
        for _ in range(Lb):
            models.append(SGDClassifier(loss='log_loss', penalty='l2', alpha=2e-4,
                                       max_iter=1, tol=None, random_state=GLOBAL_SEED))  # single-epoch per partial_fit call

        # Training: stream over training indices in batches
        for b_idx in batch_indices(tr_idx, BATCH_SIZE):
            # Build batch features on the fly
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            Y_b = Y_tr_full[b_idx][:, shard].toarray().astype(np.int8, copy=False)
            # partial_fit for each label binary model
            for j in range(Lb):
                yj = Y_b[:, j]
                models[j].partial_fit(X_batch, yj, classes=np.array([0,1], dtype=np.int32))
            del X_title, X_body, X_char, X_code, X_meta, X_batch, Y_b; gc.collect()

        # Validation: stream over validation batches, predict probs and update counts
        for b_idx in batch_indices(va_idx, BATCH_SIZE):
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            # Collect probs per label model
            P = np.zeros((X_batch.shape[0], Lb), dtype=np.float32)
            for j in range(Lb):
                try:
                    prob = models[j].predict_proba(X_batch)[:, 1]
                except Exception:
                    from scipy.special import expit
                    scores = models[j].decision_function(X_batch)
                    prob = expit(scores)
                P[:, j] = prob.astype(np.float32, copy=False)
            Y_true_batch = Y_va_full[b_idx][:, shard]
            update_counts_batch(Y_true_batch.tocsr(), P, shard, THS, tp_tot, fp_tot, fn_tot,
                                tp_hi, fp_hi, fn_hi, hi_pos)
            del X_title, X_body, X_char, X_code, X_meta, X_batch, Y_true_batch, P; gc.collect()

        # Free models
        del models
        gc.collect()

    # Add FN from excluded labels to avoid optimistic bias
    fn_excluded = int(Y_va_full[:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        for ti in range(THS.size):
            fn_tot[ti] += fn_excluded
    print(f'[Fold {fold_id}] Added FN from excluded labels: {fn_excluded}')

    del Y_tr_full, Y_va_full
    gc.collect()
    dt = time.time() - t0
    fold_times.append(dt)
    print(f'[Fold {fold_id}] Completed in {dt/60:.1f} min')
    fold_id += 1

    # Micro-pilot: run only the first fold
    if MICRO_PILOT:
        print('[Pilot] Completed first fold only (micro-pilot mode).')
        break

print('[CV] Completed. Optimizing thresholds...')
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[OOF] Global best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Per-tag thresholds for high-support labels only
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
for k, lab in enumerate(hi_label_idx):
    tps = tp_hi[k]; fps = fp_hi[k]; fns = fn_hi[k]
    f1s_lab = np.array([micro_f1(tps[i], fps[i], fns[i]) for i in range(THS.size)], dtype=np.float32)
    j = int(np.argmax(f1s_lab))
    per_tag_thr[lab] = float(THS[j])
print('[OOF] Per-tag thresholds computed for', hi_label_idx.size, 'labels; others use global.')

# Persist artifacts
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
np.save('per_tag_thresholds.npy', per_tag_thr)
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
print('[Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')

# Safety rule for inference
def apply_thresholds_with_safety(prob_row, label_indices, per_thr_vec, global_thr, min_k=1):
    sel = []
    for j in label_indices:
        thr = per_thr_vec[j]
        if prob_row[j] >= thr:
            sel.append(j)
    if len(sel) == 0:
        if len(label_indices) == 0:
            return []
        j_top = int(max(label_indices, key=lambda jj: prob_row[jj]))
        sel = [j_top]
    return sel

print('[Phase 2 v3] DONE. Global OOF micro-F1 ~', round(best_f1, 5), 'at thr', round(best_thr, 3))
print('[Timing] Avg fold time: {:.1f} min; total: {:.1f} min'.format(np.mean(fold_times)/60.0, np.sum(fold_times)/60.0))
'''

# Append the corrected modeling cell (v3)
nb.cells.append(nbf.v4.new_code_cell(modeling_source))
nb.metadata.setdefault('kernelspec', {"display_name":"Python 3","language":"python","name":"python3"})
with open(pristine_nb, 'w', encoding='utf-8') as f:
    nbf.write(nb, f)
print('[Pristine] Appended REFACTORED Phase 2 modeling cell (v3, streaming SGD) to agent_notebook_pristine.ipynb. Submit for audit, then Run-All to execute CV safely.')


In [None]:
# Execute pristine notebook (micro-pilot): Run-All on agent_notebook_pristine.ipynb
# - This will run the setup cell and the approved v3 modeling cell with MICRO_PILOT enabled
# - Collects runtime logs and persists artifacts (labels.csv, thresholds, OOF curve, etc.)
import os, time, shutil
from datetime import datetime

# Ensure dependencies for programmatic notebook execution
try:
    import nbformat as nbf
except Exception:
    import sys
    get_ipython().run_line_magic('pip', 'install --quiet nbformat')
    import nbformat as nbf
try:
    from nbclient import NotebookClient
except Exception:
    import sys
    get_ipython().run_line_magic('pip', 'install --quiet nbclient')
    from nbclient import NotebookClient

pristine_nb_path = 'agent_notebook_pristine.ipynb'
assert os.path.exists(pristine_nb_path), 'Pristine notebook not found. Please generate it first.'

# Backup pristine notebook before execution
ts = datetime.now().strftime('%Y%m%d_%H%M%S')
backup_path = f'agent_notebook_pristine_backup_{ts}_preexec.ipynb'
shutil.copy2(pristine_nb_path, backup_path)
print(f'[Backup] Pristine notebook backed up to {backup_path}')

t0 = time.time()
nb = nbf.read(pristine_nb_path, as_version=4)
client = NotebookClient(nb, timeout=0, kernel_name='python3', allow_errors=False)
print('[Execute] Starting Run-All for agent_notebook_pristine.ipynb (micro-pilot mode)...')
client.execute()
elapsed = time.time() - t0
print(f'[Execute] Completed Run-All in {elapsed/60:.1f} minutes.')

# Save executed notebook to preserve outputs
executed_path = f'agent_notebook_pristine_executed_{ts}.ipynb'
with open(executed_path, 'w', encoding='utf-8') as f:
    nbf.write(nb, f)
print(f'[Execute] Executed notebook saved to {executed_path}')

# List expected artifacts
expected = ['labels.csv', 'per_tag_thresholds.npy', 'global_threshold.npy', 'oof_global_f1_curve.csv']
for art in expected:
    print(f'[Artifact] {art}:', 'FOUND' if os.path.exists(art) else 'MISSING')
print('[Micro-Pilot] Done. Review the console logs above for D, SHARD_SIZE, shards count, OOF micro-F1, and timings.')


In [None]:
# Micro-pilot executor without nbclient: load caches here and exec the pristine modeling cell with smaller N
import os, gc, time, re, nbformat as nbf
import pandas as pd
from datetime import datetime

# 1) Load caches directly into this kernel (fast, avoids running pristine setup)
TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL   = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches. Ensure parsed_train_full.pkl and parsed_test.pkl exist.'
df_train_full = pd.read_pickle(TRAIN_FULL)
df_test_cache = pd.read_pickle(TEST_PKL)
print('[Kernel] Loaded df_train_full:', df_train_full.shape, 'df_test_cache:', df_test_cache.shape)

# 2) Read pristine notebook and extract the modeling cell
pristine_nb = 'agent_notebook_pristine.ipynb'
assert os.path.exists(pristine_nb), 'Pristine notebook not found.'
nb = nbf.read(pristine_nb, as_version=4)
code_cells = [c for c in nb.cells if c.cell_type == 'code']
assert len(code_cells) >= 2, f'Expected >=2 code cells (setup + modeling), found {len(code_cells)}'
model_cell_src = code_cells[-1].source

# 3) Patch micro-pilot params to be lighter and faster inside this kernel run
def patch_param(src: str, pattern: str, replacement: str) -> str:
    return re.sub(pattern, replacement, src)

# Force MICRO_PILOT=True, PILOT_N=10000, MIN_LABEL_FREQ_TRAIN=100, BATCH_SIZE=8192
patched = model_cell_src
patched = patch_param(patched, r'MICRO_PILOT\s*=\s*\w+', 'MICRO_PILOT = True')
patched = patch_param(patched, r'PILOT_N\s*=\s*\d[\d_]*', 'PILOT_N = 10_000')
patched = patch_param(patched, r'MIN_LABEL_FREQ_TRAIN\s*=\s*\d+', 'MIN_LABEL_FREQ_TRAIN = 100')
patched = patch_param(patched, r'BATCH_SIZE\s*=\s*\d+', 'BATCH_SIZE = 8192')

# 3b) Precise bugfix: replace fold-local label batch indexing with absolute indexing on Y_all
cnt_before_yb = len(re.findall(r"Y_b\s*=\s*Y_tr_full\[.*?\]\s*\[:,\s*shard\]", patched))
cnt_before_yt = len(re.findall(r"Y_true_batch\s*=\s*Y_va_full\[.*?\]\s*\[:,\s*shard\]", patched))
patched = re.sub(r'Y_b\s*=\s*Y_tr_full\[(.*?)\]\s*\[:,\s*shard\]', r'Y_b = Y_all[\1][:, shard]', patched)
patched = re.sub(r'Y_true_batch\s*=\s*Y_va_full\[(.*?)\]\s*\[:,\s*shard\]', r'Y_true_batch = Y_all[\1][:, shard]', patched)
cnt_after_yb = len(re.findall(r"Y_b\s*=\s*Y_tr_full\[.*?\]\s*\[:,\s*shard\]", patched))
cnt_after_yt = len(re.findall(r"Y_true_batch\s*=\s*Y_va_full\[.*?\]\s*\[:,\s*shard\]", patched))
print(f"[Patch] Y_b replacement occurrences: before={cnt_before_yb}, after={cnt_after_yb}")
print(f"[Patch] Y_true_batch replacement occurrences: before={cnt_before_yt}, after={cnt_after_yt}")

# Optional: excluded FN computation remains valid using Y_va_full[:, excluded_labels]

# 4) Execute the patched modeling cell code in this kernel
print('[Exec] Starting micro-pilot with PILOT_N=10k, MIN_LABEL_FREQ_TRAIN=100, BATCH_SIZE=8192...')
t0 = time.time()
g = globals()
try:
    exec(patched, g, g)
    print('[Exec] Micro-pilot completed in {:.1f} min'.format((time.time()-t0)/60.0))
except Exception as e:
    import traceback
    print('[Exec] Micro-pilot failed:', repr(e))
    traceback.print_exc()

# 5) Verify artifacts existence after run
expected = ['labels.csv', 'per_tag_thresholds.npy', 'global_threshold.npy', 'oof_global_f1_curve.csv']
for art in expected:
    print(f'[Artifact] {art}:', 'FOUND' if os.path.exists(art) else 'MISSING')
print('[Note] Review above logs for D, SHARD_SIZE, shard count, OOF F1, and timings. This is a micro-pilot sanity check.')


In [2]:
# Vectorized micro-pilot (batch-major, shard-wise SGD) — 1-fold
# Mandatory fixes applied: remove per-batch class weighting, global bias init via prior_logit,
# LR=0.05, EPOCHS=3, threshold sweep widened to [0.02, 0.5] with 30 points.
import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches parsed_train_full.pkl / parsed_test.pkl'
df_train_full = pd.read_pickle(TRAIN_FULL)
df_test_cache = pd.read_pickle(TEST_PKL)
print('[Kernel] Loaded caches:', df_train_full.shape, df_test_cache.shape)

# Labels
def to_list_tags(s):
    return s.split() if isinstance(s, str) else []
y_lists = df_train_full['Tags'].astype(str).apply(to_list_tags)
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[Labels] #samples:', n_samples, '#labels:', n_labels)

# Channels
title_text = df_train_full['title_norm'].fillna('').astype(str)
body_text  = df_train_full['body_text'].fillna('').astype(str)
code_text  = df_train_full['code_text'].fillna('').astype(str)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_train_full[meta_cols].astype(np.float32).values

# Pilot config (aim for realistic yet fast)
MICRO_PILOT = True
PILOT_N = 100_000
MIN_LABEL_FREQ_TRAIN = 20
BATCH_SIZE = 8_192  # batches per epoch
COEF_BUDGET_BYTES = 4_000_000_000  # 4GB
EPOCHS = 3
LR = 0.05
L2 = 2e-4
title_alpha = 3.0
THS = np.linspace(0.02, 0.5, 30)

if MICRO_PILOT:
    n_keep = int(min(PILOT_N, n_samples))
    rng = np.random.RandomState(GLOBAL_SEED)
    idx_keep = rng.choice(n_samples, size=n_keep, replace=False)
    idx_keep.sort()
    title_text = title_text.iloc[idx_keep].reset_index(drop=True)
    body_text  = body_text.iloc[idx_keep].reset_index(drop=True)
    code_text  = code_text.iloc[idx_keep].reset_index(drop=True)
    meta_all   = meta_all[idx_keep]
    Y_all      = Y_all[idx_keep]
    n_samples  = Y_all.shape[0]
    print(f'[Pilot] Subsampled to {n_samples} rows')

# Iterative stratification (dependency guard)
try:
    from skmultilearn.model_selection import IterativeStratification
except Exception:
    import sys
    from subprocess import run
    run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
    from skmultilearn.model_selection import IterativeStratification

mskf = IterativeStratification(n_splits=5, order=1)
X_dummy = np.zeros((n_samples, 1))

# Vectorizers (reduced caps to keep D in check)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=150_000, sublinear_tf=True, dtype=np.float32)
body_vec_cfg  = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=200_000, sublinear_tf=True, dtype=np.float32)
code_vec_cfg  = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                     token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=80_000,
                     sublinear_tf=True, dtype=np.float32)
char_vec_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), min_df=2, max_features=150_000, dtype=np.float32)
meta_scaler = StandardScaler(with_mean=False)

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

def sigmoid_stable(Z):
    Z = np.clip(Z, -8.0, 8.0)
    return 1.0 / (1.0 + np.exp(-Z))

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

def prior_logit(p):
    p = np.clip(p, 1e-6, 1-1e-6)
    return np.log(p/(1.0-p))

# Run first fold only for acceptance
fold_id = 0
tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)
t_fold = time.time()
for tr_idx, va_idx in mskf.split(X_dummy, Y_all):
    print(f'[Fold {fold_id}] Train={tr_idx.size}, Valid={va_idx.size}')
    Y_tr = Y_all[tr_idx]
    Y_va = Y_all[va_idx]
    sup_tr = np.asarray(Y_tr.sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    print(f'[Fold {fold_id}] kept_labels={kept_labels.size}, excluded={excluded_labels.size}')
    if kept_labels.size == 0:
        print('[Fold {fold_id}] No labels after pruning; abort fold')
        break

    # Fit vectorizers on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    body_vec  = TfidfVectorizer(**body_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    char_vec  = TfidfVectorizer(**char_vec_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    body_vec.fit(body_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    char_vec.fit((title_text.iloc[tr_idx] + ' ' + body_text.iloc[tr_idx]))
    # Fit meta scaler
    for b_idx in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b_idx])

    # Probe to get D
    probe_idx = tr_idx[:min(BATCH_SIZE, tr_idx.size)]
    Xt_probe = title_vec.transform(title_text.iloc[probe_idx]).astype(np.float32)
    Xt_probe = Xt_probe.multiply(title_alpha)
    Xb_probe = body_vec.transform(body_text.iloc[probe_idx]).astype(np.float32)
    Xcode_probe = code_vec.transform(code_text.iloc[probe_idx]).astype(np.float32)
    Xchar_probe = char_vec.transform((title_text.iloc[probe_idx] + ' ' + body_text.iloc[probe_idx])).astype(np.float32)
    Xm_probe = sparse.csr_matrix(meta_scaler.transform(meta_all[probe_idx]), dtype=np.float32)
    D = sparse.hstack([Xt_probe, Xb_probe, Xchar_probe, Xcode_probe, Xm_probe], format='csr', dtype=np.float32).shape[1]
    del Xt_probe, Xb_probe, Xchar_probe, Xcode_probe, Xm_probe; gc.collect()
    bytes_per_coef = 4  # float32
    shard_cap = max(1, int(COEF_BUDGET_BYTES // (bytes_per_coef * D)))
    SHARD_SIZE = max(1, min(2000, shard_cap))
    SHARD_SIZE = min(SHARD_SIZE, 600)
    shards = [kept_labels[i:i+SHARD_SIZE] for i in range(0, kept_labels.size, SHARD_SIZE)]
    est_w_mb = (D * SHARD_SIZE * bytes_per_coef) / (1024**2)
    print(f'[Fold {fold_id}] D={D:,}, SHARD_SIZE={SHARD_SIZE}, #shards={len(shards)} (~{est_w_mb:.1f} MB per-shard W, budget={COEF_BUDGET_BYTES/1e9:.1f}GB, fp32)')

    # Initialize per-shard weights with global bias using prior logit from training split
    shard_params = []  # list of tuples (labels_idx, W[D x Lb], b[Lb])
    n_tr = tr_idx.size
    for shard in shards:
        Lb = len(shard)
        W = np.zeros((D, Lb), dtype=np.float32)
        # prior per label from training split
        pos = sup_tr[shard].astype(np.float64)
        p = pos / float(n_tr)
        b = prior_logit(p).astype(np.float32, copy=False)
        shard_params.append((shard, W, b))

    # Precompute list of batch index arrays for shuffling
    tr_batches = [b for b in batch_indices(tr_idx, BATCH_SIZE)]

    # Training: multi-epoch, batch-major, compute X_batch ONCE then update all shards
    t_tr = time.time()
    rng = np.random.RandomState(GLOBAL_SEED)
    for ep in range(EPOCHS):
        rng.shuffle(tr_batches)
        t_ep = time.time()
        for bi, b_idx in enumerate(tr_batches):
            # Build features once
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_vec.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_vec.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            bs = X_batch.shape[0]
            # Update each shard with standard logistic loss (no per-batch class weighting)
            for shard, W, b in shard_params:
                Y_b = Y_all[b_idx][:, shard].toarray().astype(np.float32, copy=False)
                Z = X_batch @ W
                Z += b  # broadcast bias
                P = sigmoid_stable(Z)
                E = (P - Y_b)  # (bs x Lb)
                # Gradients
                grad_W = (X_batch.T @ E) / float(bs)
                if isinstance(grad_W, np.matrix):
                    grad_W = np.asarray(grad_W)
                grad_W = grad_W.astype(np.float32, copy=False) + (L2 * W)
                grad_b = E.mean(axis=0).astype(np.float32, copy=False)
                # Update
                W -= LR * grad_W
                b -= LR * grad_b
            if (bi % 5 == 0) and len(shard_params) > 0:
                shard0, W0, b0 = shard_params[0]
                Z0 = X_batch @ W0; Z0 += b0
                P0 = sigmoid_stable(Z0)
                Y0 = Y_all[b_idx][:, shard0].toarray().astype(np.float32, copy=False)
                eps = 1e-7
                loss = -np.mean(Y0*np.log(P0+eps) + (1-Y0)*np.log(1-P0+eps))
                print(f'[Fold {fold_id}] Ep {ep+1}/{EPOCHS} batch {bi+1}/{len(tr_batches)} loss~{loss:.4f}; meanP~{P0.mean():.4f}')
            del X_title, X_body, X_code, X_char, X_meta, X_batch; gc.collect()
        print(f'[Fold {fold_id}] Epoch {ep+1}/{EPOCHS} time: {(time.time()-t_ep)/60:.2f} min')
    print(f'[Fold {fold_id}] Train total time: {(time.time()-t_tr)/60:.2f} min')

    # Validation: batch-major, compute once, vectorized prediction per shard
    def update_counts(Y_true_csr, probs):
        for ti, thr in enumerate(THS):
            pred_bin = (probs >= thr).astype(np.uint8)
            pred_csr = sparse.csr_matrix(pred_bin, dtype=np.uint8)
            tp = int(pred_csr.multiply(Y_true_csr).sum())
            ppos = int(pred_bin.sum())
            tpos = int(Y_true_csr.sum())
            tp_tot[ti] += tp
            fp_tot[ti] += (ppos - tp)
            fn_tot[ti] += (tpos - tp)

    t_va = time.time()
    for b_idx in batch_indices(va_idx, BATCH_SIZE):
        X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
        X_body  = body_vec.transform(body_text.iloc[b_idx]).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
        X_char  = char_vec.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
        X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
        P_all = None; col_slices = []
        for shard, W, b in shard_params:
            Z = (X_batch @ W); Z += b
            P = sigmoid_stable(Z)
            if P_all is None:
                P_all = P; col_slices = [np.array(shard)]
            else:
                P_all = np.concatenate([P_all, P], axis=1)
                col_slices.append(np.array(shard))
        if len(col_slices) > 1:
            order = np.concatenate(col_slices)
            pos_map = {lab:i for i, lab in enumerate(order)}
            kept_pos = np.array([pos_map[k] for k in kept_labels], dtype=np.int32)
            P_kept = P_all[:, kept_pos]
        else:
            P_kept = P_all
        Y_true_batch = Y_all[b_idx][:, kept_labels]
        update_counts(Y_true_batch.tocsr(), P_kept)
        del X_title, X_body, X_char, X_code, X_meta, X_batch, P_all, P_kept; gc.collect()
    fn_excluded = int(Y_va[:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        for i in range(THS.size):
            fn_tot[i] += fn_excluded
    print(f'[Fold {fold_id}] Validation time: {(time.time()-t_va)/60:.2f} min, FN(excluded)={fn_excluded}')

    print(f'[Fold {fold_id}] Total fold time: {(time.time()-t_fold)/60:.2f} min')
    break

# Optimize global threshold
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[OOF] Global best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Save artifacts compatible with downstream steps
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
np.save('per_tag_thresholds.npy', per_tag_thr)
print('[Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')
print('[Pilot Vectorized] DONE (with stable bias init and LR/EPOCHS adjustments)')


[Kernel] Loaded caches: (5430775, 11) (603420, 10)


[Labels] #samples: 5430775 #labels: 41781


[Pilot] Subsampled to 100000 rows


[Fold 0] Train=80041, Valid=19959
[Fold 0] kept_labels=1556, excluded=40225


[Fold 0] D=480,087, SHARD_SIZE=600, #shards=3 (~1098.8 MB per-shard W, budget=4.0GB, fp32)


[Fold 0] Ep 1/3 batch 1/10 loss~0.0093; meanP~0.0015


[Fold 0] Ep 1/3 batch 6/10 loss~0.0091; meanP~0.0015


[Fold 0] Epoch 1/3 time: 6.24 min


[Fold 0] Ep 2/3 batch 1/10 loss~0.0094; meanP~0.0015


[Fold 0] Ep 2/3 batch 6/10 loss~0.0093; meanP~0.0015


[Fold 0] Epoch 2/3 time: 6.74 min


[Fold 0] Ep 3/3 batch 1/10 loss~0.0092; meanP~0.0015


[Fold 0] Ep 3/3 batch 6/10 loss~0.0094; meanP~0.0015


[Fold 0] Epoch 3/3 time: 6.54 min
[Fold 0] Train total time: 19.53 min


[Fold 0] Validation time: 1.15 min, FN(excluded)=13245
[Fold 0] Total fold time: 24.23 min
[OOF] Global best micro-F1 = 0.08454 at thr = 0.053
[Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv
[Pilot Vectorized] DONE (with stable bias init and LR/EPOCHS adjustments)


In [7]:
# Vectorized micro-pilot vNext-opt: weighted + momentum + clipping + hashing (1-fold)
# Batch-major optimization: build X_batch once and reuse across shards per batch
# - Global per-label positive weighting (fold-level, clipped to [1,10])
# - Nesterov momentum, gradient clipping
# - Hashing for body/char (reduced dims)
# - Xavier warm-start, prior_logit bias
# - Batch-major training across shards to remove redundant feature work
import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer

GLOBAL_SEED = 1337
rng_global = np.random.RandomState(GLOBAL_SEED)
np.random.seed(GLOBAL_SEED)

TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches parsed_train_full.pkl / parsed_test.pkl'
df_train_full = pd.read_pickle(TRAIN_FULL)
df_test_cache = pd.read_pickle(TEST_PKL)
print('[Kernel] Loaded caches:', df_train_full.shape, df_test_cache.shape)

# Labels
def to_list_tags(s):
    return s.split() if isinstance(s, str) else []
y_lists = df_train_full['Tags'].astype(str).apply(to_list_tags)
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[Labels] #samples:', n_samples, '#labels:', n_labels)

# Channels
title_text = df_train_full['title_norm'].fillna('').astype(str)
body_text  = df_train_full['body_text'].fillna('').astype(str)
code_text  = df_train_full['code_text'].fillna('').astype(str)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_train_full[meta_cols].astype(np.float32).values

# Pilot config
MICRO_PILOT = True
PILOT_N = 100_000
MIN_LABEL_FREQ_TRAIN = 20
BATCH_SIZE = 8_192
COEF_BUDGET_BYTES = 4_000_000_000  # budget used for estimating shard size (we hold all shards concurrently here)
EPOCHS = 8
LR = 0.12
MU = 0.9
L2 = 2e-4
title_alpha = 3.0
THS = np.linspace(0.02, 0.5, 30)

if MICRO_PILOT:
    n_keep = int(min(PILOT_N, n_samples))
    idx_keep = rng_global.choice(n_samples, size=n_keep, replace=False)
    idx_keep.sort()
    title_text = title_text.iloc[idx_keep].reset_index(drop=True)
    body_text  = body_text.iloc[idx_keep].reset_index(drop=True)
    code_text  = code_text.iloc[idx_keep].reset_index(drop=True)
    meta_all   = meta_all[idx_keep]
    Y_all      = Y_all[idx_keep]
    n_samples  = Y_all.shape[0]
    print(f'[Pilot] Subsampled to {n_samples} rows')

# Iterative stratification (dependency guard)
try:
    from skmultilearn.model_selection import IterativeStratification
except Exception:
    import sys
    from subprocess import run
    run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
    from skmultilearn.model_selection import IterativeStratification

mskf = IterativeStratification(n_splits=5, order=1)
X_dummy = np.zeros((n_samples, 1))

# Vectorizers (hashing for heavy channels, reduced dims)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=150_000, sublinear_tf=True, dtype=np.float32)
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**18,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**17,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg  = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                     token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=80_000,
                     sublinear_tf=True, dtype=np.float32)
meta_scaler = StandardScaler(with_mean=False)

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

def sigmoid_stable(Z):
    Z = np.clip(Z, -12.0, 12.0)
    return 1.0 / (1.0 + np.exp(-Z))

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

def prior_logit(p):
    p = np.clip(p, 1e-6, 1-1e-6)
    return np.log(p/(1.0-p))

fold_id = 0
tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)
t_fold = time.time()
for tr_idx, va_idx in mskf.split(X_dummy, Y_all):
    print(f'[Fold {fold_id}] Train={tr_idx.size}, Valid={va_idx.size}')
    Y_tr = Y_all[tr_idx]
    Y_va = Y_all[va_idx]
    sup_tr = np.asarray(Y_tr.sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    print(f'[Fold {fold_id}] kept_labels={kept_labels.size}, excluded={excluded_labels.size}')
    if kept_labels.size == 0:
        print('[Fold {fold_id}] No labels after pruning; abort fold')
        break

    # Global per-label positive weighting (fold-level)
    p_tr = np.clip(sup_tr / float(tr_idx.size), 1e-6, 1-1e-6)
    pos_weights_global = np.clip(1.0 / p_tr, 1.0, 10.0).astype(np.float32)

    # Fit vectorizers on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    # Fit meta scaler
    for b_idx in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b_idx])

    # Probe to get D and shard size
    probe_idx = tr_idx[:min(BATCH_SIZE, tr_idx.size)]
    Xt_probe = title_vec.transform(title_text.iloc[probe_idx]).astype(np.float32).multiply(title_alpha)
    Xb_probe = body_hash.transform(body_text.iloc[probe_idx]).astype(np.float32)
    Xchar_probe = char_hash.transform((title_text.iloc[probe_idx] + ' ' + body_text.iloc[probe_idx])).astype(np.float32)
    Xcode_probe = code_vec.transform(code_text.iloc[probe_idx]).astype(np.float32)
    Xm_probe = sparse.csr_matrix(meta_scaler.transform(meta_all[probe_idx]), dtype=np.float32)
    D = sparse.hstack([Xt_probe, Xb_probe, Xchar_probe, Xcode_probe, Xm_probe], format='csr', dtype=np.float32).shape[1]
    del Xt_probe, Xb_probe, Xchar_probe, Xcode_probe, Xm_probe; gc.collect()
    bytes_per_coef = 4
    shard_cap = max(1, int(COEF_BUDGET_BYTES // (bytes_per_coef * D)))
    SHARD_SIZE = max(1, min(600, shard_cap))
    shards = [kept_labels[i:i+SHARD_SIZE] for i in range(0, kept_labels.size, SHARD_SIZE)]
    est_w_mb = (D * SHARD_SIZE * bytes_per_coef) / (1024**2)
    print(f'[Fold {fold_id}] D={D:,}, SHARD_SIZE={SHARD_SIZE}, #shards={len(shards)} (~{est_w_mb:.1f} MB per-shard W, fp32)')

    # Initialize all shard parameters (we hold all shards concurrently to enable batch-major reuse)
    shard_params = []  # list of tuples: (labels_idx, W, b, Vw, Vb)
    n_tr = tr_idx.size
    for shard in shards:
        Lb = len(shard)
        std = 1e-3 / math.sqrt(D)
        W = (rng_global.randn(D, Lb).astype(np.float32) * std).astype(np.float32)
        pos = sup_tr[shard].astype(np.float64)
        p = pos / float(n_tr)
        b = prior_logit(p).astype(np.float32, copy=False)
        Vw = np.zeros_like(W, dtype=np.float32)
        Vb = np.zeros_like(b, dtype=np.float32)
        shard_params.append((shard, W, b, Vw, Vb))

    # Precompute shuffled batch index arrays
    tr_batches = [b for b in batch_indices(tr_idx, BATCH_SIZE)]

    # Training: batch-major — build X_batch once and reuse across all shards
    t_tr = time.time()
    rng = np.random.RandomState(GLOBAL_SEED)
    for ep in range(EPOCHS):
        rng.shuffle(tr_batches)
        t_ep = time.time()
        for bi, b_idx in enumerate(tr_batches):
            # Build batch features once
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            bs = X_batch.shape[0]
            # Update each shard using the same X_batch
            for si, (shard, W, b, Vw, Vb) in enumerate(shard_params):
                Y_b = Y_all[b_idx][:, shard].toarray().astype(np.float32, copy=False)
                Z = X_batch @ W; Z += b
                P = sigmoid_stable(Z)
                E = (P - Y_b)
                # Positive weighting mask
                w_pos = pos_weights_global[shard][None, :]
                Wmat = 1.0 + (w_pos - 1.0) * Y_b
                E_weighted = E * Wmat
                # Gradients
                grad_W = (X_batch.T @ E_weighted) / float(bs)
                if isinstance(grad_W, np.matrix):
                    grad_W = np.asarray(grad_W)
                grad_W = grad_W.astype(np.float32, copy=False) + (L2 * W)
                grad_b = E_weighted.mean(axis=0).astype(np.float32, copy=False)
                # Gradient clipping
                col_norms = np.linalg.norm(grad_W, axis=0) + 1e-8
                clip_scale = np.minimum(1.0, 5.0 / col_norms).astype(np.float32, copy=False)
                grad_W *= clip_scale
                grad_b = np.clip(grad_b, -1.0, 1.0)
                # Nesterov momentum (in-place)
                Vw *= MU; Vw += LR * grad_W
                Vb *= MU; Vb += LR * grad_b
                W -= (MU * Vw + LR * grad_W)
                b -= (MU * Vb + LR * grad_b)
                shard_params[si] = (shard, W, b, Vw, Vb)
            if (bi % 3 == 0) and len(shard_params) > 0:
                shard0, W0, b0, _, _ = shard_params[0]
                Z0 = X_batch @ W0; Z0 += b0
                P0 = sigmoid_stable(Z0)
                Y0 = Y_all[b_idx][:, shard0].toarray().astype(np.float32, copy=False)
                eps = 1e-7
                loss = -np.mean(Y0*np.log(P0+eps) + (1-Y0)*np.log(1-P0+eps))
                print(f'[Fold {fold_id}] Ep {ep+1}/{EPOCHS} batch {bi+1}/{len(tr_batches)} loss~{loss:.4f}; meanP~{P0.mean():.4f}')
            del X_title, X_body, X_char, X_code, X_meta, X_batch; gc.collect()
        print(f'[Fold {fold_id}] Epoch {ep+1}/{EPOCHS} time: {(time.time()-t_ep)/60:.2f} min')
    print(f'[Fold {fold_id}] Train total time: {(time.time()-t_tr)/60:.2f} min')

    # Validation: update TP/FP/FN per shard directly (no concatenation)
    def update_counts(Y_true_csr, probs):
        for ti, thr in enumerate(THS):
            pred_bin = (probs >= thr).astype(np.uint8)
            pred_csr = sparse.csr_matrix(pred_bin, dtype=np.uint8)
            tp = int(pred_csr.multiply(Y_true_csr).sum())
            ppos = int(pred_bin.sum())
            tpos = int(Y_true_csr.sum())
            tp_tot[ti] += tp
            fp_tot[ti] += (ppos - tp)
            fn_tot[ti] += (tpos - tp)

    t_va = time.time()
    for b_idx in batch_indices(va_idx, BATCH_SIZE):
        X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
        X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
        X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
        X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
        for shard, W, b, _, _ in shard_params:
            Z = (X_batch @ W); Z += b
            P = sigmoid_stable(Z)
            Y_true_batch = Y_all[b_idx][:, shard]
            update_counts(Y_true_batch.tocsr(), P)
        del X_title, X_body, X_char, X_code, X_meta, X_batch; gc.collect()
    print(f'[Fold {fold_id}] Validation time: {(time.time()-t_va)/60:.2f} min')

    # Account for excluded labels' positives as FN
    fn_excluded = int(Y_va[:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        for i in range(THS.size):
            fn_tot[i] += fn_excluded
    print(f'[Fold {fold_id}] Added FN(excluded)={fn_excluded}')

    print(f'[Fold {fold_id}] Total fold time: {(time.time()-t_fold)/60:.2f} min')
    break

# Optimize global threshold
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[OOF] Global best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Save artifacts
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
np.save('per_tag_thresholds.npy', per_tag_thr)
print('[Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')
print('[Pilot Vectorized vNext-opt] DONE (batch-major across shards)')


[Kernel] Loaded caches: (5430775, 11) (603420, 10)


[Labels] #samples: 5430775 #labels: 41781


[Pilot] Subsampled to 100000 rows


[Fold 0] Train=80041, Valid=19959
[Fold 0] kept_labels=1556, excluded=40225


[Fold 0] D=523,303, SHARD_SIZE=600, #shards=3 (~1197.7 MB per-shard W, fp32)


[Fold 0] Ep 1/8 batch 1/10 loss~0.0098; meanP~0.0024


[Fold 0] Ep 1/8 batch 4/10 loss~0.0138; meanP~0.0060


[Fold 0] Ep 1/8 batch 7/10 loss~0.0130; meanP~0.0064


[Fold 0] Ep 1/8 batch 10/10 loss~0.0140; meanP~0.0075
[Fold 0] Epoch 1/8 time: 7.60 min


[Fold 0] Ep 2/8 batch 1/10 loss~0.0144; meanP~0.0080


[Fold 0] Ep 2/8 batch 4/10 loss~0.0148; meanP~0.0087


[Fold 0] Ep 2/8 batch 7/10 loss~0.0151; meanP~0.0091


[Fold 0] Ep 2/8 batch 10/10 loss~0.0153; meanP~0.0094
[Fold 0] Epoch 2/8 time: 8.30 min


[Fold 0] Ep 3/8 batch 1/10 loss~0.0151; meanP~0.0095


[Fold 0] Ep 3/8 batch 4/10 loss~0.0154; meanP~0.0097


[Fold 0] Ep 3/8 batch 7/10 loss~0.0152; meanP~0.0099


[Fold 0] Ep 3/8 batch 10/10 loss~0.0152; meanP~0.0098
[Fold 0] Epoch 3/8 time: 8.37 min


[Fold 0] Ep 4/8 batch 1/10 loss~0.0150; meanP~0.0098


[Fold 0] Ep 4/8 batch 4/10 loss~0.0154; meanP~0.0102


[Fold 0] Ep 4/8 batch 7/10 loss~0.0152; meanP~0.0101


[Fold 0] Ep 4/8 batch 10/10 loss~0.0154; meanP~0.0102
[Fold 0] Epoch 4/8 time: 8.43 min


[Fold 0] Ep 5/8 batch 1/10 loss~0.0154; meanP~0.0102


[Fold 0] Ep 5/8 batch 4/10 loss~0.0152; meanP~0.0103


[Fold 0] Ep 5/8 batch 7/10 loss~0.0153; meanP~0.0104


[Fold 0] Ep 5/8 batch 10/10 loss~0.0149; meanP~0.0102
[Fold 0] Epoch 5/8 time: 8.66 min


[Fold 0] Ep 6/8 batch 1/10 loss~0.0153; meanP~0.0103


[Fold 0] Ep 6/8 batch 4/10 loss~0.0149; meanP~0.0103


[Fold 0] Ep 6/8 batch 7/10 loss~0.0151; meanP~0.0105


[Fold 0] Ep 6/8 batch 10/10 loss~0.0152; meanP~0.0105
[Fold 0] Epoch 6/8 time: 8.08 min


[Fold 0] Ep 7/8 batch 1/10 loss~0.0150; meanP~0.0104


[Fold 0] Ep 7/8 batch 4/10 loss~0.0148; meanP~0.0103


[Fold 0] Ep 7/8 batch 7/10 loss~0.0148; meanP~0.0103


[Fold 0] Ep 7/8 batch 10/10 loss~0.0151; meanP~0.0106
[Fold 0] Epoch 7/8 time: 7.86 min


[Fold 0] Ep 8/8 batch 1/10 loss~0.0149; meanP~0.0105


[Fold 0] Ep 8/8 batch 4/10 loss~0.0147; meanP~0.0103


[Fold 0] Ep 8/8 batch 7/10 loss~0.0150; meanP~0.0104


[Fold 0] Ep 8/8 batch 10/10 loss~0.0148; meanP~0.0104
[Fold 0] Epoch 8/8 time: 8.62 min
[Fold 0] Train total time: 65.92 min


[Fold 0] Validation time: 1.07 min
[Fold 0] Added FN(excluded)=13245
[Fold 0] Total fold time: 68.72 min
[OOF] Global best micro-F1 = 0.22046 at thr = 0.368
[Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv
[Pilot Vectorized vNext-opt] DONE (batch-major across shards)


In [9]:
# Track 2: OVR-SAGA Baseline (1-fold, 100k micro-pilot) with the same feature stack as Cell 17 (FAST CONFIG)
# - Apples-to-apples channels: Title TF-IDF (boosted), Body Hashing, Char Hashing, Code TF-IDF, Meta (scaled)
# - Leak-proof: fit vectorizers/scaler on train split only
# - Per-fold label pruning (train-only) with FN addback for excluded labels
# - Global threshold sweep + per-tag thresholds for high-support labels (>=50)

import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Dependency guard for iterative stratification
try:
    from skmultilearn.model_selection import IterativeStratification
except Exception:
    import sys
    from subprocess import run
    run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
    from skmultilearn.model_selection import IterativeStratification

GLOBAL_SEED = 1337
rng = np.random.RandomState(GLOBAL_SEED)
np.random.seed(GLOBAL_SEED)

TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL   = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches parsed_train_full.pkl / parsed_test.pkl'
df_train_full = pd.read_pickle(TRAIN_FULL)
df_test_cache = pd.read_pickle(TEST_PKL)
print('[SAGA] Loaded caches:', df_train_full.shape, df_test_cache.shape)

# Labels
def to_list_tags(s):
    return s.split() if isinstance(s, str) else []
y_lists = df_train_full['Tags'].astype(str).apply(to_list_tags)
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[SAGA][Labels] #samples:', n_samples, '#labels:', n_labels)

# Channels
title_text = df_train_full['title_norm'].fillna('').astype(str)
body_text  = df_train_full['body_text'].fillna('').astype(str)
code_text  = df_train_full['code_text'].fillna('').astype(str)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_train_full[meta_cols].astype(np.float32).values

# Micro-pilot: subsample to 100k and run first fold only
MICRO_PILOT = True
PILOT_N = 100_000
if MICRO_PILOT:
    n_keep = int(min(PILOT_N, n_samples))
    idx_keep = rng.choice(n_samples, size=n_keep, replace=False)
    idx_keep.sort()
    title_text = title_text.iloc[idx_keep].reset_index(drop=True)
    body_text  = body_text.iloc[idx_keep].reset_index(drop=True)
    code_text  = code_text.iloc[idx_keep].reset_index(drop=True)
    meta_all   = meta_all[idx_keep]
    Y_all      = Y_all[idx_keep]
    n_samples  = Y_all.shape[0]
    print(f'[SAGA][Pilot] Subsampled to {n_samples} rows')

# CV
mskf = IterativeStratification(n_splits=5, order=1)
X_dummy = np.zeros((n_samples, 1))

# Feature configs (reduced for speed, still apples-to-apples channels)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=100_000, sublinear_tf=True, dtype=np.float32)
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**17,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**16,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg   = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                      token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=50_000,
                      sublinear_tf=True, dtype=np.float32)
title_alpha = 3.0
MIN_LABEL_FREQ_TRAIN = 100  # more aggressive pruning for speed (train split only)
THS = np.linspace(0.05, 0.6, 12)

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

BATCH_SIZE = 8192

tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)

# Per-tag thresholds (for high-support labels only)
global_support = np.asarray(Y_all.sum(axis=0)).ravel()
hi_mask = (global_support >= 50)
hi_idx = np.where(hi_mask)[0]
tp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fn_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
hi_pos = {lab: pos for pos, lab in enumerate(hi_idx)}

def update_counts_batch(Y_true_batch_csr, probs_batch, label_idx_batch, ths, tp, fp, fn,
                        tp_hi, fp_hi, fn_hi, hi_pos_map):
    for ti, thr in enumerate(ths):
        pred_bin = (probs_batch >= thr).astype(np.uint8)
        pred_csr = sparse.csr_matrix(pred_bin, dtype=np.uint8)
        tp_b = int(pred_csr.multiply(Y_true_batch_csr).sum())
        ppos = int(pred_bin.sum())
        tpos = int(Y_true_batch_csr.sum())
        tp[ti] += tp_b
        fp[ti] += (ppos - tp_b)
        fn[ti] += (tpos - tp_b)
        if hi_pos_map:
            for j_local in range(probs_batch.shape[1]):
                g_lab = int(label_idx_batch[j_local])
                pos = hi_pos_map.get(g_lab, None)
                if pos is None:
                    continue
                col_true = Y_true_batch_csr[:, j_local]
                col_pred = pred_csr[:, j_local]
                tp_j = int(col_true.multiply(col_pred).sum())
                p_j = int(col_pred.sum())
                t_j = int(col_true.sum())
                tp_hi[pos, ti] += tp_j
                fp_hi[pos, ti] += (p_j - tp_j)
                fn_hi[pos, ti] += (t_j - tp_j)

fold_id = 0
t0_fold = time.time()
for tr_idx, va_idx in mskf.split(X_dummy, Y_all):
    print(f'[SAGA][Fold {fold_id}] Train={tr_idx.size}, Valid={va_idx.size}')
    Y_tr_full = Y_all[tr_idx]
    Y_va_full = Y_all[va_idx]
    sup_tr = np.asarray(Y_tr_full.sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    print(f'[SAGA][Fold {fold_id}] kept_labels={kept_labels.size}, excluded={excluded_labels.size}')
    if kept_labels.size == 0:
        print('[SAGA] No labels after pruning; abort fold')
        break

    # Fit vectorizers/scaler on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    meta_scaler = StandardScaler(with_mean=False)
    for b in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b])

    # Transform helpers
    def transform_block(idxs):
        X_title = title_vec.transform(title_text.iloc[idxs]).astype(np.float32).multiply(title_alpha)
        X_body  = body_hash.transform(body_text.iloc[idxs]).astype(np.float32)
        X_char  = char_hash.transform((title_text.iloc[idxs] + ' ' + body_text.iloc[idxs])).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[idxs]).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[idxs]), dtype=np.float32)
        return sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)

    # Build sparse matrices in batches to avoid peak RAM: collect then vstack
    X_tr_parts, X_va_parts = [], []
    for b in batch_indices(tr_idx, BATCH_SIZE):
        X_tr_parts.append(transform_block(b))
    for b in batch_indices(va_idx, BATCH_SIZE):
        X_va_parts.append(transform_block(b))
    X_tr = sparse.vstack(X_tr_parts, format='csr', dtype=np.float32)
    X_va = sparse.vstack(X_va_parts, format='csr', dtype=np.float32)
    del X_tr_parts, X_va_parts; gc.collect()

    # Reduce label space to kept_labels for training; track excluded FNs during eval
    Y_tr = Y_tr_full[:, kept_labels]
    Y_va = Y_va_full[:, kept_labels]

    # Model: OVR Logistic Regression with SAGA (fast config)
    base = LogisticRegression(
        solver='saga', penalty='l2', C=1.0, max_iter=1000, tol=1e-3,
        class_weight=None, n_jobs=1, random_state=GLOBAL_SEED, verbose=0
    )
    clf = OneVsRestClassifier(base, n_jobs=16)
    t_fit = time.time()
    clf.fit(X_tr, Y_tr)
    print(f'[SAGA][Fold {fold_id}] Fit time: {(time.time()-t_fit)/60:.2f} min')

    # Predict probabilities on validation
    try:
        va_probs = clf.predict_proba(X_va)
    except Exception:
        from scipy.special import expit
        scores = clf.decision_function(X_va)
        va_probs = expit(scores)

    # Update TP/FP/FN across thresholds (streaming by batches to limit memory)
    bs = 4096
    for s in range(0, va_probs.shape[0], bs):
        e = min(va_probs.shape[0], s + bs)
        P = va_probs[s:e].astype(np.float32, copy=False)
        Y_true = Y_va[s:e]
        update_counts_batch(Y_true.tocsr(), P, kept_labels, THS, tp_tot, fp_tot, fn_tot,
                            tp_hi, fp_hi, fn_hi, hi_pos)
    # FN addback for excluded labels
    fn_excluded = int(Y_va_full[:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        fn_tot += fn_excluded
    print(f'[SAGA][Fold {fold_id}] Excluded FN addback: {fn_excluded}')

    # Cleanup large mats before breaking (micro-pilot)
    del X_tr, X_va, Y_tr, Y_va, Y_tr_full, Y_va_full, va_probs, clf
    gc.collect()
    print(f'[SAGA][Fold {fold_id}] Total fold time: {(time.time()-t0_fold)/60:.2f} min')
    break

# Compute global micro-F1 across thresholds and choose best
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[SAGA][OOF] Best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Per-tag thresholds for high-support labels only (based on collected counts)
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
for k, lab in enumerate(hi_idx):
    tps = tp_hi[k]; fps = fp_hi[k]; fns = fn_hi[k]
    f1s_lab = np.array([micro_f1(tps[i], fps[i], fns[i]) for i in range(THS.size)], dtype=np.float32)
    j = int(np.argmax(f1s_lab))
    per_tag_thr[lab] = float(THS[j])
print('[SAGA][OOF] Per-tag thresholds computed for', hi_idx.size, 'labels; others use global.')

# Persist artifacts aligned with the pipeline
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
np.save('per_tag_thresholds.npy', per_tag_thr)
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
print('[SAGA][Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')
print('[SAGA Baseline Pilot - FAST] DONE. Ready to scale folds and data after review.')


[SAGA] Loaded caches: (5430775, 11) (603420, 10)


[SAGA][Labels] #samples: 5430775 #labels: 41781


[SAGA][Pilot] Subsampled to 100000 rows


[SAGA][Fold 0] Train=80041, Valid=19959
[SAGA][Fold 0] kept_labels=307, excluded=41474


## Experiment Log (Emergency Track + Micro-Pilots) — To Be Archived Before Cleanup

This cell records the key outcomes from emergency fallback and pilot runs, per hygiene mandate. Cells 20–22 will be deleted next.

- Cell 16 (Vectorized micro-pilot, early version):
  - Config: 100k subset, 1-fold; TF-IDF (title/body/code/char) + meta; batch-major trainer without momentum.
  - Result: OOF micro-F1 = 0.08454 @ thr=0.053. Notes: under-learning; very low mean probabilities; served as a diagnostic baseline.

- Cell 17 (Batch-major custom trainer vNext-opt):
  - Config: 100k subset, 1-fold; Title TF-IDF (alpha=3), Body/Char Hashing, Code TF-IDF, Meta; positive weighting; Nesterov momentum; grad clipping; prior-logit bias; EPOCHS=8, BATCH=8192.
  - Result: OOF micro-F1 = 0.22046 @ thr=0.368. Notes: major architectural fix (batch-major across shards) validated learning dynamics; fastest scalable path per audits.

- Cell 18 (SAGA micro-pilot FAST):
  - Config: 100k subset, 1-fold; same channels as Cell 17; per-fold vectorizers; LogisticRegression(saga), OVR; threading guarded. 
  - Status: Began with kept_labels=307; run did not complete to metrics here (prior SAGA attempts in env stalled). Marked high-risk; retained for Track B2 time-boxed try only.

- Cell 20 (Emergency OVR-SGD fallback, hashing-only):
  - Config: 50k subset; top-1000 labels; word-hash(2^18)+char-wb(2^16); OneVsRest(SGDClassifier, max_iter=8). 
  - Holdout: micro-F1 = 0.25443 @ thr=0.100. ConvergenceWarnings due to low max_iter.
  - Submission (Cell 21 fast inference K=200): LB = 0.20057.
  - Submission (Cell 22 fast inference++ K=400): LB = 0.19910.
  - Notes: Hashing-only simplification and aggressive label pruning led to poor LB; retained only as emergency record.

Summary and Mandate Alignment:
- Custom batch-major trainer (Cell 17) is the primary asset to scale (Track A). 
- SAGA is kept as a time-boxed, guarded experiment (Track B2) only; prior stalls noted.
- Next actions (immediate):
  1) Delete Cells 20–22 (emergency fallback) to restore a clean, production notebook.
  2) Implement Track A scale-up: 300k subset, 3-fold iterative CV; feature caps (Title TF-IDF 150k, Body Hash 2^18, Char Hash 2^17, Code TF-IDF 80k), EPOCHS=5, BATCH=12,288, LR schedule (warmup + cosine). Persist kept_labels per fold and compute per-tag thresholds for support ≥ 50.
  3) Track B1: Systematize OVR-SGD on same folds for OOF ensembling data.
  4) Track B2: One last SAGA attempt with strict threading and TF-IDF on all channels; abort on stall.
  5) Track C: Blend OOF from Track A with best of B, re-optimize thresholds, refit, and infer.

In [17]:
# Track A: Productionize custom batch-major trainer on 300k subset with 3-fold iterative CV
# - Deterministic 300k sample (seed=1337). Persist fold indices for reuse across tracks.
# - Leak-proof per-fold vectorizers (Title TF-IDF 150k, Body Hash 2^18, Char Hash 2^17, Code TF-IDF 80k) + meta scaler.
# - Custom OVR logistic trainer: batch-major updates across label shards; Nesterov momentum, grad clipping,
#   positive weighting (clipped [1,10]), prior-logit bias. EPOCHS=5, BATCH=12,288.
# - LR schedule: Epoch 1 linear warmup 0.04->0.12; Epochs 2-5 cosine decay 0.12->0.02.
# - OOF aggregation: global threshold sweep + per-tag thresholds for labels with OOF support >= 50.
# - Artifacts: fold indices, kept_labels per fold, oof fold npz (va_idx, kept_labels, probs float16),
#   global_threshold.npy, per_tag_thresholds.npy, oof_global_f1_curve.csv

import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer

# Dependency guard for iterative stratification
try:
    from skmultilearn.model_selection import IterativeStratification
except Exception:
    import sys
    from subprocess import run
    run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
    from skmultilearn.model_selection import IterativeStratification

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

# Load parsed caches
TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL   = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches: parsed_train_full.pkl / parsed_test.pkl'
df_tr = pd.read_pickle(TRAIN_FULL)
df_te = pd.read_pickle(TEST_PKL)
print('[TrackA] Loaded caches:', df_tr.shape, df_te.shape)

# Prepare subset (deterministic 300k)
N_TARGET = 300_000
n_all = len(df_tr)
idx_all = np.arange(n_all)
if n_all > N_TARGET:
    rng = np.random.RandomState(GLOBAL_SEED)
    idx_sub = rng.choice(n_all, size=N_TARGET, replace=False)
    idx_sub.sort()
else:
    idx_sub = idx_all
print(f'[TrackA] Using {idx_sub.size} rows for 3-fold CV')

# Build channels on subset
title_text = df_tr['title_norm'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
body_text  = df_tr['body_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
code_text  = df_tr['code_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_tr.loc[idx_sub, meta_cols].astype(np.float32).values
y_lists = df_tr['Tags'].iloc[idx_sub].astype(str).apply(lambda s: s.split())
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[TrackA][Labels] #samples:', n_samples, '#labels:', n_labels)

# Persist top-level subset indices for reproducibility
np.save('subset300k_idx.npy', idx_sub)

# CV setup: 3-fold iterative stratification
mskf = IterativeStratification(n_splits=3, order=1)
X_dummy = np.zeros((n_samples, 1))

# Feature configurations
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=150_000, sublinear_tf=True, dtype=np.float32)
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**18,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**17,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg   = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                      token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=80_000,
                      sublinear_tf=True, dtype=np.float32)
title_alpha = 3.0

# Trainer hyperparameters
EPOCHS = 5
BATCH_SIZE = 12_288
L2 = 2e-4
MU = 0.9
COEF_BUDGET_BYTES = 4_000_000_000  # fp32 budget for W per shard sizing
THS = np.linspace(0.05, 0.6, 12)

# Label pruning within-train to stabilize (count as FN on valid for excluded)
MIN_LABEL_FREQ_TRAIN = 20

def sigmoid_stable(Z):
    Z = np.clip(Z, -12.0, 12.0)
    return 1.0 / (1.0 + np.exp(-Z))

def prior_logit(p):
    p = np.clip(p, 1e-6, 1-1e-6)
    return np.log(p/(1.0-p))

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

def lr_for_batch(epoch_idx, batch_idx, batches_per_epoch):
    # Epochs indexed 0..EPOCHS-1
    if epoch_idx == 0:
        # linear warmup: 0.04 -> 0.12 across this epoch's batches
        start, end = 0.04, 0.12
        t = batch_idx / max(1, (batches_per_epoch - 1))
        return start + (end - start) * t
    else:
        # cosine decay over epochs 2-5 from 0.12 -> 0.02 (4 epochs total)
        # map epoch 1..4 (relative) -> progress in [0,1]
        rel = (epoch_idx) / 4.0  # epoch_idx in {1,2,3,4}
        lr_max, lr_min = 0.12, 0.02
        cos_val = 0.5 * (1 + math.cos(math.pi * rel))
        return lr_min + (lr_max - lr_min) * cos_val

# Global support to decide high-support labels for per-tag thresholds
global_support = np.asarray(Y_all.sum(axis=0)).ravel()
hi_mask = (global_support >= 50)
hi_idx = np.where(hi_mask)[0]
hi_pos = {lab: pos for pos, lab in enumerate(hi_idx)}
print('[TrackA] High-support labels (>=50):', hi_idx.size)

# Aggregators
tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)
tp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fn_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)

# For OOF persistence (per-fold): write npz with va_idx, kept_label_idx, probs float16
oof_fold_paths = []

fold_num = 0
for tr_idx, va_idx in mskf.split(X_dummy, Y_all):
    t_fold = time.time()
    print(f"\n[TrackA][Fold {fold_num}] Train={tr_idx.size}, Valid={va_idx.size}")
    # Persist fold indices (aligned to subset)
    np.save(f'fold{fold_num}_train_idx.npy', tr_idx)
    np.save(f'fold{fold_num}_valid_idx.npy', va_idx)

    # Label pruning on train
    sup_tr = np.asarray(Y_all[tr_idx].sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    print(f'[TrackA][Fold {fold_num}] kept_labels={kept_labels.size}, excluded={excluded_labels.size}')
    # Persist kept_labels for this fold
    np.save(f'fold{fold_num}_kept_labels.npy', kept_labels)

    # Per-label positive weights (fold-level)
    p_tr = np.clip(sup_tr / float(tr_idx.size), 1e-6, 1-1e-6)
    pos_weights_global = np.clip(1.0 / p_tr, 1.0, 10.0).astype(np.float32)

    # Fit vectorizers/scaler on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    meta_scaler = StandardScaler(with_mean=False)
    for b in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b])

    # Probe D and compute shard sizing
    probe_idx = tr_idx[:min(BATCH_SIZE, tr_idx.size)]
    Xt_probe = title_vec.transform(title_text.iloc[probe_idx]).astype(np.float32).multiply(title_alpha)
    Xb_probe = body_hash.transform(body_text.iloc[probe_idx]).astype(np.float32)
    Xc_probe = char_hash.transform((title_text.iloc[probe_idx] + ' ' + body_text.iloc[probe_idx])).astype(np.float32)
    Xcode_probe = code_vec.transform(code_text.iloc[probe_idx]).astype(np.float32)
    Xmeta_probe = sparse.csr_matrix(meta_scaler.transform(meta_all[probe_idx]), dtype=np.float32)
    D = sparse.hstack([Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe], format='csr', dtype=np.float32).shape[1]
    del Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe; gc.collect()
    bytes_per_coef = 4  # fp32
    shard_cap = max(1, int(COEF_BUDGET_BYTES // (bytes_per_coef * D)))
    SHARD_SIZE = max(1, min(600, shard_cap))
    shards = [kept_labels[i:i+SHARD_SIZE] for i in range(0, kept_labels.size, SHARD_SIZE)]
    est_w_mb = (D * SHARD_SIZE * bytes_per_coef) / (1024**2)
    print(f'[TrackA][Fold {fold_num}] D={D:,}, SHARD_SIZE={SHARD_SIZE}, #shards={len(shards)} (~{est_w_mb:.1f} MB per-shard W)')

    # Initialize shard parameters (held concurrently)
    shard_params = []  # (labels_idx, W, b, Vw, Vb)
    n_tr = tr_idx.size
    for shard in shards:
        Lb = len(shard)
        std = 1e-3 / math.sqrt(max(1, D))
        W = (np.random.randn(D, Lb).astype(np.float32) * std).astype(np.float32)
        pos = sup_tr[shard].astype(np.float64)
        p = pos / float(n_tr)
        b = prior_logit(p).astype(np.float32, copy=False)
        Vw = np.zeros_like(W, dtype=np.float32)
        Vb = np.zeros_like(b, dtype=np.float32)
        shard_params.append((shard, W, b, Vw, Vb))

    # Precompute batch index list
    tr_batches = [b for b in batch_indices(tr_idx, BATCH_SIZE)]
    batches_per_epoch = len(tr_batches)

    # Precompute mapping from kept_labels -> hi_idx row positions (or -1 if not high-support)
    kept_is_hi = np.array([int(lab) in hi_pos for lab in kept_labels], dtype=bool)
    kept_hi_rows = np.array([hi_pos.get(int(lab), -1) for lab in kept_labels], dtype=np.int32)

    # Training loop: batch-major with LR schedule
    t_tr = time.time()
    for ep in range(EPOCHS):
        rng = np.random.RandomState(GLOBAL_SEED + ep)
        rng.shuffle(tr_batches)
        t_ep = time.time()
        for bi, b_idx in enumerate(tr_batches):
            LR = lr_for_batch(ep, bi, batches_per_epoch)
            # Log LR for first and last batch per epoch for verification
            if bi == 0 or bi == (batches_per_epoch - 1):
                print(f'[TrackA][Fold {fold_num}] Epoch {ep+1}/{EPOCHS} batch {bi+1}/{batches_per_epoch} LR={LR:.5f}')
            # Build features once
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            bs = X_batch.shape[0]
            for si, (shard, W, b, Vw, Vb) in enumerate(shard_params):
                Y_b = Y_all[b_idx][:, shard].toarray().astype(np.float32, copy=False)
                Z = X_batch @ W; Z += b
                P = sigmoid_stable(Z)
                E = (P - Y_b)
                # Positive weighting
                w_pos = pos_weights_global[shard][None, :]
                Wmat = 1.0 + (w_pos - 1.0) * Y_b
                E_weighted = E * Wmat
                # Gradients
                grad_W = (X_batch.T @ E_weighted) / float(bs)
                if isinstance(grad_W, np.matrix):
                    grad_W = np.asarray(grad_W)
                grad_W = grad_W.astype(np.float32, copy=False) + (L2 * W)
                grad_b = E_weighted.mean(axis=0).astype(np.float32, copy=False)
                # Gradient clipping (per-column L2) and bias clip
                col_norms = np.linalg.norm(grad_W, axis=0) + 1e-8
                clip_scale = np.minimum(1.0, 10.0 / col_norms).astype(np.float32, copy=False)
                grad_W *= clip_scale
                grad_b = np.clip(grad_b, -1.0, 1.0)
                # Nesterov momentum
                Vw *= MU; Vw += LR * grad_W
                Vb *= MU; Vb += LR * grad_b
                W -= (MU * Vw + LR * grad_W)
                b -= (MU * Vb + LR * grad_b)
                shard_params[si] = (shard, W, b, Vw, Vb)
            del X_title, X_body, X_char, X_code, X_meta, X_batch; gc.collect()
        print(f'[TrackA][Fold {fold_num}] Epoch {ep+1}/{EPOCHS} time: {(time.time()-t_ep)/60:.2f} min')
    print(f'[TrackA][Fold {fold_num}] Train total time: {(time.time()-t_tr)/60:.2f} min')

    # Validation streaming: update global and per-tag counts, and persist OOF probs per fold
    # We'll collect probs for kept_labels only and write float16 to disk with va indices and kept label indices
    P_va_chunks = []
    warned_mismatch = False
    for b in batch_indices(va_idx, BATCH_SIZE):
        # Normalize batch indices to avoid any dtype/shape surprises on last partial batch
        b = np.asarray(b, dtype=np.int64)
        b_list = b.tolist()
        X_title = title_vec.transform(title_text.iloc[b_list]).astype(np.float32).multiply(title_alpha)
        X_body  = body_hash.transform(body_text.iloc[b_list]).astype(np.float32)
        X_char  = char_hash.transform((title_text.iloc[b_list] + ' ' + body_text.iloc[b_list])).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[b_list]).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b]), dtype=np.float32)
        X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
        # collect probs across shards for kept_labels in proper order
        P_collect = []
        col_order = []
        for shard, W, b_sh, _, _ in shard_params:
            Z = (X_batch @ W); Z += b_sh
            P = sigmoid_stable(Z)
            P_collect.append(P)
            col_order.append(np.array(shard, dtype=np.int32))
        if len(P_collect) == 1:
            P_all = P_collect[0]
            order = col_order[0]
        else:
            P_all = np.concatenate(P_collect, axis=1)
            order = np.concatenate(col_order)
        # reorder columns to match kept_labels sorted order
        pos_map = {lab:i for i, lab in enumerate(order)}
        kept_pos = np.array([pos_map[k] for k in kept_labels], dtype=np.int32)
        P_kept = P_all[:, kept_pos].astype(np.float32, copy=False)
        # Dense ground-truth for safe counting (avoid sparse shape inconsistencies)
        Y_true_kept_dense = Y_all[b][:, kept_labels].toarray().astype(np.uint8, copy=False)
        # SAFEGUARD: align row counts between predictions and truths for this batch
        if P_kept.shape[0] != Y_true_kept_dense.shape[0]:
            min_rows = min(P_kept.shape[0], Y_true_kept_dense.shape[0])
            if not warned_mismatch:
                print(f"[TrackA][Fold {fold_num}] Warning: pred/true batch rows mismatch {P_kept.shape[0]} vs {Y_true_kept_dense.shape[0]} -> slicing to {min_rows}")
                warned_mismatch = True
            if P_kept.shape[0] > min_rows:
                P_kept = P_kept[:min_rows]
            if Y_true_kept_dense.shape[0] > min_rows:
                Y_true_kept_dense = Y_true_kept_dense[:min_rows]
        # Update counts for all thresholds (global) and per-tag (vectorized for hi labels)
        for ti, thr in enumerate(THS):
            pred_bin = (P_kept >= thr).astype(np.uint8)
            # Global counts
            tp = int((pred_bin & Y_true_kept_dense).sum())
            ppos = int(pred_bin.sum())
            tpos = int(Y_true_kept_dense.sum())
            tp_tot[ti] += tp
            fp_tot[ti] += (ppos - tp)
            fn_tot[ti] += (tpos - tp)
            # Per-tag counts for high-support labels present in kept_labels (vectorized)
            if hi_pos and kept_is_hi.any():
                pred_hi = pred_bin[:, kept_is_hi]
                true_hi = Y_true_kept_dense[:, kept_is_hi]
                # Column-wise sums
                tp_vec = (pred_hi & true_hi).sum(axis=0).astype(np.int64)
                p_vec = pred_hi.sum(axis=0).astype(np.int64)
                t_vec = true_hi.sum(axis=0).astype(np.int64)
                rows = kept_hi_rows[kept_is_hi]
                tp_hi[rows, ti] += np.asarray(tp_vec).ravel()
                fp_hi[rows, ti] += (np.asarray(p_vec).ravel() - np.asarray(tp_vec).ravel())
                fn_hi[rows, ti] += (np.asarray(t_vec).ravel() - np.asarray(tp_vec).ravel())
        P_va_chunks.append(P_kept.astype(np.float16, copy=False))
        del X_title, X_body, X_char, X_code, X_meta, X_batch, P_collect, col_order, P_all, order, kept_pos, P_kept, Y_true_kept_dense
        gc.collect()

    # FN addback for excluded labels to keep OOF honest
    fn_excluded = int(Y_all[va_idx][:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        fn_tot += fn_excluded
    print(f'[TrackA][Fold {fold_num}] Excluded FN addback: {fn_excluded}')

    # Persist fold OOF npz (compressed)
    P_va = np.vstack(P_va_chunks) if len(P_va_chunks) > 0 else np.zeros((va_idx.size, kept_labels.size), dtype=np.float16)
    np.savez_compressed(f'fold{fold_num}_oof_probs.npz', va_idx=va_idx, kept_labels=kept_labels, probs=P_va)
    oof_fold_paths.append(f'fold{fold_num}_oof_probs.npz')
    print(f'[TrackA][Fold {fold_num}] Saved OOF npz: probs shape {P_va.shape}')

    # Cleanup heavy objects per fold
    del shard_params, P_va_chunks, P_va
    gc.collect()
    print(f'[TrackA][Fold {fold_num}] Time: {(time.time()-t_fold)/60:.2f} min')
    fold_num += 1

# Optimize global threshold
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[TrackA][OOF] Global best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Per-tag thresholds for high-support labels only (based on aggregated counts)
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
for k, lab in enumerate(hi_idx):
    tps = tp_hi[k]; fps = fp_hi[k]; fns = fn_hi[k]
    f1s_lab = np.array([micro_f1(tps[i], fps[i], fns[i]) for i in range(THS.size)], dtype=np.float32)
    j = int(np.argmax(f1s_lab))
    per_tag_thr[lab] = float(THS[j])
print('[TrackA][OOF] Per-tag thresholds set for', hi_idx.size, 'labels; others use global.')

# Persist global/per-tag thresholds and OOF curve
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
np.save('per_tag_thresholds.npy', per_tag_thr)
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
print('[TrackA][Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')

# Log summary
print('\n[TrackA] SUMMARY:')
print(' - Subset size:', n_samples)
print(' - Labels:', n_labels)
print(' - High-support labels (>=50):', hi_idx.size)
print(' - Folds OOF files:', oof_fold_paths)
print(' - Best global thr:', round(best_thr,3), 'OOF micro-F1:', round(best_f1,5))
print('[TrackA] DONE. Proceed to Track B using persisted fold indices and kept_labels per fold.')


[TrackA] Loaded caches: (5430775, 11) (603420, 10)


[TrackA] Using 300000 rows for 3-fold CV


[TrackA][Labels] #samples: 300000 #labels: 26770


[TrackA] High-support labels (>=50): 2147



[TrackA][Fold 0] Train=199994, Valid=100006
[TrackA][Fold 0] kept_labels=3232, excluded=23538


[TrackA][Fold 0] D=597,039, SHARD_SIZE=600, #shards=6 (~1366.5 MB per-shard W)


[TrackA][Fold 0] Epoch 1/5 batch 1/17 LR=0.04000


[TrackA][Fold 0] Epoch 1/5 batch 17/17 LR=0.12000


[TrackA][Fold 0] Epoch 1/5 time: 30.16 min
[TrackA][Fold 0] Epoch 2/5 batch 1/17 LR=0.10536


[TrackA][Fold 0] Epoch 2/5 batch 17/17 LR=0.10536


[TrackA][Fold 0] Epoch 2/5 time: 29.75 min
[TrackA][Fold 0] Epoch 3/5 batch 1/17 LR=0.07000


[TrackA][Fold 0] Epoch 3/5 batch 17/17 LR=0.07000


[TrackA][Fold 0] Epoch 3/5 time: 30.96 min
[TrackA][Fold 0] Epoch 4/5 batch 1/17 LR=0.03464


[TrackA][Fold 0] Epoch 4/5 batch 17/17 LR=0.03464


[TrackA][Fold 0] Epoch 4/5 time: 30.09 min
[TrackA][Fold 0] Epoch 5/5 batch 1/17 LR=0.02000


[TrackA][Fold 0] Epoch 5/5 batch 17/17 LR=0.02000


[TrackA][Fold 0] Epoch 5/5 time: 31.33 min
[TrackA][Fold 0] Train total time: 152.29 min


[TrackA][Fold 0] Excluded FN addback: 40455


[TrackA][Fold 0] Saved OOF npz: probs shape (100006, 3232)


[TrackA][Fold 0] Time: 162.53 min

[TrackA][Fold 1] Train=200076, Valid=99924
[TrackA][Fold 1] kept_labels=3227, excluded=23543


[TrackA][Fold 1] D=596,170, SHARD_SIZE=600, #shards=6 (~1364.5 MB per-shard W)


[TrackA][Fold 1] Epoch 1/5 batch 1/17 LR=0.04000


[TrackA][Fold 1] Epoch 1/5 batch 17/17 LR=0.12000


[TrackA][Fold 1] Epoch 1/5 time: 28.95 min
[TrackA][Fold 1] Epoch 2/5 batch 1/17 LR=0.10536


[TrackA][Fold 1] Epoch 2/5 batch 17/17 LR=0.10536


[TrackA][Fold 1] Epoch 2/5 time: 31.27 min
[TrackA][Fold 1] Epoch 3/5 batch 1/17 LR=0.07000


[TrackA][Fold 1] Epoch 3/5 batch 17/17 LR=0.07000


[TrackA][Fold 1] Epoch 3/5 time: 27.76 min
[TrackA][Fold 1] Epoch 4/5 batch 1/17 LR=0.03464


[TrackA][Fold 1] Epoch 4/5 batch 17/17 LR=0.03464


[TrackA][Fold 1] Epoch 4/5 time: 29.13 min
[TrackA][Fold 1] Epoch 5/5 batch 1/17 LR=0.02000


[TrackA][Fold 1] Epoch 5/5 batch 17/17 LR=0.02000


[TrackA][Fold 1] Epoch 5/5 time: 33.07 min
[TrackA][Fold 1] Train total time: 150.17 min


[TrackA][Fold 1] Excluded FN addback: 40109


[TrackA][Fold 1] Saved OOF npz: probs shape (99924, 3227)


[TrackA][Fold 1] Time: 160.75 min

[TrackA][Fold 2] Train=199930, Valid=100070
[TrackA][Fold 2] kept_labels=3233, excluded=23537


[TrackA][Fold 2] D=596,389, SHARD_SIZE=600, #shards=6 (~1365.0 MB per-shard W)


[TrackA][Fold 2] Epoch 1/5 batch 1/17 LR=0.04000


[TrackA][Fold 2] Epoch 1/5 batch 17/17 LR=0.12000


[TrackA][Fold 2] Epoch 1/5 time: 31.76 min
[TrackA][Fold 2] Epoch 2/5 batch 1/17 LR=0.10536


[TrackA][Fold 2] Epoch 2/5 batch 17/17 LR=0.10536


[TrackA][Fold 2] Epoch 2/5 time: 31.48 min
[TrackA][Fold 2] Epoch 3/5 batch 1/17 LR=0.07000


[TrackA][Fold 2] Epoch 3/5 batch 17/17 LR=0.07000


[TrackA][Fold 2] Epoch 3/5 time: 30.35 min
[TrackA][Fold 2] Epoch 4/5 batch 1/17 LR=0.03464


[TrackA][Fold 2] Epoch 4/5 batch 17/17 LR=0.03464


[TrackA][Fold 2] Epoch 4/5 time: 28.57 min
[TrackA][Fold 2] Epoch 5/5 batch 1/17 LR=0.02000


[TrackA][Fold 2] Epoch 5/5 batch 17/17 LR=0.02000


[TrackA][Fold 2] Epoch 5/5 time: 33.28 min
[TrackA][Fold 2] Train total time: 155.44 min


[TrackA][Fold 2] Excluded FN addback: 40372


[TrackA][Fold 2] Saved OOF npz: probs shape (100070, 3233)


[TrackA][Fold 2] Time: 166.21 min


[TrackA][OOF] Global best micro-F1 = 0.18089 at thr = 0.400
[TrackA][OOF] Per-tag thresholds set for 2147 labels; others use global.
[TrackA][Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv

[TrackA] SUMMARY:
 - Subset size: 300000
 - Labels: 26770
 - High-support labels (>=50): 2147
 - Folds OOF files: ['fold0_oof_probs.npz', 'fold1_oof_probs.npz', 'fold2_oof_probs.npz']
 - Best global thr: 0.4 OOF micro-F1: 0.18089
[TrackA] DONE. Proceed to Track B using persisted fold indices and kept_labels per fold.


In [18]:
# Track A Rerun (Fixed): Critical per-tag FN bugfix + underfitting fixes + safer shard sizing
# - Reuse the EXACT same 300k subset and 3-fold splits from prior Track A (for comparability/ensembling)
# - Bugfix: add per-tag FN for any high-support label excluded in a fold (was missing before)
# - Learning: EPOCHS=8, more aggressive LR schedule (2-epoch warmup to 0.15, cosine decay to 0.03)
# - Memory headroom: cap SHARD_SIZE at 400 (prev 600)
# - Persist new OOF per-fold npz, thresholds, and curve (overwrites prior Track A artifacts)

import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

# Load caches
TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL   = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches: parsed_train_full.pkl / parsed_test.pkl'
df_tr = pd.read_pickle(TRAIN_FULL)
df_te = pd.read_pickle(TEST_PKL)
print('[TrackA-FIX] Loaded caches:', df_tr.shape, df_te.shape)

# Reuse the exact 300k subset indices if available; else recreate deterministically
if os.path.exists('subset300k_idx.npy'):
    idx_sub = np.load('subset300k_idx.npy')
    print('[TrackA-FIX] Loaded subset indices from subset300k_idx.npy:', idx_sub.size)
else:
    N_TARGET = 300_000
    n_all = len(df_tr)
    if n_all > N_TARGET:
        rng = np.random.RandomState(GLOBAL_SEED)
        idx_sub = rng.choice(n_all, size=N_TARGET, replace=False)
        idx_sub.sort()
    else:
        idx_sub = np.arange(n_all)
    np.save('subset300k_idx.npy', idx_sub)
    print('[TrackA-FIX] Saved subset indices; size=', idx_sub.size)

# Build channels on subset
title_text = df_tr['title_norm'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
body_text  = df_tr['body_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
code_text  = df_tr['code_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_tr.loc[idx_sub, meta_cols].astype(np.float32).values
y_lists = df_tr['Tags'].iloc[idx_sub].astype(str).apply(lambda s: s.split())
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[TrackA-FIX][Labels] #samples:', n_samples, '#labels:', n_labels)

# Attempt to reuse prior fold indices for perfect alignment with Track B; else derive them from disk
fold_splits = []
reuse_ok = True
for i in range(3):
    tr_path = f'fold{i}_train_idx.npy'
    va_path = f'fold{i}_valid_idx.npy'
    if os.path.exists(tr_path) and os.path.exists(va_path):
        tr_idx = np.load(tr_path)
        va_idx = np.load(va_path)
        fold_splits.append((tr_idx, va_idx))
    else:
        reuse_ok = False
        break
if not reuse_ok or len(fold_splits) != 3:
    # As a fallback, rebuild the same splits deterministically using iterative stratification
    print('[TrackA-FIX] Prior fold indices not found/complete. Rebuilding deterministic 3-fold splits...')
    try:
        from skmultilearn.model_selection import IterativeStratification
    except Exception:
        import sys
        from subprocess import run
        run([sys.executable, '-m', 'pip', 'install', '--quiet', 'scikit-multilearn'])
        from skmultilearn.model_selection import IterativeStratification
    mskf = IterativeStratification(n_splits=3, order=1)
    X_dummy = np.zeros((n_samples, 1))
    for k, (tr_idx, va_idx) in enumerate(mskf.split(X_dummy, Y_all)):
        np.save(f'fold{k}_train_idx.npy', tr_idx)
        np.save(f'fold{k}_valid_idx.npy', va_idx)
        fold_splits.append((tr_idx, va_idx))
    print('[TrackA-FIX] Saved fold indices.')
else:
    print('[TrackA-FIX] Reusing existing 3-fold indices from disk.')

# Feature configs (same channels; safer caps)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=150_000, sublinear_tf=True, dtype=np.float32)
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**18,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**17,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg   = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                      token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=80_000,
                      sublinear_tf=True, dtype=np.float32)
title_alpha = 3.0

# Trainer hyperparameters (fixed per audit):
EPOCHS = 8
BATCH_SIZE = 12_288
L2 = 2e-4
MU = 0.9
COEF_BUDGET_BYTES = 4_000_000_000  # fp32 budget for W per shard sizing
MAX_SHARD_CAP = 400  # safer headroom (was 600)
THS = np.linspace(0.05, 0.6, 12)
MIN_LABEL_FREQ_TRAIN = 20

def sigmoid_stable(Z):
    Z = np.clip(Z, -12.0, 12.0)
    return 1.0 / (1.0 + np.exp(-Z))

def prior_logit(p):
    p = np.clip(p, 1e-6, 1-1e-6)
    return np.log(p/(1.0-p))

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

def lr_for_batch(epoch_idx, batch_idx, batches_per_epoch):
    # Aggressive schedule: 2-epoch warmup to 0.15, then cosine decay to 0.03 over epochs 2..7
    if epoch_idx <= 1:
        start, end = 0.06, 0.15
        steps = max(1, (2 * batches_per_epoch - 1))
        t = (epoch_idx * batches_per_epoch + batch_idx) / steps
        return start + (end - start) * t
    else:
        lr_max, lr_min = 0.15, 0.03
        # epochs 2..7 inclusive -> 6 epochs of decay
        rel = (epoch_idx - 2 + batch_idx / max(1, (batches_per_epoch - 1))) / 6.0
        rel = np.clip(rel, 0.0, 1.0)
        cos_val = 0.5 * (1 + math.cos(math.pi * rel))
        return lr_min + (lr_max - lr_min) * cos_val

# Global support to decide high-support labels for per-tag thresholds
global_support = np.asarray(Y_all.sum(axis=0)).ravel()
hi_mask = (global_support >= 50)
hi_idx = np.where(hi_mask)[0]
hi_pos = {lab: pos for pos, lab in enumerate(hi_idx)}
print('[TrackA-FIX] High-support labels (>=50):', hi_idx.size)

# Aggregators
tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)
tp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fn_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)

oof_fold_paths = []
for fold_num, (tr_idx, va_idx) in enumerate(fold_splits[1:], start=1):  # RESUME FROM FOLD 1
    t_fold = time.time()
    print(f"\n[TrackA-FIX][Fold {fold_num}] Train={tr_idx.size}, Valid={va_idx.size}")
    # Save indices again to confirm reuse
    np.save(f'fold{fold_num}_train_idx.npy', tr_idx)
    np.save(f'fold{fold_num}_valid_idx.npy', va_idx)

    # Label pruning on train
    Y_tr_full = Y_all[tr_idx]
    Y_va_full = Y_all[va_idx]
    sup_tr = np.asarray(Y_tr_full.sum(axis=0)).ravel()
    kept_labels = np.where(sup_tr >= MIN_LABEL_FREQ_TRAIN)[0]
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    np.save(f'fold{fold_num}_kept_labels.npy', kept_labels)
    print(f'[TrackA-FIX][Fold {fold_num}] kept_labels={kept_labels.size}, excluded={excluded_labels.size}')

    # Per-label positive weights (fold-level)
    p_tr = np.clip(sup_tr / float(tr_idx.size), 1e-6, 1-1e-6)
    pos_weights_global = np.clip(1.0 / p_tr, 1.0, 10.0).astype(np.float32)

    # Fit vectorizers/scaler on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    meta_scaler = StandardScaler(with_mean=False)
    for b in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b])

    # Probe D and compute shard sizing (with safer cap)
    probe_idx = tr_idx[:min(BATCH_SIZE, tr_idx.size)]
    Xt_probe = title_vec.transform(title_text.iloc[probe_idx]).astype(np.float32).multiply(title_alpha)
    Xb_probe = body_hash.transform(body_text.iloc[probe_idx]).astype(np.float32)
    Xc_probe = char_hash.transform((title_text.iloc[probe_idx] + ' ' + body_text.iloc[probe_idx])).astype(np.float32)
    Xcode_probe = code_vec.transform(code_text.iloc[probe_idx]).astype(np.float32)
    Xmeta_probe = sparse.csr_matrix(meta_scaler.transform(meta_all[probe_idx]), dtype=np.float32)
    D = sparse.hstack([Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe], format='csr', dtype=np.float32).shape[1]
    del Xt_probe, Xb_probe, Xc_probe, Xcode_probe, Xmeta_probe; gc.collect()
    bytes_per_coef = 4
    shard_cap = max(1, int(COEF_BUDGET_BYTES // (bytes_per_coef * D)))
    SHARD_SIZE = max(1, min(MAX_SHARD_CAP, shard_cap))
    shards = [kept_labels[i:i+SHARD_SIZE] for i in range(0, kept_labels.size, SHARD_SIZE)]
    est_w_mb = (D * SHARD_SIZE * bytes_per_coef) / (1024**2)
    print(f'[TrackA-FIX][Fold {fold_num}] D={D:,}, SHARD_SIZE={SHARD_SIZE}, #shards={len(shards)} (~{est_w_mb:.1f} MB per-shard W)')

    # Initialize shard params
    shard_params = []  # (labels_idx, W, b, Vw, Vb)
    n_tr = tr_idx.size
    for shard in shards:
        Lb = len(shard)
        std = 1e-3 / math.sqrt(max(1, D))
        W = (np.random.randn(D, Lb).astype(np.float32) * std).astype(np.float32)
        pos = sup_tr[shard].astype(np.float64)
        p = pos / float(n_tr)
        b = prior_logit(p).astype(np.float32, copy=False)
        Vw = np.zeros_like(W, dtype=np.float32)
        Vb = np.zeros_like(b, dtype=np.float32)
        shard_params.append((shard, W, b, Vw, Vb))

    # Precompute batches
    tr_batches = [b for b in batch_indices(tr_idx, BATCH_SIZE)]
    batches_per_epoch = len(tr_batches)

    # Precompute mapping from kept_labels -> hi_idx row positions
    kept_is_hi = np.array([int(lab) in hi_pos for lab in kept_labels], dtype=bool)
    kept_hi_rows = np.array([hi_pos.get(int(lab), -1) for lab in kept_labels], dtype=np.int32)

    # Training loop with new LR schedule
    t_tr = time.time()
    for ep in range(EPOCHS):
        rng = np.random.RandomState(GLOBAL_SEED + ep)
        rng.shuffle(tr_batches)
        t_ep = time.time()
        for bi, b_idx in enumerate(tr_batches):
            LR = lr_for_batch(ep, bi, batches_per_epoch)
            if bi == 0 or bi == (batches_per_epoch - 1):
                print(f'[TrackA-FIX][Fold {fold_num}] Epoch {ep+1}/{EPOCHS} batch {bi+1}/{batches_per_epoch} LR={LR:.5f}')
            X_title = title_vec.transform(title_text.iloc[b_idx]).astype(np.float32).multiply(title_alpha)
            X_body  = body_hash.transform(body_text.iloc[b_idx]).astype(np.float32)
            X_char  = char_hash.transform((title_text.iloc[b_idx] + ' ' + body_text.iloc[b_idx])).astype(np.float32)
            X_code  = code_vec.transform(code_text.iloc[b_idx]).astype(np.float32)
            X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b_idx]), dtype=np.float32)
            X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
            bs = X_batch.shape[0]
            for si, (shard, W, b_sh, Vw, Vb) in enumerate(shard_params):
                Y_b = Y_all[b_idx][:, shard].toarray().astype(np.float32, copy=False)
                Z = X_batch @ W; Z += b_sh
                P = sigmoid_stable(Z)
                E = (P - Y_b)
                # Positive weighting
                w_pos = pos_weights_global[shard][None, :]
                Wmat = 1.0 + (w_pos - 1.0) * Y_b
                E_weighted = E * Wmat
                # Gradients
                grad_W = (X_batch.T @ E_weighted) / float(bs)
                if isinstance(grad_W, np.matrix):
                    grad_W = np.asarray(grad_W)
                grad_W = grad_W.astype(np.float32, copy=False) + (L2 * W)
                grad_b = E_weighted.mean(axis=0).astype(np.float32, copy=False)
                # Gradient clipping
                col_norms = np.linalg.norm(grad_W, axis=0) + 1e-8
                clip_scale = np.minimum(1.0, 10.0 / col_norms).astype(np.float32, copy=False)
                grad_W *= clip_scale
                grad_b = np.clip(grad_b, -1.0, 1.0)
                # Nesterov momentum update
                Vw *= MU; Vw += LR * grad_W
                Vb *= MU; Vb += LR * grad_b
                W -= (MU * Vw + LR * grad_W)
                b_sh -= (MU * Vb + LR * grad_b)
                shard_params[si] = (shard, W, b_sh, Vw, Vb)
            del X_title, X_body, X_char, X_code, X_meta, X_batch; gc.collect()
        print(f'[TrackA-FIX][Fold {fold_num}] Epoch {ep+1}/{EPOCHS} time: {(time.time()-t_ep)/60:.2f} min')
    print(f'[TrackA-FIX][Fold {fold_num}] Train total time: {(time.time()-t_tr)/60:.2f} min')

    # Validation + OOF collection
    P_va_chunks = []
    warned_mismatch = False
    for b in batch_indices(va_idx, BATCH_SIZE):
        b = np.asarray(b, dtype=np.int64)
        b_list = b.tolist()
        X_title = title_vec.transform(title_text.iloc[b_list]).astype(np.float32).multiply(title_alpha)
        X_body  = body_hash.transform(body_text.iloc[b_list]).astype(np.float32)
        X_char  = char_hash.transform((title_text.iloc[b_list] + ' ' + body_text.iloc[b_list])).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[b_list]).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[b]), dtype=np.float32)
        X_batch = sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)
        P_collect = []
        col_order = []
        for shard, W, b_sh, _, _ in shard_params:
            Z = (X_batch @ W); Z += b_sh
            P = sigmoid_stable(Z)
            P_collect.append(P)
            col_order.append(np.array(shard, dtype=np.int32))
        if len(P_collect) == 1:
            P_all = P_collect[0]; order = col_order[0]
        else:
            P_all = np.concatenate(P_collect, axis=1); order = np.concatenate(col_order)
        pos_map = {lab:i for i, lab in enumerate(order)}
        kept_pos = np.array([pos_map[k] for k in kept_labels], dtype=np.int32)
        P_kept = P_all[:, kept_pos].astype(np.float32, copy=False)
        Y_true_kept_dense = Y_all[b][:, kept_labels].toarray().astype(np.uint8, copy=False)
        if P_kept.shape[0] != Y_true_kept_dense.shape[0]:
            min_rows = min(P_kept.shape[0], Y_true_kept_dense.shape[0])
            if not warned_mismatch:
                print(f"[TrackA-FIX][Fold {fold_num}] Warning: pred/true batch rows mismatch {P_kept.shape[0]} vs {Y_true_kept_dense.shape[0]} -> slicing to {min_rows}")
                warned_mismatch = True
            P_kept = P_kept[:min_rows]
            Y_true_kept_dense = Y_true_kept_dense[:min_rows]
        for ti, thr in enumerate(THS):
            pred_bin = (P_kept >= thr).astype(np.uint8)
            tp = int((pred_bin & Y_true_kept_dense).sum())
            ppos = int(pred_bin.sum())
            tpos = int(Y_true_kept_dense.sum())
            tp_tot[ti] += tp
            fp_tot[ti] += (ppos - tp)
            fn_tot[ti] += (tpos - tp)
            if hi_pos and kept_is_hi.any():
                pred_hi = pred_bin[:, kept_is_hi]
                true_hi = Y_true_kept_dense[:, kept_is_hi]
                tp_vec = (pred_hi & true_hi).sum(axis=0).astype(np.int64)
                p_vec = pred_hi.sum(axis=0).astype(np.int64)
                t_vec = true_hi.sum(axis=0).astype(np.int64)
                rows = kept_hi_rows[kept_is_hi]
                tp_hi[rows, ti] += np.asarray(tp_vec).ravel()
                fp_hi[rows, ti] += (np.asarray(p_vec).ravel() - np.asarray(tp_vec).ravel())
                fn_hi[rows, ti] += (np.asarray(t_vec).ravel() - np.asarray(tp_vec).ravel())
        P_va_chunks.append(P_kept.astype(np.float16, copy=False))
        del X_title, X_body, X_char, X_code, X_meta, X_batch, P_collect, col_order, P_all, order, kept_pos, P_kept, Y_true_kept_dense
        gc.collect()

    # FN addback for globally excluded labels (global counts)
    fn_excluded = int(Y_va_full[:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        fn_tot += fn_excluded
    print(f'[TrackA-FIX][Fold {fold_num}] Excluded FN addback (global): {fn_excluded}')

    # CRITICAL BUGFIX: per-tag FN addback for high-support labels excluded from this fold
    hi_missing = np.array([lab for lab in hi_idx if lab not in set(kept_labels)], dtype=np.int32)
    add_count_total = 0
    if hi_missing.size > 0:
        # For each missing high-support label, all its validation positives are FN across ALL thresholds
        for lab in hi_missing:
            row = hi_pos[int(lab)]
            val_pos = int(Y_va_full[:, lab].sum())
            if val_pos > 0:
                fn_hi[row, :] += val_pos
                add_count_total += val_pos
        print(f'[TrackA-FIX][Fold {fold_num}] Per-tag FN bugfix: missing_hi_labels={hi_missing.size}, added_FN_total={add_count_total}')
    else:
        print(f'[TrackA-FIX][Fold {fold_num}] Per-tag FN bugfix: no high-support labels missing in this fold.')

    # Persist fold OOF npz
    P_va = np.vstack(P_va_chunks) if len(P_va_chunks) > 0 else np.zeros((va_idx.size, kept_labels.size), dtype=np.float16)
    np.savez_compressed(f'fold{fold_num}_oof_probs.npz', va_idx=va_idx, kept_labels=kept_labels, probs=P_va)
    oof_fold_paths.append(f'fold{fold_num}_oof_probs.npz')
    print(f'[TrackA-FIX][Fold {fold_num}] Saved OOF npz: probs shape {P_va.shape}')

    # Cleanup per fold
    del shard_params, P_va_chunks, P_va, Y_tr_full, Y_va_full
    gc.collect()
    print(f'[TrackA-FIX][Fold {fold_num}] Time: {(time.time()-t_fold)/60:.2f} min')

# Optimize global threshold
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[TrackA-FIX][OOF] Global best micro-F1 = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

# Per-tag thresholds for high-support labels only
per_tag_thr = np.full(n_labels, best_thr, dtype=np.float32)
for k, lab in enumerate(hi_idx):
    tps = tp_hi[k]; fps = fp_hi[k]; fns = fn_hi[k]
    f1s_lab = np.array([micro_f1(tps[i], fps[i], fns[i]) for i in range(THS.size)], dtype=np.float32)
    j = int(np.argmax(f1s_lab))
    per_tag_thr[lab] = float(THS[j])
print('[TrackA-FIX][OOF] Per-tag thresholds set for', hi_idx.size, 'labels; others use global.')

# Persist artifacts
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
np.save('per_tag_thresholds.npy', per_tag_thr)
np.save('global_threshold.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve.csv', index=False)
print('[TrackA-FIX][Artifacts] Saved labels.csv, per_tag_thresholds.npy, global_threshold.npy, oof_global_f1_curve.csv')

print('\n[TrackA-FIX] SUMMARY:')
print(' - Subset size:', n_samples)
print(' - Labels:', n_labels)
print(' - High-support labels (>=50):', hi_idx.size)
print(' - Folds OOF files:', oof_fold_paths)
print(' - Best global thr:', round(best_thr,3), 'OOF micro-F1:', round(best_f1,5))
print('[TrackA-FIX] DONE. Next: Start Track B1 (OVR-SGD) on the same folds for ensembling.')


[TrackA-FIX] Loaded caches: (5430775, 11) (603420, 10)
[TrackA-FIX] Loaded subset indices from subset300k_idx.npy: 300000


[TrackA-FIX][Labels] #samples: 300000 #labels: 26770
[TrackA-FIX] Reusing existing 3-fold indices from disk.
[TrackA-FIX] High-support labels (>=50): 2147

[TrackA-FIX][Fold 0] Train=199994, Valid=100006
[TrackA-FIX][Fold 0] kept_labels=3232, excluded=23538


[TrackA-FIX][Fold 0] D=597,039, SHARD_SIZE=400, #shards=9 (~911.0 MB per-shard W)


[TrackA-FIX][Fold 0] Epoch 1/8 batch 1/17 LR=0.06000


[TrackA-FIX][Fold 0] Epoch 1/8 batch 17/17 LR=0.10364


[TrackA-FIX][Fold 0] Epoch 1/8 time: 35.87 min
[TrackA-FIX][Fold 0] Epoch 2/8 batch 1/17 LR=0.10636


[TrackA-FIX][Fold 0] Epoch 2/8 batch 17/17 LR=0.15000


[TrackA-FIX][Fold 0] Epoch 2/8 time: 34.93 min
[TrackA-FIX][Fold 0] Epoch 3/8 batch 1/17 LR=0.15000


[TrackA-FIX][Fold 0] Epoch 3/8 batch 17/17 LR=0.14196


[TrackA-FIX][Fold 0] Epoch 3/8 time: 34.50 min
[TrackA-FIX][Fold 0] Epoch 4/8 batch 1/17 LR=0.14196


[TrackA-FIX][Fold 0] Epoch 4/8 batch 17/17 LR=0.12000


[TrackA-FIX][Fold 0] Epoch 4/8 time: 35.91 min
[TrackA-FIX][Fold 0] Epoch 5/8 batch 1/17 LR=0.12000


[TrackA-FIX][Fold 0] Epoch 5/8 batch 17/17 LR=0.09000


[TrackA-FIX][Fold 0] Epoch 5/8 time: 32.78 min
[TrackA-FIX][Fold 0] Epoch 6/8 batch 1/17 LR=0.09000


[TrackA-FIX][Fold 0] Epoch 6/8 batch 17/17 LR=0.06000


[TrackA-FIX][Fold 0] Epoch 6/8 time: 34.36 min
[TrackA-FIX][Fold 0] Epoch 7/8 batch 1/17 LR=0.06000


[TrackA-FIX][Fold 0] Epoch 7/8 batch 17/17 LR=0.03804


[TrackA-FIX][Fold 0] Epoch 7/8 time: 37.41 min
[TrackA-FIX][Fold 0] Epoch 8/8 batch 1/17 LR=0.03804


[TrackA-FIX][Fold 0] Epoch 8/8 batch 17/17 LR=0.03000


[TrackA-FIX][Fold 0] Epoch 8/8 time: 36.62 min
[TrackA-FIX][Fold 0] Train total time: 282.38 min


[TrackA-FIX][Fold 0] Excluded FN addback (global): 40455


[TrackA-FIX][Fold 0] Per-tag FN bugfix: no high-support labels missing in this fold.


[TrackA-FIX][Fold 0] Saved OOF npz: probs shape (100006, 3232)


[TrackA-FIX][Fold 0] Time: 294.08 min

[TrackA-FIX][Fold 1] Train=200076, Valid=99924
[TrackA-FIX][Fold 1] kept_labels=3227, excluded=23543


[TrackA-FIX][Fold 1] D=596,170, SHARD_SIZE=400, #shards=9 (~909.7 MB per-shard W)


[TrackA-FIX][Fold 1] Epoch 1/8 batch 1/17 LR=0.06000


In [None]:
# Track B1: OVR-SGD baseline on EXACT same 300k subset and 3-fold splits as Track A
# - Uses persisted fold{i}_train_idx.npy, fold{i}_valid_idx.npy, and fold{i}_kept_labels.npy
# - Channels match Track A: Title TF-IDF (alpha=3), Body Hash 2^18, Char Hash 2^17, Code TF-IDF 80k, Meta scaler
# - Model: OneVsRest(SGDClassifier, loss='log_loss', early_stopping) with reasonable iterations
# - Outputs: fold{i}_oof_probs_sgd.npz (va_idx, kept_labels, probs float16), plus global OOF micro-F1 and thresholds

import os, gc, time, math
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier

GLOBAL_SEED = 1337
np.random.seed(GLOBAL_SEED)

# Load caches and subset indices
TRAIN_FULL = 'parsed_train_full.pkl'
TEST_PKL   = 'parsed_test.pkl'
assert os.path.exists(TRAIN_FULL) and os.path.exists(TEST_PKL), 'Missing caches parsed_train_full.pkl / parsed_test.pkl'
assert os.path.exists('subset300k_idx.npy'), 'subset300k_idx.npy not found. Run Track A first.'
df_tr = pd.read_pickle(TRAIN_FULL)
idx_sub = np.load('subset300k_idx.npy')
print('[TrackB1] Loaded caches:', df_tr.shape, 'subset size=', idx_sub.size)

# Build channels on subset (identical to Track A)
title_text = df_tr['title_norm'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
body_text  = df_tr['body_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
code_text  = df_tr['code_text'].iloc[idx_sub].fillna('').astype(str).reset_index(drop=True)
meta_cols = ['title_len','body_len','code_len','num_block_code','num_urls','punct_density']
meta_all = df_tr.loc[idx_sub, meta_cols].astype(np.float32).values
y_lists = df_tr['Tags'].iloc[idx_sub].astype(str).apply(lambda s: s.split())
mlb = MultiLabelBinarizer(sparse_output=True)
Y_all = mlb.fit_transform(y_lists)
labels_list = mlb.classes_.tolist()
n_samples, n_labels = Y_all.shape
print('[TrackB1][Labels] #samples:', n_samples, '#labels:', n_labels)

# Feature configurations (match Track A)
title_vec_cfg = dict(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                     max_features=150_000, sublinear_tf=True, dtype=np.float32)
body_hash_cfg  = dict(analyzer='word', ngram_range=(1,3), n_features=2**18,
                      alternate_sign=False, norm='l2', dtype=np.float32)
char_hash_cfg  = dict(analyzer='char_wb', ngram_range=(3,6), n_features=2**17,
                      alternate_sign=False, norm='l2', dtype=np.float32)
code_vec_cfg   = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.999,
                      token_pattern=r'(?u)\b\w[\w_\+\-\#\.]*\b', max_features=80_000,
                      sublinear_tf=True, dtype=np.float32)
title_alpha = 3.0

BATCH_SIZE = 8192
THS = np.linspace(0.05, 0.6, 12)

def batch_indices(idxs, bs):
    for s in range(0, idxs.size, bs):
        yield idxs[s:min(idxs.size, s+bs)]

def micro_f1(tp, fp, fn):
    denom = (2*tp + fp + fn)
    return 0.0 if denom == 0 else (2.0*tp)/denom

# Global high-support mask (for per-tag thresholds)
global_support = np.asarray(Y_all.sum(axis=0)).ravel()
hi_mask = (global_support >= 50)
hi_idx = np.where(hi_mask)[0]
hi_pos = {lab: pos for pos, lab in enumerate(hi_idx)}
print('[TrackB1] High-support labels (>=50):', hi_idx.size)

tp_tot = np.zeros_like(THS, dtype=np.int64)
fp_tot = np.zeros_like(THS, dtype=np.int64)
fn_tot = np.zeros_like(THS, dtype=np.int64)
tp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fp_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)
fn_hi = np.zeros((hi_idx.size, THS.size), dtype=np.int64)

fold_oof_paths = []
for fold_num in range(3):
    tr_path = f'fold{fold_num}_train_idx.npy'
    va_path = f'fold{fold_num}_valid_idx.npy'
    kept_path = f'fold{fold_num}_kept_labels.npy'
    assert os.path.exists(tr_path) and os.path.exists(va_path) and os.path.exists(kept_path), f'Missing fold artifacts for fold {fold_num}'
    tr_idx = np.load(tr_path)
    va_idx = np.load(va_path)
    kept_labels = np.load(kept_path)
    print(f"\n[TrackB1][Fold {fold_num}] Train={tr_idx.size}, Valid={va_idx.size}, kept_labels={kept_labels.size}")

    # Fit vectorizers/scaler on train split only
    title_vec = TfidfVectorizer(**title_vec_cfg)
    code_vec  = TfidfVectorizer(**code_vec_cfg)
    body_hash = HashingVectorizer(**body_hash_cfg)
    char_hash = HashingVectorizer(**char_hash_cfg)
    title_vec.fit(title_text.iloc[tr_idx])
    code_vec.fit(code_text.iloc[tr_idx])
    meta_scaler = StandardScaler(with_mean=False)
    for b in batch_indices(tr_idx, BATCH_SIZE):
        meta_scaler.partial_fit(meta_all[b])

    # Transform helper (batch)
    def transform_block(idxs):
        idxs_list = np.asarray(idxs, dtype=np.int64).tolist()
        X_title = title_vec.transform(title_text.iloc[idxs_list]).astype(np.float32).multiply(title_alpha)
        X_body  = body_hash.transform(body_text.iloc[idxs_list]).astype(np.float32)
        X_char  = char_hash.transform((title_text.iloc[idxs_list] + ' ' + body_text.iloc[idxs_list])).astype(np.float32)
        X_code  = code_vec.transform(code_text.iloc[idxs_list]).astype(np.float32)
        X_meta  = sparse.csr_matrix(meta_scaler.transform(meta_all[idxs]), dtype=np.float32)
        return sparse.hstack([X_title, X_body, X_char, X_code, X_meta], format='csr', dtype=np.float32)

    # Build train/valid matrices in chunks to control memory
    X_tr_parts, X_va_parts = [], []
    for b in batch_indices(tr_idx, BATCH_SIZE):
        X_tr_parts.append(transform_block(b))
    for b in batch_indices(va_idx, BATCH_SIZE):
        X_va_parts.append(transform_block(b))
    X_tr = sparse.vstack(X_tr_parts, format='csr', dtype=np.float32)
    X_va = sparse.vstack(X_va_parts, format='csr', dtype=np.float32)
    del X_tr_parts, X_va_parts; gc.collect()

    Y_tr = Y_all[tr_idx][:, kept_labels]
    Y_va = Y_all[va_idx][:, kept_labels]

    # Model: OVR-SGD (logistic loss); keep jobs modest to avoid contention
    base = SGDClassifier(loss='log_loss', penalty='l2', alpha=1e-4, max_iter=120, tol=1e-3,
                         early_stopping=True, n_iter_no_change=5, learning_rate='optimal',
                         random_state=GLOBAL_SEED)
    clf = OneVsRestClassifier(base, n_jobs=8)
    t_fit = time.time()
    clf.fit(X_tr, Y_tr)
    print(f'[TrackB1][Fold {fold_num}] Fit time: {(time.time()-t_fit)/60:.2f} min')

    # Predict probabilities on validation
    try:
        va_probs = clf.predict_proba(X_va)
    except Exception:
        from scipy.special import expit
        scores = clf.decision_function(X_va)
        va_probs = expit(scores)
    va_probs = va_probs.astype(np.float32, copy=False)

    # Update OOF counts by threshold (streamed by blocks) and save per-fold npz for ensembling
    bs = 4096
    for s in range(0, va_probs.shape[0], bs):
        e = min(va_probs.shape[0], s + bs)
        P = va_probs[s:e]
        Y_true = Y_va[s:e]
        # global counts
        for ti, thr in enumerate(THS):
            pred_bin = (P >= thr).astype(np.uint8)
            tp = int(sparse.csr_matrix(pred_bin, dtype=np.uint8).multiply(Y_true).sum())
            ppos = int(pred_bin.sum())
            tpos = int(Y_true.sum())
            tp_tot[ti] += tp
            fp_tot[ti] += (ppos - tp)
            fn_tot[ti] += (tpos - tp)
        # per-tag for high-support labels that are in kept_labels
        kept_set = set(int(x) for x in kept_labels.tolist())
        hi_in_fold = [lab for lab in hi_idx if lab in kept_set]
        if len(hi_in_fold) > 0:
            col_map = {lab: j for j, lab in enumerate(kept_labels.tolist())}
            cols = np.array([col_map[lab] for lab in hi_in_fold], dtype=np.int32)
            Y_true_hi = Y_true[:, cols]
            P_hi = P[:, cols]
            rows = np.array([hi_pos[lab] for lab in hi_in_fold], dtype=np.int32)
            for ti, thr in enumerate(THS):
                pred_bin_hi = (P_hi >= thr).astype(np.uint8)
                pred_csr = sparse.csr_matrix(pred_bin_hi, dtype=np.uint8)
                tp_vec = np.asarray(pred_csr.multiply(Y_true_hi).sum(axis=0)).ravel().astype(np.int64)
                p_vec = pred_bin_hi.sum(axis=0).astype(np.int64)
                t_vec = np.asarray(Y_true_hi.sum(axis=0)).ravel().astype(np.int64)
                tp_hi[rows, ti] += tp_vec
                fp_hi[rows, ti] += (p_vec - tp_vec)
                fn_hi[rows, ti] += (t_vec - tp_vec)

    # FN addback for labels excluded in this fold (global counts only)
    excluded_labels = np.setdiff1d(np.arange(n_labels), kept_labels)
    fn_excluded = int(Y_all[va_idx][:, excluded_labels].sum()) if excluded_labels.size > 0 else 0
    if fn_excluded > 0:
        fn_tot += fn_excluded
    print(f'[TrackB1][Fold {fold_num}] Excluded FN addback (global): {fn_excluded}')

    # Persist per-fold OOF probs for alignment with Track A
    np.savez_compressed(f'fold{fold_num}_oof_probs_sgd.npz', va_idx=va_idx, kept_labels=kept_labels, probs=va_probs.astype(np.float16))
    fold_oof_paths.append(f'fold{fold_num}_oof_probs_sgd.npz')
    print(f'[TrackB1][Fold {fold_num}] Saved OOF npz (SGD): probs shape {va_probs.shape}')

    # Cleanup
    del X_tr, X_va, Y_tr, Y_va, va_probs, clf, title_vec, code_vec, body_hash, char_hash, meta_scaler
    gc.collect()

# Optimize global threshold and compute per-tag thresholds (high-support only)
f1s = [micro_f1(tp_tot[i], fp_tot[i], fn_tot[i]) for i in range(THS.size)]
best_idx = int(np.argmax(f1s))
best_thr = float(THS[best_idx])
best_f1 = float(f1s[best_idx])
print('[TrackB1][OOF] Global best micro-F1 (SGD) = {:.5f} at thr = {:.3f}'.format(best_f1, best_thr))

per_tag_thr_sgd = np.full(n_labels, best_thr, dtype=np.float32)
for k, lab in enumerate(hi_idx):
    tps = tp_hi[k]; fps = fp_hi[k]; fns = fn_hi[k]
    f1s_lab = np.array([micro_f1(tps[i], fps[i], fns[i]) for i in range(THS.size)], dtype=np.float32)
    j = int(np.argmax(f1s_lab))
    per_tag_thr_sgd[lab] = float(THS[j])
print('[TrackB1][OOF] Per-tag thresholds computed for', hi_idx.size, 'labels (SGD).')

# Persist Track B1 artifacts (distinct filenames)
pd.DataFrame({'label': labels_list}).to_csv('labels.csv', index=False)
np.save('per_tag_thresholds_sgd.npy', per_tag_thr_sgd)
np.save('global_threshold_sgd.npy', np.array([best_thr], dtype=np.float32))
pd.DataFrame({'threshold': THS, 'f1': f1s}).to_csv('oof_global_f1_curve_sgd.csv', index=False)
print('[TrackB1][Artifacts] Saved per_tag_thresholds_sgd.npy, global_threshold_sgd.npy, oof_global_f1_curve_sgd.csv')

print('\n[TrackB1] SUMMARY:')
print(' - Folds OOF files (SGD):', fold_oof_paths)
print(' - Best global thr (SGD):', round(best_thr,3), 'OOF micro-F1:', round(best_f1,5))
print('[TrackB1] DONE. Ready to blend with Track A after Track A FIX completes all folds.')
