# Plan: chaii-hindi-and-tamil-question-answering

Objectives:
- Build a robust extractive QA pipeline for Hindi/Tamil; target word-level Jaccard for medal.
- Establish trustworthy CV mirroring test; iterate from a fast baseline to strong transformer fine-tuning; ensemble if time.

Process checklist:
1) Environment
- Verify GPU (nvidia-smi). Install PyTorch CUDA 12.1 stack, transformers, accelerate, datasets, sentencepiece.
- Add progress logging and elapsed-time prints in all loops.
- Note: Current GPU V100 16GB → adjust batch sizes accordingly.

2) Data & EDA
- Load train.csv/test.csv; inspect cols (id, context, question, language, answer_text, answer_start if present).
- Basic stats: language distribution, lengths, answer coverage in context, Unicode ranges, punctuation/diacritics.
- Validate that answers are substrings of context; handle edge cases (multiple occurrences, whitespace normalization).

3) Metric & CV
- Implement competition word-level Jaccard scorer; use it (not HF EM/F1) for model selection/OOF.
- Build normalized_context_hash: NFKC, remove zero-width (ZWJ/ZWNJ/BOM), collapse spaces, strip.
- CV: 5-fold StratifiedGroupKFold with stratify=language and groups=normalized_context_hash (optional extra strat by question length bins).
- Save and reuse folds.csv with fixed seed; no leakage (fit transforms per fold).

4) Baselines
- Rule baseline: exact substring span finder; quick CV to sanity check pipeline.
- HF baseline: xlm-roberta-base for quick iteration; target xlm-roberta-large as workhorse; add MuRIL (google/muril-*-cased) for diversity.
- Convert data to SQuAD format; use_fast=True tokenizer with return_offsets_mapping=True.

5) Training
- Mixed precision (fp16), gradient accumulation as needed; V100 16GB typical: seq_len=384, stride=128; bs ~8 (xlm-r-large) / 16–32 (base).
- Optim: AdamW, lr 2e-5–3e-5, warmup ~10%, weight_decay 0.01, max_grad_norm 1.0; optional LLRD ~0.95.
- Epochs: 2–3 (large), 3–4 (base). Save per-fold best by eval Jaccard; store OOF start/end logits and predictions.

6) Inference
- Sliding windows with return_overflowing_tokens=True; aggregate spans across features per example.
- n_best_size 20–30; max_answer_length 40–64.
- Post-process: normalization + realignment to original context; fallback heuristic if needed.

7) Ensembling
- Average start/end logits across folds/seeds/models; for mixed tokenizers, optionally aggregate to char-level evidence before decoding.
- Diverse models: xlm-roberta-large/base + muril-base/large; optional mdeberta-v3-base, IndicBERT v2.

8) Normalization & Error analysis
- normalize(text): NFKC; remove ZWJ/ZWNJ/BOM; collapse spaces; strip; gentle trim of outer punctuation/quotes/danda; keep diacritics.
- Apply normalize() for scoring and predictions (not for training offsets).
- Bucket errors by language, lengths; adjust stride/max_answer_len; ensure Unicode hygiene.

Milestones with Expert Review:
- After this plan.
- After EDA + CV definition and folds.csv creation.
- After baseline heuristic.
- After first transformer OOF.
- After any ensemble/tuning.

Next:
- Install torch cu121 + transformers stack; verify CUDA in torch.
- Load data; implement normalize(), Jaccard scorer, and StratifiedGroupKFold by context hash; save folds.csv.
- Quick rule baseline; then train xlm-roberta-base (384/128) to establish OOF; iterate to xlm-roberta-large.
- Request expert review on CV/OOF before heavier training/ensembles.

In [None]:
import subprocess, sys, os, time, platform
from pathlib import Path

print('Python:', sys.version)
print('Platform:', platform.platform())
print('Checking GPU with nvidia-smi...')
t0 = time.time()
subprocess.run(['bash','-lc','nvidia-smi || true'], check=False)
print('Elapsed: %.2fs' % (time.time()-t0))

In [None]:
import sys, subprocess, os, shutil, time
from pathlib import Path

def pip(*args):
    print('>', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

print('Starting PyTorch cu121 + NLP stack installation...')
t0 = time.time()

# 0) Uninstall any existing torch stack to avoid conflicts
for pkg in ('torch','torchvision','torchaudio'):
    try:
        subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', pkg], check=False)
    except Exception as e:
        print('Uninstall error for', pkg, e)

# Clean possible shadow dirs (idempotent)
for d in (
    '/app/.pip-target/torch',
    '/app/.pip-target/torchvision',
    '/app/.pip-target/torchaudio',
    '/app/.pip-target/torch-2.8.0.dist-info',
    '/app/.pip-target/torch-2.4.1.dist-info',
    '/app/.pip-target/torchvision-0.23.0.dist-info',
    '/app/.pip-target/torchvision-0.19.1.dist-info',
    '/app/.pip-target/torchaudio-2.8.0.dist-info',
    '/app/.pip-target/torchaudio-2.4.1.dist-info',
    '/app/.pip-target/torchgen',
    '/app/.pip-target/functorch',
):
    if os.path.exists(d):
        print('Removing', d)
        shutil.rmtree(d, ignore_errors=True)

# 1) Install the EXACT cu121 torch stack
pip('install',
    '--index-url', 'https://download.pytorch.org/whl/cu121',
    '--extra-index-url', 'https://pypi.org/simple',
    'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1')

# 2) Freeze torch versions
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

# 3) Install NLP deps honoring constraints (avoid upgrading torch)
pip('install', '-c', 'constraints.txt',
    'transformers==4.44.2', 'accelerate==0.34.2',
    'datasets==2.21.0', 'evaluate==0.4.2',
    'sentencepiece', 'scikit-learn',
    '--upgrade-strategy', 'only-if-needed')

# 4) Sanity gate
import torch
print('torch:', torch.__version__, 'built CUDA:', getattr(torch.version, 'cuda', None))
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available after install'
print('Install complete in %.1fs' % (time.time()-t0))

In [None]:
import pandas as pd, numpy as np, unicodedata, re, hashlib, os, time
from sklearn.model_selection import StratifiedGroupKFold

# Normalization helpers per expert guidance
ZW_CHARS = ''.join([
    '\u200c',  # ZWNJ
    '\u200d',  # ZWJ
    '\ufeff',  # BOM
])
ZW_RE = re.compile(f'[{re.escape(ZW_CHARS)}]')
WS_RE = re.compile(r'\s+')
OUTER_PUNCT_RE = re.compile(r'^[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+|[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+$')

def normalize_text(s: str, trim_outer_punct: bool = True) -> str:
    if s is None:
        return ''
    s = unicodedata.normalize('NFKC', str(s))
    s = ZW_RE.sub('', s)
    s = WS_RE.sub(' ', s).strip()
    if trim_outer_punct and s:
        s = OUTER_PUNCT_RE.sub('', s).strip()
    return s

def context_group_hash(s: str) -> str:
    norm = normalize_text(s, trim_outer_punct=False)
    return hashlib.md5(norm.encode('utf-8')).hexdigest()

def jaccard_word_level(a: str, b: str) -> float:
    a_n = normalize_text(a)
    b_n = normalize_text(b)
    if not a_n and not b_n:
        return 1.0
    a_set = set(a_n.split())
    b_set = set(b_n.split())
    if not a_set and not b_set:
        return 1.0
    inter = len(a_set & b_set)
    union = len(a_set | b_set)
    return inter / union if union else 0.0

# Load data
t0 = time.time()
train_path = 'train.csv'
assert os.path.exists(train_path), 'train.csv not found'
df = pd.read_csv(train_path)
print('Loaded train:', df.shape, 'cols:', list(df.columns))

# Basic sanity checks
required_cols = ['id','context','question','answer_text']
for c in required_cols:
    assert c in df.columns, f'Missing column: {c}'
lang_col = 'language' if 'language' in df.columns else None
if lang_col is None:
    # Fallback heuristic: mark unknown
    df['language'] = 'unknown'
    lang_col = 'language'

# Build normalized context hash (group key)
df['norm_ctx_hash'] = df['context'].astype(str).map(context_group_hash)

# Quick EDA: answer substring coverage
ans_in_ctx = (df.apply(lambda r: normalize_text(str(r['answer_text'])) in normalize_text(str(r['context'])), axis=1)).mean()
print(f'Answer in normalized context coverage: {ans_in_ctx:.3f}')

# Create StratifiedGroupKFold by language with groups = norm_ctx_hash
n_splits = 5
skf = StratifiedGroupKFold(n_splits=n_splits, shuffle=True, random_state=2025)
y_strat = df[lang_col].astype(str).values
groups = df['norm_ctx_hash'].values
folds = np.full(len(df), -1, dtype=int)
for fold, (trn_idx, val_idx) in enumerate(skf.split(df, y_strat, groups)):
    folds[val_idx] = fold
assert (folds >= 0).all(), 'Unassigned fold indices present'
df['fold'] = folds

# Save folds.csv
folds_df = df[['id','fold']].copy()
folds_df.to_csv('folds.csv', index=False)
print('Saved folds.csv with shape:', folds_df.shape)

# Report per-fold counts and language distribution
for f in range(n_splits):
    sub = df[df['fold']==f]
    print(f'Fold {f}: n={len(sub)} | lang dist:', sub[lang_col].value_counts(normalize=True).to_dict())

print('Done in %.2fs' % (time.time()-t0))

In [None]:
import os, math, numpy as np, pandas as pd, time, collections, json, random
from datasets import Dataset
from transformers import AutoTokenizer

# Constants for QA preprocessing (retokenize to match model checkpoint)
MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 384
DOC_STRIDE = 192
MAX_ANSWER_LEN = 64
NEG_RATIO = 2  # keep up to 2x negatives per example (after mapping); keep all positives
SEED = 2025
random.seed(SEED); np.random.seed(SEED)

print('Loading tokenizer:', MODEL_NAME)
tok_t0 = time.time()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'
print('Tokenizer loaded in %.2fs; pad_on_right=%s' % (time.time()-tok_t0, pad_on_right))

# Load data and folds
df = pd.read_csv('train.csv')
folds_df = pd.read_csv('folds.csv')
df = df.merge(folds_df, on='id', how='left')
assert df['fold'].notna().all(), 'Missing fold assignments'

# Select a fold for smoke-prep (fold 0 val) — full 5-fold will reuse same pipeline
FOLD = 0
train_df = df[df['fold'] != FOLD].reset_index(drop=True)
valid_df = df[df['fold'] == FOLD].reset_index(drop=True)
print(f'Fold {FOLD} split: train {len(train_df)} | valid {len(valid_df)}')

# Build HF Datasets
train_ds = Dataset.from_pandas(train_df[['id','question','context','answer_text','answer_start','language']])
valid_ds = Dataset.from_pandas(valid_df[['id','question','context','answer_text','answer_start','language']])

# Coverage-based labeling per expert guidance
def prepare_train_features(examples):
    questions = [q.strip() for q in examples['question']]
    contexts = examples['context']

    tokenized = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )

    sample_mapping = tokenized.pop('overflow_to_sample_mapping')
    offset_mapping = tokenized.pop('offset_mapping')
    input_ids = tokenized['input_ids']

    start_positions, end_positions = [], []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_mapping[i]
        context_text = contexts[sample_idx]
        ans_text = str(examples['answer_text'][sample_idx])
        ans_start = int(examples['answer_start'][sample_idx])
        ans_end = ans_start + len(ans_text)  # exclusive

        # Trim whitespace at boundaries to stabilize mapping
        while ans_start < ans_end and ans_start < len(context_text) and context_text[ans_start].isspace():
            ans_start += 1
        while ans_end > ans_start and ans_end-1 < len(context_text) and context_text[ans_end-1].isspace():
            ans_end -= 1
        if ans_start >= ans_end:
            ans_end = ans_start  # degenerate, will force CLS below if no coverage

        seq_ids = tokenized.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0

        # Find context token span [ctx_start_tok, ctx_end_tok]
        context_id = 1 if pad_on_right else 0
        try:
            ctx_start_tok = seq_ids.index(context_id)
            ctx_end_tok = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue

        # Find first token covering ans_start and last token covering ans_end-1 (inclusive)
        start_tok, end_tok = None, None
        target_end_inclusive = ans_end - 1

        for t in range(ctx_start_tok, ctx_end_tok + 1):
            o = offsets[t]
            if o is None:
                continue
            tok_s, tok_e = o  # [tok_s, tok_e)
            if tok_s is None or tok_e is None:
                continue
            if start_tok is None and tok_s <= ans_start < tok_e:
                start_tok = t
            if tok_s <= target_end_inclusive < tok_e:
                end_tok = t

        if start_tok is not None and end_tok is not None and start_tok <= end_tok:
            start_positions.append(start_tok)
            end_positions.append(end_tok)
        else:
            start_positions.append(cls_index)
            end_positions.append(cls_index)

    # Keep mapping for downstream sampling
    tokenized['start_positions'] = start_positions
    tokenized['end_positions'] = end_positions
    tokenized['overflow_to_sample_mapping'] = sample_mapping
    return tokenized

def prepare_validation_features(examples):
    questions = [q.strip() for q in examples['question']]
    contexts = examples['context']
    tokenized = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    sample_mapping = tokenized.pop('overflow_to_sample_mapping')
    tokenized['example_id'] = []
    new_offsets = []
    for i, offsets in enumerate(tokenized['offset_mapping']):
        sample_idx = sample_mapping[i]
        tokenized['example_id'].append(str(examples['id'][sample_idx]))
        sequence_ids = tokenized.sequence_ids(i)
        # Set offsets to None for non-context tokens
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if sequence_ids[k] == context_id else None for k, o in enumerate(offsets)])
    tokenized['offset_mapping'] = new_offsets
    return tokenized

print('Tokenizing train with sliding windows...')
t0 = time.time()
train_tokenized = train_ds.map(prepare_train_features, batched=True, remove_columns=train_ds.column_names, desc='train_tokenize')
print('Train features (pre-sample):', train_tokenized.num_rows, 'in %.2fs' % (time.time()-t0))

# Negative window sampling: keep all positives; for negatives per example, keep up to NEG_RATIO x positives
print('Applying negative window sampling ...')
smap = np.array(train_tokenized['overflow_to_sample_mapping'])
sp = np.array(train_tokenized['start_positions'])
is_pos = (sp != 0)  # CLS assumed at position 0 for XLM-R
keep_indices = []
by_example = collections.defaultdict(list)
for idx, ex in enumerate(smap):
    by_example[ex].append(idx)
for ex, idxs in by_example.items():
    idxs = np.array(idxs)
    pos_idxs = idxs[is_pos[idxs]]
    neg_idxs = idxs[~is_pos[idxs]]
    keep_indices.extend(pos_idxs.tolist())
    if len(pos_idxs) == 0:
        # Fallback: keep a small cap of negatives to avoid dropping the example entirely
        sel = neg_idxs[:min(4, len(neg_idxs))]
        keep_indices.extend(sel.tolist())
    else:
        cap = min(len(neg_idxs), NEG_RATIO * len(pos_idxs))
        if cap > 0:
            sel = neg_idxs[np.random.permutation(len(neg_idxs))[:cap]]
            keep_indices.extend(sel.tolist())
keep_indices = np.array(sorted(set(keep_indices)))
before_n = train_tokenized.num_rows
train_tokenized = train_tokenized.select(keep_indices.tolist())
after_n = train_tokenized.num_rows
print(f'Negative sampling reduced features: {before_n} -> {after_n} ({after_n/before_n:.2%})')

print('Tokenizing valid with sliding windows...')
t0 = time.time()
valid_tokenized = valid_ds.map(prepare_validation_features, batched=True, remove_columns=valid_ds.column_names, desc='valid_tokenize')
print('Valid features:', valid_tokenized.num_rows, 'in %.2fs' % (time.time()-t0))

# Quick sanity: ensure at least some start/end positions not equal to CLS
if 'start_positions' in train_tokenized.features:
    sp2 = np.array(train_tokenized['start_positions'])
    cls_pos = 0  # typically CLS is position 0 in XLM-R
    ratio_cls = float((sp2 == cls_pos).mean())
    print('Train start_positions CLS ratio (post-sample):', round(ratio_cls, 4))

print('Prep complete. Ready for model training (sanity next).')

In [None]:
import math, numpy as np, torch, time, os
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from collections import defaultdict
from torch.nn.functional import log_softmax

# Environment tweaks to reduce CUDA OOM
os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

# Sanity-check training on fold 0 with QA-pretrained LARGE checkpoint
MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
EPOCHS = 2
LR = 2e-5
TRAIN_BS = 6  # per advice for V100 16GB
EVAL_BS = 16
GRAD_ACCUM = 3  # effective batch 18
SEED = 2025
N_BEST = 25
MAX_ANSWER_LEN = 64

def set_seed_all(seed: int):
    import random, numpy as _np, torch as _torch
    random.seed(seed); _np.random.seed(seed); _torch.manual_seed(seed);
    if _torch.cuda.is_available():
        _torch.cuda.manual_seed_all(seed)

set_seed_all(SEED)
torch.cuda.empty_cache()

# Reuse tokenizer, train_tokenized, valid_tokenized from previous cell
assert 'train_tokenized' in globals() and 'valid_tokenized' in globals(), 'Run preprocessing cell first'

# Prepare torch datasets
train_tok = train_tokenized.remove_columns([c for c in train_tokenized.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
valid_tok_for_pred = valid_tokenized  # keep example_id and offset_mapping in this copy for decoding
valid_tok = valid_tokenized.remove_columns([c for c in valid_tokenized.column_names if c not in ['input_ids','attention_mask']])
train_tok.set_format(type='torch')
valid_tok.set_format(type='torch')

print('Loading model:', MODEL_NAME)
m_t0 = time.time()
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
try:
    model.gradient_checkpointing_enable()
except Exception:
    pass
print('Model loaded in %.2fs' % (time.time()-m_t0))

collator = DataCollatorWithPadding(tokenizer)

args = TrainingArguments(
    output_dir='oof_xlmr_large_fold0_sanity',
    num_train_epochs=EPOCHS,
    learning_rate=LR,
    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=EVAL_BS,
    gradient_accumulation_steps=GRAD_ACCUM,
    fp16=True,
    logging_steps=100,
    save_strategy='no',
    evaluation_strategy='no',
    seed=SEED,
    dataloader_num_workers=2,
    report_to=[],
    gradient_checkpointing=True,
    warmup_ratio=0.1,
    weight_decay=0.01,
    max_grad_norm=1.0
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=collator,
)

print('Starting training ...')
t0 = time.time()
trainer.train()
print('Train done in %.1fs' % (time.time()-t0))
torch.cuda.empty_cache()

# Predict on validation features
print('Predicting on validation features ...')
pred = trainer.predict(valid_tok)
start_logits, end_logits = pred.predictions

# Decoding with log-softmax, mask non-context, aggregate best per example
s_logp = log_softmax(torch.tensor(start_logits), dim=-1).numpy()
e_logp = log_softmax(torch.tensor(end_logits), dim=-1).numpy()

example_to_best = {}  # example_id (str) -> (score, text)
for i in range(len(valid_tok_for_pred)):
    example_id = valid_tok_for_pred[i]['example_id']
    offsets = valid_tok_for_pred[i]['offset_mapping']
    context_text = valid_df.loc[valid_df['id'].astype(str)==example_id, 'context'].values[0]

    s = s_logp[i]; e = e_logp[i]
    mask = np.array([o is not None for o in offsets], dtype=bool)
    s_masked = np.where(mask, s, -1e9)
    e_masked = np.where(mask, e, -1e9)

    top_starts = np.argsort(s_masked)[-N_BEST:]
    top_ends = np.argsort(e_masked)[-N_BEST:]

    best = (float('-inf'), '')
    for si in top_starts:
        for ei in top_ends:
            if ei < si:
                continue
            if (ei - si + 1) > MAX_ANSWER_LEN:
                continue
            if offsets[si] is None or offsets[ei] is None:
                continue
            cs, _ = offsets[si]; _, ce = offsets[ei]
            if cs is None or ce is None or ce <= cs:
                continue
            cand_text = context_text[cs:ce]
            score = float(s_masked[si] + e_masked[ei])
            if score > best[0]:
                best = (score, cand_text)

    if best[0] == float('-inf'):
        best = (-1e9, '')

    prev = example_to_best.get(example_id, (float('-inf'), ''))
    if best[0] > prev[0]:
        example_to_best[example_id] = best

# Build predictions and score Jaccard
pred_texts = {k: v[1] for k, v in example_to_best.items()}
y_true = valid_df.assign(id_str=valid_df['id'].astype(str)).set_index('id_str')['answer_text'].to_dict()

assert 'jaccard_word_level' in globals() and 'normalize_text' in globals(), 'Run CV/normalization cell first'
scores = []
for ex_id, gt in y_true.items():
    pr = pred_texts.get(ex_id, '')
    scores.append(jaccard_word_level(gt, pr))
oof_jaccard = float(np.mean(scores)) if scores else 0.0
print(f'Fold 0 OOF Jaccard: {oof_jaccard:.4f} (N={len(scores)})')

for k in list(pred_texts.keys())[:5]:
    print('ID', k, '| pred:', normalize_text(pred_texts[k])[:80], '| gt:', normalize_text(y_true.get(k, ''))[:80])

In [None]:
# 5-fold training with deepset/xlm-roberta-large-squad2, negative sampling, OOF Jaccard logging + TEST LOGIT SAVING
import os, time, numpy as np, pandas as pd, collections, random, math, json, sys
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 384
DOC_STRIDE = 192
MAX_ANSWER_LEN = 64
NEG_RATIO = 2
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 4  # reduced to avoid OOM
EVAL_BS = 12  # slight reduction
GRAD_ACCUM = 5  # keep effective batch reasonable
N_BEST = 25

print('Retokenizing with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions = [], []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace():
            a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace():
            a_end -= 1
        if a_start >= a_end:
            a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None:
                continue
            ts, te = o
            if ts is None or te is None:
                continue
            if st_tok is None and ts <= a_start < te:
                st_tok = t
            if ts <= tgt_end_incl < te:
                en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once; feature order is deterministic
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='test_tokenize_allfolds')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')

# Persist example_id and offsets once for downstream char-level aggregation
os.makedirs('xlmr_large_test_logits', exist_ok=True)
with open('xlmr_large_test_logits/test_example_id.json', 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save('xlmr_large_test_logits/test_offset_mapping.npy', np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

all_oof_scores = []; oof_rows = []; test_start_list = []; test_end_list = [];
t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== Fold {FOLD} =====')
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'train_tokenize_f{FOLD}')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'valid_tokenize_f{FOLD}')
    print(f'Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | valid feats={va_tok_all.num_rows} | tok_time={time.time()-t0:.1f}s')
    # Negative sampling
    smap = np.array(tr_tok['overflow_to_sample_mapping']); sp = np.array(tr_tok['start_positions']); is_pos = (sp != 0)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap):
        by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), NEG_RATIO*len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})')
    # Format for Trainer
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model/Trainer
    torch.cuda.empty_cache()
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    try:
        model.gradient_checkpointing_enable()
    except Exception:
        pass
    args = TrainingArguments(
        output_dir=f'xlmr_large_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=True,
        logging_steps=100,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=2,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'Fold {FOLD}: training...')
    torch.cuda.empty_cache()
    t1 = time.time()
    trainer.train()
    print(f'Fold {FOLD}: train_done in {time.time()-t1:.1f}s | total_elapsed={time.time()-t_fold:.1f}s')
    # Predict on valid
    print(f'Fold {FOLD}: predicting valid...')
    pred = trainer.predict(va_ds_pred)
    s_logp = log_softmax(torch.tensor(pred.predictions[0]), dim=-1).numpy()
    e_logp = log_softmax(torch.tensor(pred.predictions[1]), dim=-1).numpy()
    # Decode
    ex_best = {}
    for i in range(len(va_tok_all)):
        ex_id = va_tok_all[i]['example_id']
        offsets = va_tok_all[i]['offset_mapping']
        ctx = va_df.loc[va_df['id'].astype(str)==ex_id, 'context'].values[0]
        s = s_logp[i]; e = e_logp[i]
        mask = np.array([o is not None for o in offsets], dtype=bool)
        s_m = np.where(mask, s, -1e9); e_m = np.where(mask, e, -1e9)
        top_s = np.argsort(s_m)[-N_BEST:]; top_e = np.argsort(e_m)[-N_BEST:]
        best = (float('-inf'), '')
        for si in top_s:
            for ei in top_e:
                if ei < si: continue
                if (ei - si + 1) > MAX_ANSWER_LEN: continue
                if offsets[si] is None or offsets[ei] is None: continue
                cs,_ = offsets[si]; _,ce = offsets[ei]
                if cs is None or ce is None or ce <= cs: continue
                cand = ctx[cs:ce]; score = float(s_m[si] + e_m[ei])
                if score > best[0]: best = (score, cand)
        if best[0] == float('-inf'): best = (-1e9, '')
        prev = ex_best.get(ex_id, (float('-inf'), ''))
        if best[0] > prev[0]: ex_best[ex_id] = best
    preds = {k:v[1] for k,v in ex_best.items()}
    y_true = va_df.assign(id_str=va_df['id'].astype(str)).set_index('id_str')['answer_text'].to_dict()
    fold_scores = [jaccard_word_level(y_true[k], preds.get(k, '')) for k in y_true.keys()]
    f_j = float(np.mean(fold_scores)) if fold_scores else 0.0
    all_oof_scores.append(f_j)
    print(f'Fold {FOLD} Jaccard: {f_j:.4f} | elapsed_fold={time.time()-t_fold:.1f}s')
    for k,v in preds.items():
        oof_rows.append({'id': k, 'pred_text': v, 'gt': y_true.get(k, ''), 'fold': FOLD})

    # Predict TEST and save per-fold logits
    print(f'Fold {FOLD}: predicting TEST and saving logits...')
    te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    # Save raw logits per fold
    np.save(f'xlmr_large_test_logits/test_start_logits_f{FOLD}.npy', te_start)
    np.save(f'xlmr_large_test_logits/test_end_logits_f{FOLD}.npy', te_end)
    # Keep in memory for averaging, use logit space (not softmax)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    # free
    del model, trainer, tr_ds_torch, va_ds_pred, tr_tok, va_tok_all, pred, s_logp, e_logp, te_pred
    torch.cuda.empty_cache()

# Save OOF preds
oof_df = pd.DataFrame(oof_rows)
oof_df.to_csv('oof_preds_xlmr_large.csv', index=False)
mean_oof = float(np.mean(all_oof_scores)) if all_oof_scores else 0.0
print('5-fold OOF Jaccard (mean across folds):', round(mean_oof, 4))

# Average TEST logits across folds and persist for downstream ensembling
print('Averaging TEST logits across folds...')
start_stack = np.stack(test_start_list, axis=0)  # [fold, n_feat, L]
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('xlmr_large_test_avg.npz', start=start_avg, end=end_avg)
print('Saved averaged TEST logits to xlmr_large_test_avg.npz; example_id/offsets saved under xlmr_large_test_logits/')
print('Done 5-fold in %.1fs' % (time.time()-t_global))

In [None]:
# Train on full training set and predict test to create submission.csv
import os, time, numpy as np, pandas as pd, random, collections, torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 384
DOC_STRIDE = 192
MAX_ANSWER_LEN = 64
NEG_RATIO = 2
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 6
EVAL_BS = 16
GRAD_ACCUM = 3
N_BEST = 25

os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try: torch.set_float32_matmul_precision('high')
except Exception: pass

print('Loading data ...')
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
print('Train:', train_df.shape, '| Test:', test_df.shape)

print('Loading tokenizer/model:', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

def prepare_train_features_full(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions = [], []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try: cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError: cls_index = 0
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap
    return tok

def prepare_test_features(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

print('Building HF datasets ...')
full_tr_ds = Dataset.from_pandas(train_df[['id','question','context','answer_text','answer_start','language']])
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])

t0 = time.time()
full_tr_tok = full_tr_ds.map(prepare_train_features_full, batched=True, remove_columns=full_tr_ds.column_names, desc='train_tokenize_full')
print('Full train feats pre-sample=', full_tr_tok.num_rows, 'tok_time=%.1fs' % (time.time()-t0))

# Negative sampling as in CV
smap = np.array(full_tr_tok['overflow_to_sample_mapping']); sp = np.array(full_tr_tok['start_positions']); is_pos = (sp != 0)
keep = []; by_ex = collections.defaultdict(list)
for idx, ex in enumerate(smap): by_ex[ex].append(idx)
for ex, idxs in by_ex.items():
    idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
    keep.extend(pos.tolist())
    if len(pos) == 0:
        sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
    else:
        cap = min(len(neg), NEG_RATIO*len(pos))
        if cap > 0:
            sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
keep = np.array(sorted(set(keep))); before = full_tr_tok.num_rows
full_tr_tok = full_tr_tok.select(keep.tolist()); after = full_tr_tok.num_rows
print(f'Neg-sample {before}->{after} ({after/before:.2%})')

print('Tokenizing test ...')
t1 = time.time()
test_tok_all = test_ds.map(prepare_test_features, batched=True, remove_columns=test_ds.column_names, desc='test_tokenize')
print('Test feats:', test_tok_all.num_rows, 'tok_time=%.1fs' % (time.time()-t1))

tr_ds_torch = full_tr_tok.remove_columns([c for c in full_tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
tr_ds_torch.set_format(type='torch')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')

print('Loading model ...')
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
try: model.gradient_checkpointing_enable()
except Exception: pass

args = TrainingArguments(
    output_dir='xlmr_large_full',
    num_train_epochs=EPOCHS,
    learning_rate=LR,
    per_device_train_batch_size=TRAIN_BS,
    per_device_eval_batch_size=EVAL_BS,
    gradient_accumulation_steps=GRAD_ACCUM,
    fp16=True,
    logging_steps=200,
    save_strategy='no',
    evaluation_strategy='no',
    seed=SEED,
    dataloader_num_workers=2,
    report_to=[],
    gradient_checkpointing=True,
    warmup_ratio=0.1,
    weight_decay=0.01,
    max_grad_norm=1.0
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tr_ds_torch,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
)

print('Training on full data ...')
t2 = time.time()
trainer.train()
print('Full training done in %.1fs' % (time.time()-t2))
torch.cuda.empty_cache()

print('Predicting test ...')
pred = trainer.predict(te_ds_pred)
s_logp = log_softmax(torch.tensor(pred.predictions[0]), dim=-1).numpy()
e_logp = log_softmax(torch.tensor(pred.predictions[1]), dim=-1).numpy()

# Decode per example
ex_best = {}
for i in range(len(test_tok_all)):
    ex_id = test_tok_all[i]['example_id']
    offsets = test_tok_all[i]['offset_mapping']
    ctx = test_df.loc[test_df['id'].astype(str)==ex_id, 'context'].values[0]
    s = s_logp[i]; e = e_logp[i]
    mask = np.array([o is not None for o in offsets], dtype=bool)
    s_m = np.where(mask, s, -1e9); e_m = np.where(mask, e, -1e9)
    top_s = np.argsort(s_m)[-N_BEST:]; top_e = np.argsort(e_m)[-N_BEST:]
    best = (float('-inf'), '')
    for si in top_s:
        for ei in top_e:
            if ei < si: continue
            if (ei - si + 1) > MAX_ANSWER_LEN: continue
            if offsets[si] is None or offsets[ei] is None: continue
            cs,_ = offsets[si]; _,ce = offsets[ei]
            if cs is None or ce is None or ce <= cs: continue
            cand = ctx[cs:ce]
            score = float(s_m[si] + e_m[ei])
            if score > best[0]: best = (score, cand)
    if best[0] == float('-inf'): best = (-1e9, '')
    prev = ex_best.get(ex_id, (float('-inf'), ''))
    if best[0] > prev[0]: ex_best[ex_id] = best

test_preds = {k:v[1] for k,v in ex_best.items()}
sub = pd.DataFrame({'id': test_df['id'].astype(str)})
sub['PredictionString'] = sub['id'].map(test_preds).fillna('')
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', sub.shape)
sub.head()

In [None]:
# 5-fold training with google/muril-large-cased + TEST LOGIT SAVING (mirror of XLM-R large)
import os, time, numpy as np, pandas as pd, collections, random, json, torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'google/muril-large-cased'
MAX_LENGTH = 384
DOC_STRIDE = 192
MAX_ANSWER_LEN = 64
NEG_RATIO = 2
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 6
EVAL_BS = 16
GRAD_ACCUM = 3
N_BEST = 30

print('Retokenizing with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions = [], []
    cls_positions = []  # track per-feature CLS index for robust negative sampling
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='muril_test_tokenize_allfolds')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')

os.makedirs('muril_large_test_logits', exist_ok=True)
with open('muril_large_test_logits/test_example_id.json', 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save('muril_large_test_logits/test_offset_mapping.npy', np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

all_oof_scores = []; oof_rows = []; test_start_list = []; test_end_list = [];
t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== MuRIL Fold {FOLD} =====')
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'muril_train_tokenize_f{FOLD}')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'muril_valid_tokenize_f{FOLD}')
    print(f'MuRIL Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | valid feats={va_tok_all.num_rows} | tok_time={time.time()-t0:.1f}s')
    # Negative sampling (robust to CLS index)
    smap = np.array(tr_tok['overflow_to_sample_mapping'])
    sp = np.array(tr_tok['start_positions'])
    cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), NEG_RATIO*len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'MuRIL Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})')
    # Datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    try: model.gradient_checkpointing_enable()
    except Exception: pass
    args = TrainingArguments(
        output_dir=f'muril_large_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=True,
        logging_steps=250,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=2,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'MuRIL Fold {FOLD}: training...')
    t1 = time.time()
    trainer.train()
    print(f'MuRIL Fold {FOLD}: train_done in {time.time()-t1:.1f}s | total_elapsed={time.time()-t_fold:.1f}s')
    # Predict valid
    print(f'MuRIL Fold {FOLD}: predicting valid...')
    pred = trainer.predict(va_ds_pred)
    s_logp = log_softmax(torch.tensor(pred.predictions[0]), dim=-1).numpy()
    e_logp = log_softmax(torch.tensor(pred.predictions[1]), dim=-1).numpy()
    ex_best = {}
    for i in range(len(va_tok_all)):
        ex_id = va_tok_all[i]['example_id']
        offsets = va_tok_all[i]['offset_mapping']
        ctx = va_df.loc[va_df['id'].astype(str)==ex_id, 'context'].values[0]
        s = s_logp[i]; e = e_logp[i]
        mask = np.array([o is not None for o in offsets], dtype=bool)
        s_m = np.where(mask, s, -1e9); e_m = np.where(mask, e, -1e9)
        top_s = np.argsort(s_m)[-N_BEST:]; top_e = np.argsort(e_m)[-N_BEST:]
        best = (float('-inf'), '')
        for si in top_s:
            for ei in top_e:
                if ei < si: continue
                if (ei - si + 1) > MAX_ANSWER_LEN: continue
                if offsets[si] is None or offsets[ei] is None: continue
                cs,_ = offsets[si]; _,ce = offsets[ei]
                if cs is None or ce is None or ce <= cs: continue
                cand = ctx[cs:ce]; score = float(s_m[si] + e_m[ei])
                if score > best[0]: best = (score, cand)
        if best[0] == float('-inf'): best = (-1e9, '')
        prev = ex_best.get(ex_id, (float('-inf'), ''))
        if best[0] > prev[0]: ex_best[ex_id] = best
    preds = {k:v[1] for k,v in ex_best.items()}
    y_true = va_df.assign(id_str=va_df['id'].astype(str)).set_index('id_str')['answer_text'].to_dict()
    fold_scores = [jaccard_word_level(y_true[k], preds.get(k, '')) for k in y_true.keys()]
    f_j = float(np.mean(fold_scores)) if fold_scores else 0.0
    all_oof_scores.append(f_j)
    print(f'MuRIL Fold {FOLD} Jaccard: {f_j:.4f} | elapsed_fold={time.time()-t_fold:.1f}s')
    for k,v in preds.items():
        oof_rows.append({'id': k, 'pred_text': v, 'gt': y_true.get(k, ''), 'fold': FOLD})

    # Predict TEST and save logits
    print(f'MuRIL Fold {FOLD}: predicting TEST and saving logits...')
    te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(f'muril_large_test_logits/test_start_logits_f{FOLD}.npy', te_start)
    np.save(f'muril_large_test_logits/test_end_logits_f{FOLD}.npy', te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, va_ds_pred, tr_tok, va_tok_all, pred, s_logp, e_logp, te_pred
    torch.cuda.empty_cache()

# Save OOF, mean
oof_df = pd.DataFrame(oof_rows)
oof_df.to_csv('oof_preds_muril_large.csv', index=False)
mean_oof = float(np.mean(all_oof_scores)) if all_oof_scores else 0.0
print('MuRIL 5-fold OOF Jaccard (mean):', round(mean_oof, 4))

# Average TEST logits across folds
print('MuRIL: Averaging TEST logits across folds...')
start_stack = np.stack(test_start_list, axis=0)
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('muril_large_test_avg.npz', start=start_avg, end=end_avg)
print('Saved muril_large_test_avg.npz; example_id/offsets saved under muril_large_test_logits/')
print('MuRIL 5-fold done in %.1fs' % (time.time()-t_global))

In [36]:
# Char-level blending and decoding (prob-space aggregation, banded decode, robust fallback with per-lang caps and optional smoothing)
import os, json, numpy as np, pandas as pd, time, torch, re, unicodedata
from torch.nn.functional import log_softmax

# Safe normalization fallback (in case kernel was restarted)
ZW_CHARS = ''.join([
    '\u200c',  # ZWNJ
    '\u200d',  # ZWJ
    '\ufeff',  # BOM
])
ZW_RE = re.compile(f'[{re.escape(ZW_CHARS)}]')
WS_RE = re.compile(r'\s+')
OUTER_PUNCT_RE = re.compile(r'^[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+|[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+$')
def _normalize_text_local(s: str, trim_outer_punct: bool = True) -> str:
    if s is None:
        return ''
    s = unicodedata.normalize('NFKC', str(s))
    s = ZW_RE.sub('', s)
    s = WS_RE.sub(' ', s).strip()
    if trim_outer_punct and s:
        s = OUTER_PUNCT_RE.sub('', s).strip()
    return s
try:
    normalize_text
except NameError:
    normalize_text = _normalize_text_local

# Globals and per-stream temperature
N_BEST_CHAR = 150
MAX_ANSWER_LEN = 64
LENGTH_PENALTY = 0.0048
LOGIT_TEMPERATURE = 0.90
STREAM_TEMP = {
    'xlmr': 0.95,
    'muril': {'hindi': 1.15, 'tamil': 1.10, 'default': 1.10},
    'xlmr512': 0.88,
    'mdeberta': 1.00,
}
APPLY_SMOOTHING = False  # set True for the smoothing probe (global; additional Tamil end-only smoothing applied inside build loop)
EPS = 1e-12

PUNCT_STRIP = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'

def strip_trailing_punct(s: str) -> str:
    if not s: return s
    return s.strip().strip(PUNCT_STRIP)

def trim_boundary_combining(s: str) -> str:
    if not s: return s
    if len(s) > 0 and unicodedata.combining(s[0]):
        s = s[1:]
    if len(s) > 0 and unicodedata.combining(s[-1]):
        s = s[:-1]
    return s

def remove_unmatched_quotes(s: str) -> str:
    if not s: return s
    quotes = [("\"","\""), ("'","'"), ('“','”'), ('‘','’')]
    for lq, rq in quotes:
        if s.startswith(lq) and not s.endswith(rq): s = s[len(lq):]
        if s.endswith(rq) and not s.startswith(lq): s = s[:-len(rq)]
    return s

# Load model evidence: masked log-softmax -> prob -> accumulate -> mean-by-coverage -> log
def load_model_char_evidence(avg_npz_path, exid_json, offsets_npy, test_df, stream_key):
    if not (os.path.exists(avg_npz_path) and os.path.exists(exid_json) and os.path.exists(offsets_npy)):
        return None
    data = np.load(avg_npz_path)
    start_logits = data['start']
    end_logits = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        example_ids = json.load(f)
    offsets_list = np.load(offsets_npy, allow_pickle=True)
    ctx_texts = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    lang_map = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))

    start_char_map, end_char_map = {}, {}
    start_cov, end_cov = {}, {}

    for i in range(len(example_ids)):
        ex_id = example_ids[i]
        ctx = ctx_texts.get(ex_id, '')
        ex_lang = str(lang_map.get(ex_id, 'unknown')).lower().strip()
        # per-language temperature (supports dict for muril) or scalar
        T_cfg = STREAM_TEMP.get(stream_key, LOGIT_TEMPERATURE)
        if isinstance(T_cfg, dict):
            T = T_cfg.get(ex_lang, T_cfg.get('default', LOGIT_TEMPERATURE))
        else:
            T = T_cfg

        n = len(ctx)
        if ex_id not in start_char_map:
            start_char_map[ex_id] = np.zeros(n, dtype=np.float32)
            end_char_map[ex_id] = np.zeros(n, dtype=np.float32)
            start_cov[ex_id] = np.zeros(n, dtype=np.float32)
            end_cov[ex_id] = np.zeros(n, dtype=np.float32)

        offs = offsets_list[i]
        s_logit = torch.tensor(start_logits[i], dtype=torch.float32) / max(T, 1e-6)
        e_logit = torch.tensor(end_logits[i], dtype=torch.float32) / max(T, 1e-6)
        mask_bool = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offs], dtype=bool)
        s_masked = torch.where(torch.tensor(mask_bool), s_logit, torch.full_like(s_logit, -1e9))
        e_masked = torch.where(torch.tensor(mask_bool), e_logit, torch.full_like(e_logit, -1e9))

        s_logp = torch.nn.functional.log_softmax(s_masked, dim=-1).numpy()
        e_logp = torch.nn.functional.log_softmax(e_masked, dim=-1).numpy()
        s_prob = np.exp(s_logp)
        e_prob = np.exp(e_logp)

        for t, o in enumerate(offs):
            if o is None: continue
            cs, ce = o
            if cs is None or ce is None or ce <= cs: continue
            if 0 <= cs < n:
                start_char_map[ex_id][cs] += float(s_prob[t])
                start_cov[ex_id][cs] += 1.0
            end_idx = ce - 1
            if 0 <= end_idx < n:
                end_char_map[ex_id][end_idx] += float(e_prob[t])
                end_cov[ex_id][end_idx] += 1.0

    for ex_id in start_char_map.keys():
        sc = start_cov[ex_id]; ec = end_cov[ex_id]
        s = start_char_map[ex_id]; e = end_char_map[ex_id]
        sc_safe = np.where(sc > 0, sc, 1.0)
        ec_safe = np.where(ec > 0, ec, 1.0)
        s_mean = s / sc_safe
        e_mean = e / ec_safe
        start_char_map[ex_id] = np.log(np.clip(s_mean, EPS, 1.0)).astype(np.float32)
        end_char_map[ex_id] = np.log(np.clip(e_mean, EPS, 1.0)).astype(np.float32)

    return start_char_map, end_char_map

# Banded outer-product decode that returns (score, start, end, text)
def decode_best_span(ctx, s_arr, e_arr, n_best=N_BEST_CHAR, max_len=MAX_ANSWER_LEN, length_penalty=LENGTH_PENALTY):
    if len(ctx) == 0:
        return (float('-inf'), -1, -1, '')
    n = len(s_arr)
    top_s = np.argsort(s_arr)[-n_best:]
    best = (float('-inf'), 0, -1)
    for si in top_s:
        ei_max = min(n - 1, si + max_len - 1)
        band = e_arr[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_arr[si] + e_arr[ei] - length_penalty * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    if best[2] < best[1]:
        return (float('-inf'), -1, -1, '')
    return (best[0], best[1], best[2], ctx[best[1]:best[2]+1])

def _conv1d_reflect(arr: np.ndarray, kernel=(0.2, 0.6, 0.2)) -> np.ndarray:
    if arr.size == 0: return arr
    k = np.array(kernel, dtype=np.float32)
    pad = len(k)//2
    # reflect padding
    left = arr[1:pad+1][::-1] if arr.size > 1 else arr
    right = arr[-pad-1:-1][::-1] if arr.size > 1 else arr
    padded = np.concatenate([left, arr, right]) if pad > 0 else arr
    out = np.convolve(padded, k, mode='valid')
    return out.astype(np.float32)

def robust_decode(ctx, s_char_log, e_char_log, lang='unknown'):
    lang_l = str(lang).lower().strip()
    # per-language max_len caps, decode knobs, and no-answer thresholds
    if lang_l == 'hindi':
        max_len = 50; nb = 160; len_pen = 0.0048; na_threshold = -5.0
    elif lang_l == 'tamil':
        max_len = 64; nb = 200; len_pen = 0.0040; na_threshold = -4.5
    else:
        max_len = MAX_ANSWER_LEN; nb = N_BEST_CHAR; len_pen = LENGTH_PENALTY; na_threshold = -5.0

    score, si, ei, ans = decode_best_span(ctx, s_char_log, e_char_log, n_best=nb, max_len=max_len, length_penalty=len_pen)
    # No-answer threshold: if below, return empty (disable risky fallbacks)
    if score < na_threshold:
        return ''

    # Minimal hygiene
    if ans:
        if len(ans) and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'):
            ans = ans[:-1]
        if len(ans) and unicodedata.category(ans[0]) == 'Mn':
            ans = ans[1:]
        # Tamil-specific: drop isolated trailing vowel sign once
        if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
            ans = ans[:-1]
    return ans or ''

def build_submission(blend_weights=None, per_language_weights=None):
    t0 = time.time()
    test_df = pd.read_csv('test.csv')
    # Load models' char evidence
    xlmr = load_model_char_evidence(
        'xlmr_large_test_avg.npz',
        'xlmr_large_test_logits/test_example_id.json',
        'xlmr_large_test_logits/test_offset_mapping.npy',
        test_df, 'xlmr'
    )
    muril = load_model_char_evidence(
        'muril_large_test_avg.npz',
        'muril_large_test_logits/test_example_id.json',
        'muril_large_test_logits/test_offset_mapping.npy',
        test_df, 'muril'
    )
    mdb = load_model_char_evidence(
        'mdeberta_base_test_avg.npz',
        'mdeberta_base_test_logits/test_example_id.json',
        'mdeberta_base_test_logits/test_offset_mapping.npy',
        test_df, 'mdeberta'
    )
    xlmr512 = load_model_char_evidence(
        'xlmr_large_512_test_avg.npz',
        'xlmr_large_512_test_logits/test_example_id.json',
        'xlmr_large_512_test_logits/test_offset_mapping.npy',
        test_df, 'xlmr512'
    )
    have_xlmr = xlmr is not None
    have_muril = muril is not None
    have_mdb = mdb is not None
    have_xlmr512 = xlmr512 is not None
    assert have_xlmr or have_muril or have_mdb or have_xlmr512, 'No averaged logits found. Run folds first.'

    if blend_weights is None:
        if have_xlmr and have_muril and have_mdb and have_xlmr512:
            w_x, w_m, w_d, w_x512 = 0.45, 0.25, 0.0, 0.30
        elif have_xlmr and have_muril and have_xlmr512:
            w_x, w_m, w_d, w_x512 = 0.30, 0.10, 0.0, 0.60
        elif have_xlmr and have_xlmr512:
            w_x, w_m, w_d, w_x512 = 0.35, 0.0, 0.0, 0.65
        elif have_xlmr:
            w_x, w_m, w_d, w_x512 = 1.0, 0.0, 0.0, 0.0
        elif have_muril:
            w_x, w_m, w_d, w_x512 = 0.0, 1.0, 0.0, 0.0
        else:
            w_x, w_m, w_d, w_x512 = 0.0, 0.0, 0.0, 1.0
    else:
        if len(blend_weights) == 2:
            w_x, w_m = blend_weights; w_d = 0.0; w_x512 = 0.0
        elif len(blend_weights) == 3:
            w_x, w_m, w_d = blend_weights; w_x512 = 0.0
        else:
            w_x, w_m, w_d, w_x512 = blend_weights

    preds = []
    ids = test_df['id'].astype(str).values
    langs = test_df['language'].astype(str).values if 'language' in test_df.columns else np.array(['unknown']*len(ids))
    for ex_id, lang in zip(ids, langs):
        ctx = str(test_df.loc[test_df['id'].astype(str)==ex_id, 'context'].values[0])
        wx, wm, wd, wx512 = w_x, w_m, w_d, w_x512
        if per_language_weights is not None and isinstance(per_language_weights, dict):
            if lang in per_language_weights:
                pl = per_language_weights[lang]
                if len(pl) == 2:
                    wx, wm = pl; wd, wx512 = 0.0, 0.0
                elif len(pl) == 3:
                    wx, wm, wd = pl; wx512 = 0.0
                else:
                    wx, wm, wd, wx512 = pl
        weight_items = []
        if have_xlmr: weight_items.append(('xlmr', wx))
        if have_muril: weight_items.append(('muril', wm))
        if have_mdb: weight_items.append(('mdeberta', wd))
        if have_xlmr512: weight_items.append(('xlmr512', wx512))
        sumw = sum(max(0.0, w) for _, w in weight_items)
        if sumw <= 0:
            norm_weights = {k: 1.0/len(weight_items) for k,_ in weight_items} if weight_items else {}
        else:
            norm_weights = {k: max(0.0, w)/sumw for k,w in weight_items}

        s_char = np.zeros(len(ctx), dtype=np.float32)
        e_char = np.zeros(len(ctx), dtype=np.float32)
        if have_xlmr:
            s_map_x, e_map_x = xlmr
            sx = s_map_x.get(ex_id); exm = e_map_x.get(ex_id)
            if sx is not None: s_char += norm_weights.get('xlmr', 0.0) * sx
            if exm is not None: e_char += norm_weights.get('xlmr', 0.0) * exm
        if have_muril:
            s_map_m, e_map_m = muril
            sm = s_map_m.get(ex_id); em = e_map_m.get(ex_id)
            if sm is not None: s_char += norm_weights.get('muril', 0.0) * sm
            if em is not None: e_char += norm_weights.get('muril', 0.0) * em
        if have_mdb:
            s_map_d, e_map_d = mdb
            sd = s_map_d.get(ex_id); ed = e_map_d.get(ex_id)
            if sd is not None: s_char += norm_weights.get('mdeberta', 0.0) * sd
            if ed is not None: e_char += norm_weights.get('mdeberta', 0.0) * ed
        if have_xlmr512:
            s_map_x512, e_map_x512 = xlmr512
            sx512 = s_map_x512.get(ex_id); ex512 = e_map_x512.get(ex_id)
            if sx512 is not None: s_char += norm_weights.get('xlmr512', 0.0) * sx512
            if ex512 is not None: e_char += norm_weights.get('xlmr512', 0.0) * ex512

        # Optional global light smoothing (for probe only)
        if APPLY_SMOOTHING:
            s_char = _conv1d_reflect(s_char)
            e_char = _conv1d_reflect(e_char)

        # Tamil-only end smoothing as recommended
        if str(lang).lower().strip() == 'tamil':
            e_char = _conv1d_reflect(e_char)

        ans = robust_decode(ctx, s_char, e_char, lang)
        try:
            # Numeric punctuation collapse inside digit runs
            ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
            # Danda/virama hygiene: replace trailing '\u094D\u0964' with '।', collapse multiple dandas, then strip trailing danda once
            ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
            ans = re.sub(r'[।]+', '।', ans)
            if ans.endswith('।'): ans = ans[:-1]
            ans = strip_trailing_punct(ans)
            ans = trim_boundary_combining(ans)
            if len(ans) > 0 and ans[0].isspace(): ans = ans[0:].lstrip()
            if len(ans) > 0 and ans[-1].isspace(): ans = ans.rstrip()
            ans = remove_unmatched_quotes(ans)
            ans = normalize_text(ans, trim_outer_punct=True)
        except Exception:
            ans = (ans or '').strip()
        preds.append(ans if ans is not None else '')
    sub = pd.DataFrame({'id': ids, 'PredictionString': preds})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv', sub.shape, 'in %.1fs' % (time.time()-t0))
    return sub

print('Blend/Decode cell ready. Use build_submission() with per-lang nb/lp, Tamil end smoothing, per-lang temps, and per-language weights.')

Blend/Decode cell ready. Use build_submission() with per-lang nb/lp, Tamil end smoothing, per-lang temps, and per-language weights.


In [None]:
# Build blended submission with 3-model weights [XLM-R, MuRIL, mDeBERTa] = [0.50, 0.30, 0.20]
sub = build_submission(blend_weights=(0.50, 0.30, 0.20))
sub.head()

In [None]:
# 5-fold training with microsoft/mdeberta-v3-base (MAX_LEN=512, STRIDE=256) + TEST LOGIT SAVING
import os, time, numpy as np, pandas as pd, collections, random, json, torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'microsoft/mdeberta-v3-base'
MAX_LENGTH = 512
DOC_STRIDE = 256
MAX_ANSWER_LEN = 64
NEG_RATIO = 1
EPOCHS = 3
LR = 3e-5
TRAIN_BS = 16  # fallback to 12 if OOM
EVAL_BS = 32
GRAD_ACCUM = 1
N_BEST = 30

print('Retokenizing with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions = [], []
    cls_positions = []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='mdeberta_test_tokenize_allfolds')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')

os.makedirs('mdeberta_base_test_logits', exist_ok=True)
with open('mdeberta_base_test_logits/test_example_id.json', 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save('mdeberta_base_test_logits/test_offset_mapping.npy', np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

all_oof_scores = []; oof_rows = []; test_start_list = []; test_end_list = [];
t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== mDeBERTa Fold {FOLD} =====')
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'mdeberta_train_tokenize_f{FOLD}')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'mdeberta_valid_tokenize_f{FOLD}')
    print(f'mDeBERTa Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | valid feats={va_tok_all.num_rows} | tok_time={time.time()-t0:.1f}s')
    # Negative sampling (robust via cls_positions)
    smap = np.array(tr_tok['overflow_to_sample_mapping'])
    sp = np.array(tr_tok['start_positions'])
    cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), NEG_RATIO*len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'mDeBERTa Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})')
    # Datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    try: model.gradient_checkpointing_enable()
    except Exception: pass
    args = TrainingArguments(
        output_dir=f'mdeberta_base_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=True,
        logging_steps=250,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=2,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'mDeBERTa Fold {FOLD}: training...')
    t1 = time.time()
    try:
        trainer.train()
    except RuntimeError as e:
        if 'CUDA out of memory' in str(e):
            print('OOM encountered; reducing TRAIN_BS to 12 and retrying this fold...')
            del trainer, model
            torch.cuda.empty_cache()
            model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
            try: model.gradient_checkpointing_enable()
            except Exception: pass
            args = TrainingArguments(
                output_dir=f'mdeberta_base_f{FOLD}',
                num_train_epochs=EPOCHS,
                learning_rate=LR,
                per_device_train_batch_size=12,
                per_device_eval_batch_size=EVAL_BS,
                gradient_accumulation_steps=GRAD_ACCUM,
                fp16=True,
                logging_steps=250,
                save_strategy='no',
                evaluation_strategy='no',
                seed=SEED,
                dataloader_num_workers=2,
                report_to=[],
                gradient_checkpointing=True,
                warmup_ratio=0.1,
                weight_decay=0.01,
                max_grad_norm=1.0,
            )
            trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tr_ds_torch,
                eval_dataset=None,
                tokenizer=tokenizer,
                data_collator=DataCollatorWithPadding(tokenizer),
            )
            trainer.train()
        else:
            raise
    print(f'mDeBERTa Fold {FOLD}: train_done in {time.time()-t1:.1f}s | total_elapsed={time.time()-t_fold:.1f}s')
    # Predict valid
    print(f'mDeBERTa Fold {FOLD}: predicting valid...')
    pred = trainer.predict(va_ds_pred)
    s_logp = log_softmax(torch.tensor(pred.predictions[0]), dim=-1).numpy()
    e_logp = log_softmax(torch.tensor(pred.predictions[1]), dim=-1).numpy()
    ex_best = {}
    for i in range(len(va_tok_all)):
        ex_id = va_tok_all[i]['example_id']
        offsets = va_tok_all[i]['offset_mapping']
        ctx = va_df.loc[va_df['id'].astype(str)==ex_id, 'context'].values[0]
        s = s_logp[i]; e = e_logp[i]
        mask = np.array([o is not None for o in offsets], dtype=bool)
        s_m = np.where(mask, s, -1e9); e_m = np.where(mask, e, -1e9)
        top_s = np.argsort(s_m)[-N_BEST:]; top_e = np.argsort(e_m)[-N_BEST:]
        best = (float('-inf'), '')
        for si in top_s:
            for ei in top_e:
                if ei < si: continue
                if (ei - si + 1) > MAX_ANSWER_LEN: continue
                if offsets[si] is None or offsets[ei] is None: continue
                cs,_ = offsets[si]; _,ce = offsets[ei]
                if cs is None or ce is None or ce <= cs: continue
                cand = ctx[cs:ce]; score = float(s_m[si] + e_m[ei])
                if score > best[0]: best = (score, cand)
        if best[0] == float('-inf'): best = (-1e9, '')
        prev = ex_best.get(ex_id, (float('-inf'), ''))
        if best[0] > prev[0]: ex_best[ex_id] = best
    preds = {k:v[1] for k,v in ex_best.items()}
    y_true = va_df.assign(id_str=va_df['id'].astype(str)).set_index('id_str')['answer_text'].to_dict()
    fold_scores = [jaccard_word_level(y_true[k], preds.get(k, '')) for k in y_true.keys()]
    f_j = float(np.mean(fold_scores)) if fold_scores else 0.0
    all_oof_scores.append(f_j)
    print(f'mDeBERTa Fold {FOLD} Jaccard: {f_j:.4f} | elapsed_fold={time.time()-t_fold:.1f}s')
    for k,v in preds.items():
        oof_rows.append({'id': k, 'pred_text': v, 'gt': y_true.get(k, ''), 'fold': FOLD})

    # Predict TEST and save logits
    print(f'mDeBERTa Fold {FOLD}: predicting TEST and saving logits...')
    te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(f'mdeberta_base_test_logits/test_start_logits_f{FOLD}.npy', te_start)
    np.save(f'mdeberta_base_test_logits/test_end_logits_f{FOLD}.npy', te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, va_ds_pred, tr_tok, va_tok_all, pred, s_logp, e_logp, te_pred
    torch.cuda.empty_cache()

# Save OOF, mean
oof_df = pd.DataFrame(oof_rows)
oof_df.to_csv('oof_preds_mdeberta_base.csv', index=False)
mean_oof = float(np.mean(all_oof_scores)) if all_oof_scores else 0.0
print('mDeBERTa 5-fold OOF Jaccard (mean):', round(mean_oof, 4))

# Average TEST logits across folds
print('mDeBERTa: Averaging TEST logits across folds...')
start_stack = np.stack(test_start_list, axis=0)
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('mdeberta_base_test_avg.npz', start=start_avg, end=end_avg)
print('Saved mdeberta_base_test_avg.npz; example_id/offsets saved under mdeberta_base_test_logits/')
print('mDeBERTa 5-fold done in %.1fs' % (time.time()-t_global))

In [3]:
# 5-fold training with deepset/xlm-roberta-large-squad2 at MAX_LEN=512, STRIDE=256 + TEST LOGIT SAVING (TTA-style retrain)
import os, time, numpy as np, pandas as pd, collections, random, json, torch, re, unicodedata, hashlib
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

# --- Inline metric/normalization to avoid missing globals after restarts ---
ZW_CHARS = ''.join([
    '\u200c',  # ZWNJ
    '\u200d',  # ZWJ
    '\ufeff',  # BOM
])
ZW_RE = re.compile(f'[{re.escape(ZW_CHARS)}]')
WS_RE = re.compile(r'\s+')
OUTER_PUNCT_RE = re.compile(r'^[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+|[\s\"\'\“\”\‘\’\(\)\[\]\{\}\|।,:;.!?-]+$')
def normalize_text(s: str, trim_outer_punct: bool = True) -> str:
    if s is None:
        return ''
    s = unicodedata.normalize('NFKC', str(s))
    s = ZW_RE.sub('', s)
    s = WS_RE.sub(' ', s).strip()
    if trim_outer_punct and s:
        s = OUTER_PUNCT_RE.sub('', s).strip()
    return s
def jaccard_word_level(a: str, b: str) -> float:
    a_n = normalize_text(a)
    b_n = normalize_text(b)
    if not a_n and not b_n:
        return 1.0
    a_set = set(a_n.split())
    b_set = set(b_n.split())
    if not a_set and not b_set:
        return 1.0
    inter = len(a_set & b_set)
    union = len(a_set | b_set)
    return inter / union if union else 0.0

# Memory-friendly env settings
os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try: torch.set_float32_matmul_precision('high')
except Exception: pass

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 512
DOC_STRIDE = 256
MAX_ANSWER_LEN = 64
NEG_RATIO = 2
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 1  # ultra-conservative for 24GB A10/V100; rely on GRAD_ACCUM
EVAL_BS = 4   # reduce eval bs to avoid OOM during prediction
GRAD_ACCUM = 16  # effective batch size kept reasonable
N_BEST = 25

print('Retokenizing with', MODEL_NAME, 'len', MAX_LENGTH, 'stride', DOC_STRIDE)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions = [], []
    cls_positions = []  # track per-feature CLS index for robust negative sampling
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once for 512 setting
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='xlmr512_test_tokenize_allfolds')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')

os.makedirs('xlmr_large_512_test_logits', exist_ok=True)
with open('xlmr_large_512_test_logits/test_example_id.json', 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save('xlmr_large_512_test_logits/test_offset_mapping.npy', np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

all_oof_scores = []; oof_rows = []; test_start_list = []; test_end_list = [];
t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== XLM-R Large 512 Fold {FOLD} =====')
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'xlmr512_train_tokenize_f{FOLD}')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'xlmr512_valid_tokenize_f{FOLD}')
    print(f'Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | valid feats={va_tok_all.num_rows} | tok_time={time.time()-t0:.1f}s')
    # Negative sampling (robust via cls_positions)
    smap = np.array(tr_tok['overflow_to_sample_mapping']); sp = np.array(tr_tok['start_positions']); cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), NEG_RATIO*len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})')
    # Torch datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model
    torch.cuda.empty_cache()
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
    try: model.gradient_checkpointing_enable()
    except Exception: pass
    args = TrainingArguments(
        output_dir=f'xlmr_large_512_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=False,
        bf16=True,
        logging_steps=200,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=2,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'Fold {FOLD}: training...')
    t1 = time.time()
    try:
        trainer.train()
    except RuntimeError as e:
        if 'out of memory' in str(e).lower():
            print('OOM encountered; retrying with TRAIN_BS=1, EVAL_BS=2, GRAD_ACCUM=24, and fresh model...')
            del trainer, model
            torch.cuda.empty_cache()
            model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
            try: model.gradient_checkpointing_enable()
            except Exception: pass
            args = TrainingArguments(
                output_dir=f'xlmr_large_512_f{FOLD}',
                num_train_epochs=EPOCHS,
                learning_rate=LR,
                per_device_train_batch_size=1,
                per_device_eval_batch_size=2,
                gradient_accumulation_steps=24,
                fp16=False,
                bf16=True,
                logging_steps=200,
                save_strategy='no',
                evaluation_strategy='no',
                seed=SEED,
                dataloader_num_workers=2,
                report_to=[],
                gradient_checkpointing=True,
                warmup_ratio=0.1,
                weight_decay=0.01,
                max_grad_norm=1.0,
            )
            trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tr_ds_torch,
                eval_dataset=None,
                tokenizer=tokenizer,
                data_collator=DataCollatorWithPadding(tokenizer),
            )
            trainer.train()
        else:
            raise
    print(f'Fold {FOLD}: train_done in {time.time()-t1:.1f}s | total_elapsed={time.time()-t_fold:.1f}s')
    # Predict valid
    print(f'Fold {FOLD}: predicting valid...')
    pred = trainer.predict(va_ds_pred)
    s_logp = log_softmax(torch.tensor(pred.predictions[0]), dim=-1).numpy()
    e_logp = log_softmax(torch.tensor(pred.predictions[1]), dim=-1).numpy()
    ex_best = {}
    for i in range(len(va_tok_all)):
        ex_id = va_tok_all[i]['example_id']
        offsets = va_tok_all[i]['offset_mapping']
        ctx = va_df.loc[va_df['id'].astype(str)==ex_id, 'context'].values[0]
        s = s_logp[i]; e = e_logp[i]
        mask = np.array([o is not None for o in offsets], dtype=bool)
        s_m = np.where(mask, s, -1e9); e_m = np.where(mask, e, -1e9)
        top_s = np.argsort(s_m)[-N_BEST:]; top_e = np.argsort(e_m)[-N_BEST:]
        best = (float('-inf'), '')
        for si in top_s:
            for ei in top_e:
                if ei < si: continue
                if (ei - si + 1) > MAX_ANSWER_LEN: continue
                if offsets[si] is None or offsets[ei] is None: continue
                cs,_ = offsets[si]; _,ce = offsets[ei]
                if cs is None or ce is None or ce <= cs: continue
                cand = ctx[cs:ce]; score = float(s_m[si] + e_m[ei])
                if score > best[0]: best = (score, cand)
        if best[0] == float('-inf'): best = (-1e9, '')
        prev = ex_best.get(ex_id, (float('-inf'), ''))
        if best[0] > prev[0]: ex_best[ex_id] = best
    preds = {k:v[1] for k,v in ex_best.items()}
    y_true = va_df.assign(id_str=va_df['id'].astype(str)).set_index('id_str')['answer_text'].to_dict()
    fold_scores = [jaccard_word_level(y_true[k], preds.get(k, '')) for k in y_true.keys()]
    f_j = float(np.mean(fold_scores)) if fold_scores else 0.0
    all_oof_scores.append(f_j)
    print(f'Fold {FOLD} Jaccard: {f_j:.4f} | elapsed_fold={time.time()-t_fold:.1f}s')
    for k,v in preds.items():
        oof_rows.append({'id': k, 'pred_text': v, 'gt': y_true.get(k, ''), 'fold': FOLD})

    # Predict TEST and save logits
    print(f'Fold {FOLD}: predicting TEST and saving logits (512 setting)...')
    te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(f'xlmr_large_512_test_logits/test_start_logits_f{FOLD}.npy', te_start)
    np.save(f'xlmr_large_512_test_logits/test_end_logits_f{FOLD}.npy', te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, va_ds_pred, tr_tok, va_tok_all, pred, s_logp, e_logp, te_pred
    torch.cuda.empty_cache()

# Save OOF, mean
oof_df = pd.DataFrame(oof_rows)
oof_df.to_csv('oof_preds_xlmr_large_512.csv', index=False)
mean_oof = float(np.mean(all_oof_scores)) if all_oof_scores else 0.0
print('XLM-R Large 512 5-fold OOF Jaccard (mean):', round(mean_oof, 4))

# Average TEST logits across folds
print('XLM-R Large 512: Averaging TEST logits across folds...')
start_stack = np.stack(test_start_list, axis=0)
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('xlmr_large_512_test_avg.npz', start=start_avg, end=end_avg)
print('Saved xlmr_large_512_test_avg.npz; example_id/offsets saved under xlmr_large_512_test_logits/')
print('XLM-R Large 512 5-fold done in %.1fs' % (time.time()-t_global))

Retokenizing with deepset/xlm-roberta-large-squad2 len 512 stride 256




xlmr512_test_tokenize_allfolds:   0%|          | 0/112 [00:00<?, ? examples/s]

xlmr512_test_tokenize_allfolds: 100%|██████████| 112/112 [00:02<00:00, 46.93 examples/s]

xlmr512_test_tokenize_allfolds: 100%|██████████| 112/112 [00:02<00:00, 46.21 examples/s]





===== XLM-R Large 512 Fold 0 =====


xlmr512_train_tokenize_f0:   0%|          | 0/817 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f0: 100%|██████████| 817/817 [00:18<00:00, 43.87 examples/s]

xlmr512_train_tokenize_f0: 100%|██████████| 817/817 [00:18<00:00, 43.63 examples/s]




xlmr512_valid_tokenize_f0:   0%|          | 0/185 [00:00<?, ? examples/s]

xlmr512_valid_tokenize_f0: 100%|██████████| 185/185 [00:04<00:00, 38.89 examples/s]

xlmr512_valid_tokenize_f0: 100%|██████████| 185/185 [00:04<00:00, 38.07 examples/s]




Fold 0: train feats pre-sample=10484 | valid feats=2301 | tok_time=23.7s
Fold 0: neg-sample 10484->2995 (28.57%)


Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 0: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 0: train_done in 1960.8s | total_elapsed=1985.8s
Fold 0: predicting valid...


Fold 0 Jaccard: 0.7057 | elapsed_fold=2097.2s
Fold 0: predicting TEST and saving logits (512 setting)...



===== XLM-R Large 512 Fold 1 =====


xlmr512_train_tokenize_f1:   0%|          | 0/807 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f1: 100%|██████████| 807/807 [00:16<00:00, 48.98 examples/s]

xlmr512_train_tokenize_f1: 100%|██████████| 807/807 [00:16<00:00, 48.70 examples/s]




xlmr512_valid_tokenize_f1:   0%|          | 0/195 [00:00<?, ? examples/s]

xlmr512_valid_tokenize_f1: 100%|██████████| 195/195 [00:04<00:00, 45.91 examples/s]

xlmr512_valid_tokenize_f1: 100%|██████████| 195/195 [00:04<00:00, 44.95 examples/s]




Fold 1: train feats pre-sample=10396 | valid feats=2389 | tok_time=21.0s
Fold 1: neg-sample 10396->2932 (28.20%)


Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 1: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 1: train_done in 2076.0s | total_elapsed=2098.7s
Fold 1: predicting valid...


Fold 1 Jaccard: 0.6167 | elapsed_fold=2305.8s
Fold 1: predicting TEST and saving logits (512 setting)...



===== XLM-R Large 512 Fold 2 =====


xlmr512_train_tokenize_f2:   0%|          | 0/784 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f2: 100%|██████████| 784/784 [00:16<00:00, 48.92 examples/s]

xlmr512_train_tokenize_f2: 100%|██████████| 784/784 [00:16<00:00, 48.62 examples/s]




xlmr512_valid_tokenize_f2:   0%|          | 0/218 [00:00<?, ? examples/s]

xlmr512_valid_tokenize_f2: 100%|██████████| 218/218 [00:05<00:00, 37.41 examples/s]

xlmr512_valid_tokenize_f2: 100%|██████████| 218/218 [00:05<00:00, 36.64 examples/s]




Fold 2: train feats pre-sample=9842 | valid feats=2943 | tok_time=22.2s
Fold 2: neg-sample 9842->2810 (28.55%)


Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 2: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 2: train_done in 1852.3s | total_elapsed=1877.2s
Fold 2: predicting valid...


Fold 2 Jaccard: 0.6377 | elapsed_fold=2027.6s
Fold 2: predicting TEST and saving logits (512 setting)...



===== XLM-R Large 512 Fold 3 =====


xlmr512_train_tokenize_f3:   0%|          | 0/795 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f3: 100%|██████████| 795/795 [00:15<00:00, 49.88 examples/s]

xlmr512_train_tokenize_f3: 100%|██████████| 795/795 [00:16<00:00, 49.57 examples/s]




xlmr512_valid_tokenize_f3:   0%|          | 0/207 [00:00<?, ? examples/s]

xlmr512_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 37.82 examples/s]

xlmr512_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 37.10 examples/s]




Fold 3: train feats pre-sample=9960 | valid feats=2825 | tok_time=21.7s
Fold 3: neg-sample 9960->2857 (28.68%)


Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 3: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


OOM encountered; retrying with TRAIN_BS=1, EVAL_BS=2, GRAD_ACCUM=24, and fresh model...


Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 7.00 MiB is free. Process 1299294 has 1.68 GiB memory in use. Process 1514573 has 3.27 GiB memory in use. Process 1609549 has 5.15 GiB memory in use. Process 1639108 has 1.15 GiB memory in use. Process 1685557 has 4.50 GiB memory in use. Of the allocated memory 4.77 GiB is allocated by PyTorch, and 8.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [8]:
# Build blended submissions per expert advice (conditional on availability of xlmr512):
import shutil, os

xlmr512_ready = os.path.exists('xlmr_large_512_test_avg.npz') and os.path.exists('xlmr_large_512_test_logits/test_example_id.json') and os.path.exists('xlmr_large_512_test_logits/test_offset_mapping.npy')

if xlmr512_ready:
    # Primary weights (xlmr384, muril, mdeberta, xlmr512) = (0.15, 0.25, 0.00, 0.60)
    sub = build_submission(blend_weights=(0.15, 0.25, 0.00, 0.60))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_primary_015_025_000_060.csv')

    # Safety blends (no MuRIL, no mDeBERTa), 512-heavy
    sub_b = build_submission(blend_weights=(0.35, 0.00, 0.00, 0.65))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_safety_035_000_000_065.csv')

    sub_c = build_submission(blend_weights=(0.30, 0.00, 0.00, 0.70))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_safety_030_000_000_070.csv')

    # Extra safety: 512-only
    sub_d = build_submission(blend_weights=(0.00, 0.00, 0.00, 1.00))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_safety_000_000_000_100.csv')

    # Optional alt decode variant: same primary weights but LENGTH_PENALTY=0.005 and TEMP=0.90
    try:
        _old_lp = LENGTH_PENALTY
        _old_temp = LOGIT_TEMPERATURE
    except NameError:
        _old_lp = 0.0045
        _old_temp = 1.00
    LENGTH_PENALTY = 0.0050
    LOGIT_TEMPERATURE = 0.90
    sub_alt = build_submission(blend_weights=(0.15, 0.25, 0.00, 0.60))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_primary_altdec_lp0.0050_t0.90_015_025_000_060.csv')
    # Restore defaults
    LENGTH_PENALTY = _old_lp
    LOGIT_TEMPERATURE = _old_temp

    # Restore primary as the final submission.csv
    sub = build_submission(blend_weights=(0.15, 0.25, 0.00, 0.60))
else:
    # Fallback: 2-model blends without xlmr512 (order: xlmr384, muril)
    sub_2a = build_submission(blend_weights=(0.70, 0.30))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_twomodel_070_030.csv')

    sub_2b = build_submission(blend_weights=(0.60, 0.40))
    if os.path.exists('submission.csv'):
        shutil.copy('submission.csv', 'submission_twomodel_060_040.csv')

    # Keep 0.70/0.30 as current default submission.csv
    sub = build_submission(blend_weights=(0.70, 0.30))

sub.head()

Saved submission.csv (112, 2) in 30.2s


Saved submission.csv (112, 2) in 30.3s


Saved submission.csv (112, 2) in 31.1s


KeyboardInterrupt: 

In [4]:
# Utility: free GPU by killing stray Python processes (use with caution)
import os, subprocess, signal, time, sys

def list_gpu_procs():
    try:
        out = subprocess.check_output([
            'bash','-lc',
            "nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader,nounits || true"
        ], stderr=subprocess.STDOUT).decode('utf-8', errors='ignore')
        procs = []
        for line in out.strip().splitlines():
            parts = [p.strip() for p in line.split(',')]
            if len(parts) >= 3:
                try:
                    procs.append({'pid': int(parts[0]), 'name': parts[1], 'mem_mb': int(parts[2])})
                except Exception:
                    pass
        return procs
    except Exception as e:
        print('nvidia-smi query failed:', e)
        return []

me = os.getpid()
print('Current PID:', me)
procs = list_gpu_procs()
print('GPU processes before:', procs)

# Kill other python processes holding GPU (not this PID)
killed = []
for p in procs:
    if p['pid'] != me and ('python' in (p['name'] or '').lower() or (p['mem_mb'] > 0)):
        try:
            print('Sending SIGTERM to PID', p['pid'], p)
            os.kill(p['pid'], signal.SIGTERM)
            killed.append(p['pid'])
        except Exception as e:
            print('SIGTERM failed for', p['pid'], e)

time.sleep(2.0)
procs_after = list_gpu_procs()
print('GPU processes after SIGTERM:', procs_after)

# Force kill remaining heavy holders if needed
for p in procs_after:
    if p['pid'] != me and p['pid'] in killed:
        try:
            print('Sending SIGKILL to PID', p['pid'])
            os.kill(p['pid'], signal.SIGKILL)
        except Exception as e:
            print('SIGKILL failed for', p['pid'], e)

time.sleep(1.0)
print('Final GPU processes:', list_gpu_procs())
print('GPU cleanup done. If training still OOMs at Trainer init, restart kernel and rerun cell 12.')

Current PID: 2977
GPU processes before: [{'pid': 1299294, 'name': '[Not Found]', 'mem_mb': 1164}, {'pid': 1514573, 'name': '[Not Found]', 'mem_mb': 3352}, {'pid': 1609549, 'name': '[Not Found]', 'mem_mb': 5278}, {'pid': 1639108, 'name': '[Not Found]', 'mem_mb': 1174}, {'pid': 1685557, 'name': '[Not Found]', 'mem_mb': 4608}]
Sending SIGTERM to PID 1299294 {'pid': 1299294, 'name': '[Not Found]', 'mem_mb': 1164}
SIGTERM failed for 1299294 [Errno 3] No such process
Sending SIGTERM to PID 1514573 {'pid': 1514573, 'name': '[Not Found]', 'mem_mb': 3352}
SIGTERM failed for 1514573 [Errno 3] No such process
Sending SIGTERM to PID 1609549 {'pid': 1609549, 'name': '[Not Found]', 'mem_mb': 5278}
SIGTERM failed for 1609549 [Errno 3] No such process
Sending SIGTERM to PID 1639108 {'pid': 1639108, 'name': '[Not Found]', 'mem_mb': 1174}
SIGTERM failed for 1639108 [Errno 3] No such process
Sending SIGTERM to PID 1685557 {'pid': 1685557, 'name': '[Not Found]', 'mem_mb': 4608}
SIGTERM failed for 1685557 

GPU processes after SIGTERM: [{'pid': 1299294, 'name': '[Not Found]', 'mem_mb': 1164}, {'pid': 1514573, 'name': '[Not Found]', 'mem_mb': 3352}, {'pid': 1609549, 'name': '[Not Found]', 'mem_mb': 5278}, {'pid': 1639108, 'name': '[Not Found]', 'mem_mb': 1174}, {'pid': 1685557, 'name': '[Not Found]', 'mem_mb': 4608}]


Final GPU processes: [{'pid': 1299294, 'name': '[Not Found]', 'mem_mb': 1164}, {'pid': 1514573, 'name': '[Not Found]', 'mem_mb': 3352}, {'pid': 1609549, 'name': '[Not Found]', 'mem_mb': 5278}, {'pid': 1639108, 'name': '[Not Found]', 'mem_mb': 1174}, {'pid': 1685557, 'name': '[Not Found]', 'mem_mb': 4608}]
GPU cleanup done. If training still OOMs at Trainer init, restart kernel and rerun cell 12.


In [11]:
# Build per-language blended submission (renormalizes by available streams)
import os, shutil

per_lang = {
    'hindi':  (0.35, 0.40, 0.00, 0.25),
    'tamil':  (0.45, 0.20, 0.00, 0.35),
}

sub_lang = build_submission(per_language_weights=per_lang)
if os.path.exists('submission.csv'):
    shutil.copy('submission.csv', 'submission_per_language.csv')
sub_lang.head()

Saved submission.csv (112, 2) in 28.9s


Unnamed: 0,id,PredictionString
0,be799d365,
1,26f356026,
2,57a56c43f,
3,da062fdbb,रक
4,72fc0d5b5,ोज़ेफ गोयबल्स को अपना प्रचारमंत्री नियुक्त किय...


In [None]:
# Tiny decode grid for 2-model blend (xlmr384, muril) and empty-answer fallback; writes multiple CSVs and final submission.csv
import os, shutil, pandas as pd

def run_blend_with_params(wx=0.75, wm=0.25, n_best=80, len_pen=0.004, temp=0.90, fallback=False, fb_len_pen=0.0, fb_max_len=32, tag=''):
    global N_BEST_CHAR, LENGTH_PENALTY, LOGIT_TEMPERATURE, MAX_ANSWER_LEN
    prev_max_len = MAX_ANSWER_LEN
    N_BEST_CHAR = n_best
    LENGTH_PENALTY = len_pen
    LOGIT_TEMPERATURE = temp
    if not fallback:
        sub = build_submission(blend_weights=(wx, wm))
        out_name = f'sub_2m_nb{n_best}_lp{len_pen:.4f}_t{temp:.2f}_{wx:0.2f}_{wm:0.2f}{tag}.csv'
        if os.path.exists('submission.csv'): shutil.copy('submission.csv', out_name)
        return out_name
    else:
        # fallback re-decode with zero length penalty and tighter max answer length
        LENGTH_PENALTY = fb_len_pen
        MAX_ANSWER_LEN = fb_max_len
        sub_fb = build_submission(blend_weights=(wx, wm))
        fb_name = f'sub_2m_fallback_nb{n_best}_lp{fb_len_pen:.4f}_max{fb_max_len}_t{temp:.2f}_{wx:0.2f}_{wm:0.2f}{tag}.csv'
        if os.path.exists('submission.csv'): shutil.copy('submission.csv', fb_name)
        MAX_ANSWER_LEN = prev_max_len
        return fb_name

# Grid params per expert advice
grid = [
    dict(n_best=80, len_pen=0.004, temp=0.90),
    dict(n_best=80, len_pen=0.004, temp=1.00),
    dict(n_best=80, len_pen=0.005, temp=0.90),
    dict(n_best=80, len_pen=0.005, temp=1.00),
]

wx, wm = 0.75, 0.25
generated = []
for g in grid:
    tag = f"_nb{g['n_best']}_lp{g['len_pen']:.4f}_t{g['temp']:.2f}"
    base_file = run_blend_with_params(wx=wx, wm=wm, n_best=g['n_best'], len_pen=g['len_pen'], temp=g['temp'], fallback=False, tag=tag)
    fb_file = run_blend_with_params(wx=wx, wm=wm, n_best=g['n_best'], len_pen=g['len_pen'], temp=g['temp'], fallback=True, fb_len_pen=0.0, fb_max_len=32, tag=tag)
    # Combine: if base PredictionString is empty, take from fallback
    try:
        b = pd.read_csv(base_file)
        f = pd.read_csv(fb_file)
        assert (b['id'] == f['id']).all()
        comb = b.copy()
        mask_empty = comb['PredictionString'].fillna('') == ''
        comb.loc[mask_empty, 'PredictionString'] = f.loc[mask_empty, 'PredictionString']
        comb_name = base_file.replace('.csv', '_with_fallback.csv')
        comb.to_csv(comb_name, index=False)
        generated.append(comb_name)
    except Exception as e:
        print('Combine failed for', base_file, 'and', fb_file, e)

# Choose a strong default to write as submission.csv: prefer len_pen=0.004, temp=0.90 with fallback
preferred = [p for p in generated if '_lp0.0040_' in p and '_t0.90_' in p]
final_file = preferred[0] if preferred else (generated[0] if generated else None)
if final_file and os.path.exists(final_file):
    shutil.copy(final_file, 'submission.csv')
    print('Final submission.csv copied from', final_file)
else:
    print('No generated submission from grid; keeping existing submission.csv if present.')

In [5]:
# Utility: create interim xlmr_large_512_test_avg.npz from available per-fold logits
import os, re, numpy as np, json

logdir = 'xlmr_large_512_test_logits'
start_files = []; end_files = []
if os.path.isdir(logdir):
    for fn in os.listdir(logdir):
        m = re.match(r'test_start_logits_f(\d+)\.npy$', fn)
        if m:
            f_end = f'test_end_logits_f{m.group(1)}.npy'
            if os.path.exists(os.path.join(logdir, f_end)):
                start_files.append(os.path.join(logdir, fn))
                end_files.append(os.path.join(logdir, f_end))

if start_files:
    start_stack = np.stack([np.load(fp) for fp in sorted(start_files)], axis=0)
    end_stack = np.stack([np.load(fp) for fp in sorted(end_files)], axis=0)
    start_avg = start_stack.mean(axis=0)
    end_avg = end_stack.mean(axis=0)
    np.savez_compressed('xlmr_large_512_test_avg.npz', start=start_avg, end=end_avg)
    print(f'Interim xlmr_large_512_test_avg.npz saved using {start_stack.shape[0]} folds.')
    # ensure example_id and offsets exist
    exid_json = os.path.join(logdir, 'test_example_id.json')
    offs_npy = os.path.join(logdir, 'test_offset_mapping.npy')
    assert os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing example_id/offset files for 512 stream'
else:
    print('No 512 per-fold logits found to average.')

Interim xlmr_large_512_test_avg.npz saved using 3 folds.


In [27]:
# Replace submission.csv with safety blend (global 512-heavy) and ensure no NaNs
import os, shutil, pandas as pd
src = 'submission_safety_global_015_005_000_080_nb200_lp0.0044.csv'
assert os.path.exists(src), f'Missing safety CSV: {src}'
df = pd.read_csv(src)
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv replaced with', src)
print('Head:')
print(pd.read_csv('submission.csv').head())

submission.csv replaced with submission_safety_global_015_005_000_080_nb200_lp0.0044.csv
Head:
          id               PredictionString
0  be799d365                          मुंबई
1  26f356026  उदासिनाचार्य सुमेरुदास महाराज
2  57a56c43f          பெருமூளை அரைக்கோளங்கள
3  da062fdbb                       बिंबिसार
4  72fc0d5b5                           1889


In [10]:
# Build 512-only blend and ensure no NaNs in submission.csv
import pandas as pd
sub_512 = build_submission(blend_weights=(0.00, 0.00, 0.00, 1.00))
df = pd.read_csv('submission.csv')
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv written (512-only weights) with NaNs filled to empty strings. Head:')
print(df.head())

Saved submission.csv (112, 2) in 29.0s
submission.csv written (512-only weights) with NaNs filled to empty strings. Head:
          id PredictionString
0  be799d365   ने खरीदा था। च
1  26f356026                 
2  57a56c43f                 
3  da062fdbb                 
4  72fc0d5b5                ह


In [12]:
# Switch to per-language blend as submission and ensure no NaNs
import os, shutil, pandas as pd
src = 'submission_per_language.csv'
assert os.path.exists(src), f'Missing per-language CSV: {src}'
df = pd.read_csv(src)
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv replaced with per-language blend and NaNs filled. Head:')
print(df.head())

submission.csv replaced with per-language blend and NaNs filled. Head:
          id                                   PredictionString
0  be799d365                                                   
1  26f356026                                                   
2  57a56c43f                                                   
3  da062fdbb                                                 रक
4  72fc0d5b5  ोज़ेफ गोयबल्स को अपना प्रचारमंत्री नियुक्त किय...


In [21]:
# Final blends per expert guidance: Primary and Safety
import os, shutil, pandas as pd

assert 'build_submission' in globals(), 'Run the decoding cell that defines build_submission() first (Cell 9).'

# Primary submission: weights (xlmr384, muril, mdeberta, xlmr512) = (0.15, 0.05, 0.00, 0.80)
sub_primary = build_submission(blend_weights=(0.15, 0.05, 0.00, 0.80))
if os.path.exists('submission.csv'):
    # Save a copy and ensure no NaNs
    df_p = pd.read_csv('submission.csv')
    if 'PredictionString' in df_p.columns:
        df_p['PredictionString'] = df_p['PredictionString'].fillna('')
    df_p.to_csv('submission.csv', index=False)
    shutil.copy('submission.csv', 'submission_primary_final.csv')
    print('Primary blend saved to submission.csv and submission_primary_final.csv')

# Safety submission: weights (0.25, 0.00, 0.00, 0.75)
sub_safety = build_submission(blend_weights=(0.25, 0.00, 0.00, 0.75))
if os.path.exists('submission.csv'):
    df_s = pd.read_csv('submission.csv')
    if 'PredictionString' in df_s.columns:
        df_s['PredictionString'] = df_s['PredictionString'].fillna('')
    df_s.to_csv('submission_safety_final.csv', index=False)
    print('Safety blend saved to submission_safety_final.csv')

# Restore primary as the active submission.csv
if os.path.exists('submission_primary_final.csv'):
    shutil.copy('submission_primary_final.csv', 'submission.csv')
    print('submission.csv restored from submission_primary_final.csv')

print('Head of submission.csv:')
print(pd.read_csv('submission.csv').head())

Saved submission.csv (112, 2) in 28.0s
Primary blend saved to submission.csv and submission_primary_final.csv


Saved submission.csv (112, 2) in 28.3s
Safety blend saved to submission_safety_final.csv
submission.csv restored from submission_primary_final.csv
Head of submission.csv:
          id               PredictionString
0  be799d365                          मुंबई
1  26f356026  उदासिनाचार्य सुमेरुदास महाराज
2  57a56c43f          பெருமூளை அரைக்கோளங்கள
3  da062fdbb                       बिंबिसार
4  72fc0d5b5                           1889


In [38]:
# Sanitize final submission.csv: ensure no NaNs and strings only
import pandas as pd, os
assert os.path.exists('submission.csv'), 'submission.csv not found'
df = pd.read_csv('submission.csv')
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].astype(str).replace('nan', '').fillna('')
else:
    raise AssertionError('PredictionString column missing in submission.csv')
df.to_csv('submission.csv', index=False)
print('submission.csv sanitized. Null count:', df['PredictionString'].isna().sum(), '| Empty count:', (df['PredictionString']=='').sum())
print(df.head())

submission.csv sanitized. Null count: 0 | Empty count: 71
          id          PredictionString
0  be799d365                     मुंबई
1  26f356026               सुधांशुबाला
2  57a56c43f  புறணிப் (cerebral cortex
3  da062fdbb                          
4  72fc0d5b5            २० अप्रैल १८८९


In [19]:
# Resume XLM-R Large 512 training for missing folds only; then rebuild 512 avg npz
import os, time, numpy as np, pandas as pd, collections, random, json, torch, re, gc
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2025
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)
os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try: torch.set_float32_matmul_precision('high')
except Exception: pass

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 512
DOC_STRIDE = 256
MAX_ANSWER_LEN = 64
NEG_RATIO = 2
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 1
EVAL_BS = 1
GRAD_ACCUM = 16
N_BEST = 25

print('Tokenizer for resume with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions, cls_positions = [], [], []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once for 512 setting (if not already)
logdir = 'xlmr_large_512_test_logits'
os.makedirs(logdir, exist_ok=True)
exid_json = os.path.join(logdir, 'test_example_id.json')
offs_npy = os.path.join(logdir, 'test_offset_mapping.npy')
need_tokenize_test = not (os.path.exists(exid_json) and os.path.exists(offs_npy))
if need_tokenize_test:
    test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
    test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='xlmr512_test_tokenize_resume')
    te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
    te_ds_pred.set_format(type='torch')
    with open(exid_json, 'w', encoding='utf-8') as f:
        json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
    np.save(offs_npy, np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)
else:
    # Build te_ds_pred for prediction
    test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
    test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='xlmr512_test_tokenize_resume_cached')
    te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
    te_ds_pred.set_format(type='torch')

# Determine missing folds
existing = set()
for f in range(5):
    s_fp = os.path.join(logdir, f'test_start_logits_f{f}.npy')
    e_fp = os.path.join(logdir, f'test_end_logits_f{f}.npy')
    if os.path.exists(s_fp) and os.path.exists(e_fp):
        existing.add(f)
missing = [f for f in range(5) if f not in existing]
print('Existing 512 folds:', sorted(existing), '| Missing:', missing)

test_start_list = []; test_end_list = []
# Also collect already-saved folds for averaging
for f in sorted(existing):
    test_start_list.append(np.load(os.path.join(logdir, f'test_start_logits_f{f}.npy')))
    test_end_list.append(np.load(os.path.join(logdir, f'test_end_logits_f{f}.npy')))

for FOLD in missing:
    t_fold = time.time()
    print(f'===== Resume XLM-R Large 512 Fold {FOLD} =====', flush=True)
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'xlmr512_train_tokenize_f{FOLD}')
    print(f'Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | tok_time={time.time()-t0:.1f}s', flush=True)
    # Negative sampling
    smap = np.array(tr_tok['overflow_to_sample_mapping']); sp = np.array(tr_tok['start_positions']); cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), NEG_RATIO*len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})', flush=True)
    # Torch datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    # Model
    gc.collect(); torch.cuda.empty_cache()
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
    try:
        model.config.use_cache = False
    except Exception:
        pass
    try:
        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
    except Exception:
        try: model.gradient_checkpointing_enable()
        except Exception: pass
    args = TrainingArguments(
        output_dir=f'xlmr_large_512_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=False,
        bf16=True,
        logging_steps=200,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=0,
        dataloader_pin_memory=False,
        eval_accumulation_steps=1,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=0.1,
        weight_decay=0.01,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'Fold {FOLD}: training...', flush=True)
    try:
        trainer.train()
    except RuntimeError as e:
        if 'out of memory' in str(e).lower():
            print('OOM; retry with GRAD_ACCUM=24, EVAL_BS=1', flush=True)
            del trainer, model
            gc.collect(); torch.cuda.empty_cache()
            model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
            try:
                model.config.use_cache = False
            except Exception:
                pass
            try:
                model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
            except Exception:
                try: model.gradient_checkpointing_enable()
                except Exception: pass
            args = TrainingArguments(
                output_dir=f'xlmr_large_512_f{FOLD}',
                num_train_epochs=EPOCHS,
                learning_rate=LR,
                per_device_train_batch_size=1,
                per_device_eval_batch_size=1,
                gradient_accumulation_steps=24,
                fp16=False,
                bf16=True,
                logging_steps=200,
                save_strategy='no',
                evaluation_strategy='no',
                seed=SEED,
                dataloader_num_workers=0,
                dataloader_pin_memory=False,
                eval_accumulation_steps=1,
                report_to=[],
                gradient_checkpointing=True,
                warmup_ratio=0.1,
                weight_decay=0.01,
                max_grad_norm=1.0,
            )
            trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tr_ds_torch,
                eval_dataset=None,
                tokenizer=tokenizer,
                data_collator=DataCollatorWithPadding(tokenizer),
            )
            try:
                trainer.train()
            except RuntimeError as e2:
                if 'out of memory' in str(e2).lower():
                    print('OOM again; retry with GRAD_ACCUM=32, EVAL_BS=1', flush=True)
                    del trainer, model
                    gc.collect(); torch.cuda.empty_cache()
                    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
                    try:
                        model.config.use_cache = False
                    except Exception:
                        pass
                    try:
                        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
                    except Exception:
                        try: model.gradient_checkpointing_enable()
                        except Exception: pass
                    args = TrainingArguments(
                        output_dir=f'xlmr_large_512_f{FOLD}',
                        num_train_epochs=EPOCHS,
                        learning_rate=LR,
                        per_device_train_batch_size=1,
                        per_device_eval_batch_size=1,
                        gradient_accumulation_steps=32,
                        fp16=False,
                        bf16=True,
                        logging_steps=200,
                        save_strategy='no',
                        evaluation_strategy='no',
                        seed=SEED,
                        dataloader_num_workers=0,
                        dataloader_pin_memory=False,
                        eval_accumulation_steps=1,
                        report_to=[],
                        gradient_checkpointing=True,
                        warmup_ratio=0.1,
                        weight_decay=0.01,
                        max_grad_norm=1.0,
                    )
                    trainer = Trainer(
                        model=model,
                        args=args,
                        train_dataset=tr_ds_torch,
                        eval_dataset=None,
                        tokenizer=tokenizer,
                        data_collator=DataCollatorWithPadding(tokenizer),
                    )
                    trainer.train()
                else:
                    raise
        else:
            raise
    print(f'Fold {FOLD}: train_done in {time.time()-t_fold:.1f}s', flush=True)
    # Predict TEST and save logits for this fold
    print(f'Fold {FOLD}: predicting TEST and saving logits...', flush=True)
    gc.collect(); torch.cuda.empty_cache()
    model.eval()
    with torch.no_grad():
        te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(os.path.join(logdir, f'test_start_logits_f{FOLD}.npy'), te_start)
    np.save(os.path.join(logdir, f'test_end_logits_f{FOLD}.npy'), te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, tr_tok, te_pred
    gc.collect(); torch.cuda.empty_cache()

# Rebuild averaged TEST logits for 512 stream
start_files = sorted([os.path.join(logdir, f) for f in os.listdir(logdir) if re.match(r'test_start_logits_f\d+\.npy$', f)])
end_files = sorted([os.path.join(logdir, f) for f in os.listdir(logdir) if re.match(r'test_end_logits_f\d+\.npy$', f)])
assert len(start_files) == len(end_files) and len(start_files) >= 3, 'Insufficient 512 folds to average'
start_stack = np.stack([np.load(fp) for fp in start_files], axis=0)
end_stack = np.stack([np.load(fp) for fp in end_files], axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('xlmr_large_512_test_avg.npz', start=start_avg, end=end_avg)
print(f'xlmr_large_512_test_avg.npz saved with {start_stack.shape[0]} folds averaged.')

Tokenizer for resume with deepset/xlm-roberta-large-squad2




xlmr512_test_tokenize_resume_cached:   0%|          | 0/112 [00:00<?, ? examples/s]

xlmr512_test_tokenize_resume_cached: 100%|██████████| 112/112 [00:03<00:00, 36.57 examples/s]

xlmr512_test_tokenize_resume_cached: 100%|██████████| 112/112 [00:03<00:00, 36.09 examples/s]

Existing 512 folds: [0, 1, 2] | Missing: [3, 4]
===== Resume XLM-R Large 512 Fold 3 =====





xlmr512_train_tokenize_f3:   0%|          | 0/795 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f3: 100%|██████████| 795/795 [00:18<00:00, 42.65 examples/s]

xlmr512_train_tokenize_f3: 100%|██████████| 795/795 [00:18<00:00, 42.43 examples/s]

Fold 3: train feats pre-sample=9960 | tok_time=18.8s


Fold 3: neg-sample 9960->2857 (28.68%)





Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 3: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 3: train_done in 2515.2s


Fold 3: predicting TEST and saving logits...


===== Resume XLM-R Large 512 Fold 4 =====


xlmr512_train_tokenize_f4:   0%|          | 0/805 [00:00<?, ? examples/s]

xlmr512_train_tokenize_f4: 100%|██████████| 805/805 [00:20<00:00, 39.76 examples/s]

xlmr512_train_tokenize_f4: 100%|██████████| 805/805 [00:20<00:00, 39.52 examples/s]

Fold 4: train feats pre-sample=10458 | tok_time=20.4s


Fold 4: neg-sample 10458->2894 (27.67%)





Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 4: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 4: train_done in 2071.2s


Fold 4: predicting TEST and saving logits...


xlmr_large_512_test_avg.npz saved with 5 folds averaged.


In [22]:
# Build per-language blend with expert-recommended weights and keep as active submission
import os, shutil, pandas as pd
assert 'build_submission' in globals(), 'Run Cell 9 first to define build_submission()'
per_lang_weights = {
    'hindi': (0.30, 0.20, 0.00, 0.50),
    'tamil': (0.20, 0.00, 0.00, 0.80),
}
sub_pl = build_submission(per_language_weights=per_lang_weights)
if os.path.exists('submission.csv'):
    # Save a copy
    shutil.copy('submission.csv', 'submission_per_language_final.csv')
    df = pd.read_csv('submission.csv')
    if 'PredictionString' in df.columns:
        df['PredictionString'] = df['PredictionString'].fillna('')
        df.to_csv('submission.csv', index=False)
    print('Per-language blend saved to submission_per_language_final.csv and active in submission.csv')
print(pd.read_csv('submission.csv').head())

Saved submission.csv (112, 2) in 27.8s
Per-language blend saved to submission_per_language_final.csv and active in submission.csv
          id PredictionString
0  be799d365            मुंबई
1  26f356026      सुधांशुबाला
2  57a56c43f            புறணி
3  da062fdbb         बिंबिसार
4  72fc0d5b5   २० अप्रैल १८८९


In [24]:
# Decode/weight sweep using existing artifacts to generate multiple candidate submissions
import os, shutil, pandas as pd, time
assert 'build_submission' in globals(), 'Run Cell 9 first to define build_submission()'

def set_decode_params(n_best=None, len_pen=None):
    global N_BEST_CHAR, LENGTH_PENALTY
    if n_best is not None:
        N_BEST_CHAR = int(n_best)
    if len_pen is not None:
        LENGTH_PENALTY = float(len_pen)

out_files = []
combos = [
    # (weights tuple, tag, per_language)
    ((0.15, 0.05, 0.00, 0.80), 'w015_005_000_080', None),
    ((0.20, 0.05, 0.00, 0.75), 'w020_005_000_075', None),
    ((0.10, 0.05, 0.00, 0.85), 'w010_005_000_085', None),
    ((0.00, 0.00, 0.00, 1.00), 'w000_000_000_100', None),
]

nbests = [120, 150, 200]
len_pens = [0.0044, 0.0048, 0.0052]

t0 = time.time()
for wts, wtag, plw in combos:
    for nb in nbests:
        for lp in len_pens:
            set_decode_params(n_best=nb, len_pen=lp)
            sub = build_submission(blend_weights=wts, per_language_weights=plw)
            out_name = f'sub_grid_{wtag}_nb{nb}_lp{lp:.4f}.csv'
            if os.path.exists('submission.csv'):
                df = pd.read_csv('submission.csv')
                if 'PredictionString' in df.columns:
                    df['PredictionString'] = df['PredictionString'].fillna('')
                df.to_csv(out_name, index=False)
                out_files.append(out_name)
                print('Wrote', out_name)

# Per-language variants (fixed params)
set_decode_params(n_best=150, len_pen=0.0048)
per_lang_variants = [
    ({'hindi': (0.30,0.20,0.00,0.50), 'tamil': (0.20,0.00,0.00,0.80)}, 'pl_hi030_020_000_050_ta020_000_000_080'),
    ({'hindi': (0.25,0.15,0.00,0.60), 'tamil': (0.15,0.00,0.00,0.85)}, 'pl_hi025_015_000_060_ta015_000_000_085'),
]
for plw, tag in per_lang_variants:
    sub = build_submission(per_language_weights=plw)
    out_name = f'sub_grid_{tag}_nb150_lp0.0048.csv'
    if os.path.exists('submission.csv'):
        df = pd.read_csv('submission.csv')
        if 'PredictionString' in df.columns:
            df['PredictionString'] = df['PredictionString'].fillna('')
        df.to_csv(out_name, index=False)
        out_files.append(out_name)
        print('Wrote', out_name)

print(f'Decode/weight sweep complete: {len(out_files)} files in {time.time()-t0:.1f}s')

Saved submission.csv (112, 2) in 28.5s
Wrote sub_grid_w015_005_000_080_nb120_lp0.0044.csv


Saved submission.csv (112, 2) in 28.4s
Wrote sub_grid_w015_005_000_080_nb120_lp0.0048.csv


Saved submission.csv (112, 2) in 27.7s
Wrote sub_grid_w015_005_000_080_nb120_lp0.0052.csv


Saved submission.csv (112, 2) in 28.1s
Wrote sub_grid_w015_005_000_080_nb150_lp0.0044.csv


Saved submission.csv (112, 2) in 29.0s
Wrote sub_grid_w015_005_000_080_nb150_lp0.0048.csv


Saved submission.csv (112, 2) in 28.1s
Wrote sub_grid_w015_005_000_080_nb150_lp0.0052.csv


Saved submission.csv (112, 2) in 29.7s
Wrote sub_grid_w015_005_000_080_nb200_lp0.0044.csv


Saved submission.csv (112, 2) in 30.1s
Wrote sub_grid_w015_005_000_080_nb200_lp0.0048.csv


Saved submission.csv (112, 2) in 28.5s
Wrote sub_grid_w015_005_000_080_nb200_lp0.0052.csv


Saved submission.csv (112, 2) in 28.5s
Wrote sub_grid_w020_005_000_075_nb120_lp0.0044.csv


Saved submission.csv (112, 2) in 28.2s
Wrote sub_grid_w020_005_000_075_nb120_lp0.0048.csv


Saved submission.csv (112, 2) in 28.9s
Wrote sub_grid_w020_005_000_075_nb120_lp0.0052.csv


Saved submission.csv (112, 2) in 28.5s
Wrote sub_grid_w020_005_000_075_nb150_lp0.0044.csv


Saved submission.csv (112, 2) in 27.2s
Wrote sub_grid_w020_005_000_075_nb150_lp0.0048.csv


Saved submission.csv (112, 2) in 27.8s
Wrote sub_grid_w020_005_000_075_nb150_lp0.0052.csv


Saved submission.csv (112, 2) in 28.0s
Wrote sub_grid_w020_005_000_075_nb200_lp0.0044.csv


Saved submission.csv (112, 2) in 27.5s
Wrote sub_grid_w020_005_000_075_nb200_lp0.0048.csv


Saved submission.csv (112, 2) in 27.7s
Wrote sub_grid_w020_005_000_075_nb200_lp0.0052.csv


Saved submission.csv (112, 2) in 28.0s
Wrote sub_grid_w010_005_000_085_nb120_lp0.0044.csv


Saved submission.csv (112, 2) in 28.2s
Wrote sub_grid_w010_005_000_085_nb120_lp0.0048.csv


Saved submission.csv (112, 2) in 28.2s
Wrote sub_grid_w010_005_000_085_nb120_lp0.0052.csv


Saved submission.csv (112, 2) in 27.9s
Wrote sub_grid_w010_005_000_085_nb150_lp0.0044.csv


Saved submission.csv (112, 2) in 29.6s
Wrote sub_grid_w010_005_000_085_nb150_lp0.0048.csv


Saved submission.csv (112, 2) in 29.9s
Wrote sub_grid_w010_005_000_085_nb150_lp0.0052.csv


Saved submission.csv (112, 2) in 29.9s
Wrote sub_grid_w010_005_000_085_nb200_lp0.0044.csv


Saved submission.csv (112, 2) in 28.2s
Wrote sub_grid_w010_005_000_085_nb200_lp0.0048.csv


Saved submission.csv (112, 2) in 28.7s
Wrote sub_grid_w010_005_000_085_nb200_lp0.0052.csv


Saved submission.csv (112, 2) in 29.5s
Wrote sub_grid_w000_000_000_100_nb120_lp0.0044.csv


Saved submission.csv (112, 2) in 28.5s
Wrote sub_grid_w000_000_000_100_nb120_lp0.0048.csv


Saved submission.csv (112, 2) in 29.2s
Wrote sub_grid_w000_000_000_100_nb120_lp0.0052.csv


Saved submission.csv (112, 2) in 27.8s
Wrote sub_grid_w000_000_000_100_nb150_lp0.0044.csv


Saved submission.csv (112, 2) in 27.7s
Wrote sub_grid_w000_000_000_100_nb150_lp0.0048.csv


Saved submission.csv (112, 2) in 27.9s
Wrote sub_grid_w000_000_000_100_nb150_lp0.0052.csv


Saved submission.csv (112, 2) in 28.7s
Wrote sub_grid_w000_000_000_100_nb200_lp0.0044.csv


Saved submission.csv (112, 2) in 29.1s
Wrote sub_grid_w000_000_000_100_nb200_lp0.0048.csv


Saved submission.csv (112, 2) in 27.6s
Wrote sub_grid_w000_000_000_100_nb200_lp0.0052.csv


Saved submission.csv (112, 2) in 29.1s
Wrote sub_grid_pl_hi030_020_000_050_ta020_000_000_080_nb150_lp0.0048.csv


Saved submission.csv (112, 2) in 28.9s
Wrote sub_grid_pl_hi025_015_000_060_ta015_000_000_085_nb150_lp0.0048.csv
Decode/weight sweep complete: 38 files in 1083.2s


In [26]:
# Final submissions per expert advice: Primary per-language and one Safety variant
import os, shutil, pandas as pd
assert 'build_submission' in globals(), 'Run Cell 9 first to define build_submission()'

# Set decode params
N_BEST_CHAR = 150
LENGTH_PENALTY = 0.0048
APPLY_SMOOTHING = False  # primary: no smoothing

# Per-language weights (xlmr384, muril, mdeberta, xlmr512)
per_lang_weights = {
    'hindi': (0.25, 0.15, 0.00, 0.60),
    'tamil': (0.15, 0.00, 0.00, 0.85),
}

print('Building PRIMARY per-language submission...')
sub_primary = build_submission(per_language_weights=per_lang_weights)
if os.path.exists('submission.csv'):
    df = pd.read_csv('submission.csv')
    if 'PredictionString' in df.columns:
        df['PredictionString'] = df['PredictionString'].fillna('')
        df.to_csv('submission.csv', index=False)
    shutil.copy('submission.csv', 'submission_primary_perlang.csv')
    print('Saved submission_primary_perlang.csv and kept as active submission.csv')

# Safety variant (choose ONE). We generate both but keep PRIMARY active.
# A) Global 512-heavy, no smoothing, nb=200, lp=0.0044
print('Building SAFETY (global 512-heavy) submission...')
N_BEST_CHAR = 200
LENGTH_PENALTY = 0.0044
APPLY_SMOOTHING = False
sub_safety_global = build_submission(blend_weights=(0.15, 0.05, 0.00, 0.80))
if os.path.exists('submission.csv'):
    df_g = pd.read_csv('submission.csv')
    if 'PredictionString' in df_g.columns:
        df_g['PredictionString'] = df_g['PredictionString'].fillna('')
    df_g.to_csv('submission_safety_global_015_005_000_080_nb200_lp0.0044.csv', index=False)
    print('Saved submission_safety_global_015_005_000_080_nb200_lp0.0044.csv')

# B) Smoothing probe: same per-language weights, smoothing on, revert decode params to primary
print('Building SECONDARY (smoothed per-language) submission...')
N_BEST_CHAR = 150
LENGTH_PENALTY = 0.0048
APPLY_SMOOTHING = True
sub_smoothed = build_submission(per_language_weights=per_lang_weights)
if os.path.exists('submission.csv'):
    df_s = pd.read_csv('submission.csv')
    if 'PredictionString' in df_s.columns:
        df_s['PredictionString'] = df_s['PredictionString'].fillna('')
    df_s.to_csv('submission_secondary_smoothed.csv', index=False)
    print('Saved submission_secondary_smoothed.csv')

# Restore PRIMARY as active submission.csv
if os.path.exists('submission_primary_perlang.csv'):
    shutil.copy('submission_primary_perlang.csv', 'submission.csv')
    print('submission.csv restored to PRIMARY per-language')

# Sanitize final active submission
df_final = pd.read_csv('submission.csv')
if 'PredictionString' in df_final.columns:
    df_final['PredictionString'] = df_final['PredictionString'].astype(str).replace('nan', '').fillna('')
df_final.to_csv('submission.csv', index=False)
print('Final submission.csv sanitized. Head:')
print(df_final.head())

Building PRIMARY per-language submission...


Saved submission.csv (112, 2) in 27.6s
Saved submission_primary_perlang.csv and kept as active submission.csv
Building SAFETY (global 512-heavy) submission...


Saved submission.csv (112, 2) in 27.5s
Saved submission_safety_global_015_005_000_080_nb200_lp0.0044.csv
Building SECONDARY (smoothed per-language) submission...


Saved submission.csv (112, 2) in 27.7s
Saved submission_secondary_smoothed.csv
submission.csv restored to PRIMARY per-language
Final submission.csv sanitized. Head:
          id PredictionString
0  be799d365            मुंबई
1  26f356026      सुधांशुबाला
2  57a56c43f            புறணி
3  da062fdbb         बिंबिसार
4  72fc0d5b5   २० अप्रैल १८८९


In [28]:
# Activate smoothing variant as submission and sanitize
import os, pandas as pd, shutil
src = 'submission_secondary_smoothed.csv'
assert os.path.exists(src), f'Missing smoothed CSV: {src}'
df = pd.read_csv(src)
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].astype(str).replace('nan', '').fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv replaced with', src)
print('Head:')
print(pd.read_csv('submission.csv').head())

submission.csv replaced with submission_secondary_smoothed.csv
Head:
          id                                   PredictionString
0  be799d365                                                  ७
1  26f356026                 सद्गुरु श्रेणी:परमहंस श्रेणी:हिन्द
2  57a56c43f  பெருமூளைப் புறணியில் உள்ள பார்வைப் புறணியில் (...
3  da062fdbb      भ ने छीन लिया था। उसके राजत्वकाल में ही विडूड
4  72fc0d5b5                                              फ गोय


In [29]:
# Aggressive probe: 512-only with N_BEST_CHAR=250, LENGTH_PENALTY=0.0040
import os, pandas as pd, shutil
assert 'build_submission' in globals(), 'Run Cell 9 first to define build_submission()'
N_BEST_CHAR = 250
LENGTH_PENALTY = 0.0040
APPLY_SMOOTHING = False
sub_aggr = build_submission(blend_weights=(0.00, 0.00, 0.00, 1.00))
out = 'submission_aggressive_512only_nb250_lp0.0040.csv'
if os.path.exists('submission.csv'):
    df = pd.read_csv('submission.csv')
    if 'PredictionString' in df.columns:
        df['PredictionString'] = df['PredictionString'].astype(str).replace('nan', '').fillna('')
        df.to_csv(out, index=False)
        shutil.copy(out, 'submission.csv')
        print('Saved and activated', out)
    else:
        print('Warning: PredictionString column missing in generated submission.csv')
else:
    print('No submission.csv found after build_submission()')
print(pd.read_csv('submission.csv').head())

Saved submission.csv (112, 2) in 27.7s
Saved and activated submission_aggressive_512only_nb250_lp0.0040.csv
          id               PredictionString
0  be799d365              मुंबई, महाराष्ट्र
1  26f356026  उदासिनाचार्य सुमेरुदास महाराज
2  57a56c43f                          புறணி
3  da062fdbb                       बिंबिसार
4  72fc0d5b5                           1889


In [37]:
# Build per-language submission with expert one-shot settings
import os, shutil, pandas as pd
assert 'build_submission' in globals(), 'Run Cell 9 first to define build_submission()'

APPLY_SMOOTHING = False  # global off; Tamil end-only smoothing is handled inside build loop

per_lang_weights = {
    'hindi': (0.25, 0.20, 0.00, 0.55),
    'tamil': (0.18, 0.00, 0.00, 0.82),
}
print('Building per-language submission (hi: 0.25/0.20/0/0.55, ta: 0.18/0/0/0.82) ...')
sub = build_submission(per_language_weights=per_lang_weights)
if os.path.exists('submission.csv'):
    df = pd.read_csv('submission.csv')
    if 'PredictionString' in df.columns:
        df['PredictionString'] = df['PredictionString'].astype(str).replace('nan', '').fillna('')
        df.to_csv('submission.csv', index=False)
    out_name = 'submission_perlang_expert_nb_langlp_temps.csv'
    shutil.copy('submission.csv', out_name)
    print('Saved', out_name, 'and kept as active submission.csv')
print(pd.read_csv('submission.csv').head())

Building per-language submission (hi: 0.25/0.20/0/0.55, ta: 0.18/0/0/0.82) ...


Saved submission.csv (112, 2) in 29.1s
Saved submission_perlang_expert_nb_langlp_temps.csv and kept as active submission.csv
          id          PredictionString
0  be799d365                     मुंबई
1  26f356026               सुधांशुबाला
2  57a56c43f  புறணிப் (cerebral cortex
3  da062fdbb                       NaN
4  72fc0d5b5            २० अप्रैल १८८९


In [60]:
# Span-level decode: XLM-R large 512 only, per-language n_best/len_pen/max_len (aggressive knobs) + extra hygiene
import numpy as np, pandas as pd, json, os, time, re, unicodedata, shutil

def span_level_decode_xlmr512():
    t0 = time.time()
    # Prefer 3-seed averaged logits for safety submission; else 2-seed; else single
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    start_logits = data['start']  # [n_feat, L]
    end_logits = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        example_ids = json.load(f)
    offsets_list = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))

    def _stable_logsoftmax(arr):
        m = np.max(arr)
        return arr - (m + np.log(np.sum(np.exp(arr - m))))

    def best_span_for_feature(ex_id, lang, offsets, s_logit, e_logit):
        # mask
        mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
        s = np.where(mask, s_logit, -1e9)
        e = np.where(mask, e_logit, -1e9)
        # per-lang knobs (primary-safe settings)
        lang_l = str(lang).lower().strip()
        if lang_l == 'hindi':
            n_best = 180; max_len = 52; len_pen = 0.0050
        elif lang_l == 'tamil':
            n_best = 220; max_len = 60; len_pen = 0.0040  # tightened per expert advice
        else:
            n_best = 150; max_len = 64; len_pen = 0.0048
        # log-softmax
        s_logp = _stable_logsoftmax(s)
        e_logp = _stable_logsoftmax(e)
        top_s = np.argsort(s_logp)[-n_best:]
        best = (float('-inf'), None, None)
        for si in top_s:
            ei_max = min(len(e_logp)-1, si + max_len - 1)
            band = e_logp[si:ei_max+1]
            if band.size == 0: continue
            off = int(np.argmax(band))
            ei = si + off
            L = ei - si + 1
            score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
            if score > best[0]:
                best = (score, si, ei)
        if best[1] is None:
            return (-1e9, '')
        cs, _ = offsets[best[1]]; _, ce = offsets[best[2]]
        if cs is None or ce is None or ce <= cs:
            return (-1e9, '')
        ctx = id2ctx.get(ex_id, '')
        return (best[0], ctx[cs:ce])

    preds = {}  # ex_id -> (score, text)
    n_feat = start_logits.shape[0]
    for i in range(n_feat):
        ex_id = example_ids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, text = best_span_for_feature(ex_id, lang, offsets_list[i], start_logits[i], end_logits[i])
        prev = preds.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            preds[ex_id] = (score, text)

    # Post-processing: safe trims + collapse spaces + remove ZWNJ/ZWJ and BOM/LRM/RLM, hyphen collapse, ZWSP/soft hyphen removal
    def strip_trailing_punct(s: str) -> str:
        PUNCT_STRIP = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
        return s.strip().strip(PUNCT_STRIP) if s else s
    def trim_boundary_combining(s: str) -> str:
        if not s: return s
        if len(s) > 0 and unicodedata.combining(s[0]): s = s[1:]
        if len(s) > 0 and unicodedata.combining(s[-1]): s = s[:-1]
        return s
    def remove_unmatched_quotes(s: str) -> str:
        if not s: return s
        for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
            if s.startswith(lq) and not s.endswith(rq): s = s[len(lq):]
            if s.endswith(rq) and not s.startswith(lq): s = s[:-len(rq)]
        return s

    out_ids = test_df['id'].astype(str).values
    out_pred = []
    for ex_id in out_ids:
        lang = str(id2lang.get(ex_id, 'unknown')).lower().strip()
        ans = preds.get(ex_id, (float('-inf'), ''))[1]
        ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans or '')
        ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
        ans = re.sub(r'[।]+', '।', ans)
        if ans.endswith('।'): ans = ans[:-1]
        # trailing virama/combining guard
        if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
        # Hindi-specific: drop isolated trailing virama after space/punct
        if lang == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
            ans = ans[:-1]
        # Tamil-specific: drop trailing combining mark once
        if lang == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
            ans = ans[:-1]
        if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
        # collapse repeated hyphens/dashes
        ans = re.sub(r'(?<=\S)[\-–—]{2,}(?=\S)', '-', ans)
        ans = strip_trailing_punct(ans)
        ans = trim_boundary_combining(ans)
        ans = remove_unmatched_quotes(ans)
        # remove BOM/LRM/RLM and ZWNJ/ZWJ; collapse NBSP/thin space; remove ZWSP/soft hyphen
        ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
        ans = ans.replace('\u200B','').replace('\u00AD','')
        ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
        ans = ' '.join(ans.split())
        ans = ans.replace('\u200C','').replace('\u200D','')
        out_pred.append(ans or '')

    sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
    sub.to_csv('submission.csv', index=False)
    # Also save a named backup (3-seed 512-only safety)
    backup_name = 'submission_safety_512only_3seeds.csv'
    sub.to_csv(backup_name, index=False)
    print('Span-level 512-only submission.csv saved', sub.shape, 'in %.1fs' % (time.time()-t0), '| backup:', backup_name, '| source npz:', os.path.basename(npz))
    return sub

# Run span-level decode and write submission
sub_span512 = span_level_decode_xlmr512()
print(sub_span512.head())

Span-level 512-only submission.csv saved (112, 2) in 2.7s | backup: submission_safety_512only_3seeds.csv | source npz: xlmr_large_512_3seeds_avg.npz
          id               PredictionString
0  be799d365              मुंबई, महाराष्ट्र
1  26f356026  उदासिनाचार्य सुमेरुदास महाराज
2  57a56c43f               பெருமூளைப் புறணி
3  da062fdbb                       बिंबिसार
4  72fc0d5b5                           1889


In [61]:
# Span-level ensemble: pick best span per model, then choose by weighted score per language
import numpy as np, pandas as pd, json, os, re, unicodedata, time

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_for_feature(offsets, s_logit, e_logit, lang):
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi':
        n_best = 180; max_len = 52; len_pen = 0.0050; fallback_cap = 36
    elif lang_l == 'tamil':
        n_best = 220; max_len = 60; len_pen = 0.0040; fallback_cap = 42
    else:
        n_best = 150; max_len = 64; len_pen = 0.0048; fallback_cap = 42
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    # Length-based fallback if char span > 40: re-decode with tighter max_len and overwrite if within 0.5 of original or higher
    if best[1] is not None and best[2] is not None:
        cs, _ = offsets[best[1]]; _, ce = offsets[best[2]]
        if cs is not None and ce is not None and ce > cs:
            char_len = ce - cs
            if char_len > 40:
                best_fb = (float('-inf'), None, None)
                for si in top_s:
                    ei_max = min(len(e_logp)-1, si + fallback_cap - 1)
                    band = e_logp[si:ei_max+1]
                    if band.size == 0: continue
                    off = int(np.argmax(band))
                    ei = si + off
                    L = ei - si + 1
                    score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
                    if score > best_fb[0]:
                        best_fb = (score, si, ei)
                if best_fb[1] is not None and best_fb[2] is not None:
                    # Accept fallback if new score within 0.5 of original or higher
                    if best_fb[0] >= best[0] - 0.5:
                        best = best_fb
    return best  # (score, si, ei) or ( -inf, None, None )

def _postprocess_ans(ans: str, lang: str = None) -> str:
    if not ans:
        return ''
    # Nukta normalization (safe map) before other trims
    nukta_map = {'\u095c':'ड़','\u095d':'ढ़','\u095e':'फ़','\u095f':'य़'}
    ans = ''.join(nukta_map.get(c, c) for c in ans)
    # Chandrabindu collapse
    ans = re.sub(r'ँ+', 'ँ', ans)
    # Numeric punctuation collapse inside digit runs
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    # Danda/virama hygiene
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    # Trim trailing virama/combining marks once
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'):
        ans = ans[:-1]
    # Hindi-specific: drop isolated trailing virama if preceded by space/punct
    if str(lang).lower().strip() == 'hindi':
        if len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
            ans = ans[:-1]
    # Tamil-specific: if last char is a lone combining mark/vowel sign, drop once
    if str(lang).lower().strip() == 'tamil':
        if len(ans) and unicodedata.category(ans[-1]) == 'Mn':
            ans = ans[:-1]
    # Strip leading combining mark once
    if ans and unicodedata.category(ans[0]) == 'Mn':
        ans = ans[1:]
    # Collapse repeated hyphens/dashes inside words (safe)
    ans = re.sub(r'(?<=\S)[\-–—]{2,}(?=\S)', '-', ans)
    # Trim outer punct/spaces
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    # Unmatched quotes
    for lq, rq in [("\"","\""), ("'","'"), ('“','”'), ('‘','’')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    # Remove BOM/LRM/RLM; drop ZWNJ/ZWJ; remove zero-width space/soft hyphen; normalize spaces incl. NBSP/thin space
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    # Final danda guard
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_span_level_ensemble_submission():
    t0 = time.time()
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))

    # Load averaged logits + mapping per model (if available)
    def load_stream(npz_path, dir_prefix):
        if not os.path.exists(npz_path):
            return None
        exid_json = os.path.join(dir_prefix, 'test_example_id.json')
        offs_npy = os.path.join(dir_prefix, 'test_offset_mapping.npy')
        if not (os.path.exists(exid_json) and os.path.exists(offs_npy)):
            return None
        data = np.load(npz_path)
        with open(exid_json, 'r', encoding='utf-8') as f:
            exids = json.load(f)
        offs = np.load(offs_npy, allow_pickle=True)
        return {'start': data['start'], 'end': data['end'], 'exids': exids, 'offs': offs}

    # Prefer 3-seed 512 npz if available; else 2-seed; else single-seed
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        xlmr512_npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        xlmr512_npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        xlmr512_npz = 'xlmr_large_512_test_avg.npz'

    streams = {
        'xlmr384': load_stream('xlmr_large_test_avg.npz', 'xlmr_large_test_logits'),
        'muril':   load_stream('muril_large_test_avg.npz', 'muril_large_test_logits'),
        'xlmr512': load_stream(xlmr512_npz, 'xlmr_large_512_test_logits'),
    }
    # Per-language weights (xlmr384, muril, xlmr512) updated for 3-seed-dominant primary
    per_lang_weights = {
        'hindi': {'xlmr384': 0.05, 'muril': 0.10, 'xlmr512': 0.85},
        'tamil': {'xlmr384': 0.02, 'muril': 0.00, 'xlmr512': 0.98},
    }

    # Collect best span and score per example per model (store text for final selection)
    best_by_model = {k: {} for k in streams.keys()}
    for key, pack in streams.items():
        if pack is None:
            continue
        S, E, exids, offs_all = pack['start'], pack['end'], pack['exids'], pack['offs']
        n_feat = S.shape[0]
        for i in range(n_feat):
            ex_id = exids[i]
            lang = id2lang.get(ex_id, 'unknown')
            ctx = id2ctx.get(ex_id, '')
            score, si, ei = _decode_best_for_feature(offs_all[i], S[i], E[i], lang)
            if si is not None and ei is not None:
                cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
                if cs is not None and ce is not None and ce > cs:
                    text = ctx[cs:ce]
                else:
                    text = ''
            else:
                text = ''
            prev = best_by_model[key].get(ex_id, (float('-inf'), ''))
            if score > prev[0]:
                best_by_model[key][ex_id] = (score, text)

    out_ids = test_df['id'].astype(str).values
    out_pred = []
    for ex_id in out_ids:
        lang = str(id2lang.get(ex_id, 'unknown')).lower().strip()
        wmap = per_lang_weights.get(lang, per_lang_weights.get('hindi', {'xlmr384':0.05,'muril':0.10,'xlmr512':0.85}))
        best = (float('-inf'), '')
        for key, w in wmap.items():
            if w <= 0: continue
            tup = best_by_model.get(key, {}).get(ex_id)
            if not tup: continue
            score, text = tup
            wscore = w * score
            if wscore > best[0]:
                best = (wscore, text)
        ans = _postprocess_ans(best[1], lang) if best[1] else ''
        out_pred.append(ans)

    sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
    sub.to_csv('submission.csv', index=False)
    print('Span-level 3-stream weighted submission.csv saved', sub.shape, 'in %.1fs' % (time.time()-t0))
    return sub

sub_span_blend = build_span_level_ensemble_submission()
print(sub_span_blend.head())

Span-level 3-stream weighted submission.csv saved (112, 2) in 9.5s
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889


In [62]:
# Diagnostics: submission health check (empties, per-language breakdown, length stats)
import pandas as pd, numpy as np
sub = pd.read_csv('submission.csv')
test_df = pd.read_csv('test.csv')
assert len(sub)==112 and sub['id'].nunique()==112, f"Bad row/id count: rows={len(sub)}, unique={sub['id'].nunique()}"
sub = sub.merge(test_df[['id','language']], on='id', how='left')
sub['PredictionString'] = sub['PredictionString'].astype(str).replace('nan','').fillna('')
sub['len'] = sub['PredictionString'].str.len()
sub['empty'] = (sub['PredictionString']=='')
print('Overall: empties', int(sub['empty'].sum()), '/', len(sub), '| mean len', round(sub['len'].mean(),2), '| median len', int(sub['len'].median()))
print('Per-language empties:')
print(sub.groupby('language')['empty'].sum().to_dict())
print('Per-language mean length:')
print(sub.groupby('language')['len'].mean().round(2).to_dict())
print('Head (id, lang, len, empty, PredictionString[:60]):')
print(sub[['id','language','len','empty','PredictionString']].assign(PredictionString=sub['PredictionString'].str[:60]).head(10))

Overall: empties 0 / 112 | mean len 9.71 | median len 9
Per-language empties:
{'hindi': 0, 'tamil': 0}
Per-language mean length:
{'hindi': 9.57, 'tamil': 10.14}
Head (id, lang, len, empty, PredictionString[:60]):
          id language  len  empty   PredictionString
0  be799d365    hindi   17  False  मुंबई, महाराष्ट्र
1  26f356026    hindi   11  False        सुधांशुबाला
2  57a56c43f    tamil   16  False   பெருமூளைப் புறணி
3  da062fdbb    hindi    8  False           बिंबिसार
4  72fc0d5b5    hindi    4  False               1889
5  6108306f5    hindi   10  False         जिमी वेल्स
6  850c30cde    hindi    3  False                हेग
7  af061a7a4    hindi   13  False      2330 वर्ग मील
8  64b643feb    hindi   14  False     मार्च 14, 1965
9  4301ae869    tamil    9  False          அரியலூரில


In [47]:
# Train new XLM-R large 512 seed (SEED=2026) and save test logits; settings per expert with safe stride for logit-averaging
import os, time, numpy as np, pandas as pd, collections, random, json, torch, re, gc
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding
from torch.nn.functional import log_softmax

SEED = 2026
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)
os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try: torch.set_float32_matmul_precision('high')
except Exception: pass

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 512
# Note: keep DOC_STRIDE=256 to ensure identical test featureization vs prior seed for direct logit averaging
DOC_STRIDE = 256  # expert suggested 224 for diversity; we keep 256 so shapes match for averaging
MAX_ANSWER_LEN = 64
NEG_RATIO = 1  # increased diversity vs previous seed
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 1
EVAL_BS = 2
GRAD_ACCUM = 20
N_BEST = 25
WARMUP_RATIO = 0.15
WEIGHT_DECAY = 0.02

print('Tokenizer for new seed with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions, cls_positions = [], [], []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once for this seed
logdir = 'xlmr_large_512_seed2026_test_logits'
os.makedirs(logdir, exist_ok=True)
print('Tokenizing TEST for seed 2026 ...')
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='xlmr512_seed2026_test_tokenize')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')
with open(os.path.join(logdir, 'test_example_id.json'), 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save(os.path.join(logdir, 'test_offset_mapping.npy'), np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

all_oof_scores = []; oof_rows = []; test_start_list = []; test_end_list = []; t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== XLM-R Large 512 SEED2026 Fold {FOLD} =====', flush=True)
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'seed2026_train_tokenize_f{FOLD}')
    print(f'Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | tok_time={time.time()-t0:.1f}s', flush=True)
    # Negative sampling (NEG_RATIO=1) robust via cls_positions
    smap = np.array(tr_tok['overflow_to_sample_mapping']); sp = np.array(tr_tok['start_positions']); cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), 1 * len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})', flush=True)
    # Torch datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'seed2026_valid_tokenize_f{FOLD}')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model
    gc.collect(); torch.cuda.empty_cache()
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
    try:
        model.config.use_cache = False
    except Exception:
        pass
    try:
        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
    except Exception:
        try: model.gradient_checkpointing_enable()
        except Exception: pass
    args = TrainingArguments(
        output_dir=f'xlmr_large_512_seed2026_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=False,
        bf16=True,
        logging_steps=200,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=0,
        dataloader_pin_memory=False,
        eval_accumulation_steps=1,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=WARMUP_RATIO,
        weight_decay=WEIGHT_DECAY,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'Fold {FOLD}: training...')
    try:
        trainer.train()
    except RuntimeError as e:
        if 'out of memory' in str(e).lower():
            print('OOM; retry with GRAD_ACCUM=24, EVAL_BS=1')
            del trainer, model
            gc.collect(); torch.cuda.empty_cache()
            model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
            try: model.config.use_cache = False
            except Exception: pass
            try: model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
            except Exception:
                try: model.gradient_checkpointing_enable()
                except Exception: pass
            args = TrainingArguments(
                output_dir=f'xlmr_large_512_seed2026_f{FOLD}',
                num_train_epochs=EPOCHS,
                learning_rate=LR,
                per_device_train_batch_size=1,
                per_device_eval_batch_size=1,
                gradient_accumulation_steps=24,
                fp16=False,
                bf16=True,
                logging_steps=200,
                save_strategy='no',
                evaluation_strategy='no',
                seed=SEED,
                dataloader_num_workers=0,
                dataloader_pin_memory=False,
                eval_accumulation_steps=1,
                report_to=[],
                gradient_checkpointing=True,
                warmup_ratio=WARMUP_RATIO,
                weight_decay=WEIGHT_DECAY,
                max_grad_norm=1.0,
            )
            trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tr_ds_torch,
                eval_dataset=None,
                tokenizer=tokenizer,
                data_collator=DataCollatorWithPadding(tokenizer),
            )
            trainer.train()
        else:
            raise
    print(f'Fold {FOLD}: train_done in {time.time()-t_fold:.1f}s')
    # Predict TEST and save logits for this fold
    print(f'Fold {FOLD}: predicting TEST and saving logits (seed2026) ...')
    gc.collect(); torch.cuda.empty_cache()
    model.eval()
    with torch.no_grad():
        te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(os.path.join(logdir, f'test_start_logits_f{FOLD}.npy'), te_start)
    np.save(os.path.join(logdir, f'test_end_logits_f{FOLD}.npy'), te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, tr_tok, va_ds_pred, va_tok_all, te_pred
    gc.collect(); torch.cuda.empty_cache()

# Average TEST logits across folds for this seed
print('Averaging TEST logits across folds for seed 2026 ...')
start_stack = np.stack(test_start_list, axis=0)
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('xlmr_large_512_seed2026_test_avg.npz', start=start_avg, end=end_avg)
print('Saved xlmr_large_512_seed2026_test_avg.npz; example_id/offsets saved under', logdir)
print('Seed 2026 5-fold done in %.1fs' % (time.time()-t_global))

Tokenizer for new seed with deepset/xlm-roberta-large-squad2




Tokenizing TEST for seed 2026 ...


xlmr512_seed2026_test_tokenize:   0%|          | 0/112 [00:00<?, ? examples/s]

xlmr512_seed2026_test_tokenize: 100%|██████████| 112/112 [00:03<00:00, 32.11 examples/s]

xlmr512_seed2026_test_tokenize: 100%|██████████| 112/112 [00:03<00:00, 31.73 examples/s]





===== XLM-R Large 512 SEED2026 Fold 0 =====


seed2026_train_tokenize_f0:   0%|          | 0/817 [00:00<?, ? examples/s]

seed2026_train_tokenize_f0: 100%|██████████| 817/817 [00:21<00:00, 38.73 examples/s]

seed2026_train_tokenize_f0: 100%|██████████| 817/817 [00:21<00:00, 38.52 examples/s]

Fold 0: train feats pre-sample=10484 | tok_time=21.2s


Fold 0: neg-sample 10484->2085 (19.89%)





seed2026_valid_tokenize_f0:   0%|          | 0/185 [00:00<?, ? examples/s]

seed2026_valid_tokenize_f0: 100%|██████████| 185/185 [00:05<00:00, 31.43 examples/s]

seed2026_valid_tokenize_f0: 100%|██████████| 185/185 [00:06<00:00, 30.79 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 0: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 0: train_done in 1665.5s
Fold 0: predicting TEST and saving logits (seed2026) ...



===== XLM-R Large 512 SEED2026 Fold 1 =====


seed2026_train_tokenize_f1:   0%|          | 0/807 [00:00<?, ? examples/s]

seed2026_train_tokenize_f1: 100%|██████████| 807/807 [00:32<00:00, 24.54 examples/s]

seed2026_train_tokenize_f1: 100%|██████████| 807/807 [00:33<00:00, 24.41 examples/s]

Fold 1: train feats pre-sample=10396 | tok_time=33.1s





Fold 1: neg-sample 10396->2047 (19.69%)


seed2026_valid_tokenize_f1:   0%|          | 0/195 [00:00<?, ? examples/s]

seed2026_valid_tokenize_f1: 100%|██████████| 195/195 [00:09<00:00, 21.17 examples/s]

seed2026_valid_tokenize_f1: 100%|██████████| 195/195 [00:09<00:00, 20.79 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 1: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 1: train_done in 1837.8s
Fold 1: predicting TEST and saving logits (seed2026) ...



===== XLM-R Large 512 SEED2026 Fold 2 =====


seed2026_train_tokenize_f2:   0%|          | 0/784 [00:00<?, ? examples/s]

seed2026_train_tokenize_f2: 100%|██████████| 784/784 [00:33<00:00, 23.48 examples/s]

seed2026_train_tokenize_f2: 100%|██████████| 784/784 [00:33<00:00, 23.36 examples/s]

Fold 2: train feats pre-sample=9842 | tok_time=33.6s





Fold 2: neg-sample 9842->1962 (19.93%)


seed2026_valid_tokenize_f2:   0%|          | 0/218 [00:00<?, ? examples/s]

seed2026_valid_tokenize_f2: 100%|██████████| 218/218 [00:16<00:00, 13.18 examples/s]

seed2026_valid_tokenize_f2: 100%|██████████| 218/218 [00:16<00:00, 13.08 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 2: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 2: train_done in 1902.8s
Fold 2: predicting TEST and saving logits (seed2026) ...



===== XLM-R Large 512 SEED2026 Fold 3 =====


seed2026_train_tokenize_f3:   0%|          | 0/795 [00:00<?, ? examples/s]

seed2026_train_tokenize_f3: 100%|██████████| 795/795 [00:14<00:00, 53.72 examples/s]

seed2026_train_tokenize_f3: 100%|██████████| 795/795 [00:14<00:00, 53.38 examples/s]

Fold 3: train feats pre-sample=9960 | tok_time=14.9s


Fold 3: neg-sample 9960->1994 (20.02%)





seed2026_valid_tokenize_f3:   0%|          | 0/207 [00:00<?, ? examples/s]

seed2026_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 40.17 examples/s]

seed2026_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 39.38 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 3: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 3: train_done in 1822.6s
Fold 3: predicting TEST and saving logits (seed2026) ...



===== XLM-R Large 512 SEED2026 Fold 4 =====


seed2026_train_tokenize_f4:   0%|          | 0/805 [00:00<?, ? examples/s]

seed2026_train_tokenize_f4: 100%|██████████| 805/805 [00:15<00:00, 51.32 examples/s]

seed2026_train_tokenize_f4: 100%|██████████| 805/805 [00:15<00:00, 51.00 examples/s]

Fold 4: train feats pre-sample=10458 | tok_time=15.8s


Fold 4: neg-sample 10458->2024 (19.35%)





seed2026_valid_tokenize_f4:   0%|          | 0/197 [00:00<?, ? examples/s]

seed2026_valid_tokenize_f4: 100%|██████████| 197/197 [00:04<00:00, 48.13 examples/s]

seed2026_valid_tokenize_f4: 100%|██████████| 197/197 [00:04<00:00, 47.04 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 4: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 4: train_done in 2550.9s
Fold 4: predicting TEST and saving logits (seed2026) ...


Averaging TEST logits across folds for seed 2026 ...


Saved xlmr_large_512_seed2026_test_avg.npz; example_id/offsets saved under xlmr_large_512_seed2026_test_logits
Seed 2026 5-fold done in 10308.1s


In [49]:
# Average 512 seeds at logit level, rebuild primary span-level ensemble, and run diagnostics
import os, numpy as np, pandas as pd, time

def average_512_seeds_and_rebuild():
    t0 = time.time()
    npz1 = 'xlmr_large_512_test_avg.npz'
    npz2 = 'xlmr_large_512_seed2026_test_avg.npz'
    assert os.path.exists(npz1), f'Missing {npz1}'
    assert os.path.exists(npz2), f'Missing {npz2} (wait for training to finish)'
    d1 = np.load(npz1)
    d2 = np.load(npz2)
    start = (d1['start'] + d2['start']) / 2.0
    end = (d1['end'] + d2['end']) / 2.0
    out_npz = 'xlmr_large_512_2seeds_avg.npz'
    np.savez_compressed(out_npz, start=start, end=end)
    print('Saved', out_npz, 'with shape', start.shape, 'in %.1fs' % (time.time()-t0))

    # Rebuild primary span-level 3-stream submission using Cell 31's builder (it prefers 2seeds file if present)
    assert 'build_span_level_ensemble_submission' in globals(), 'Run Cell 31 once to define the span-level builder.'
    sub = build_span_level_ensemble_submission()
    print('Primary span-level 3-stream rebuilt. Head:')
    print(sub.head())

    # Diagnostics (same as Cell 32)
    test_df = pd.read_csv('test.csv')
    sub_diag = pd.read_csv('submission.csv').merge(test_df[['id','language']], on='id', how='left')
    sub_diag['PredictionString'] = sub_diag['PredictionString'].astype(str).replace('nan','').fillna('')
    sub_diag['len'] = sub_diag['PredictionString'].str.len()
    sub_diag['empty'] = (sub_diag['PredictionString']=='')
    print('Diag: empties', int(sub_diag['empty'].sum()), '/ 112 | mean len', round(sub_diag['len'].mean(),2))
    print(sub_diag.head())
    return out_npz

# Execute averaging and rebuild now
avg_npz_path = average_512_seeds_and_rebuild()
print('Averaged NPZ:', avg_npz_path)

Saved xlmr_large_512_2seeds_avg.npz with shape (1401, 512) in 0.7s


Span-level 3-stream weighted submission.csv saved (112, 2) in 15.0s
Primary span-level 3-stream rebuilt. Head:
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889
Diag: empties 0 / 112 | mean len 9.79
          id   PredictionString language  len  empty
0  be799d365  मुंबई, महाराष्ट्र    hindi   17  False
1  26f356026        सुधांशुबाला    hindi   11  False
2  57a56c43f   பெருமூளைப் புறணி    tamil   16  False
3  da062fdbb           बिंबिसार    hindi    8  False
4  72fc0d5b5               1889    hindi    4  False
Averaged NPZ: xlmr_large_512_2seeds_avg.npz


In [52]:
# Build alt per-language weight submission (hi 0.20/0.10/0.70, ta 0.15/0.00/0.85) without disturbing current primary
import os, json, numpy as np, pandas as pd, time

# Save current primary submission as backup
if os.path.exists('submission.csv'):
    pd.read_csv('submission.csv').to_csv('submission_primary_3stream_2seeds.csv', index=False)

def _load_stream_alt(npz_path, dir_prefix):
    if not os.path.exists(npz_path):
        return None
    exid_json = os.path.join(dir_prefix, 'test_example_id.json')
    offs_npy = os.path.join(dir_prefix, 'test_offset_mapping.npy')
    if not (os.path.exists(exid_json) and os.path.exists(offs_npy)):
        return None
    data = np.load(npz_path)
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs = np.load(offs_npy, allow_pickle=True)
    return {'start': data['start'], 'end': data['end'], 'exids': exids, 'offs': offs}

def build_span_level_ensemble_submission_alt():
    t0 = time.time()
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    xlmr512_npz = 'xlmr_large_512_2seeds_avg.npz' if os.path.exists('xlmr_large_512_2seeds_avg.npz') else 'xlmr_large_512_test_avg.npz'
    streams = {
        'xlmr384': _load_stream_alt('xlmr_large_test_avg.npz', 'xlmr_large_test_logits'),
        'muril':   _load_stream_alt('muril_large_test_avg.npz', 'muril_large_test_logits'),
        'xlmr512': _load_stream_alt(xlmr512_npz, 'xlmr_large_512_test_logits'),
    }
    # Alt per-language weights (xlmr384, muril, xlmr512)
    per_lang_weights = {
        'hindi': {'xlmr384': 0.20, 'muril': 0.10, 'xlmr512': 0.70},
        'tamil': {'xlmr384': 0.15, 'muril': 0.00, 'xlmr512': 0.85},
    }
    # Collect best per model
    best_by_model = {k: {} for k in streams.keys()}
    for key, pack in streams.items():
        if pack is None:
            continue
        S, E, exids, offs_all = pack['start'], pack['end'], pack['exids'], pack['offs']
        for i in range(S.shape[0]):
            ex_id = exids[i]
            lang = id2lang.get(ex_id, 'unknown')
            ctx = id2ctx.get(ex_id, '')
            score, si, ei = _decode_best_for_feature(offs_all[i], S[i], E[i], lang)
            if si is not None and ei is not None:
                cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
                text = ctx[cs:ce] if (cs is not None and ce is not None and ce > cs) else ''
            else:
                text = ''
            prev = best_by_model[key].get(ex_id, (float('-inf'), ''))
            if score > prev[0]:
                best_by_model[key][ex_id] = (score, text)
    out_ids = test_df['id'].astype(str).values
    out_pred = []
    for ex_id in out_ids:
        lang = str(id2lang.get(ex_id, 'unknown')).lower().strip()
        wmap = per_lang_weights.get(lang, per_lang_weights['hindi'])
        best = (float('-inf'), '')
        for key, w in wmap.items():
            if w <= 0: continue
            tup = best_by_model.get(key, {}).get(ex_id)
            if not tup: continue
            score, text = tup
            wscore = w * score
            if wscore > best[0]:
                best = (wscore, text)
        ans = _postprocess_ans(best[1], lang) if best[1] else ''
        out_pred.append(ans)
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
    sub.to_csv('submission.csv', index=False)
    print('Alt weights span-level submission.csv saved', sub.shape, 'in %.1fs' % (time.time()-t0))
    return sub

# Build ALT and save under distinct name; then restore primary to submission.csv
sub_alt = build_span_level_ensemble_submission_alt()
alt_name = 'submission_alt_perlang_hi020_010_070_ta015_000_085.csv'
sub_alt.to_csv(alt_name, index=False)
# Diagnostics for ALT
test_df = pd.read_csv('test.csv')
diag = sub_alt.merge(test_df[['id','language']], on='id', how='left')
diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
diag['len'] = diag['PredictionString'].str.len()
diag['empty'] = (diag['PredictionString']=='')
print('ALT Diag: empties', int(diag['empty'].sum()), '/ 112 | mean len', round(diag['len'].mean(),2))
print(diag.head())

# Restore primary submission.csv if backup exists
if os.path.exists('submission_primary_3stream_2seeds.csv'):
    pd.read_csv('submission_primary_3stream_2seeds.csv').to_csv('submission.csv', index=False)
    print('Restored primary submission.csv from backup; ALT saved as', alt_name)

Alt weights span-level submission.csv saved (112, 2) in 7.8s
ALT Diag: empties 0 / 112 | mean len 9.65
          id   PredictionString language  len  empty
0  be799d365  मुंबई, महाराष्ट्र    hindi   17  False
1  26f356026           बामाखेपा    hindi    8  False
2  57a56c43f   பெருமூளைப் புறணி    tamil   16  False
3  da062fdbb           बिंबिसार    hindi    8  False
4  72fc0d5b5               1889    hindi    4  False
Restored primary submission.csv from backup; ALT saved as submission_alt_perlang_hi020_010_070_ta015_000_085.csv


In [63]:
# Activate ALT per-language weights submission as submission.csv
import pandas as pd, shutil, os
alt_fp = 'submission_alt_perlang_hi020_010_070_ta015_000_085.csv'
assert os.path.exists(alt_fp), f'Missing ALT file: {alt_fp}'
df_alt = pd.read_csv(alt_fp)
if 'PredictionString' in df_alt.columns:
    df_alt['PredictionString'] = df_alt['PredictionString'].astype(str).replace('nan','').fillna('')
df_alt.to_csv('submission.csv', index=False)
print('submission.csv replaced with ALT:', alt_fp)
print(df_alt.head())

submission.csv replaced with ALT: submission_alt_perlang_hi020_010_070_ta015_000_085.csv
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026           बामाखेपा
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889


In [68]:
# Span-level tiny knob sweep (per expert ranges) keeping streams and weights fixed
import os, json, numpy as np, pandas as pd, time, re, unicodedata

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_for_feature_params(offsets, s_logit, e_logit, n_best, max_len, len_pen):
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-int(n_best):]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + int(max_len) - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - float(len_pen) * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    return best

def _postprocess_ans_sweep(ans: str, lang: str = None) -> str:
    if not ans: return ''
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
        ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
        ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    for lq, rq in [("\"","\""), ("'","'"), ('“','”'), ('‘','’')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    return ans or ''

def build_span_level_ensemble_with_params(hi_params, ta_params, per_lang_weights=None, out_name=None):
    t0 = time.time()
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    # Streams
    xlmr512_npz = 'xlmr_large_512_2seeds_avg.npz' if os.path.exists('xlmr_large_512_2seeds_avg.npz') else 'xlmr_large_512_test_avg.npz'
    def load_stream(npz_path, dir_prefix):
        if not os.path.exists(npz_path): return None
        exid_json = os.path.join(dir_prefix, 'test_example_id.json')
        offs_npy = os.path.join(dir_prefix, 'test_offset_mapping.npy')
        if not (os.path.exists(exid_json) and os.path.exists(offs_npy)): return None
        data = np.load(npz_path)
        with open(exid_json, 'r', encoding='utf-8') as f:
            exids = json.load(f)
        offs = np.load(offs_npy, allow_pickle=True)
        return {'start': data['start'], 'end': data['end'], 'exids': exids, 'offs': offs}
    streams = {
        'xlmr384': load_stream('xlmr_large_test_avg.npz', 'xlmr_large_test_logits'),
        'muril':   load_stream('muril_large_test_avg.npz', 'muril_large_test_logits'),
        'xlmr512': load_stream(xlmr512_npz, 'xlmr_large_512_test_logits'),
    }
    if per_lang_weights is None:
        per_lang_weights = {
            'hindi': {'xlmr384': 0.15, 'muril': 0.10, 'xlmr512': 0.75},
            'tamil': {'xlmr384': 0.10, 'muril': 0.00, 'xlmr512': 0.90},
        }
    # Collect best span text per model using custom params
    best_by_model = {k: {} for k in streams.keys()}
    for key, pack in streams.items():
        if pack is None: continue
        S, E, exids, offs_all = pack['start'], pack['end'], pack['exids'], pack['offs']
        for i in range(S.shape[0]):
            ex_id = exids[i]
            lang = str(id2lang.get(ex_id, 'unknown')).lower().strip()
            ctx = id2ctx.get(ex_id, '')
            if lang == 'hindi':
                params = hi_params
            elif lang == 'tamil':
                params = ta_params
            else:
                params = {'n_best': 150, 'max_len': 64, 'len_pen': 0.0048}
            score, si, ei = _decode_best_for_feature_params(offs_all[i], S[i], E[i], params['n_best'], params['max_len'], params['len_pen'])
            if si is not None and ei is not None:
                cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
                text = ctx[cs:ce] if (cs is not None and ce is not None and ce > cs) else ''
            else:
                text = ''
            prev = best_by_model[key].get(ex_id, (float('-inf'), ''))
            if score > prev[0]:
                best_by_model[key][ex_id] = (score, text)
    # Choose per-language by weights
    out_ids = test_df['id'].astype(str).values
    out_pred = []
    for ex_id in out_ids:
        lang = str(id2lang.get(ex_id, 'unknown')).lower().strip()
        wmap = per_lang_weights.get(lang, per_lang_weights.get('hindi'))
        best = (float('-inf'), '')
        for key, w in wmap.items():
            if w <= 0: continue
            tup = best_by_model.get(key, {}).get(ex_id)
            if not tup: continue
            score, text = tup
            wscore = float(w) * score
            if wscore > best[0]:
                best = (wscore, text)
        ans = _postprocess_ans_sweep(best[1], lang) if best[1] else ''
        out_pred.append(ans)
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
    if out_name:
        sub.to_csv(out_name, index=False)
    return sub

# Define two tiny-probe configs per expert ranges
hi_def = {'n_best': 180, 'max_len': 52, 'len_pen': 0.0050}
ta_def = {'n_best': 220, 'max_len': 62, 'len_pen': 0.0040}
hi_alt = {'n_best': 170, 'max_len': 50, 'len_pen': 0.0051}
ta_alt = {'n_best': 210, 'max_len': 60, 'len_pen': 0.0039}

weights_pl = {
    'hindi': {'xlmr384': 0.15, 'muril': 0.10, 'xlmr512': 0.75},
    'tamil': {'xlmr384': 0.10, 'muril': 0.00, 'xlmr512': 0.90},
}

print('Building span-level DEF config...')
sub_def = build_span_level_ensemble_with_params(hi_def, ta_def, per_lang_weights=weights_pl, out_name='submission_spanlvl_def_weights_defknobs.csv')
print('Building span-level ALT config...')
sub_alt = build_span_level_ensemble_with_params(hi_alt, ta_alt, per_lang_weights=weights_pl, out_name='submission_spanlvl_def_weights_altknobs.csv')

# Diagnostics for both
test_df = pd.read_csv('test.csv')
for name in ['submission_spanlvl_def_weights_defknobs.csv', 'submission_spanlvl_def_weights_altknobs.csv']:
    df = pd.read_csv(name).merge(test_df[['id','language']], on='id', how='left')
    df['PredictionString'] = df['PredictionString'].astype(str).replace('nan','').fillna('')
    df['len'] = df['PredictionString'].str.len()
    df['empty'] = (df['PredictionString']=='')
    print(name, '| empties:', int(df['empty'].sum()), '/ 112 | mean len:', round(df['len'].mean(),2), '| per-lang empties:', df.groupby('language')['empty'].sum().to_dict())
print('Sweep done. Choose which to activate and submit.')

Building span-level DEF config...


Building span-level ALT config...


submission_spanlvl_def_weights_defknobs.csv | empties: 0 / 112 | mean len: 9.79 | per-lang empties: {'hindi': 0, 'tamil': 0}
submission_spanlvl_def_weights_altknobs.csv | empties: 0 / 112 | mean len: 9.79 | per-lang empties: {'hindi': 0, 'tamil': 0}
Sweep done. Choose which to activate and submit.


In [55]:
# Train another XLM-R large 512 seed (SEED=2027) for 3-seed averaging
import os, time, numpy as np, pandas as pd, collections, random, json, torch, re, gc
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DataCollatorWithPadding

SEED = 2027
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(SEED)
os.environ.setdefault('TOKENIZERS_PARALLELISM', 'false')
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True,max_split_size_mb:64')
torch.backends.cuda.matmul.allow_tf32 = True
try: torch.set_float32_matmul_precision('high')
except Exception: pass

MODEL_NAME = 'deepset/xlm-roberta-large-squad2'
MAX_LENGTH = 512
DOC_STRIDE = 256  # keep identical featureization for direct logit averaging
NEG_RATIO = 1
EPOCHS = 3
LR = 2e-5
TRAIN_BS = 1
EVAL_BS = 2
GRAD_ACCUM = 20
WARMUP_RATIO = 0.15
WEIGHT_DECAY = 0.02

print('Tokenizer for new seed with', MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
pad_on_right = tokenizer.padding_side == 'right'

full_df = pd.read_csv('train.csv').merge(pd.read_csv('folds.csv'), on='id', how='left')
assert full_df['fold'].notna().all()
test_df = pd.read_csv('test.csv')

def prepare_train_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    offsets = tok.pop('offset_mapping')
    input_ids = tok['input_ids']
    start_positions, end_positions, cls_positions = [], [], []
    for i, off in enumerate(offsets):
        si = smap[i]
        ctx = contexts[si]
        a_text = str(examples['answer_text'][si])
        a_start = int(examples['answer_start'][si])
        a_end = a_start + len(a_text)
        while a_start < a_end and a_start < len(ctx) and ctx[a_start].isspace(): a_start += 1
        while a_end > a_start and a_end-1 < len(ctx) and ctx[a_end-1].isspace(): a_end -= 1
        if a_start >= a_end: a_end = a_start
        seq_ids = tok.sequence_ids(i)
        try:
            cls_index = input_ids[i].index(tokenizer.cls_token_id) if tokenizer.cls_token_id is not None else 0
        except ValueError:
            cls_index = 0
        cls_positions.append(cls_index)
        context_id = 1 if pad_on_right else 0
        try:
            ctx_s = seq_ids.index(context_id)
            ctx_e = len(seq_ids) - 1 - seq_ids[::-1].index(context_id)
        except ValueError:
            start_positions.append(cls_index); end_positions.append(cls_index); continue
        st_tok = en_tok = None
        tgt_end_incl = a_end - 1
        for t in range(ctx_s, ctx_e + 1):
            o = off[t]
            if o is None: continue
            ts, te = o
            if ts is None or te is None: continue
            if st_tok is None and ts <= a_start < te: st_tok = t
            if ts <= tgt_end_incl < te: en_tok = t
        if st_tok is not None and en_tok is not None and st_tok <= en_tok:
            start_positions.append(st_tok); end_positions.append(en_tok)
        else:
            start_positions.append(cls_index); end_positions.append(cls_index)
    tok['start_positions'] = start_positions; tok['end_positions'] = end_positions; tok['overflow_to_sample_mapping'] = smap; tok['cls_positions'] = cls_positions
    return tok

def prepare_pred_features_fn(examples):
    questions = [q.strip() for q in examples['question']]; contexts = examples['context']
    tok = tokenizer(
        questions if pad_on_right else contexts,
        contexts if pad_on_right else questions,
        truncation='only_second' if pad_on_right else 'only_first',
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )
    smap = tok.pop('overflow_to_sample_mapping')
    tok['example_id'] = []
    new_offsets = []
    for i, off in enumerate(tok['offset_mapping']):
        si = smap[i]
        tok['example_id'].append(str(examples['id'][si]))
        seq_ids = tok.sequence_ids(i)
        context_id = 1 if pad_on_right else 0
        new_offsets.append([o if seq_ids[k] == context_id else None for k, o in enumerate(off)])
    tok['offset_mapping'] = new_offsets
    return tok

# Pre-tokenize TEST once for this seed
logdir = 'xlmr_large_512_seed2027_test_logits'
os.makedirs(logdir, exist_ok=True)
print('Tokenizing TEST for seed 2027 ...')
test_ds = Dataset.from_pandas(test_df[['id','question','context','language']])
test_tok_all = test_ds.map(prepare_pred_features_fn, batched=True, remove_columns=test_ds.column_names, desc='xlmr512_seed2027_test_tokenize')
te_ds_pred = test_tok_all.remove_columns([c for c in test_tok_all.column_names if c not in ['input_ids','attention_mask']])
te_ds_pred.set_format(type='torch')
with open(os.path.join(logdir, 'test_example_id.json'), 'w', encoding='utf-8') as f:
    json.dump(list(test_tok_all['example_id']), f, ensure_ascii=False)
np.save(os.path.join(logdir, 'test_offset_mapping.npy'), np.array(test_tok_all['offset_mapping'], dtype=object), allow_pickle=True)

test_start_list = []; test_end_list = []; t_global = time.time()
for FOLD in range(5):
    t_fold = time.time()
    print(f'\n===== XLM-R Large 512 SEED2027 Fold {FOLD} =====', flush=True)
    tr_df = full_df[full_df['fold'] != FOLD].reset_index(drop=True)
    va_df = full_df[full_df['fold'] == FOLD].reset_index(drop=True)
    tr_ds = Dataset.from_pandas(tr_df[['id','question','context','answer_text','answer_start','language']])
    va_ds = Dataset.from_pandas(va_df[['id','question','context','answer_text','answer_start','language']])
    t0 = time.time()
    tr_tok = tr_ds.map(prepare_train_features_fn, batched=True, remove_columns=tr_ds.column_names, desc=f'seed2027_train_tokenize_f{FOLD}')
    print(f'Fold {FOLD}: train feats pre-sample={tr_tok.num_rows} | tok_time={time.time()-t0:.1f}s', flush=True)
    # Negative sampling (NEG_RATIO=1) robust via cls_positions
    smap = np.array(tr_tok['overflow_to_sample_mapping']); sp = np.array(tr_tok['start_positions']); cls_arr = np.array(tr_tok['cls_positions'])
    is_pos = (sp != cls_arr)
    keep = []; by_ex = collections.defaultdict(list)
    for idx, ex in enumerate(smap): by_ex[ex].append(idx)
    for ex, idxs in by_ex.items():
        idxs = np.array(idxs); pos = idxs[is_pos[idxs]]; neg = idxs[~is_pos[idxs]]
        keep.extend(pos.tolist())
        if len(pos) == 0:
            sel = neg[:min(4, len(neg))]; keep.extend(sel.tolist())
        else:
            cap = min(len(neg), 1 * len(pos))
            if cap > 0:
                sel = neg[np.random.permutation(len(neg))[:cap]]; keep.extend(sel.tolist())
    keep = np.array(sorted(set(keep))); before = tr_tok.num_rows
    tr_tok = tr_tok.select(keep.tolist()); after = tr_tok.num_rows
    print(f'Fold {FOLD}: neg-sample {before}->{after} ({after/before:.2%})', flush=True)
    # Torch datasets
    tr_ds_torch = tr_tok.remove_columns([c for c in tr_tok.column_names if c not in ['input_ids','attention_mask','start_positions','end_positions']])
    tr_ds_torch.set_format(type='torch')
    va_tok_all = va_ds.map(prepare_pred_features_fn, batched=True, remove_columns=va_ds.column_names, desc=f'seed2027_valid_tokenize_f{FOLD}')
    va_ds_pred = va_tok_all.remove_columns([c for c in va_tok_all.column_names if c not in ['input_ids','attention_mask']])
    va_ds_pred.set_format(type='torch')
    # Model
    gc.collect(); torch.cuda.empty_cache()
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
    try:
        model.config.use_cache = False
    except Exception:
        pass
    try:
        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': False})
    except Exception:
        try: model.gradient_checkpointing_enable()
        except Exception: pass
    args = TrainingArguments(
        output_dir=f'xlmr_large_512_seed2027_f{FOLD}',
        num_train_epochs=EPOCHS,
        learning_rate=LR,
        per_device_train_batch_size=TRAIN_BS,
        per_device_eval_batch_size=EVAL_BS,
        gradient_accumulation_steps=GRAD_ACCUM,
        fp16=False,
        bf16=True,
        logging_steps=200,
        save_strategy='no',
        evaluation_strategy='no',
        seed=SEED,
        dataloader_num_workers=0,
        dataloader_pin_memory=False,
        eval_accumulation_steps=1,
        report_to=[],
        gradient_checkpointing=True,
        warmup_ratio=WARMUP_RATIO,
        weight_decay=WEIGHT_DECAY,
        max_grad_norm=1.0
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds_torch,
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    print(f'Fold {FOLD}: training...')
    try:
        trainer.train()
    except RuntimeError as e:
        if 'out of memory' in str(e).lower():
            print('OOM; retry with GRAD_ACCUM=24, EVAL_BS=1')
            del trainer, model
            gc.collect(); torch.cuda.empty_cache()
            model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
            try: model.config.use_cache = False
            except Exception: pass
            try: model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': False})
            except Exception:
                try: model.gradient_checkpointing_enable()
                except Exception: pass
            args = TrainingArguments(
                output_dir=f'xlmr_large_512_seed2027_f{FOLD}',
                num_train_epochs=EPOCHS,
                learning_rate=LR,
                per_device_train_batch_size=1,
                per_device_eval_batch_size=1,
                gradient_accumulation_steps=24,
                fp16=False,
                bf16=True,
                logging_steps=200,
                save_strategy='no',
                evaluation_strategy='no',
                seed=SEED,
                dataloader_num_workers=0,
                dataloader_pin_memory=False,
                eval_accumulation_steps=1,
                report_to=[],
                gradient_checkpointing=True,
                warmup_ratio=WARMUP_RATIO,
                weight_decay=WEIGHT_DECAY,
                max_grad_norm=1.0,
            )
            trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tr_ds_torch,
                eval_dataset=None,
                tokenizer=tokenizer,
                data_collator=DataCollatorWithPadding(tokenizer),
            )
            trainer.train()
        else:
            raise
    print(f'Fold {FOLD}: train_done in {time.time()-t_fold:.1f}s')
    # Predict TEST and save logits for this fold
    print(f'Fold {FOLD}: predicting TEST and saving logits (seed2027) ...')
    gc.collect(); torch.cuda.empty_cache()
    model.eval()
    with torch.no_grad():
        te_pred = trainer.predict(te_ds_pred)
    te_start = te_pred.predictions[0]
    te_end = te_pred.predictions[1]
    np.save(os.path.join(logdir, f'test_start_logits_f{FOLD}.npy'), te_start)
    np.save(os.path.join(logdir, f'test_end_logits_f{FOLD}.npy'), te_end)
    test_start_list.append(te_start)
    test_end_list.append(te_end)

    del model, trainer, tr_ds_torch, tr_tok, va_ds_pred, va_tok_all, te_pred
    gc.collect(); torch.cuda.empty_cache()

# Average TEST logits across folds for this seed
print('Averaging TEST logits across folds for seed 2027 ...')
start_stack = np.stack(test_start_list, axis=0)
end_stack = np.stack(test_end_list, axis=0)
start_avg = start_stack.mean(axis=0)
end_avg = end_stack.mean(axis=0)
np.savez_compressed('xlmr_large_512_seed2027_test_avg.npz', start=start_avg, end=end_avg)
print('Saved xlmr_large_512_seed2027_test_avg.npz; example_id/offsets saved under', logdir)
print('Seed 2027 5-fold done in %.1fs' % (time.time()-t_global))

Tokenizer for new seed with deepset/xlm-roberta-large-squad2




Tokenizing TEST for seed 2027 ...


xlmr512_seed2027_test_tokenize:   0%|          | 0/112 [00:00<?, ? examples/s]

xlmr512_seed2027_test_tokenize: 100%|██████████| 112/112 [00:02<00:00, 39.66 examples/s]

xlmr512_seed2027_test_tokenize: 100%|██████████| 112/112 [00:02<00:00, 39.14 examples/s]





===== XLM-R Large 512 SEED2027 Fold 0 =====


seed2027_train_tokenize_f0:   0%|          | 0/817 [00:00<?, ? examples/s]

seed2027_train_tokenize_f0: 100%|██████████| 817/817 [00:17<00:00, 46.69 examples/s]

seed2027_train_tokenize_f0: 100%|██████████| 817/817 [00:17<00:00, 46.42 examples/s]

Fold 0: train feats pre-sample=10484 | tok_time=17.6s


Fold 0: neg-sample 10484->2085 (19.89%)





seed2027_valid_tokenize_f0:   0%|          | 0/185 [00:00<?, ? examples/s]

seed2027_valid_tokenize_f0: 100%|██████████| 185/185 [00:04<00:00, 41.86 examples/s]

seed2027_valid_tokenize_f0: 100%|██████████| 185/185 [00:04<00:00, 40.94 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 0: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 0: train_done in 2598.1s
Fold 0: predicting TEST and saving logits (seed2027) ...



===== XLM-R Large 512 SEED2027 Fold 1 =====


seed2027_train_tokenize_f1:   0%|          | 0/807 [00:00<?, ? examples/s]

seed2027_train_tokenize_f1: 100%|██████████| 807/807 [00:13<00:00, 57.71 examples/s]

seed2027_train_tokenize_f1: 100%|██████████| 807/807 [00:14<00:00, 57.32 examples/s]

Fold 1: train feats pre-sample=10396 | tok_time=14.1s


Fold 1: neg-sample 10396->2047 (19.69%)





seed2027_valid_tokenize_f1:   0%|          | 0/195 [00:00<?, ? examples/s]

seed2027_valid_tokenize_f1: 100%|██████████| 195/195 [00:03<00:00, 51.41 examples/s]

seed2027_valid_tokenize_f1: 100%|██████████| 195/195 [00:03<00:00, 50.07 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 1: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 1: train_done in 2566.3s
Fold 1: predicting TEST and saving logits (seed2027) ...



===== XLM-R Large 512 SEED2027 Fold 2 =====


seed2027_train_tokenize_f2:   0%|          | 0/784 [00:00<?, ? examples/s]

seed2027_train_tokenize_f2: 100%|██████████| 784/784 [00:13<00:00, 56.42 examples/s]

seed2027_train_tokenize_f2: 100%|██████████| 784/784 [00:13<00:00, 56.06 examples/s]

Fold 2: train feats pre-sample=9842 | tok_time=14.0s


Fold 2: neg-sample 9842->1962 (19.93%)





seed2027_valid_tokenize_f2:   0%|          | 0/218 [00:00<?, ? examples/s]

seed2027_valid_tokenize_f2: 100%|██████████| 218/218 [00:04<00:00, 43.95 examples/s]

seed2027_valid_tokenize_f2: 100%|██████████| 218/218 [00:05<00:00, 43.04 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 2: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 2: train_done in 2408.6s
Fold 2: predicting TEST and saving logits (seed2027) ...



===== XLM-R Large 512 SEED2027 Fold 3 =====


seed2027_train_tokenize_f3:   0%|          | 0/795 [00:00<?, ? examples/s]

seed2027_train_tokenize_f3: 100%|██████████| 795/795 [00:15<00:00, 51.33 examples/s]

seed2027_train_tokenize_f3: 100%|██████████| 795/795 [00:15<00:00, 51.01 examples/s]

Fold 3: train feats pre-sample=9960 | tok_time=15.6s


Fold 3: neg-sample 9960->1994 (20.02%)





seed2027_valid_tokenize_f3:   0%|          | 0/207 [00:00<?, ? examples/s]

seed2027_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 39.22 examples/s]

seed2027_valid_tokenize_f3: 100%|██████████| 207/207 [00:05<00:00, 38.45 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 3: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 3: train_done in 2471.8s
Fold 3: predicting TEST and saving logits (seed2027) ...



===== XLM-R Large 512 SEED2027 Fold 4 =====


seed2027_train_tokenize_f4:   0%|          | 0/805 [00:00<?, ? examples/s]

seed2027_train_tokenize_f4: 100%|██████████| 805/805 [00:13<00:00, 58.23 examples/s]

seed2027_train_tokenize_f4: 100%|██████████| 805/805 [00:13<00:00, 57.84 examples/s]

Fold 4: train feats pre-sample=10458 | tok_time=14.0s


Fold 4: neg-sample 10458->2024 (19.35%)





seed2027_valid_tokenize_f4:   0%|          | 0/197 [00:00<?, ? examples/s]

seed2027_valid_tokenize_f4: 100%|██████████| 197/197 [00:03<00:00, 51.09 examples/s]

seed2027_valid_tokenize_f4: 100%|██████████| 197/197 [00:03<00:00, 49.82 examples/s]




Some weights of the model checkpoint at deepset/xlm-roberta-large-squad2 were not used when initializing XLMRobertaForQuestionAnswering: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




Fold 4: training...


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Fold 4: train_done in 2548.0s
Fold 4: predicting TEST and saving logits (seed2027) ...


Averaging TEST logits across folds for seed 2027 ...


Saved xlmr_large_512_seed2027_test_avg.npz; example_id/offsets saved under xlmr_large_512_seed2027_test_logits
Seed 2027 5-fold done in 13207.9s


In [58]:
# Average 3 seeds (512) at logit level and rebuild span-level ensemble + diagnostics (no file swapping)
import os, time, numpy as np, pandas as pd

def average_3seeds_512_and_rebuild():
    t0 = time.time()
    npz_main = 'xlmr_large_512_test_avg.npz'
    npz_s2 = 'xlmr_large_512_seed2026_test_avg.npz'
    npz_s3 = 'xlmr_large_512_seed2027_test_avg.npz'
    assert os.path.exists(npz_main), f'Missing {npz_main}'
    assert os.path.exists(npz_s2), f'Missing {npz_s2}'
    assert os.path.exists(npz_s3), f'Missing {npz_s3} (wait for seed2027 training to finish)'
    d0 = np.load(npz_main); d1 = np.load(npz_s2); d2 = np.load(npz_s3)
    start = (d0['start'] + d1['start'] + d2['start']) / 3.0
    end = (d0['end'] + d1['end'] + d2['end']) / 3.0
    out_npz = 'xlmr_large_512_3seeds_avg.npz'
    np.savez_compressed(out_npz, start=start, end=end)
    print('Saved', out_npz, 'with shape', start.shape, 'in %.1fs' % (time.time()-t0))

    # Build span-level 3-stream using builder (Cell 31 prefers 3-seed if present)
    assert 'build_span_level_ensemble_submission' in globals(), 'Run Cell 31 to define the span-level builder.'
    sub = build_span_level_ensemble_submission()
    print('3-seed primary span-level rebuilt. Head:')
    print(sub.head())

    # Diagnostics
    test_df = pd.read_csv('test.csv')
    sub_diag = pd.read_csv('submission.csv').merge(test_df[['id','language']], on='id', how='left')
    sub_diag['PredictionString'] = sub_diag['PredictionString'].astype(str).replace('nan','').fillna('')
    sub_diag['len'] = sub_diag['PredictionString'].str.len()
    sub_diag['empty'] = (sub_diag['PredictionString']=='')
    print('Diag: empties', int(sub_diag['empty'].sum()), '/ 112 | mean len', round(sub_diag['len'].mean(),2))
    return out_npz

# Execute now
avg_npz = average_3seeds_512_and_rebuild()
print('Averaged NPZ path:', avg_npz)

Saved xlmr_large_512_3seeds_avg.npz with shape (1401, 512) in 0.5s


Span-level 3-stream weighted submission.csv saved (112, 2) in 8.2s
3-seed primary span-level rebuilt. Head:
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889
Diag: empties 0 / 112 | mean len 9.79
Averaged NPZ path: xlmr_large_512_3seeds_avg.npz


In [64]:
# Activate expert-priority variant #2: span-level def-weights ALT knobs
import pandas as pd, shutil, os
alt2 = 'submission_spanlvl_def_weights_altknobs.csv'
assert os.path.exists(alt2), f'Missing file: {alt2}'
df = pd.read_csv(alt2)
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].astype(str).replace('nan','').fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv replaced with', alt2)
print(df.head())

submission.csv replaced with submission_spanlvl_def_weights_altknobs.csv
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889


In [65]:
# Build expert-suggested extra slot: 512-only span-level with tightened max_len and slightly higher len_pen
import os, json, numpy as np, pandas as pd, time, re, unicodedata, shutil

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_for_feature_custom(offsets, s_logit, e_logit, lang):
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    lang_l = str(lang).lower().strip()
    # Expert extra slot knobs: tighten max_len and add +0.0001 len_pen
    if lang_l == 'hindi':
        n_best = 180; max_len = 48; len_pen = 0.0051
    elif lang_l == 'tamil':
        n_best = 220; max_len = 58; len_pen = 0.0041
    else:
        n_best = 150; max_len = 60; len_pen = 0.0049
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    return best

def _postprocess_ans_min(ans: str, lang: str = None) -> str:
    if not ans: return ''
    # numeric collapse
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    # danda/virama hygiene and collapse
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    # trim trailing combining/virama once
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
        ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
        ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    # outer punct/space trim + zero-widths
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_512only_spanlvl_tight():
    t0 = time.time()
    # Choose 3-seed averaged logits if available
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    S = data['start']; E = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs_all = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    # Pick best per example from 512 stream only
    best_per_ex = {}  # ex_id -> (score, text)
    for i in range(S.shape[0]):
        ex_id = exids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, si, ei = _decode_best_for_feature_custom(offs_all[i], S[i], E[i], lang)
        if si is not None and ei is not None:
            cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
            if cs is not None and ce is not None and ce > cs:
                text = id2ctx.get(ex_id, '')[cs:ce]
            else:
                text = ''
        else:
            text = ''
        prev = best_per_ex.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            best_per_ex[ex_id] = (score, text)
    # Post-process and write
    out_ids = test_df['id'].astype(str).values
    out_pred = []
    for ex_id in out_ids:
        lang = id2lang.get(ex_id, 'unknown')
        ans = best_per_ex.get(ex_id, (float('-inf'), ''))[1]
        out_pred.append(_postprocess_ans_min(ans, lang))
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
    sub.to_csv('submission_512only_spanlvl_hi48_ta58_lenpen_plus0.0001.csv', index=False)
    # Activate as submission.csv
    sub.to_csv('submission.csv', index=False)
    print('Activated 512-only tight variant as submission.csv | shape', sub.shape, '| wrote backup submission_512only_spanlvl_hi48_ta58_lenpen_plus0.0001.csv in %.1fs' % (time.time()-t0))
    # Quick diagnostics
    diag = sub.merge(test_df[['id','language']], on='id', how='left')
    diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
    diag['len'] = diag['PredictionString'].str.len()
    diag['empty'] = (diag['PredictionString']=='')
    print('Diag: empties', int(diag['empty'].sum()), '/ 112 | mean len', round(diag['len'].mean(),2), '| per-lang empties', diag.groupby('language')['empty'].sum().to_dict())
    return sub

_ = build_512only_spanlvl_tight()

Activated 512-only tight variant as submission.csv | shape (112, 2) | wrote backup submission_512only_spanlvl_hi48_ta58_lenpen_plus0.0001.csv in 3.0s
Diag: empties 0 / 112 | mean len 10.4 | per-lang empties {'hindi': 0, 'tamil': 0}


In [66]:
# 512-only span-level with fallback gate (experts' top pick)
import os, json, numpy as np, pandas as pd, time, re, unicodedata

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_with_fallback(offsets, s_logit, e_logit, lang):
    # Per-language knobs
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi':
        n_best = 180; max_len = 50; len_pen = 0.00505; fb_cap = 36
    elif lang_l == 'tamil':
        n_best = 220; max_len = 60; len_pen = 0.00405; fb_cap = 42
    else:
        n_best = 150; max_len = 60; len_pen = 0.0048; fb_cap = 42
    # mask non-context
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    # Fallback if chosen char span > 40: re-decode with tighter cap and accept if >= best-0.5
    if best[1] is not None and best[2] is not None:
        cs, _ = offsets[best[1]]; _, ce = offsets[best[2]]
        if cs is not None and ce is not None and ce > cs:
            if (ce - cs) > 40:
                best_fb = (float('-inf'), None, None)
                for si in top_s:
                    ei_max = min(len(e_logp)-1, si + fb_cap - 1)
                    band = e_logp[si:ei_max+1]
                    if band.size == 0: continue
                    off = int(np.argmax(band))
                    ei = si + off
                    L = ei - si + 1
                    score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
                    if score > best_fb[0]:
                        best_fb = (score, si, ei)
                if best_fb[1] is not None and best_fb[2] is not None and best_fb[0] >= best[0] - 0.5:
                    best = best_fb
    return best

def _postprocess_ans_fallback(ans: str, lang: str = None) -> str:
    if not ans: return ''
    # numeric collapse
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    # danda/virama hygiene
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    # trailing combining/virama guard
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
        ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
        ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    # collapse hyphens inside words
    ans = re.sub(r'(?<=\S)[\-–—]{2,}(?=\S)', '-', ans)
    # strip outer punct/space
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    # unmatched quotes
    for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    # zero-widths and spaces
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_512only_with_fallback_submit():
    t0 = time.time()
    # Prefer 3-seed avg npz
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    S = data['start']; E = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs_all = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    # Best per example
    best_per_ex = {}
    for i in range(S.shape[0]):
        ex_id = exids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, si, ei = _decode_best_with_fallback(offs_all[i], S[i], E[i], lang)
        if si is not None and ei is not None:
            cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
            if cs is not None and ce is not None and ce > cs:
                text = id2ctx.get(ex_id, '')[cs:ce]
            else:
                text = ''
        else:
            text = ''
        prev = best_per_ex.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            best_per_ex[ex_id] = (score, text)
    # Post-process and save
    out_ids = test_df['id'].astype(str).values
    preds = []
    for ex_id in out_ids:
        lang = id2lang.get(ex_id, 'unknown')
        ans = best_per_ex.get(ex_id, (float('-inf'), ''))[1]
        preds.append(_postprocess_ans_fallback(ans, lang))
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': preds})
    out_name = 'submission_512only_fallback_hi180_50_00505_ta220_60_00405_fb36_42.csv'
    sub.to_csv(out_name, index=False)
    sub.to_csv('submission.csv', index=False)
    # Diagnostics
    diag = sub.merge(test_df[['id','language']], on='id', how='left')
    diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
    diag['len'] = diag['PredictionString'].str.len()
    diag['empty'] = (diag['PredictionString']=='')
    print('512-only with fallback saved as submission.csv and', out_name, '| Empties:', int(diag['empty'].sum()), '/ 112 | Mean len:', round(diag['len'].mean(),2), '| Per-lang empties:', diag.groupby('language')['empty'].sum().to_dict(), '| Took: %.1fs' % (time.time()-t0))
    return sub

# Build and activate
_ = build_512only_with_fallback_submit()

512-only with fallback saved as submission.csv and submission_512only_fallback_hi180_50_00505_ta220_60_00405_fb36_42.csv | Empties: 0 / 112 | Mean len: 10.4 | Per-lang empties: {'hindi': 0, 'tamil': 0} | Took: 2.4s


In [67]:
# 512-only span-level, no-fallback, tiny knob bump (experts' #2 variant)
import os, json, numpy as np, pandas as pd, time, re, unicodedata

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_nofallback(offsets, s_logit, e_logit, lang):
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi':
        n_best = 190; max_len = 50; len_pen = 0.00505
    elif lang_l == 'tamil':
        n_best = 230; max_len = 60; len_pen = 0.00405
    else:
        n_best = 150; max_len = 60; len_pen = 0.0048
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    return best

def _postprocess_ans_min2(ans: str, lang: str = None) -> str:
    if not ans: return ''
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'):
        ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn':
        ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_512only_nofallback_submit():
    t0 = time.time()
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    S = data['start']; E = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs_all = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    best_per_ex = {}
    for i in range(S.shape[0]):
        ex_id = exids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, si, ei = _decode_best_nofallback(offs_all[i], S[i], E[i], lang)
        if si is not None and ei is not None:
            cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
            if cs is not None and ce is not None and ce > cs:
                text = id2ctx.get(ex_id, '')[cs:ce]
            else:
                text = ''
        else:
            text = ''
        prev = best_per_ex.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            best_per_ex[ex_id] = (score, text)
    out_ids = test_df['id'].astype(str).values
    preds = []
    for ex_id in out_ids:
        lang = id2lang.get(ex_id, 'unknown')
        ans = best_per_ex.get(ex_id, (float('-inf'), ''))[1]
        preds.append(_postprocess_ans_min2(ans, lang))
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': preds})
    out_name = 'submission_512only_nb_hi190_ta230_max50_60_lp0.00505_0.00405.csv'
    sub.to_csv(out_name, index=False)
    sub.to_csv('submission.csv', index=False)
    # Diagnostics
    diag = sub.merge(test_df[['id','language']], on='id', how='left')
    diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
    diag['len'] = diag['PredictionString'].str.len()
    diag['empty'] = (diag['PredictionString']=='')
    print('512-only no-fallback saved as submission.csv and', out_name, '| Empties:', int(diag['empty'].sum()), '/ 112 | Mean len:', round(diag['len'].mean(),2), '| Per-lang empties:', diag.groupby('language')['empty'].sum().to_dict(), '| Took: %.1fs' % (time.time()-t0))
    return sub

_ = build_512only_nofallback_submit()

512-only no-fallback saved as submission.csv and submission_512only_nb_hi190_ta230_max50_60_lp0.00505_0.00405.csv | Empties: 0 / 112 | Mean len: 10.4 | Per-lang empties: {'hindi': 0, 'tamil': 0} | Took: 2.8s


In [69]:
# Hybrid-Lang submission: Hindi from 3-stream span-level (DEF weights), Tamil from 512-only with fallback
import pandas as pd, os

hi_src = 'submission_spanlvl_def_weights_defknobs.csv'  # fallback to DEF file from Cell 37
ta_src = 'submission_512only_fallback_hi180_50_00505_ta220_60_00405_fb36_42.csv'  # Cell 42
assert os.path.exists(hi_src), f'Missing Hindi source: {hi_src}'
assert os.path.exists(ta_src), f'Missing Tamil source: {ta_src}'

test_df = pd.read_csv('test.csv')
lang_map = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str)))
df_hi = pd.read_csv(hi_src).copy()
df_ta = pd.read_csv(ta_src).copy()
df_hi['PredictionString'] = df_hi['PredictionString'].astype(str).replace('nan','').fillna('')
df_ta['PredictionString'] = df_ta['PredictionString'].astype(str).replace('nan','').fillna('')
map_hi = dict(zip(df_hi['id'].astype(str), df_hi['PredictionString']))
map_ta = dict(zip(df_ta['id'].astype(str), df_ta['PredictionString']))

out_ids = test_df['id'].astype(str).values
out_pred = []
for _id in out_ids:
    lang = str(lang_map.get(_id, 'unknown')).lower().strip()
    if lang == 'hindi':
        ans = map_hi.get(_id, '')
    elif lang == 'tamil':
        ans = map_ta.get(_id, '')
    else:
        ans = map_ta.get(_id, '')  # default to safer 512-fallback
    out_pred.append(ans if isinstance(ans, str) else '')

hybrid = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
hybrid.to_csv('submission_hybrid_hi3stream_ta512fb.csv', index=False)
hybrid.to_csv('submission.csv', index=False)
print('Hybrid submission built and activated as submission.csv; backup saved as submission_hybrid_hi3stream_ta512fb.csv')
# Quick diagnostics
diag = hybrid.merge(test_df[['id','language']], on='id', how='left')
diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
diag['len'] = diag['PredictionString'].str.len()
diag['empty'] = (diag['PredictionString'] == '')
print('Diag: empties', int(diag['empty'].sum()), '/ 112 | mean len', round(diag['len'].mean(),2), '| per-lang empties', diag.groupby('language')['empty'].sum().to_dict())
print(diag.head())

Hybrid submission built and activated as submission.csv; backup saved as submission_hybrid_hi3stream_ta512fb.csv
Diag: empties 0 / 112 | mean len 10.38 | per-lang empties {'hindi': 0, 'tamil': 0}
          id   PredictionString language  len  empty
0  be799d365  मुंबई, महाराष्ट्र    hindi   17  False
1  26f356026        सुधांशुबाला    hindi   11  False
2  57a56c43f   பெருமூளைப் புறணி    tamil   16  False
3  da062fdbb           बिंबिसार    hindi    8  False
4  72fc0d5b5               1889    hindi    4  False


In [70]:
# 512-only span-level with fallback and n_best bump (hi=200, ta=240) per expert advice
import os, json, numpy as np, pandas as pd, time, re, unicodedata

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_with_fallback_nb_bump(offsets, s_logit, e_logit, lang):
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi':
        n_best = 200; max_len = 50; len_pen = 0.00505; fb_cap = 36
    elif lang_l == 'tamil':
        n_best = 240; max_len = 60; len_pen = 0.00405; fb_cap = 42
    else:
        n_best = 150; max_len = 60; len_pen = 0.0048; fb_cap = 42
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    if best[1] is not None and best[2] is not None:
        cs, _ = offsets[best[1]]; _, ce = offsets[best[2]]
        if cs is not None and ce is not None and ce > cs:
            if (ce - cs) > 40:
                best_fb = (float('-inf'), None, None)
                for si in top_s:
                    ei_max = min(len(e_logp)-1, si + fb_cap - 1)
                    band = e_logp[si:ei_max+1]
                    if band.size == 0: continue
                    off = int(np.argmax(band))
                    ei = si + off
                    L = ei - si + 1
                    score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
                    if score > best_fb[0]:
                        best_fb = (score, si, ei)
                if best_fb[1] is not None and best_fb[2] is not None and best_fb[0] >= best[0] - 0.5:
                    best = best_fb
    return best

def _postprocess_ans_fb(ans: str, lang: str = None) -> str:
    if not ans: return ''
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'): ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn': ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    ans = re.sub(r'(?<=\S)[\-–—]{2,}(?=\S)', '-', ans)
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_512only_with_fallback_nb_bump_submit():
    t0 = time.time()
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    S = data['start']; E = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs_all = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    best_per_ex = {}
    for i in range(S.shape[0]):
        ex_id = exids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, si, ei = _decode_best_with_fallback_nb_bump(offs_all[i], S[i], E[i], lang)
        if si is not None and ei is not None:
            cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
            if cs is not None and ce is not None and ce > cs:
                text = id2ctx.get(ex_id, '')[cs:ce]
            else:
                text = ''
        else:
            text = ''
        prev = best_per_ex.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            best_per_ex[ex_id] = (score, text)
    out_ids = test_df['id'].astype(str).values
    preds = []
    for ex_id in out_ids:
        lang = id2lang.get(ex_id, 'unknown')
        ans = best_per_ex.get(ex_id, (float('-inf'), ''))[1]
        preds.append(_postprocess_ans_fb(ans, lang))
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': preds})
    out_name = 'submission_512only_fallback_nb_bump_hi200_ta240.csv'
    sub.to_csv(out_name, index=False)
    sub.to_csv('submission.csv', index=False)
    diag = sub.merge(test_df[['id','language']], on='id', how='left')
    diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
    diag['len'] = diag['PredictionString'].str.len()
    diag['empty'] = (diag['PredictionString']=='')
    print('512-only fallback (nb bump) saved as submission.csv and', out_name, '| Empties:', int(diag['empty'].sum()), '/ 112 | Mean len:', round(diag['len'].mean(),2), '| Per-lang empties:', diag.groupby('language')['empty'].sum().to_dict(), '| Took: %.1fs' % (time.time()-t0))
    return sub

_ = build_512only_with_fallback_nb_bump_submit()

512-only fallback (nb bump) saved as submission.csv and submission_512only_fallback_nb_bump_hi200_ta240.csv | Empties: 0 / 112 | Mean len: 10.4 | Per-lang empties: {'hindi': 0, 'tamil': 0} | Took: 2.8s


In [71]:
# Activate 3-stream primary (DEF weights, DEF knobs) as submission.csv
import pandas as pd, os
src = 'submission_spanlvl_def_weights_defknobs.csv'
assert os.path.exists(src), f'Missing file: {src}'
df = pd.read_csv(src)
if 'PredictionString' in df.columns:
    df['PredictionString'] = df['PredictionString'].astype(str).replace('nan','').fillna('')
df.to_csv('submission.csv', index=False)
print('submission.csv replaced with', src)
print(df.head())

submission.csv replaced with submission_spanlvl_def_weights_defknobs.csv
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889


In [72]:
# 512-only span-level with fallback using DEF knobs (hi: 180/52/0.0050; ta: 220/62/0.0040)
import os, json, numpy as np, pandas as pd, time, re, unicodedata

def _stable_logsoftmax(arr):
    m = np.max(arr)
    return arr - (m + np.log(np.sum(np.exp(arr - m))))

def _decode_best_with_fallback_def(offsets, s_logit, e_logit, lang):
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi':
        n_best = 180; max_len = 52; len_pen = 0.0050; fb_cap = 36
    elif lang_l == 'tamil':
        n_best = 220; max_len = 62; len_pen = 0.0040; fb_cap = 42
    else:
        n_best = 150; max_len = 60; len_pen = 0.0048; fb_cap = 42
    mask = np.array([o is not None and o[0] is not None and o[1] is not None and (o[1] > o[0]) for o in offsets], dtype=bool)
    s = np.where(mask, s_logit, -1e9)
    e = np.where(mask, e_logit, -1e9)
    s_logp = _stable_logsoftmax(s)
    e_logp = _stable_logsoftmax(e)
    top_s = np.argsort(s_logp)[-n_best:]
    best = (float('-inf'), None, None)
    for si in top_s:
        ei_max = min(len(e_logp)-1, si + max_len - 1)
        band = e_logp[si:ei_max+1]
        if band.size == 0: continue
        off = int(np.argmax(band))
        ei = si + off
        L = ei - si + 1
        score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
        if score > best[0]:
            best = (score, si, ei)
    if best[1] is not None and best[2] is not None:
        cs, _ = offsets[best[1]]; _, ce = offsets[best[2]]
        if cs is not None and ce is not None and ce > cs:
            if (ce - cs) > 40:
                best_fb = (float('-inf'), None, None)
                for si in top_s:
                    ei_max = min(len(e_logp)-1, si + fb_cap - 1)
                    band = e_logp[si:ei_max+1]
                    if band.size == 0: continue
                    off = int(np.argmax(band))
                    ei = si + off
                    L = ei - si + 1
                    score = float(s_logp[si] + e_logp[ei] - len_pen * (L - 1))
                    if score > best_fb[0]:
                        best_fb = (score, si, ei)
                if best_fb[1] is not None and best_fb[2] is not None and best_fb[0] >= best[0] - 0.5:
                    best = best_fb
    return best

def _postprocess_ans_def(ans: str, lang: str = None) -> str:
    if not ans: return ''
    ans = re.sub(r'(?<=\d)[ ,._-](?=\d)', '', ans)
    ans = ans.replace('\u094D\u0964', '\u0964').replace('\u094D।', '।')
    ans = re.sub(r'[।]+', '।', ans)
    if ans.endswith('।'): ans = ans[:-1]
    if ans and (unicodedata.category(ans[-1]) == 'Mn' or ans[-1] == '\u094D'): ans = ans[:-1]
    lang_l = str(lang).lower().strip()
    if lang_l == 'hindi' and len(ans) >= 2 and ans[-1] == '\u094D' and (ans[-2].isspace() or ans[-2] in '।,:;.!?"\'\u0964\u0965-'): ans = ans[:-1]
    if lang_l == 'tamil' and len(ans) and unicodedata.category(ans[-1]) == 'Mn': ans = ans[:-1]
    if ans and unicodedata.category(ans[0]) == 'Mn': ans = ans[1:]
    ans = re.sub(r'(?<=\S)[\-–—]{2,}(?=\S)', '-', ans)
    punct_strip = '\u0964\u0965।,:;.!?\"\'\“\”\‘\’\)\]\}\|\-\s'
    ans = ans.strip().strip(punct_strip)
    for lq, rq in [("\"","\""), ("'","'"), ('\u201c','\u201d'), ('\u2018','\u2019')]:
        if ans.startswith(lq) and not ans.endswith(rq): ans = ans[len(lq):]
        if ans.endswith(rq) and not ans.startswith(lq): ans = ans[:-len(rq)]
    ans = ans.replace('\ufeff','').replace('\u200e','').replace('\u200f','')
    ans = ans.replace('\u200B','').replace('\u00AD','')
    ans = ans.replace('\u00A0',' ').replace('\u2009',' ')
    ans = ' '.join(ans.split())
    ans = ans.replace('\u200C','').replace('\u200D','')
    if ans.endswith('।'): ans = ans[:-1]
    return ans or ''

def build_512only_with_fallback_def_submit():
    t0 = time.time()
    if os.path.exists('xlmr_large_512_3seeds_avg.npz'):
        npz = 'xlmr_large_512_3seeds_avg.npz'
    elif os.path.exists('xlmr_large_512_2seeds_avg.npz'):
        npz = 'xlmr_large_512_2seeds_avg.npz'
    else:
        npz = 'xlmr_large_512_test_avg.npz'
    exid_json = 'xlmr_large_512_test_logits/test_example_id.json'
    offs_npy = 'xlmr_large_512_test_logits/test_offset_mapping.npy'
    assert os.path.exists(npz) and os.path.exists(exid_json) and os.path.exists(offs_npy), 'Missing 512 artifacts'
    data = np.load(npz)
    S = data['start']; E = data['end']
    with open(exid_json, 'r', encoding='utf-8') as f:
        exids = json.load(f)
    offs_all = np.load(offs_npy, allow_pickle=True)
    test_df = pd.read_csv('test.csv')
    id2ctx = dict(zip(test_df['id'].astype(str), test_df['context'].astype(str)))
    id2lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str) if 'language' in test_df.columns else ['unknown']*len(test_df)))
    best_per_ex = {}
    for i in range(S.shape[0]):
        ex_id = exids[i]
        lang = id2lang.get(ex_id, 'unknown')
        score, si, ei = _decode_best_with_fallback_def(offs_all[i], S[i], E[i], lang)
        if si is not None and ei is not None:
            cs, _ = offs_all[i][si]; _, ce = offs_all[i][ei]
            if cs is not None and ce is not None and ce > cs:
                text = id2ctx.get(ex_id, '')[cs:ce]
            else:
                text = ''
        else:
            text = ''
        prev = best_per_ex.get(ex_id, (float('-inf'), ''))
        if score > prev[0]:
            best_per_ex[ex_id] = (score, text)
    out_ids = test_df['id'].astype(str).values
    preds = []
    for ex_id in out_ids:
        lang = id2lang.get(ex_id, 'unknown')
        ans = best_per_ex.get(ex_id, (float('-inf'), ''))[1]
        preds.append(_postprocess_ans_def(ans, lang))
    sub = pd.DataFrame({'id': out_ids, 'PredictionString': preds})
    out_name = 'submission_512only_fallback_hi180_52_00500_ta220_62_00400_fb36_42.csv'
    sub.to_csv(out_name, index=False)
    sub.to_csv('submission.csv', index=False)
    diag = sub.merge(test_df[['id','language']], on='id', how='left')
    diag['PredictionString'] = diag['PredictionString'].astype(str).replace('nan','').fillna('')
    diag['len'] = diag['PredictionString'].str.len()
    diag['empty'] = (diag['PredictionString']=='')
    print('512-only fallback (DEF knobs) saved as submission.csv and', out_name, '| Empties:', int(diag['empty'].sum()), '/ 112 | Mean len:', round(diag['len'].mean(),2), '| Per-lang empties:', diag.groupby('language')['empty'].sum().to_dict(), '| Took: %.1fs' % (time.time()-t0))
    return sub

_ = build_512only_with_fallback_def_submit()

512-only fallback (DEF knobs) saved as submission.csv and submission_512only_fallback_hi180_52_00500_ta220_62_00400_fb36_42.csv | Empties: 0 / 112 | Mean len: 10.4 | Per-lang empties: {'hindi': 0, 'tamil': 0} | Took: 3.0s


In [73]:
# Hybrid-Lang v2: Hindi = 3-stream DEF, Tamil = 512-only fallback with n_best bump
import pandas as pd, os
test_df = pd.read_csv('test.csv')
lang = dict(zip(test_df['id'].astype(str), test_df['language'].astype(str)))
hi_fp = 'submission_spanlvl_def_weights_defknobs.csv'
ta_fp = 'submission_512only_fallback_nb_bump_hi200_ta240.csv'
assert os.path.exists(hi_fp), f'Missing Hindi source: {hi_fp}'
assert os.path.exists(ta_fp), f'Missing Tamil source: {ta_fp}'
hi = pd.read_csv(hi_fp)
ta = pd.read_csv(ta_fp)
hi_map = dict(zip(hi['id'].astype(str), hi['PredictionString'].astype(str).fillna('')))
ta_map = dict(zip(ta['id'].astype(str), ta['PredictionString'].astype(str).fillna('')))
out_ids, out_pred = [], []
for _id in test_df['id'].astype(str).values:
    l = str(lang.get(_id,'')).lower()
    out_ids.append(_id)
    out_pred.append(hi_map.get(_id,'')) if l=='hindi' else out_pred.append(ta_map.get(_id,''))
sub = pd.DataFrame({'id': out_ids, 'PredictionString': out_pred})
sub.to_csv('submission_hybrid_v2.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Hybrid v2 built and activated as submission.csv; backup saved as submission_hybrid_v2.csv')
print(sub.head())

Hybrid v2 built and activated as submission.csv; backup saved as submission_hybrid_v2.csv
          id   PredictionString
0  be799d365  मुंबई, महाराष्ट्र
1  26f356026        सुधांशुबाला
2  57a56c43f   பெருமூளைப் புறணி
3  da062fdbb           बिंबिसार
4  72fc0d5b5               1889
