# Plan: Tweet Sentiment Extraction (MLE-Benchmark)

Objective:
- Build a strong baseline fast and iterate to medal. Metric: Jaccard similarity on selected_text.

High-level Approach:
- Treat as extractive QA: sentiment = question; tweet = context; predict start/end token indices.
- Use transformer encoder (RoBERTa-base or DeBERTa-v3-base) fine-tuned for span extraction.
- Post-process: for neutral, output full tweet; sanitize offsets; fallback to sentiment-specific heuristics if needed.

Environment & Efficiency:
- Verify GPU with nvidia-smi; install PyTorch cu121 stack only; use Transformers + Accelerate.
- Log progress per fold with timings; cache tokenized datasets and OOF/test logits.
- Subsample smoke tests before full runs; early stop on plateau.

Data Pipeline:
- Load train/test; inspect nulls, length distributions, sentiments.
- Build character-level alignment of selected_text to tweet for training start/end char indices.
- Tokenize with fast tokenizer to get offset mapping; map char spans to token spans.
- Save processed features to disk (parquet/npz) for reuse.

Validation:
- Stratified KFold by sentiment (e.g., 5 folds). Deterministic seed. Same folds reused for all models.
- OOF Jaccard evaluation to guide iterations. Multiple seeds later if time.

Modeling v1 (Baseline):
- roberta-base, max_len ~ 128 (cap at e.g., 96/128 after inspecting lengths).
- Input format: "question" = sentiment token(s); "context" = tweet. Simple pair encoding: [CLS] sentiment [SEP] tweet [SEP].
- Loss: cross-entropy on start and end.
- Hyperparams: lr 2e-5 to 3e-5, batch size per GPU memory (16 if fits), epochs 3 with early stopping on OOF.
- Inference: average start/end logits across folds; pick span via argmax/argmax with simple constraint (end >= start).

Post-processing:
- If sentiment == neutral: return full tweet.
- If predicted span empty/invalid: fallback to full tweet for neutral, else minimal heuristic (e.g., top token).
- Optional refinement: trim leading/trailing spaces/punctuation to improve Jaccard.

Iteration Roadmap:
1) GPU/env check + installs.
2) EDA: lengths, nulls, label distribution.
3) Build alignment + tokenizer pipeline; cache.
4) Baseline training 3-5 folds; measure OOF Jaccard.
5) Error analysis buckets (neutral/pos/neg, short/long tweets).
6) Improvements:
   - Model: deberta-v3-base or roberta-large if time/memory.
   - Longer max_len if needed.
   - Data augmentation: none initially (risk).
   - Post-process rules tuning.
7) Blend diverse seeds/models if time.

Requests for Expert Review (next step):
- Are roberta-base/deberta-v3-base still the best for this dataset under time constraints?
- Critical post-processing rules that typically boost Jaccard here?
- Recommended max_len and any special text normalization to avoid alignment bugs?
- Optimal CV folds count vs runtime for medal-level performance?

Deliverables:
- Reusable fold splits, cached tokenized datasets, OOF metrics.
- submission.csv matching sample format.

Time Management:
- <1h to baseline pipeline ready and smoke-tested.
- 2-4h for full 5-fold run on base model.
- Remainder for improvements/ensembling and error-driven fixes.

In [1]:
# Environment check: GPU + install correct Torch stack (cu121)
import os, sys, subprocess, shutil, time
from pathlib import Path

def run(cmd):
    print('>>', ' '.join(cmd), flush=True)
    return subprocess.run(cmd, check=False, text=True, capture_output=True)

# 0) GPU presence
print(run(['bash','-lc','nvidia-smi || true']).stdout)

# 1) Clean any prior torch stacks
for pkg in ("torch","torchvision","torchaudio"):
    subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)
for d in (
    "/app/.pip-target/torch",
    "/app/.pip-target/torch-2.8.0.dist-info",
    "/app/.pip-target/torch-2.4.1.dist-info",
    "/app/.pip-target/torchvision",
    "/app/.pip-target/torchvision-0.23.0.dist-info",
    "/app/.pip-target/torchvision-0.19.1.dist-info",
    "/app/.pip-target/torchaudio",
    "/app/.pip-target/torchaudio-2.8.0.dist-info",
    "/app/.pip-target/torchaudio-2.4.1.dist-info",
    "/app/.pip-target/torchgen",
    "/app/.pip-target/functorch",
):
    if os.path.exists(d):
        print("Removing", d)
        shutil.rmtree(d, ignore_errors=True)

def pip(*args):
    print('> pip', ' '.join(args), flush=True)
    subprocess.run([sys.executable, "-m", "pip", *args], check=True)

# 2) Install exact cu121 torch stack
pip("install",
    "--index-url", "https://download.pytorch.org/whl/cu121",
    "--extra-index-url", "https://pypi.org/simple",
    "torch==2.4.1", "torchvision==0.19.1", "torchaudio==2.4.1")

# 3) Freeze torch versions
Path("constraints.txt").write_text(
    "torch==2.4.1\n"
    "torchvision==0.19.1\n"
    "torchaudio==2.4.1\n"
)

# 4) Install NLP deps honoring constraints
pip("install", "-c", "constraints.txt",
    "transformers==4.44.2", "accelerate==0.34.2",
    "datasets==2.21.0", "evaluate==0.4.2",
    "sentencepiece", "scikit-learn", "pandas", "numpy", "pyarrow",
    "tqdm", "matplotlib",
    "--upgrade-strategy", "only-if-needed")

# 5) Sanity check torch + CUDA
import torch
print("torch:", torch.__version__, "built CUDA:", getattr(torch.version, "cuda", None))
print("CUDA available:", torch.cuda.is_available())
assert str(getattr(torch.version, "cuda", "")).startswith("12.1"), f"Wrong CUDA build: {torch.version.cuda}"
assert torch.cuda.is_available(), "CUDA not available"
print("GPU:", torch.cuda.get_device_name(0))
print("Setup OK at", time.strftime('%Y-%m-%d %H:%M:%S'))

>> bash -lc nvidia-smi || true


Tue Sep 30 04:14:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                





> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 272.0 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 379.6 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 448.0 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105


  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 281.9 MB/s eta 0:00:00




Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 199.4 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 472.0 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 133.7 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 213.4 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 110.3 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 260.3 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 71.8 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 471.1 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 476.4 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 229.8 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 280.1 MB/s eta 0:00:00
Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 341.1 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 68.1 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 493.3 MB/s eta 0:00:00
Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 512.0 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 99.4 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 151.6 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 186.2 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 105.6 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 535.6 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


> pip install -c constraints.txt transformers==4.44.2 accelerate==0.34.2 datasets==2.21.0 evaluate==0.4.2 sentencepiece scikit-learn pandas numpy pyarrow tqdm matplotlib --upgrade-strategy only-if-needed


Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 79.7 MB/s eta 0:00:00


Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 507.6 MB/s eta 0:00:00
Collecting datasets==2.21.0
  Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527.3/527.3 KB 490.0 MB/s eta 0:00:00
Collecting evaluate==0.4.2
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 KB 447.2 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 260.2 MB/s eta 0:00:00


Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 220.2 MB/s eta 0:00:00


Collecting pandas
  Downloading pandas-2.3.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 231.3 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 278.1 MB/s eta 0:00:00


Collecting pyarrow
  Downloading pyarrow-21.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (42.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 157.6 MB/s eta 0:00:00
Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 449.0 MB/s eta 0:00:00


Collecting matplotlib
  Downloading matplotlib-3.10.6-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 181.1 MB/s eta 0:00:00


Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 432.8 MB/s eta 0:00:00
Collecting safetensors>=0.4.1
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 548.3 MB/s eta 0:00:00


Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 277.5 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.23.2
  Downloading huggingface_hub-0.35.3-py3-none-any.whl (564 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 564.3/564.3 KB 530.5 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 329.6 MB/s eta 0:00:00


Collecting regex!=2019.12.17
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 315.3 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 487.1 MB/s eta 0:00:00
Collecting torch>=1.10.0
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 153.8 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 517.0 MB/s eta 0:00:00
Collecting dill<0.3.9,>=0.3.0
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 KB 430.3 MB/s eta 0:00:00


Collecting aiohttp
  Downloading aiohttp-3.12.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 171.0 MB/s eta 0:00:00
Collecting multiprocess
  Downloading multiprocess-0.70.18-py311-none-any.whl (144 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.5/144.5 KB 388.9 MB/s eta 0:00:00
Collecting fsspec[http]<=2024.6.1,>=2023.1.0
  Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 KB 482.0 MB/s eta 0:00:00


Collecting xxhash
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 KB 461.8 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)


Collecting scipy>=1.8.0
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 585.2 MB/s eta 0:00:00
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 522.1 MB/s eta 0:00:00


Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 KB 514.9 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 KB 511.6 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 KB 535.8 MB/s eta 0:00:00
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)


Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (355 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 355.2/355.2 KB 284.1 MB/s eta 0:00:00
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.9-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 217.9 MB/s eta 0:00:00


Collecting fonttools>=4.22.0
  Downloading fonttools-4.60.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (5.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.0/5.0 MB 143.1 MB/s eta 0:00:00
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.5-py3-none-any.whl (113 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.9/113.9 KB 434.0 MB/s eta 0:00:00


Collecting pillow>=8
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 163.5 MB/s eta 0:00:00


Collecting attrs>=17.3.0
  Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 KB 429.6 MB/s eta 0:00:00
Collecting propcache>=0.2.0
  Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 213.5/213.5 KB 536.7 MB/s eta 0:00:00


Collecting yarl<2.0,>=1.17.0
  Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (348 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 349.0/349.0 KB 519.5 MB/s eta 0:00:00
Collecting aiosignal>=1.4.0
  Downloading aiosignal-1.4.0-py3-none-any.whl (7.5 kB)
Collecting aiohappyeyeballs>=2.5.0
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (235 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.3/235.3 KB 514.5 MB/s eta 0:00:00


Collecting multidict<7.0,>=4.5
  Downloading multidict-6.6.4-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (246 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 246.7/246.7 KB 516.9 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 560.3 MB/s eta 0:00:00
Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 381.7 MB/s eta 0:00:00


Collecting six>=1.5
  Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 507.6 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 482.3 MB/s eta 0:00:00


Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 490.7 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 456.4 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 177.8 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 186.9 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 159.7 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 504.5 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 406.9 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70


  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 73.0 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 204.6 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 321.2 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 213.0 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 506.2 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 128.8 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 223.1 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 488.5 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 254.3 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 192.8 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 214.7 MB/s eta 0:00:00


Collecting multiprocess
  Downloading multiprocess-0.70.17-py311-none-any.whl (144 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.3/144.3 KB 470.3 MB/s eta 0:00:00
  Downloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 KB 506.7 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 527.3 MB/s eta 0:00:00


Installing collected packages: pytz, mpmath, xxhash, urllib3, tzdata, typing-extensions, tqdm, threadpoolctl, sympy, six, sentencepiece, safetensors, regex, pyyaml, pyparsing, pyarrow, psutil, propcache, pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, multidict, MarkupSafe, kiwisolver, joblib, idna, hf-xet, fsspec, frozenlist, fonttools, filelock, dill, cycler, charset_normalizer, certifi, attrs, aiohappyeyeballs, yarl, triton, scipy, requests, python-dateutil, nvidia-cusparse-cu12, nvidia-cudnn-cu12, multiprocess, jinja2, contourpy, aiosignal, scikit-learn, pandas, nvidia-cusolver-cu12, matplotlib, huggingface-hub, aiohttp, torch, tokenizers, transformers, datasets, accelerate, evaluate


Successfully installed MarkupSafe-3.0.3 accelerate-0.34.2 aiohappyeyeballs-2.6.1 aiohttp-3.12.15 aiosignal-1.4.0 attrs-25.3.0 certifi-2025.8.3 charset_normalizer-3.4.3 contourpy-1.3.3 cycler-0.12.1 datasets-2.21.0 dill-0.3.8 evaluate-0.4.2 filelock-3.19.1 fonttools-4.60.1 frozenlist-1.7.0 fsspec-2024.6.1 hf-xet-1.1.10 huggingface-hub-0.35.3 idna-3.10 jinja2-3.1.6 joblib-1.5.2 kiwisolver-1.4.9 matplotlib-3.10.6 mpmath-1.3.0 multidict-6.6.4 multiprocess-0.70.16 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 packaging-25.0 pandas-2.3.3 pillow-11.3.0 propcache-0.3.2 psutil-7.1.0 pyarrow-21.0.0 pyparsing-3.2.5 python-dateutil-2.9.0.post0 pytz-2025.2 pyyaml-6.0.3 









torch: 2.4.1+cu121 built CUDA: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q
Setup OK at 2025-09-30 04:17:07


In [3]:
# EDA: load data, inspect distributions, token length coverage
import pandas as pd, numpy as np, os, time
from collections import Counter
from transformers import AutoTokenizer

t0 = time.time()
train_path, test_path = 'train.csv', 'test.csv'
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
print('Loaded:', train.shape, test.shape)
print(train.head(3))

# Basic checks
print('\nNulls train:\n', train.isnull().sum())
print('\nSentiment distribution (train):\n', train['sentiment'].value_counts())

# Tweet and selected_text length stats
train['tweet_len'] = train['text'].astype(str).apply(len)
train['sel_len'] = train['selected_text'].astype(str).apply(len)
print('\nTweet length percentiles:', np.percentile(train['tweet_len'], [50, 75, 90, 95, 99]))
print('Selected_text length percentiles:', np.percentile(train['sel_len'], [50, 75, 90, 95, 99]))

# Tokenizer length study (pair encoding: sentiment + tweet)
model_name = 'microsoft/deberta-v3-base'  # primary choice per expert advice
tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
def pair_len(sent, txt):
    enc = tok(text=sent, text_pair=txt, add_special_tokens=True, truncation=False)
    return len(enc['input_ids'])

sample_idx = np.random.RandomState(42).choice(len(train), size=min(5000, len(train)), replace=False)
lens = [pair_len(train.loc[i,'sentiment'], str(train.loc[i,'text'])) for i in sample_idx]
lens = np.array(lens)
print('\nToken pair length percentiles (DeBERTa-v3-base):', np.percentile(lens, [50, 75, 90, 95, 99]))
coverage_128 = (lens <= 128).mean()
coverage_96 = (lens <= 96).mean()
print(f'Coverage <=128: {coverage_128:.4f}, <=96: {coverage_96:.4f}')

print('\nTop examples near tail:')
tail_idx = np.argsort(lens)[-5:]
for idx in tail_idx:
    i = sample_idx[idx]
    print('len=', lens[idx], '| sentiment=', train.loc[i,'sentiment'], '| text[:120]=', str(train.loc[i,'text'])[:120].replace('\n',' '))

print(f'EDA done in {time.time()-t0:.1f}s')

# Decide tentative max_len recommendation based on coverage
if coverage_128 > 0.995:
    print('Recommendation: max_len=128 (safe).')
elif coverage_96 > 0.995:
    print('Recommendation: max_len=96 (safe).')
else:
    print('Recommendation: max_len=128 (use), consider 160 if truncation noticeably >0.5%.')

Loaded: (24732, 4) (2749, 3)
       textID                                               text  \
0  8d4ad58b45  eating breakfast  getting ready to go to schoo...   
1  fdfe12a800  Going to fold laundry and then hit the sack. I...   
2  5efd224f4e  happy mothers day to all   im off to spend the...   

                                       selected_text sentiment  
0  eating breakfast  getting ready to go to schoo...  negative  
1                    I have boring saturday evenings  negative  
2                                              happy  positive  

Nulls train:
 textID           0
text             1
selected_text    1
sentiment        0
dtype: int64

Sentiment distribution (train):
 sentiment
neutral     9998
positive    7735
negative    6999
Name: count, dtype: int64

Tweet length percentiles: [ 64.  97. 122. 129. 137.]
Selected_text length percentiles: [ 22.  55.  97. 117. 135.]





Token pair length percentiles (DeBERTa-v3-base): [21. 29. 35. 38. 44.]
Coverage <=128: 1.0000, <=96: 1.0000

Top examples near tail:
len= 55 | sentiment= positive | text[:120]= wakey wakey lemon shakeyyyy!  haha, goin` 2 schooliooo!  it`s raining!! (ugh!) guess where I WISH I CAN BE RIGHT NOW....
len= 56 | sentiment= positive | text[:120]=  I LOVE IT!!!!!!!!!!!!!!!!!!!!!! I ALSO LIKE THE NEW PROFILE PIC!!!!!!!!!!!!!!!!!!!!
len= 58 | sentiment= negative | text[:120]= God **** you Twitter!!!!!!!!!!!!!! Stop eating my undeleted DMs!!!!!!!!!!!!!!!!!!!!!!!!!!
len= 63 | sentiment= negative | text[:120]= TODAy SUCKs - kisskass19: ï¿½you and katey broke up?! Yah Kate Broke up with me  Itï¿½s been awful nd Vodkas... http://t
len= 64 | sentiment= positive | text[:120]= Star trek was SOOOOO AWESOME!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Spock and Kirk were hillarious!!!  Im seeing it again. so
EDA done in 1.0s
Recommendation: max_len=128 (safe).


In [4]:
# Build robust char->token alignment, create 5-fold splits, and cache tokenized datasets (smoke: roberta-base, max_len=128)
import os, re, json, time, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer

SEED = 42
np.random.seed(SEED)
OUT_DIR = Path('cache')
OUT_DIR.mkdir(exist_ok=True, parents=True)

def jaccard(str1, str2):
    if not isinstance(str1, str): str1 = '' if str1 is None else str(str1)
    if not isinstance(str2, str): str2 = '' if str2 is None else str(str2)
    a = set(str1.split()); b = set(str2.split())
    if not a and not b: return 1.0
    return float(len(a & b)) / (len(a | b) + 1e-12)

def find_span(text, sel):
    # Return (start_char, end_char) inclusive-exclusive on original text; None if invalid
    if not isinstance(text, str) or not isinstance(sel, str) or len(sel) == 0:
        return None
    # Exact substring first
    start = text.find(sel)
    if start != -1:
        return (start, start + len(sel))
    # Collapse multiple spaces for robust match
    def collapse_spaces(s):
        return re.sub(r'\s+', ' ', s.strip())
    text_c = collapse_spaces(text)
    sel_c = collapse_spaces(sel)
    start_c = text_c.find(sel_c)
    if start_c == -1:
        return None
    # Map collapsed indices back to original via two-pointer walk
    i = j = 0
    map_idx = []  # map from collapsed index to original index
    while i < len(text):
        if text[i].isspace():
            # collapse run of spaces to single space
            # next collapsed char corresponds to first space in run
            map_idx.append(i)
            while i < len(text) and text[i].isspace():
                i += 1
            j += 1
        else:
            map_idx.append(i); i += 1; j += 1
    # Ensure map covers length
    if start_c < len(map_idx):
        start_orig = map_idx[start_c]
        end_c = start_c + len(sel_c)
        end_orig = map_idx[min(end_c-1, len(map_idx)-1)] + 1
        return (start_orig, end_orig)
    return None

def map_char_to_tokens(offsets, seq_ids, char_span, target_seq_id=1):
    # offsets: list of (start,end) per token; seq_ids: list of sequence_ids (None,0,1,...)
    if char_span is None:
        return None
    cs, ce = char_span
    start_tok = end_tok = None
    for i, (o, sid) in enumerate(zip(offsets, seq_ids)):
        if sid != target_seq_id:  # only tweet side
            continue
        os_, oe_ = o
        if os_ is None:
            continue
        # token overlaps char span?
        if oe_ > cs and os_ < ce:
            if start_tok is None:
                start_tok = i
            end_tok = i
    if start_tok is None or end_tok is None:
        return None
    return (start_tok, end_tok)

def prepare_cached_dataset(model_name='roberta-base', max_len=128, prefix='roberta_base_m128'):
    print(f'Preparing dataset for {model_name}, max_len={max_len}', flush=True)
    tok_kwargs = {'use_fast': True}
    if 'roberta' in model_name:
        tok_kwargs['add_prefix_space'] = True
    tokenizer = AutoTokenizer.from_pretrained(model_name, **tok_kwargs)

    # Ensure no NaNs
    df = train.copy()
    df['text'] = df['text'].fillna('')
    df['selected_text'] = df['selected_text'].fillna('')
    df['sentiment'] = df['sentiment'].fillna('neutral')

    # Create folds
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    df['fold'] = -1
    for k, (_, val_idx) in enumerate(skf.split(df, df['sentiment'])):
        df.loc[val_idx, 'fold'] = k
    df.to_csv(OUT_DIR / f'train_folds_{prefix}.csv', index=False)
    print('Saved folds to', OUT_DIR / f'train_folds_{prefix}.csv')

    # Encode train with offsets
    input_ids_list = []; attention_mask_list = []; token_type_ids_list = []
    start_list = []; end_list = []
    n = len(df)
    t0 = time.time()
    for i, row in df.iterrows():
        if i % 2000 == 0:
            print(f'Row {i}/{n} elapsed {time.time()-t0:.1f}s', flush=True)
        sent = str(row['sentiment'])
        txt = str(row['text'])
        sel = str(row['selected_text'])
        enc = tokenizer(text=sent, text_pair=txt, add_special_tokens=True, truncation=True, max_length=max_len,
                        return_offsets_mapping=True, return_attention_mask=True)
        input_ids = enc['input_ids']; attn = enc['attention_mask']
        tt = enc.get('token_type_ids', None)
        offsets = enc['offset_mapping']
        seq_ids = enc.sequence_ids()

        span = find_span(txt, sel) if sent != 'neutral' else find_span(txt, sel)  # keep original spans; post-proc handles neutral later
        tok_span = map_char_to_tokens(offsets, seq_ids, span, target_seq_id=1)
        # If mapping failed, default to tweet-side entire span
        if tok_span is None:
            # find first and last token on tweet side
            idxs = [idx for idx, sid in enumerate(seq_ids) if sid == 1]
            if len(idxs) == 0:
                s_tok = e_tok = 0
            else:
                s_tok, e_tok = idxs[0], idxs[-1]
        else:
            s_tok, e_tok = tok_span

        # Pad to max_len
        if len(input_ids) < max_len:
            pad_len = max_len - len(input_ids)
            pad_id = tokenizer.pad_token_id
            input_ids = input_ids + [pad_id] * pad_len
            attn = attn + [0] * pad_len
            if tt is not None:
                tt = tt + [0] * pad_len
        else:
            input_ids = input_ids[:max_len]
            attn = attn[:max_len]
            if tt is not None:
                tt = tt[:max_len]
            # also clamp labels within range if truncation occurred
            s_tok = min(s_tok, max_len-1); e_tok = min(e_tok, max_len-1)

        input_ids_list.append(np.array(input_ids, dtype=np.int32))
        attention_mask_list.append(np.array(attn, dtype=np.int8))
        if tt is not None:
            token_type_ids_list.append(np.array(tt, dtype=np.int8))
        else:
            token_type_ids_list = None
        start_list.append(s_tok); end_list.append(e_tok)

    X_train = {
        'input_ids': np.stack(input_ids_list),
        'attention_mask': np.stack(attention_mask_list),
        'start_positions': np.array(start_list, dtype=np.int32),
        'end_positions': np.array(end_list, dtype=np.int32),
        'fold': df['fold'].values.astype(np.int8),
        'sentiment': df['sentiment'].values,
        'text': df['text'].values,
        'selected_text': df['selected_text'].values,
    }
    if token_type_ids_list is not None:
        X_train['token_type_ids'] = np.stack(token_type_ids_list)
    np.savez_compressed(OUT_DIR / f'train_{prefix}.npz', **X_train)
    print('Saved', OUT_DIR / f'train_{prefix}.npz', 'shapes:',
          {k: v.shape if isinstance(v, np.ndarray) else len(v) for k, v in X_train.items() if hasattr(v, 'shape') or isinstance(v, (list, np.ndarray))})

    # Encode test (no labels)
    test_df = test.copy()
    test_df['text'] = test_df['text'].fillna('')
    test_df['sentiment'] = test_df['sentiment'].fillna('neutral')
    ti_ids = []; ta_masks = []; tt_ids = [];
    for i, row in test_df.iterrows():
        if i % 2000 == 0:
            print(f'Test row {i}/{len(test_df)}', flush=True)
        enc = tokenizer(text=str(row['sentiment']), text_pair=str(row['text']), add_special_tokens=True, truncation=True, max_length=max_len,
                        return_attention_mask=True)
        ids = enc['input_ids']; attn = enc['attention_mask']; tt = enc.get('token_type_ids', None)
        if len(ids) < max_len:
            pad_len = max_len - len(ids)
            ids = ids + [tokenizer.pad_token_id]*pad_len
            attn = attn + [0]*pad_len
            if tt is not None: tt = tt + [0]*pad_len
        else:
            ids = ids[:max_len]; attn = attn[:max_len]
            if tt is not None: tt = tt[:max_len]
        ti_ids.append(np.array(ids, dtype=np.int32))
        ta_masks.append(np.array(attn, dtype=np.int8))
        if tt is not None: tt_ids.append(np.array(tt, dtype=np.int8))
    X_test = {
        'input_ids': np.stack(ti_ids),
        'attention_mask': np.stack(ta_masks),
        'sentiment': test_df['sentiment'].values,
        'text': test_df['text'].values,
        'textID': test_df['textID'].values,
    }
    if len(tt_ids) == len(test_df):
        X_test['token_type_ids'] = np.stack(tt_ids)
    np.savez_compressed(OUT_DIR / f'test_{prefix}.npz', **X_test)
    print('Saved', OUT_DIR / f'test_{prefix}.npz', 'shapes:',
          {k: v.shape if isinstance(v, np.ndarray) else len(v) for k, v in X_test.items() if hasattr(v, 'shape') or isinstance(v, (list, np.ndarray))})

    meta = {'model_name': model_name, 'max_len': max_len, 'prefix': prefix}
    Path(OUT_DIR / f'meta_{prefix}.json').write_text(json.dumps(meta))
    print('Meta saved.')

prepare_cached_dataset(model_name='roberta-base', max_len=128, prefix='roberta_base_m128')
print('Cache build complete.')

Preparing dataset for roberta-base, max_len=128


Saved folds to cache/train_folds_roberta_base_m128.csv
Row 0/24732 elapsed 0.0s




Row 2000/24732 elapsed 0.3s


Row 4000/24732 elapsed 0.5s


Row 6000/24732 elapsed 0.8s


Row 8000/24732 elapsed 1.1s


Row 10000/24732 elapsed 1.3s


Row 12000/24732 elapsed 1.6s


Row 14000/24732 elapsed 1.9s


Row 16000/24732 elapsed 2.1s


Row 18000/24732 elapsed 2.4s


Row 20000/24732 elapsed 2.7s


Row 22000/24732 elapsed 2.9s


Row 24000/24732 elapsed 3.2s


Saved cache/train_roberta_base_m128.npz shapes: {'input_ids': (24732, 128), 'attention_mask': (24732, 128), 'start_positions': (24732,), 'end_positions': (24732,), 'fold': (24732,), 'sentiment': (24732,), 'text': (24732,), 'selected_text': (24732,)}
Test row 0/2749


Test row 2000/2749


Saved cache/test_roberta_base_m128.npz shapes: {'input_ids': (2749, 128), 'attention_mask': (2749, 128), 'sentiment': (2749,), 'text': (2749,), 'textID': (2749,)}
Meta saved.
Cache build complete.


In [5]:
# Smoke training: 1-fold RoBERTa-base QA head, fp16, constrained decoding, neutral full-tweet rule
import math, time, json, numpy as np, pandas as pd, torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, get_linear_schedule_with_warmup
from torch.optim import AdamW

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
CACHE_TRAIN = Path('cache/train_roberta_base_m128.npz')
FOLDS_CSV = Path('cache/train_folds_roberta_base_m128.csv')
MAX_LEN = 128
MODEL_NAME = 'roberta-base'
BATCH_SIZE = 32
EPOCHS = 2  # smoke
LR = 3e-5
WARMUP = 0.1
MAX_SPAN_LEN = 30

tok_kwargs = {'use_fast': True, 'add_prefix_space': True}
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, **tok_kwargs)

data = np.load(CACHE_TRAIN, allow_pickle=True)
folds_df = pd.read_csv(FOLDS_CSV)

class QADataset(Dataset):
    def __init__(self, idxs):
        self.ids = data['input_ids'][idxs]
        self.attn = data['attention_mask'][idxs]
        self.has_tt = 'token_type_ids' in data.files
        if self.has_tt:
            self.tt = data['token_type_ids'][idxs]
        self.start = data['start_positions'][idxs]
        self.end = data['end_positions'][idxs]
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        item = {
            'input_ids': torch.tensor(self.ids[i], dtype=torch.long),
            'attention_mask': torch.tensor(self.attn[i], dtype=torch.long),
            'start_positions': torch.tensor(self.start[i], dtype=torch.long),
            'end_positions': torch.tensor(self.end[i], dtype=torch.long),
        }
        if 'token_type_ids' in data.files:
            item['token_type_ids'] = torch.tensor(self.tt[i], dtype=torch.long)
        return item

def constrained_decode_for_row(sentiment, text, model):
    if sentiment == 'neutral':
        return text
    enc = tokenizer(text=str(sentiment), text_pair=str(text), add_special_tokens=True, truncation=True, max_length=MAX_LEN,
                    return_offsets_mapping=True, return_tensors='pt')
    input_ids = enc['input_ids'].to(device)
    attention_mask = enc['attention_mask'].to(device)
    with torch.no_grad():
        out = model(input_ids=input_ids, attention_mask=attention_mask)
        start_logits = out.start_logits[0].detach().cpu().numpy()
        end_logits = out.end_logits[0].detach().cpu().numpy()
    offsets = enc['offset_mapping'][0].tolist()
    seq_ids = tokenizer.decode(enc['input_ids'][0]).split()  # placeholder to ensure execution; proper seq_ids via fast tokenizer:
    seq_ids = enc.sequence_ids(0)  # list with None/0/1
    # indices for tweet side
    tweet_idxs = [i for i, sid in enumerate(seq_ids) if sid == 1]
    if not tweet_idxs:
        return text
    # restrict logits to tweet side by setting others to -inf
    neg_inf = -1e9
    s_logits = start_logits.copy(); e_logits = end_logits.copy()
    for i, sid in enumerate(seq_ids):
        if sid != 1:
            s_logits[i] = neg_inf; e_logits[i] = neg_inf
    # top-k candidates
    k = min(5, len(tweet_idxs))
    start_cand = np.argsort(s_logits)[-k:]
    end_cand = np.argsort(e_logits)[-k:]
    best = None; best_score = -1e18
    for si in start_cand:
        for ei in end_cand:
            if ei < si: continue
            if ei - si + 1 > MAX_SPAN_LEN: continue
            score = s_logits[si] + e_logits[ei]
            if score > best_score:
                best_score = score; best = (si, ei)
    if best is None:
        # fallback single best start token
        si = int(np.argmax(s_logits)); ei = si
    else:
        si, ei = best
    # map to char offsets and extract
    cs = offsets[si][0]; ce = offsets[ei][1]
    sub = text[cs:ce]
    sub = sub.strip()
    if not sub:
        # fallback
        si = int(np.argmax(s_logits)); cs = offsets[si][0]; ce = offsets[si][1]
        sub = text[cs:ce].strip() or text
    return sub

def run_fold(fold=0):
    all_folds = folds_df['fold'].values
    train_idx = np.where(all_folds != fold)[0]
    val_idx = np.where(all_folds == fold)[0]
    print(f'Fold {fold}: train {len(train_idx)} | val {len(val_idx)}')
    train_ds = QADataset(train_idx); val_ds = QADataset(val_idx)
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
    val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)
    opt = AdamW(model.parameters(), lr=LR, weight_decay=0.01)
    num_train_steps = EPOCHS * math.ceil(len(train_loader))
    num_warmup = int(WARMUP * num_train_steps)
    sch = get_linear_schedule_with_warmup(opt, num_warmup, num_train_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=True)

    t0 = time.time()
    for epoch in range(EPOCHS):
        model.train(); tr_loss = 0.0
        for step, batch in enumerate(train_loader):
            for k in list(batch.keys()): batch[k] = batch[k].to(device)
            opt.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=True):
                out = model(**batch)
                loss = out.loss
            scaler.scale(loss).backward()
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt); scaler.update(); sch.step()
            tr_loss += loss.item()
            if (step+1) % 100 == 0:
                print(f'Epoch {epoch+1} Step {step+1}/{len(train_loader)} loss {tr_loss/(step+1):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        print(f'Epoch {epoch+1} done. Train loss {tr_loss/max(1,len(train_loader)):.4f}')

    # Validation decode and Jaccard
    model.eval()
    sentiments = data['sentiment'][val_idx]
    texts = data['text'][val_idx]
    gold = data['selected_text'][val_idx]
    preds = []
    for i in range(len(val_idx)):
        pred = constrained_decode_for_row(str(sentiments[i]), str(texts[i]), model)
        # neutral hard rule
        if sentiments[i] == 'neutral': pred = str(texts[i])
        preds.append(pred)
        if (i+1) % 500 == 0:
            print(f'Val decoded {i+1}/{len(val_idx)}', flush=True)
    # Jaccard
    def jac(a,b):
        sa = set(str(a).split()); sb = set(str(b).split());
        return (len(sa & sb)) / (len(sa | sb) + 1e-12)
    scores = [jac(preds[i], gold[i]) for i in range(len(preds))]
    score = float(np.mean(scores))
    print(f'Fold {fold} OOF Jaccard: {score:.5f}')
    # Save model for potential reuse
    outdir = Path('models/roberta_base_f0')
    outdir.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(outdir)
    tokenizer.save_pretrained(outdir)
    # Return preds for potential error analysis
    return score, preds

score, _ = run_fold(fold=0)
print('Smoke training complete. Fold0 Jaccard =', score)

# If the score looks sane (>0.70), we will proceed to full 5-fold DeBERTa-v3-base next.



Fold 0: train 19785 | val 4947


Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  scaler = torch.cuda.amp.GradScaler(enabled=True)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  with torch.cuda.amp.autocast(enabled=True):


Epoch 1 Step 100/619 loss 3.1064 elapsed 10.9s


Epoch 1 Step 200/619 loss 2.1781 elapsed 21.5s


Epoch 1 Step 300/619 loss 1.7868 elapsed 32.1s


Epoch 1 Step 400/619 loss 1.5770 elapsed 42.7s


Epoch 1 Step 500/619 loss 1.4384 elapsed 53.3s


Epoch 1 Step 600/619 loss 1.3437 elapsed 64.0s


Epoch 1 done. Train loss 1.3306


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 2 Step 100/619 loss 0.8026 elapsed 76.7s


Epoch 2 Step 200/619 loss 0.8096 elapsed 87.4s


Epoch 2 Step 300/619 loss 0.7896 elapsed 98.1s


Epoch 2 Step 400/619 loss 0.7779 elapsed 108.8s


Epoch 2 Step 500/619 loss 0.7738 elapsed 119.5s


Epoch 2 Step 600/619 loss 0.7672 elapsed 130.3s


Epoch 2 done. Train loss 0.7655


Val decoded 500/4947


Val decoded 1000/4947


Val decoded 1500/4947


Val decoded 2000/4947


Val decoded 2500/4947


Val decoded 3000/4947


Val decoded 3500/4947


Val decoded 4000/4947


Val decoded 4500/4947


Fold 0 OOF Jaccard: 0.71234


Smoke training complete. Fold0 Jaccard = 0.7123445887907528


In [6]:
# Build DeBERTa-v3-base cache with sentiment prompt and robust alignment
import os, re, json, time, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer

SEED = 42
OUT_DIR = Path('cache')
OUT_DIR.mkdir(exist_ok=True, parents=True)

def find_span(text, sel):
    if not isinstance(text, str) or not isinstance(sel, str) or len(sel) == 0:
        return None
    start = text.find(sel)
    if start != -1:
        return (start, start + len(sel))
    def collapse_spaces(s):
        return re.sub(r'\s+', ' ', s.strip())
    text_c = collapse_spaces(text); sel_c = collapse_spaces(sel)
    start_c = text_c.find(sel_c)
    if start_c == -1:
        return None
    i = 0
    map_idx = []
    while i < len(text):
        if text[i].isspace():
            map_idx.append(i)
            while i < len(text) and text[i].isspace():
                i += 1
        else:
            map_idx.append(i); i += 1
    if start_c < len(map_idx):
        start_orig = map_idx[start_c]
        end_c = start_c + len(sel_c)
        end_orig = map_idx[min(end_c-1, len(map_idx)-1)] + 1
        return (start_orig, end_orig)
    return None

def map_char_to_tokens(offsets, seq_ids, char_span, target_seq_id=1):
    if char_span is None:
        return None
    cs, ce = char_span
    start_tok = end_tok = None
    for i, (o, sid) in enumerate(zip(offsets, seq_ids)):
        if sid != target_seq_id:
            continue
        os_, oe_ = o
        if os_ is None:
            continue
        if oe_ > cs and os_ < ce:
            if start_tok is None:
                start_tok = i
            end_tok = i
    if start_tok is None or end_tok is None:
        return None
    return (start_tok, end_tok)

def prepare_cached_dataset_deberta(model_name='microsoft/deberta-v3-base', max_len=128, prefix='deberta_v3_base_m128_prompt'):
    print(f'Preparing dataset for {model_name}, max_len={max_len}, prefix={prefix}', flush=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    df = train.copy()
    df['text'] = df['text'].fillna('')
    df['selected_text'] = df['selected_text'].fillna('')
    df['sentiment'] = df['sentiment'].fillna('neutral')

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    df['fold'] = -1
    for k, (_, val_idx) in enumerate(skf.split(df, df['sentiment'])):
        df.loc[val_idx, 'fold'] = k
    df.to_csv(OUT_DIR / f'train_folds_{prefix}.csv', index=False)
    print('Saved folds to', OUT_DIR / f'train_folds_{prefix}.csv')

    input_ids_list = []; attention_mask_list = [];
    start_list = []; end_list = []
    n = len(df); t0 = time.time()
    for i, row in df.iterrows():
        if i % 2000 == 0:
            print(f'Row {i}/{n} elapsed {time.time()-t0:.1f}s', flush=True)
        sent_prompt = f"sentiment: {str(row['sentiment'])}"
        txt = str(row['text'])
        sel = str(row['selected_text'])
        enc = tokenizer(text=sent_prompt, text_pair=txt, add_special_tokens=True, truncation=True, max_length=max_len,
                        return_offsets_mapping=True, return_attention_mask=True)
        input_ids = enc['input_ids']; attn = enc['attention_mask']
        offsets = enc['offset_mapping']
        seq_ids = enc.sequence_ids()
        span = find_span(txt, sel)
        tok_span = map_char_to_tokens(offsets, seq_ids, span, target_seq_id=1)
        if tok_span is None:
            idxs = [idx for idx, sid in enumerate(seq_ids) if sid == 1]
            if len(idxs) == 0:
                s_tok = e_tok = 0
            else:
                s_tok, e_tok = idxs[0], idxs[-1]
        else:
            s_tok, e_tok = tok_span
        if len(input_ids) < max_len:
            pad_len = max_len - len(input_ids)
            pad_id = tokenizer.pad_token_id
            input_ids = input_ids + [pad_id]*pad_len
            attn = attn + [0]*pad_len
        else:
            input_ids = input_ids[:max_len]; attn = attn[:max_len]
            s_tok = min(s_tok, max_len-1); e_tok = min(e_tok, max_len-1)
        input_ids_list.append(np.array(input_ids, dtype=np.int32))
        attention_mask_list.append(np.array(attn, dtype=np.int8))
        start_list.append(s_tok); end_list.append(e_tok)

    X_train = {
        'input_ids': np.stack(input_ids_list),
        'attention_mask': np.stack(attention_mask_list),
        'start_positions': np.array(start_list, dtype=np.int32),
        'end_positions': np.array(end_list, dtype=np.int32),
        'fold': df['fold'].values.astype(np.int8),
        'sentiment': df['sentiment'].values,
        'text': df['text'].values,
        'selected_text': df['selected_text'].values,
    }
    np.savez_compressed(OUT_DIR / f'train_{prefix}.npz', **X_train)
    print('Saved', OUT_DIR / f'train_{prefix}.npz')

    # Test encoding
    test_df = test.copy()
    test_df['text'] = test_df['text'].fillna('')
    test_df['sentiment'] = test_df['sentiment'].fillna('neutral')
    ti_ids = []; ta_masks = []
    for i, row in test_df.iterrows():
        if i % 2000 == 0:
            print(f'Test row {i}/{len(test_df)}', flush=True)
        sent_prompt = f"sentiment: {str(row['sentiment'])}"
        enc = tokenizer(text=sent_prompt, text_pair=str(row['text']), add_special_tokens=True, truncation=True, max_length=max_len,
                        return_attention_mask=True)
        ids = enc['input_ids']; attn = enc['attention_mask']
        if len(ids) < max_len:
            pad_len = max_len - len(ids)
            ids = ids + [tokenizer.pad_token_id]*pad_len
            attn = attn + [0]*pad_len
        else:
            ids = ids[:max_len]; attn = attn[:max_len]
        ti_ids.append(np.array(ids, dtype=np.int32))
        ta_masks.append(np.array(attn, dtype=np.int8))
    X_test = {
        'input_ids': np.stack(ti_ids),
        'attention_mask': np.stack(ta_masks),
        'sentiment': test_df['sentiment'].values,
        'text': test_df['text'].values,
        'textID': test_df['textID'].values,
    }
    np.savez_compressed(OUT_DIR / f'test_{prefix}.npz', **X_test)
    print('Saved', OUT_DIR / f'test_{prefix}.npz')
    Path(OUT_DIR / f'meta_{prefix}.json').write_text(json.dumps({'model_name': model_name, 'max_len': max_len, 'prefix': prefix, 'prompt': True}))
    print('Meta saved.')

prepare_cached_dataset_deberta()
print('DeBERTa cache build complete.')

Preparing dataset for microsoft/deberta-v3-base, max_len=128, prefix=deberta_v3_base_m128_prompt




Saved folds to cache/train_folds_deberta_v3_base_m128_prompt.csv
Row 0/24732 elapsed 0.0s


Row 2000/24732 elapsed 0.3s


Row 4000/24732 elapsed 0.5s


Row 6000/24732 elapsed 0.8s


Row 8000/24732 elapsed 1.1s


Row 10000/24732 elapsed 1.4s


Row 12000/24732 elapsed 1.6s


Row 14000/24732 elapsed 1.9s


Row 16000/24732 elapsed 2.2s


Row 18000/24732 elapsed 2.5s


Row 20000/24732 elapsed 2.7s


Row 22000/24732 elapsed 3.0s


Row 24000/24732 elapsed 3.3s


Saved cache/train_deberta_v3_base_m128_prompt.npz
Test row 0/2749


Test row 2000/2749


Saved cache/test_deberta_v3_base_m128_prompt.npz
Meta saved.
DeBERTa cache build complete.


In [9]:
# DeBERTa-v3-base 5-fold training with sentiment prompt, fp16, OOF/test logits save, fold-avg decode, submission
import math, time, json, numpy as np, pandas as pd, torch, os
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, get_linear_schedule_with_warmup
from torch.optim import AdamW

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
PREFIX = 'deberta_v3_base_m128_prompt'
CACHE_TRAIN = Path(f'cache/train_{PREFIX}.npz')
CACHE_TEST = Path(f'cache/test_{PREFIX}.npz')
FOLDS_CSV = Path(f'cache/train_folds_{PREFIX}.csv')
MODEL_NAME = 'microsoft/deberta-v3-base'
MAX_LEN = 128
EPOCHS = 3
LR = 3e-5
WARMUP = 0.1
WEIGHT_DECAY = 0.01
BATCH_SIZE = 16  # per device
GRAD_ACCUM = 4   # effective batch 64
CLIP_NORM = 1.0
TOP_K = 10
SPAN_CAP = 30
SEED = 42
torch.manual_seed(SEED); np.random.seed(SEED)

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

train_npz = np.load(CACHE_TRAIN, allow_pickle=True)
test_npz = np.load(CACHE_TEST, allow_pickle=True)
folds_df = pd.read_csv(FOLDS_CSV)

class QADatasetCached(Dataset):
    def __init__(self, idxs):
        self.ids = train_npz['input_ids'][idxs]
        self.attn = train_npz['attention_mask'][idxs]
        self.start = train_npz['start_positions'][idxs]
        self.end = train_npz['end_positions'][idxs]
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        return {
            'input_ids': torch.tensor(self.ids[i], dtype=torch.long),
            'attention_mask': torch.tensor(self.attn[i], dtype=torch.long),
            'start_positions': torch.tensor(self.start[i], dtype=torch.long),
            'end_positions': torch.tensor(self.end[i], dtype=torch.long),
        }

def decode_with_logits_for_row(sentiment, text, start_logits, end_logits):
    # Neutral hard rule
    if sentiment == 'neutral':
        return text
    # Re-encode to get offsets and sequence_ids; MUST pad to max_length to match logits shape
    sent_prompt = f'sentiment: {sentiment}'
    enc = tok(text=sent_prompt, text_pair=str(text), add_special_tokens=True, truncation=True, max_length=MAX_LEN,
              padding='max_length', return_offsets_mapping=True, return_tensors='pt')
    offsets = enc['offset_mapping'][0].tolist()
    seq_ids = enc.sequence_ids(0)
    # Restrict to tweet tokens
    tweet_mask = np.array([1 if sid==1 else 0 for sid in seq_ids], dtype=np.int8)
    neg_inf = -1e9
    s = start_logits.copy(); e = end_logits.copy()
    if tweet_mask.shape[0] != s.shape[0]:
        # fallback: return full text
        return text
    s[tweet_mask==0] = neg_inf; e[tweet_mask==0] = neg_inf
    k = min(TOP_K, int(tweet_mask.sum()))
    start_cand = np.argsort(s)[-k:]
    end_cand = np.argsort(e)[-k:]
    best = None; best_score = -1e18
    for si in start_cand:
        for ei in end_cand:
            if ei < si: continue
            if (ei - si + 1) > SPAN_CAP: continue
            sc = s[si] + e[ei]
            if sc > best_score:
                best_score = sc; best = (si, ei)
    if best is None:
        si = int(np.argmax(s)); ei = si
    else:
        si, ei = best
    # Guard None offsets by moving inward to nearest valid token on tweet side
    def valid_left(i):
        while i >= 0 and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None):
            i -= 1
        return i
    def valid_right(i):
        n = len(offsets)
        while i < n and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None):
            i += 1
        return i
    si = valid_left(si); ei = valid_right(ei)
    if si < 0 or ei >= len(offsets) or si > ei:
        # fallback to best single start token
        si = int(np.argmax(s)); si = valid_left(si)
        if si < 0: return text
        cs, ce = offsets[si][0], offsets[si][1]
        sub = text[cs:ce].strip()
        return sub if sub else text
    cs = offsets[si][0]; ce = offsets[ei][1]
    if cs is None or ce is None:
        return text
    sub = text[cs:ce].strip()
    return sub if sub else text

def jaccard_str(a, b):
    sa = set(str(a).split()); sb = set(str(b).split())
    return (len(sa & sb)) / (len(sa | sb) + 1e-12)

def train_fold(fold):
    all_folds = folds_df['fold'].values
    tr_idx = np.where(all_folds != fold)[0]
    va_idx = np.where(all_folds == fold)[0]
    print(f'Fold {fold}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
    train_ds = QADatasetCached(tr_idx)
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)
    opt = AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)
    # Scheduler calibrated to optimizer steps (with grad accumulation)
    steps_per_epoch = len(train_loader)
    opt_steps_per_epoch = math.ceil(steps_per_epoch / GRAD_ACCUM)
    total_opt_steps = opt_steps_per_epoch * EPOCHS
    warmup_steps = int(WARMUP * total_opt_steps)
    sch = get_linear_schedule_with_warmup(opt, warmup_steps, total_opt_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=True)
    best_score = -1.0; best_state = None
    t0 = time.time()
    model.train()
    for epoch in range(EPOCHS):
        tr_loss = 0.0
        opt.zero_grad(set_to_none=True)
        for step, batch in enumerate(train_loader):
            batch = {k: v.to(device) for k, v in batch.items()}
            with torch.cuda.amp.autocast(enabled=True):
                out = model(**batch)
                loss = out.loss / GRAD_ACCUM
            scaler.scale(loss).backward()
            if (step + 1) % GRAD_ACCUM == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP_NORM)
                scaler.step(opt); scaler.update(); sch.step()
                opt.zero_grad(set_to_none=True)
            tr_loss += loss.item() * GRAD_ACCUM
            if (step + 1) % 100 == 0:
                print(f'fold {fold} epoch {epoch+1} step {step+1}/{steps_per_epoch} loss {tr_loss/(step+1):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        # Eval at epoch end
        model.eval()
        # Collect OOF logits for val indices by running on cached inputs
        va_ids = train_npz['input_ids'][va_idx]
        va_attn = train_npz['attention_mask'][va_idx]
        start_logits = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
        end_logits = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
        bs = 64
        with torch.no_grad():
            for i in range(0, len(va_idx), bs):
                x_ids = torch.tensor(va_ids[i:i+bs], dtype=torch.long, device=device)
                x_attn = torch.tensor(va_attn[i:i+bs], dtype=torch.long, device=device)
                out = model(input_ids=x_ids, attention_mask=x_attn)
                start_logits[i:i+bs] = out.start_logits.detach().cpu().numpy()
                end_logits[i:i+bs] = out.end_logits.detach().cpu().numpy()
        # Decode OOF
        sentiments = train_npz['sentiment'][va_idx]
        texts = train_npz['text'][va_idx]
        gold = train_npz['selected_text'][va_idx]
        preds = []
        for i in range(len(va_idx)):
            pred = decode_with_logits_for_row(str(sentiments[i]), str(texts[i]), start_logits[i], end_logits[i])
            preds.append(pred)
        score = float(np.mean([jaccard_str(preds[i], gold[i]) for i in range(len(preds))]))
        print(f'fold {fold} epoch {epoch+1} OOF Jaccard {score:.5f}', flush=True)
        if score > best_score + 1e-4:
            best_score = score
            best_state = model.state_dict()
        model.train()
    # Load best state
    if best_state is not None:
        model.load_state_dict(best_state)
    # Final OOF logits with best model
    model.eval()
    va_ids = train_npz['input_ids'][va_idx]
    va_attn = train_npz['attention_mask'][va_idx]
    oof_start = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
    oof_end = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
    bs = 64
    with torch.no_grad():
        for i in range(0, len(va_idx), bs):
            x_ids = torch.tensor(va_ids[i:i+bs], dtype=torch.long, device=device)
            x_attn = torch.tensor(va_attn[i:i+bs], dtype=torch.long, device=device)
            out = model(input_ids=x_ids, attention_mask=x_attn)
            oof_start[i:i+bs] = out.start_logits.detach().cpu().numpy()
            oof_end[i:i+bs] = out.end_logits.detach().cpu().numpy()
    # Test logits for this fold
    te_ids = test_npz['input_ids']
    te_attn = test_npz['attention_mask']
    te_start = np.zeros((len(te_ids), MAX_LEN), dtype=np.float32)
    te_end = np.zeros((len(te_ids), MAX_LEN), dtype=np.float32)
    with torch.no_grad():
        for i in range(0, len(te_ids), bs):
            x_ids = torch.tensor(te_ids[i:i+bs], dtype=torch.long, device=device)
            x_attn = torch.tensor(te_attn[i:i+bs], dtype=torch.long, device=device)
            out = model(input_ids=x_ids, attention_mask=x_attn)
            te_start[i:i+bs] = out.start_logits.detach().cpu().numpy()
            te_end[i:i+bs] = out.end_logits.detach().cpu().numpy()
    # Save logits
    fold_dir = Path(f'cache/oof_{PREFIX}')
    fold_dir.mkdir(parents=True, exist_ok=True)
    np.savez_compressed(fold_dir / f'fold{fold}_oof_logits.npz', idx=va_idx, start=oof_start, end=oof_end)
    np.savez_compressed(fold_dir / f'fold{fold}_test_logits.npz', start=te_start, end=te_end)
    # Report final best OOF using decode
    sentiments = train_npz['sentiment'][va_idx]
    texts = train_npz['text'][va_idx]
    gold = train_npz['selected_text'][va_idx]
    preds = []
    for i in range(len(va_idx)):
        pred = decode_with_logits_for_row(str(sentiments[i]), str(texts[i]), oof_start[i], oof_end[i])
        preds.append(pred)
    final_oof = float(np.mean([jaccard_str(preds[i], gold[i]) for i in range(len(preds))]))
    print(f'fold {fold} best OOF Jaccard {final_oof:.5f}', flush=True)
    return final_oof

# Train all folds
fold_scores = []
for f in range(5):
    t0 = time.time()
    sc = train_fold(f)
    fold_scores.append(sc)
    print(f'Fold {f} done in {time.time()-t0:.1f}s, OOF {sc:.5f}', flush=True)
print('OOF mean:', float(np.mean(fold_scores)))

# Average test logits across folds and decode once
fold_dir = Path(f'cache/oof_{PREFIX}')
te_files = [np.load(fold_dir / f'fold{f}_test_logits.npz') for f in range(5)]
te_start = np.mean([f['start'] for f in te_files], axis=0)
te_end = np.mean([f['end'] for f in te_files], axis=0)
test_sent = test_npz['sentiment']
test_text = test_npz['text']
preds = []
for i in range(len(test_text)):
    pred = decode_with_logits_for_row(str(test_sent[i]), str(test_text[i]), te_start[i], te_end[i])
    preds.append(pred)
# Build submission
sub = pd.DataFrame({'textID': test_npz['textID'], 'selected_text': preds})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(sub), 'rows')

# Quick sanity: show head
print(sub.head())



Fold 0: train 19785 | val 4947


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  scaler = torch.cuda.amp.GradScaler(enabled=True)


  with torch.cuda.amp.autocast(enabled=True):


fold 0 epoch 1 step 100/1237 loss 4.2007 elapsed 8.4s


fold 0 epoch 1 step 200/1237 loss 3.3069 elapsed 16.7s


fold 0 epoch 1 step 300/1237 loss 2.6995 elapsed 25.1s


fold 0 epoch 1 step 400/1237 loss 2.3098 elapsed 33.4s


fold 0 epoch 1 step 500/1237 loss 2.0506 elapsed 41.8s


fold 0 epoch 1 step 600/1237 loss 1.8634 elapsed 50.2s


fold 0 epoch 1 step 700/1237 loss 1.7289 elapsed 58.6s


fold 0 epoch 1 step 800/1237 loss 1.6225 elapsed 67.0s


fold 0 epoch 1 step 900/1237 loss 1.5384 elapsed 75.5s


fold 0 epoch 1 step 1000/1237 loss 1.4743 elapsed 83.9s


fold 0 epoch 1 step 1100/1237 loss 1.4201 elapsed 92.4s


fold 0 epoch 1 step 1200/1237 loss 1.3738 elapsed 100.9s


fold 0 epoch 1 OOF Jaccard 0.70714


fold 0 epoch 2 step 100/1237 loss 0.8010 elapsed 129.7s


fold 0 epoch 2 step 200/1237 loss 0.7932 elapsed 138.2s


fold 0 epoch 2 step 300/1237 loss 0.7956 elapsed 146.7s


fold 0 epoch 2 step 400/1237 loss 0.7863 elapsed 155.3s


fold 0 epoch 2 step 500/1237 loss 0.7863 elapsed 163.8s


fold 0 epoch 2 step 600/1237 loss 0.7919 elapsed 172.3s


fold 0 epoch 2 step 700/1237 loss 0.7933 elapsed 180.8s


fold 0 epoch 2 step 800/1237 loss 0.7888 elapsed 189.4s


fold 0 epoch 2 step 900/1237 loss 0.7907 elapsed 197.9s


fold 0 epoch 2 step 1000/1237 loss 0.7913 elapsed 206.4s


fold 0 epoch 2 step 1100/1237 loss 0.7929 elapsed 215.0s


fold 0 epoch 2 step 1200/1237 loss 0.7906 elapsed 223.5s


fold 0 epoch 2 OOF Jaccard 0.71007


fold 0 epoch 3 step 100/1237 loss 0.6945 elapsed 252.6s


fold 0 epoch 3 step 200/1237 loss 0.7003 elapsed 261.1s


fold 0 epoch 3 step 300/1237 loss 0.7083 elapsed 269.7s


fold 0 epoch 3 step 400/1237 loss 0.7140 elapsed 278.2s


fold 0 epoch 3 step 500/1237 loss 0.7126 elapsed 286.8s


fold 0 epoch 3 step 600/1237 loss 0.7122 elapsed 295.3s


fold 0 epoch 3 step 700/1237 loss 0.7089 elapsed 303.9s


fold 0 epoch 3 step 800/1237 loss 0.7123 elapsed 312.4s


fold 0 epoch 3 step 900/1237 loss 0.7108 elapsed 321.0s


fold 0 epoch 3 step 1000/1237 loss 0.7121 elapsed 329.5s


fold 0 epoch 3 step 1100/1237 loss 0.7139 elapsed 338.3s


fold 0 epoch 3 step 1200/1237 loss 0.7133 elapsed 346.8s


fold 0 epoch 3 OOF Jaccard 0.71308


fold 0 best OOF Jaccard 0.71308


Fold 0 done in 394.2s, OOF 0.71308


Fold 1: train 19785 | val 4947


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


fold 1 epoch 1 step 100/1237 loss 5.0575 elapsed 8.6s


fold 1 epoch 1 step 200/1237 loss 3.9591 elapsed 17.1s


fold 1 epoch 1 step 300/1237 loss 3.1630 elapsed 25.7s


fold 1 epoch 1 step 400/1237 loss 2.6524 elapsed 34.3s


fold 1 epoch 1 step 500/1237 loss 2.3128 elapsed 42.9s


fold 1 epoch 1 step 600/1237 loss 2.0861 elapsed 51.5s


fold 1 epoch 1 step 700/1237 loss 1.9183 elapsed 60.1s


fold 1 epoch 1 step 800/1237 loss 1.7856 elapsed 68.8s


fold 1 epoch 1 step 900/1237 loss 1.6889 elapsed 77.4s


fold 1 epoch 1 step 1000/1237 loss 1.6067 elapsed 86.0s


fold 1 epoch 1 step 1100/1237 loss 1.5379 elapsed 94.6s


fold 1 epoch 1 step 1200/1237 loss 1.4816 elapsed 103.2s


fold 1 epoch 1 OOF Jaccard 0.70076


fold 1 epoch 2 step 100/1237 loss 0.7906 elapsed 132.4s


fold 1 epoch 2 step 200/1237 loss 0.7685 elapsed 141.0s


fold 1 epoch 2 step 300/1237 loss 0.7838 elapsed 149.6s


fold 1 epoch 2 step 400/1237 loss 0.7811 elapsed 158.2s


fold 1 epoch 2 step 500/1237 loss 0.7907 elapsed 166.8s


fold 1 epoch 2 step 600/1237 loss 0.7826 elapsed 175.4s


fold 1 epoch 2 step 700/1237 loss 0.7852 elapsed 184.0s


fold 1 epoch 2 step 800/1237 loss 0.7883 elapsed 192.6s


fold 1 epoch 2 step 900/1237 loss 0.7907 elapsed 201.2s


fold 1 epoch 2 step 1000/1237 loss 0.7886 elapsed 209.8s


fold 1 epoch 2 step 1100/1237 loss 0.7883 elapsed 218.4s


fold 1 epoch 2 step 1200/1237 loss 0.7847 elapsed 227.0s


fold 1 epoch 2 OOF Jaccard 0.70544


fold 1 epoch 3 step 100/1237 loss 0.7157 elapsed 256.3s


fold 1 epoch 3 step 200/1237 loss 0.7262 elapsed 264.9s


fold 1 epoch 3 step 300/1237 loss 0.7225 elapsed 273.5s


fold 1 epoch 3 step 400/1237 loss 0.7190 elapsed 282.1s


fold 1 epoch 3 step 500/1237 loss 0.7203 elapsed 290.7s


fold 1 epoch 3 step 600/1237 loss 0.7200 elapsed 299.3s


fold 1 epoch 3 step 700/1237 loss 0.7226 elapsed 307.9s


fold 1 epoch 3 step 800/1237 loss 0.7210 elapsed 316.5s


fold 1 epoch 3 step 900/1237 loss 0.7166 elapsed 325.1s


fold 1 epoch 3 step 1000/1237 loss 0.7157 elapsed 333.7s


fold 1 epoch 3 step 1100/1237 loss 0.7124 elapsed 342.2s


fold 1 epoch 3 step 1200/1237 loss 0.7082 elapsed 350.8s


fold 1 epoch 3 OOF Jaccard 0.70956


fold 1 best OOF Jaccard 0.70956


Fold 1 done in 398.5s, OOF 0.70956


Fold 2: train 19786 | val 4946


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


fold 2 epoch 1 step 100/1237 loss 4.9843 elapsed 8.6s


fold 2 epoch 1 step 200/1237 loss 3.8584 elapsed 17.2s


fold 2 epoch 1 step 300/1237 loss 3.0886 elapsed 25.8s


fold 2 epoch 1 step 400/1237 loss 2.6052 elapsed 34.4s


fold 2 epoch 1 step 500/1237 loss 2.2897 elapsed 43.0s


fold 2 epoch 1 step 600/1237 loss 2.0640 elapsed 51.6s


fold 2 epoch 1 step 700/1237 loss 1.8926 elapsed 60.2s


fold 2 epoch 1 step 800/1237 loss 1.7633 elapsed 68.8s


fold 2 epoch 1 step 900/1237 loss 1.6671 elapsed 77.5s


fold 2 epoch 1 step 1000/1237 loss 1.5862 elapsed 86.1s


fold 2 epoch 1 step 1100/1237 loss 1.5219 elapsed 94.7s


fold 2 epoch 1 step 1200/1237 loss 1.4678 elapsed 103.3s


fold 2 epoch 1 OOF Jaccard 0.71214


fold 2 epoch 2 step 100/1237 loss 0.8000 elapsed 132.6s


fold 2 epoch 2 step 200/1237 loss 0.8050 elapsed 141.1s


fold 2 epoch 2 step 300/1237 loss 0.7938 elapsed 149.7s


fold 2 epoch 2 step 400/1237 loss 0.7873 elapsed 158.2s


fold 2 epoch 2 step 500/1237 loss 0.7945 elapsed 166.8s


fold 2 epoch 2 step 600/1237 loss 0.7926 elapsed 175.4s


fold 2 epoch 2 step 700/1237 loss 0.8012 elapsed 183.9s


fold 2 epoch 2 step 800/1237 loss 0.7970 elapsed 192.5s


fold 2 epoch 2 step 900/1237 loss 0.7954 elapsed 201.1s


fold 2 epoch 2 step 1000/1237 loss 0.7907 elapsed 209.6s


fold 2 epoch 2 step 1100/1237 loss 0.7911 elapsed 218.2s


fold 2 epoch 2 step 1200/1237 loss 0.7868 elapsed 226.8s


fold 2 epoch 2 OOF Jaccard 0.71689


fold 2 epoch 3 step 100/1237 loss 0.7003 elapsed 255.9s


fold 2 epoch 3 step 200/1237 loss 0.7104 elapsed 264.5s


fold 2 epoch 3 step 300/1237 loss 0.7135 elapsed 273.1s


fold 2 epoch 3 step 400/1237 loss 0.7152 elapsed 281.7s


fold 2 epoch 3 step 500/1237 loss 0.7252 elapsed 290.3s


fold 2 epoch 3 step 600/1237 loss 0.7262 elapsed 298.9s


fold 2 epoch 3 step 700/1237 loss 0.7205 elapsed 307.5s


fold 2 epoch 3 step 800/1237 loss 0.7163 elapsed 316.1s


fold 2 epoch 3 step 900/1237 loss 0.7145 elapsed 324.7s


fold 2 epoch 3 step 1000/1237 loss 0.7108 elapsed 333.3s


fold 2 epoch 3 step 1100/1237 loss 0.7094 elapsed 341.9s


fold 2 epoch 3 step 1200/1237 loss 0.7071 elapsed 350.5s


fold 2 epoch 3 OOF Jaccard 0.71737


fold 2 best OOF Jaccard 0.71737


Fold 2 done in 398.1s, OOF 0.71737


Fold 3: train 19786 | val 4946


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


fold 3 epoch 1 step 100/1237 loss 4.3199 elapsed 8.6s


fold 3 epoch 1 step 200/1237 loss 3.3764 elapsed 17.2s


fold 3 epoch 1 step 300/1237 loss 2.7407 elapsed 25.7s


fold 3 epoch 1 step 400/1237 loss 2.3289 elapsed 34.3s


fold 3 epoch 1 step 500/1237 loss 2.0599 elapsed 42.9s


fold 3 epoch 1 step 600/1237 loss 1.8718 elapsed 51.5s


fold 3 epoch 1 step 700/1237 loss 1.7367 elapsed 60.1s


fold 3 epoch 1 step 800/1237 loss 1.6310 elapsed 68.7s


fold 3 epoch 1 step 900/1237 loss 1.5493 elapsed 77.3s


fold 3 epoch 1 step 1000/1237 loss 1.4787 elapsed 85.9s


fold 3 epoch 1 step 1100/1237 loss 1.4225 elapsed 94.5s


fold 3 epoch 1 step 1200/1237 loss 1.3739 elapsed 103.1s


fold 3 epoch 1 OOF Jaccard 0.69629


fold 3 epoch 2 step 100/1237 loss 0.7958 elapsed 132.3s


fold 3 epoch 2 step 200/1237 loss 0.8031 elapsed 140.8s


fold 3 epoch 2 step 300/1237 loss 0.7935 elapsed 149.4s


fold 3 epoch 2 step 400/1237 loss 0.7866 elapsed 158.0s


fold 3 epoch 2 step 500/1237 loss 0.7875 elapsed 166.6s


fold 3 epoch 2 step 600/1237 loss 0.7836 elapsed 175.1s


fold 3 epoch 2 step 700/1237 loss 0.7843 elapsed 183.7s


fold 3 epoch 2 step 800/1237 loss 0.7895 elapsed 192.3s


fold 3 epoch 2 step 900/1237 loss 0.7894 elapsed 200.8s


fold 3 epoch 2 step 1000/1237 loss 0.7881 elapsed 209.4s


fold 3 epoch 2 step 1100/1237 loss 0.7870 elapsed 218.0s


fold 3 epoch 2 step 1200/1237 loss 0.7849 elapsed 226.5s


fold 3 epoch 2 OOF Jaccard 0.70987


fold 3 epoch 3 step 100/1237 loss 0.7241 elapsed 255.8s


fold 3 epoch 3 step 200/1237 loss 0.7148 elapsed 264.4s


fold 3 epoch 3 step 300/1237 loss 0.7078 elapsed 273.0s


fold 3 epoch 3 step 400/1237 loss 0.7189 elapsed 281.6s


fold 3 epoch 3 step 500/1237 loss 0.7197 elapsed 290.2s


fold 3 epoch 3 step 600/1237 loss 0.7199 elapsed 298.8s


fold 3 epoch 3 step 700/1237 loss 0.7166 elapsed 307.4s


fold 3 epoch 3 step 800/1237 loss 0.7177 elapsed 316.0s


fold 3 epoch 3 step 900/1237 loss 0.7136 elapsed 324.6s


fold 3 epoch 3 step 1000/1237 loss 0.7140 elapsed 333.2s


fold 3 epoch 3 step 1100/1237 loss 0.7106 elapsed 341.8s


fold 3 epoch 3 step 1200/1237 loss 0.7090 elapsed 350.4s


fold 3 epoch 3 OOF Jaccard 0.70755


fold 3 best OOF Jaccard 0.70755


Fold 3 done in 398.1s, OOF 0.70755


Fold 4: train 19786 | val 4946


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


fold 4 epoch 1 step 100/1237 loss 4.7051 elapsed 8.6s


fold 4 epoch 1 step 200/1237 loss 3.6356 elapsed 17.2s


fold 4 epoch 1 step 300/1237 loss 2.9339 elapsed 25.9s


fold 4 epoch 1 step 400/1237 loss 2.4758 elapsed 34.5s


fold 4 epoch 1 step 500/1237 loss 2.1727 elapsed 43.1s


fold 4 epoch 1 step 600/1237 loss 1.9551 elapsed 51.7s


fold 4 epoch 1 step 700/1237 loss 1.8093 elapsed 60.4s


fold 4 epoch 1 step 800/1237 loss 1.6935 elapsed 69.0s


fold 4 epoch 1 step 900/1237 loss 1.5999 elapsed 77.7s


fold 4 epoch 1 step 1000/1237 loss 1.5275 elapsed 86.3s


fold 4 epoch 1 step 1100/1237 loss 1.4679 elapsed 94.9s


fold 4 epoch 1 step 1200/1237 loss 1.4168 elapsed 103.5s


fold 4 epoch 1 OOF Jaccard 0.70402


fold 4 epoch 2 step 100/1237 loss 0.8123 elapsed 132.8s


fold 4 epoch 2 step 200/1237 loss 0.8173 elapsed 141.4s


fold 4 epoch 2 step 300/1237 loss 0.8020 elapsed 150.0s


fold 4 epoch 2 step 400/1237 loss 0.7964 elapsed 158.6s


fold 4 epoch 2 step 500/1237 loss 0.8024 elapsed 167.2s


fold 4 epoch 2 step 600/1237 loss 0.7952 elapsed 175.8s


fold 4 epoch 2 step 700/1237 loss 0.7900 elapsed 184.4s


fold 4 epoch 2 step 800/1237 loss 0.7889 elapsed 193.1s


fold 4 epoch 2 step 900/1237 loss 0.7885 elapsed 201.7s


fold 4 epoch 2 step 1000/1237 loss 0.7893 elapsed 210.3s


fold 4 epoch 2 step 1100/1237 loss 0.7896 elapsed 219.0s


fold 4 epoch 2 step 1200/1237 loss 0.7872 elapsed 227.6s


fold 4 epoch 2 OOF Jaccard 0.71489


fold 4 epoch 3 step 100/1237 loss 0.7368 elapsed 256.9s


fold 4 epoch 3 step 200/1237 loss 0.7310 elapsed 265.6s


fold 4 epoch 3 step 300/1237 loss 0.7236 elapsed 274.2s


fold 4 epoch 3 step 400/1237 loss 0.7237 elapsed 282.8s


fold 4 epoch 3 step 500/1237 loss 0.7131 elapsed 291.4s


fold 4 epoch 3 step 600/1237 loss 0.7102 elapsed 300.0s


fold 4 epoch 3 step 700/1237 loss 0.7078 elapsed 308.6s


fold 4 epoch 3 step 800/1237 loss 0.7079 elapsed 317.2s


fold 4 epoch 3 step 900/1237 loss 0.7064 elapsed 325.8s


fold 4 epoch 3 step 1000/1237 loss 0.7073 elapsed 334.4s


fold 4 epoch 3 step 1100/1237 loss 0.7113 elapsed 343.0s


fold 4 epoch 3 step 1200/1237 loss 0.7102 elapsed 351.6s


fold 4 epoch 3 OOF Jaccard 0.71474


fold 4 best OOF Jaccard 0.71474


Fold 4 done in 399.2s, OOF 0.71474


OOF mean: 0.7124612654099624


Saved submission.csv with 2749 rows
       textID                                      selected_text
0  80a1e6bc32                                               wish
1  863097735d                                  gosh today sucks!
2  264cd5277f  tired and didn`t really have an exciting Satur...
3  baee1e6ffc             i`ve been eating cheetos all morning..
4  67d06a8dee   haiiii sankQ i`m fineee ima js get a checkup ...


In [10]:
# Second-seed DeBERTa-v3-base run (SEED=43) with identical folds; saves logits under new prefix for later blending
import math, time, json, numpy as np, pandas as pd, torch, os
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, get_linear_schedule_with_warmup
from torch.optim import AdamW

os.environ['TOKENIZERS_PARALLELISM'] = 'false'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BASE_PREFIX = 'deberta_v3_base_m128_prompt'
SEED2 = 43
PREFIX2 = BASE_PREFIX + '_s43'
CACHE_TRAIN = Path(f'cache/train_{BASE_PREFIX}.npz')
CACHE_TEST = Path(f'cache/test_{BASE_PREFIX}.npz')
FOLDS_CSV = Path(f'cache/train_folds_{BASE_PREFIX}.csv')  # reuse identical folds for fair CV
MODEL_NAME = 'microsoft/deberta-v3-base'
MAX_LEN = 128; EPOCHS = 3; LR = 3e-5; WARMUP = 0.1; WEIGHT_DECAY = 0.01
BATCH_SIZE = 16; GRAD_ACCUM = 4; CLIP_NORM = 1.0; TOP_K = 10; SPAN_CAP = 30

torch.manual_seed(SEED2); np.random.seed(SEED2)
tok2 = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
train_npz2 = np.load(CACHE_TRAIN, allow_pickle=True); test_npz2 = np.load(CACHE_TEST, allow_pickle=True)
folds_df2 = pd.read_csv(FOLDS_CSV)

class QADatasetCached2(Dataset):
    def __init__(self, idxs):
        self.ids = train_npz2['input_ids'][idxs]
        self.attn = train_npz2['attention_mask'][idxs]
        self.start = train_npz2['start_positions'][idxs]
        self.end = train_npz2['end_positions'][idxs]
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        return {
            'input_ids': torch.tensor(self.ids[i], dtype=torch.long),
            'attention_mask': torch.tensor(self.attn[i], dtype=torch.long),
            'start_positions': torch.tensor(self.start[i], dtype=torch.long),
            'end_positions': torch.tensor(self.end[i], dtype=torch.long),
        }

def decode_row2(sentiment, text, start_logits, end_logits):
    if sentiment == 'neutral':
        return text
    sent_prompt = f'sentiment: {sentiment}'
    enc = tok2(text=sent_prompt, text_pair=str(text), add_special_tokens=True, truncation=True, max_length=MAX_LEN,
               padding='max_length', return_offsets_mapping=True, return_tensors='pt')
    offsets = enc['offset_mapping'][0].tolist()
    seq_ids = enc.sequence_ids(0)
    tweet_mask = np.array([1 if sid==1 else 0 for sid in seq_ids], dtype=np.int8)
    s = start_logits.copy(); e = end_logits.copy(); neg_inf = -1e9
    if tweet_mask.shape[0] != s.shape[0]:
        return text
    s[tweet_mask==0] = neg_inf; e[tweet_mask==0] = neg_inf
    k = min(TOP_K, int(tweet_mask.sum()))
    start_cand = np.argsort(s)[-k:]; end_cand = np.argsort(e)[-k:]
    best = None; best_score = -1e18
    for si in start_cand:
        for ei in end_cand:
            if ei < si: continue
            if (ei - si + 1) > SPAN_CAP: continue
            sc = s[si] + e[ei]
            if sc > best_score: best_score = sc; best = (si, ei)
    if best is None:
        si = int(np.argmax(s)); ei = si
    else:
        si, ei = best
    def valid_left(i):
        while i >= 0 and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None): i -= 1
        return i
    def valid_right(i):
        n = len(offsets)
        while i < n and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None): i += 1
        return i
    si = valid_left(si); ei = valid_right(ei)
    if si < 0 or ei >= len(offsets) or si > ei:
        si = int(np.argmax(s)); si = valid_left(si)
        if si < 0: return text
        cs, ce = offsets[si][0], offsets[si][1]
        sub = text[cs:ce].strip()
        return sub if sub else text
    cs = offsets[si][0]; ce = offsets[ei][1]
    if cs is None or ce is None: return text
    sub = text[cs:ce].strip()
    return sub if sub else text

def jaccard_str2(a, b):
    sa = set(str(a).split()); sb = set(str(b).split())
    return (len(sa & sb)) / (len(sa | sb) + 1e-12)

def train_fold_seed2(fold):
    all_folds = folds_df2['fold'].values
    tr_idx = np.where(all_folds != fold)[0]
    va_idx = np.where(all_folds == fold)[0]
    print(f'[s43] Fold {fold}: train {len(tr_idx)} | val {len(va_idx)}', flush=True)
    train_ds = QADatasetCached2(tr_idx)
    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
    model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)
    opt = AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)
    steps_per_epoch = len(train_loader)
    opt_steps_per_epoch = math.ceil(steps_per_epoch / GRAD_ACCUM)
    total_opt_steps = opt_steps_per_epoch * EPOCHS
    warmup_steps = int(WARMUP * total_opt_steps)
    sch = get_linear_schedule_with_warmup(opt, warmup_steps, total_opt_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=True)
    best_score = -1.0; best_state = None
    t0 = time.time()
    model.train()
    for epoch in range(EPOCHS):
        tr_loss = 0.0
        opt.zero_grad(set_to_none=True)
        for step, batch in enumerate(train_loader):
            batch = {k: v.to(device) for k, v in batch.items()}
            with torch.cuda.amp.autocast(enabled=True):
                out = model(**batch)
                loss = out.loss / GRAD_ACCUM
            scaler.scale(loss).backward()
            if (step + 1) % GRAD_ACCUM == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP_NORM)
                scaler.step(opt); scaler.update(); sch.step()
                opt.zero_grad(set_to_none=True)
            tr_loss += loss.item() * GRAD_ACCUM
            if (step + 1) % 100 == 0:
                print(f'[s43] fold {fold} epoch {epoch+1} step {step+1}/{steps_per_epoch} loss {tr_loss/(step+1):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        # Eval
        model.eval()
        va_ids = train_npz2['input_ids'][va_idx]
        va_attn = train_npz2['attention_mask'][va_idx]
        start_logits = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
        end_logits = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32)
        bs = 64
        with torch.no_grad():
            for i in range(0, len(va_idx), bs):
                x_ids = torch.tensor(va_ids[i:i+bs], dtype=torch.long, device=device)
                x_attn = torch.tensor(va_attn[i:i+bs], dtype=torch.long, device=device)
                out = model(input_ids=x_ids, attention_mask=x_attn)
                start_logits[i:i+bs] = out.start_logits.detach().cpu().numpy()
                end_logits[i:i+bs] = out.end_logits.detach().cpu().numpy()
        sentiments = train_npz2['sentiment'][va_idx]
        texts = train_npz2['text'][va_idx]
        gold = train_npz2['selected_text'][va_idx]
        preds = [decode_row2(str(sentiments[i]), str(texts[i]), start_logits[i], end_logits[i]) for i in range(len(va_idx))]
        score = float(np.mean([jaccard_str2(preds[i], gold[i]) for i in range(len(preds))]))
        print(f'[s43] fold {fold} epoch {epoch+1} OOF Jaccard {score:.5f}', flush=True)
        if score > best_score + 1e-4:
            best_score = score; best_state = model.state_dict()
        model.train()
    if best_state is not None: model.load_state_dict(best_state)
    # Save final OOF/Test logits
    model.eval()
    va_ids = train_npz2['input_ids'][va_idx]; va_attn = train_npz2['attention_mask'][va_idx]
    oof_start = np.zeros((len(va_idx), MAX_LEN), dtype=np.float32); oof_end = np.zeros_like(oof_start)
    bs = 64
    with torch.no_grad():
        for i in range(0, len(va_idx), bs):
            x_ids = torch.tensor(va_ids[i:i+bs], dtype=torch.long, device=device)
            x_attn = torch.tensor(va_attn[i:i+bs], dtype=torch.long, device=device)
            out = model(input_ids=x_ids, attention_mask=x_attn)
            oof_start[i:i+bs] = out.start_logits.detach().cpu().numpy()
            oof_end[i:i+bs] = out.end_logits.detach().cpu().numpy()
    te_ids = test_npz2['input_ids']; te_attn = test_npz2['attention_mask']
    te_start = np.zeros((len(te_ids), MAX_LEN), dtype=np.float32); te_end = np.zeros_like(te_start)
    with torch.no_grad():
        for i in range(0, len(te_ids), bs):
            x_ids = torch.tensor(te_ids[i:i+bs], dtype=torch.long, device=device)
            x_attn = torch.tensor(te_attn[i:i+bs], dtype=torch.long, device=device)
            out = model(input_ids=x_ids, attention_mask=x_attn)
            te_start[i:i+bs] = out.start_logits.detach().cpu().numpy()
            te_end[i:i+bs] = out.end_logits.detach().cpu().numpy()
    fold_dir = Path(f'cache/oof_{PREFIX2}'); fold_dir.mkdir(parents=True, exist_ok=True)
    np.savez_compressed(fold_dir / f'fold{fold}_oof_logits.npz', idx=va_idx, start=oof_start, end=oof_end)
    np.savez_compressed(fold_dir / f'fold{fold}_test_logits.npz', start=te_start, end=te_end)
    # report
    sentiments = train_npz2['sentiment'][va_idx]; texts = train_npz2['text'][va_idx]; gold = train_npz2['selected_text'][va_idx]
    preds = [decode_row2(str(sentiments[i]), str(texts[i]), oof_start[i], oof_end[i]) for i in range(len(va_idx))]
    final_oof = float(np.mean([jaccard_str2(preds[i], gold[i]) for i in range(len(preds))]))
    print(f'[s43] fold {fold} best OOF Jaccard {final_oof:.5f}', flush=True)
    return final_oof

def run_seed43_all_folds():
    scores = []
    for f in range(5):
        t0 = time.time()
        sc = train_fold_seed2(f)
        scores.append(sc)
        print(f'[s43] Fold {f} done in {time.time()-t0:.1f}s, OOF {sc:.5f}', flush=True)
    print('[s43] OOF mean:', float(np.mean(scores)))

# Note: Do NOT execute now while seed=42 run is training. Execute this cell after Cell 6 finishes to train the second seed.
# After both seeds finish, blend test logits by averaging across both oof directories and decode once with the same decoder.



In [None]:
# Execute second-seed training, then blend both seeds' logits and decode once
import numpy as np, pandas as pd, torch, os, math, time
from pathlib import Path
from transformers import AutoTokenizer

# 1) Train seed=43 across all folds
run_seed43_all_folds()

# 2) Blend test logits across seeds and decode once
MODEL_NAME = 'microsoft/deberta-v3-base'
MAX_LEN = 128
TOP_K = 10
SPAN_CAP = 30
tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

base = 'deberta_v3_base_m128_prompt'
dir1 = Path(f'cache/oof_{base}')
dir2 = Path(f'cache/oof_{base}_s43')
te_files1 = [np.load(dir1 / f'fold{f}_test_logits.npz') for f in range(5)]
te_files2 = [np.load(dir2 / f'fold{f}_test_logits.npz') for f in range(5)]
te_start = np.mean([f['start'] for f in te_files1 + te_files2], axis=0)
te_end = np.mean([f['end'] for f in te_files1 + te_files2], axis=0)
test_npz = np.load(f'cache/test_{base}.npz', allow_pickle=True)
test_sent = test_npz['sentiment']
test_text = test_npz['text']

def decode_row(sentiment, text, start_logits, end_logits):
    if sentiment == 'neutral':
        return text
    sent_prompt = f'sentiment: {sentiment}'
    enc = tok(text=sent_prompt, text_pair=str(text), add_special_tokens=True, truncation=True, max_length=MAX_LEN,
              padding='max_length', return_offsets_mapping=True, return_tensors='pt')
    offsets = enc['offset_mapping'][0].tolist()
    seq_ids = enc.sequence_ids(0)
    tweet_mask = np.array([1 if sid==1 else 0 for sid in seq_ids], dtype=np.int8)
    if tweet_mask.shape[0] != start_logits.shape[0]:
        return text
    neg_inf = -1e9
    s = start_logits.copy(); e = end_logits.copy()
    s[tweet_mask==0] = neg_inf; e[tweet_mask==0] = neg_inf
    k = min(TOP_K, int(tweet_mask.sum()))
    start_cand = np.argsort(s)[-k:]
    end_cand = np.argsort(e)[-k:]
    best = None; best_score = -1e18
    for si in start_cand:
        for ei in end_cand:
            if ei < si: continue
            if (ei - si + 1) > SPAN_CAP: continue
            sc = s[si] + e[ei]
            if sc > best_score: best_score = sc; best = (si, ei)
    if best is None:
        si = int(np.argmax(s)); ei = si
    else:
        si, ei = best
    def valid_left(i):
        while i >= 0 and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None): i -= 1
        return i
    def valid_right(i):
        n = len(offsets)
        while i < n and (seq_ids[i] != 1 or offsets[i][0] is None or offsets[i][1] is None): i += 1
        return i
    si = valid_left(si); ei = valid_right(ei)
    if si < 0 or ei >= len(offsets) or si > ei:
        si = int(np.argmax(s)); si = valid_left(si)
        if si < 0: return text
        cs, ce = offsets[si][0], offsets[si][1]
        sub = text[cs:ce].strip()
        return sub if sub else text
    cs = offsets[si][0]; ce = offsets[ei][1]
    if cs is None or ce is None: return text
    sub = text[cs:ce].strip()
    return sub if sub else text

preds = []
for i in range(len(test_text)):
    preds.append(decode_row(str(test_sent[i]), str(test_text[i]), te_start[i], te_end[i]))

sub = pd.DataFrame({'textID': test_npz['textID'], 'selected_text': preds})
sub.to_csv('submission.csv', index=False)
print('Blended submission.csv saved with', len(sub), 'rows')

[s43] Fold 0: train 19785 | val 4947


Some weights of DebertaV2ForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  scaler = torch.cuda.amp.GradScaler(enabled=True)


  with torch.cuda.amp.autocast(enabled=True):
