This notebook is created with the purpose of understanding the Jigsaw Puzzle using qwen2.5|0.5B. 

I am also using LoRA optimization technique with the purspose of understanding and ofcourse to make the model consume less computation resources.

I intend to understand the basics myself and share my learnings with fellow Kagglers. 

Please feel free to provide feedback, contribute to the understanding and correct me if need be(I hope Not:-))

If you really like the work - Please Upvote:-)




In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/qwen2.5/transformers/0.5b/1/config.json
/kaggle/input/qwen2.5/transformers/0.5b/1/merges.txt
/kaggle/input/qwen2.5/transformers/0.5b/1/LICENSE
/kaggle/input/qwen2.5/transformers/0.5b/1/README.md
/kaggle/input/qwen2.5/transformers/0.5b/1/tokenizer.json
/kaggle/input/qwen2.5/transformers/0.5b/1/vocab.json
/kaggle/input/qwen2.5/transformers/0.5b/1/tokenizer_config.json
/kaggle/input/qwen2.5/transformers/0.5b/1/model.safetensors
/kaggle/input/qwen2.5/transformers/0.5b/1/.gitattributes
/kaggle/input/qwen2.5/transformers/0.5b/1/generation_config.json
/kaggle/input/jigsaw-agile-community-rules/sample_submission.csv
/kaggle/input/jigsaw-agile-community-rules/train.csv
/kaggle/input/jigsaw-agile-community-rules/test.csv


### Setting Environment Variables

<ul>
    <li>TOKENIZERS_PARALLELISM controls whether the Hugging Face tokenizers library uses its own internal multi-threading to speed up tokenization.</li>
    <li>Reason to disable it is to prevent a conflict with PyTorch's multi-processing data loading, which can lead to deadlocks or severely degraded performance - 
</ul>

In [2]:
# ===== Qwen2.5-0.5B cross-encoder, head-only, 5-fold (error-proof) =====
!pip -q install "transformers==4.44.2" "accelerate==0.33.0"

import os, gc, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
)

# --------------- Config ---------------
DATA_DIR  = "/kaggle/input/jigsaw-agile-community-rules"
TRAIN_CSV = f"{DATA_DIR}/train.csv"

MODEL_ID  = "Qwen/Qwen2.5-0.5B"
OUT_ROOT  = "/kaggle/working/qwen25_ce_headonly"
os.makedirs(OUT_ROOT, exist_ok=True)

SEED=42; N_FOLDS=5
MAX_LEN=256
EPOCHS=2
TRAIN_BS=4
GRAD_ACC=4
LR=2e-4
WARMUP=0.05
LABEL_SMOOTH = 0.0  # set 0.05 if you want mild smoothing

set_seed(SEED)
device = "cuda" if torch.cuda.is_available() else "cpu"

# --------------- Build paired inputs ---------------
def _clip(x):
    x = "" if pd.isna(x) else str(x).strip()
    return x if 0 < len(x) < 300 else ""

def build_text_a(row):
    sub  = str(row["subreddit"])
    rule = str(row["rule"])
    pos1 = _clip(row.get("positive_example_1",""))
    neg1 = _clip(row.get("negative_example_1",""))
    parts = [f"r/{sub}", f"Rule: {rule}"]
    if pos1: parts.append(f"Yes: {pos1}")
    if neg1: parts.append(f"No: {neg1}")
    return " | ".join(parts)

def prepare_df(df):
    df = df.copy()
    df["text_a"] = df.apply(build_text_a, axis=1)
    df["text_b"] = df["body"].astype(str)
    return df

df = pd.read_csv(TRAIN_CSV)
df = prepare_df(df)
y  = df["rule_violation"].astype(int).values

# --------------- Tokenizer ---------------
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True, trust_remote_code=True)
if tok.pad_token is None:
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token":"</s>"})
    tok.pad_token = tok.eos_token
tok.padding_side = "right"

collator = DataCollatorWithPadding(tokenizer=tok)

# --------------- Dataset ---------------
class PairDataset(Dataset):
    def __init__(self, df, labels=None, max_len=256):
        self.a = df["text_a"].tolist()
        self.b = df["text_b"].tolist()
        self.labels = labels
        self.max_len = max_len
    def __len__(self): return len(self.a)
    def __getitem__(self, i):
        item = tok(self.a[i], self.b[i], truncation=True, max_length=self.max_len)
        if self.labels is not None:
            # labels must be Long for CE
            item["labels"] = torch.tensor(int(self.labels[i]), dtype=torch.long)
        return item

# --------------- Model (head-only) ---------------
def build_headonly_model():
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_ID,
        num_labels=2,
        trust_remote_code=True,
        torch_dtype=torch.float32,  # keep everything fp32 to avoid dtype mismatches
    )
    model.config.use_cache = False
    model.config.pad_token_id = tok.pad_token_id
    model.config.problem_type = "single_label_classification"

    # keep embeddings in sync if special tokens were added
    if model.get_input_embeddings().num_embeddings != len(tok):
        model.resize_token_embeddings(len(tok))

    # fresh classifier head
    hidden = model.config.hidden_size
    model.score = nn.Linear(hidden, 2)
    nn.init.xavier_uniform_(model.score.weight)
    nn.init.zeros_(model.score.bias)

    # freeze everything except the classifier head
    for n, p in model.named_parameters():
        p.requires_grad = ("score" in n)

    return model.to(device)

# --------------- Weighted CE & safe FP32 loss ---------------
pos_frac = float(y.mean())
neg_frac = 1.0 - pos_frac
# class_weights shape [2] on same device as model; weight(neg)=1, weight(pos)=neg/pos
cw = torch.tensor([1.0, (neg_frac / max(pos_frac, 1e-6))], dtype=torch.float32)

class FP32Trainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None, **kwargs):
        labels = inputs.get("labels")
        inputs = {k: v for k, v in inputs.items() if k != "labels"}
        outputs = model(**inputs)
        logits = outputs.logits.float()  # always compute loss in FP32

        # move class weights to correct device only here (model may be on GPU/CPU)
        loss_fct = nn.CrossEntropyLoss(
            weight=cw.to(logits.device),
            label_smoothing=LABEL_SMOOTH
        )
        loss = loss_fct(logits.view(-1, 2), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

# --------------- CV Train ---------------
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros(len(df), dtype=float)

for fold, (trn_idx, val_idx) in enumerate(skf.split(df, y), 1):
    print(f"\n===== Fold {fold}/{N_FOLDS} =====")
    dtr = PairDataset(df.iloc[trn_idx], y[trn_idx], MAX_LEN)
    dvl = PairDataset(df.iloc[val_idx], y[val_idx], MAX_LEN)

    model = build_headonly_model()

    args = TrainingArguments(
        output_dir=f"{OUT_ROOT}/fold{fold}",
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=TRAIN_BS,
        gradient_accumulation_steps=GRAD_ACC,
        learning_rate=LR,
        warmup_ratio=WARMUP,
        weight_decay=0.01,
        logging_steps=50,
        save_strategy="no",          # keep training simple (no checkpointing)
        report_to="none",
        remove_unused_columns=False,
        fp16=False, bf16=False,      # keep pure FP32 to avoid dtype issues
        dataloader_pin_memory=False,
        seed=SEED,
    )

    trainer = FP32Trainer(
        model=model,
        args=args,
        train_dataset=dtr,
        tokenizer=tok,
        data_collator=collator,
    )

    # quick sanity forward to ensure tensors/devices OK (no grads)
    tmp_loader = DataLoader(dtr, batch_size=2, shuffle=False, collate_fn=collator)
    batch = next(iter(tmp_loader))
    for k in batch: batch[k] = batch[k].to(device)
    with torch.no_grad():
        _ = model(**{k:v for k,v in batch.items() if k!="labels"})
    del tmp_loader, batch; gc.collect()

    trainer.train()

    # Save fold weights & tokenizer
    save_dir = f"{OUT_ROOT}/fold{fold}"
    os.makedirs(save_dir, exist_ok=True)
    trainer.model.save_pretrained(save_dir)
    tok.save_pretrained(save_dir)
    print(f"Saved fold {fold} to {save_dir}")

    # ---------- OOF AUC ----------
    model.eval()
    dl = DataLoader(dvl, batch_size=128, shuffle=False, collate_fn=collator)
    preds = []
    with torch.no_grad():
        for batch in dl:
            for k in batch: batch[k] = batch[k].to(device)
            logits = model(**{k:v for k,v in batch.items() if k!="labels"}).logits.float()
            logits = torch.nan_to_num(logits, nan=0.0, posinf=1e4, neginf=-1e4)
            prob = torch.softmax(logits, dim=1)[:, 1]
            prob = torch.nan_to_num(prob, nan=0.5)
            preds.append(prob.detach().cpu().numpy())

    prob1 = np.concatenate(preds)
    prob1 = np.clip(prob1, 0.0, 1.0)
    if np.isnan(prob1).any():
        prob1 = np.nan_to_num(prob1, nan=0.5)

    oof[val_idx] = prob1
    print(f"Fold {fold} AUC: {roc_auc_score(y[val_idx], prob1):.4f}")

    del trainer, model, dtr, dvl, dl, preds, prob1
    gc.collect()
    if torch.cuda.is_available(): torch.cuda.empty_cache()

print(f"\nOOF AUC: {roc_auc_score(y, oof):.4f}")
pd.DataFrame({"oof": oof, "y": y}).to_csv(f"{OUT_ROOT}/oof.csv", index=False)
print("Saved folds under:", OUT_ROOT)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m73.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

2025-09-08 23:33:46.633137: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757374427.010179      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757374427.112370      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]


===== Fold 1/5 =====


config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,3.5592
100,2.1131


Saved fold 1 to /kaggle/working/qwen25_ce_headonly/fold1
Fold 1 AUC: 0.5776

===== Fold 2/5 =====


Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,3.7362
100,2.3614


Saved fold 2 to /kaggle/working/qwen25_ce_headonly/fold2
Fold 2 AUC: 0.5333

===== Fold 3/5 =====


Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,3.7003
100,2.3576


Saved fold 3 to /kaggle/working/qwen25_ce_headonly/fold3
Fold 3 AUC: 0.5531

===== Fold 4/5 =====


Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,3.6156
100,2.3428


Saved fold 4 to /kaggle/working/qwen25_ce_headonly/fold4
Fold 4 AUC: 0.5257

===== Fold 5/5 =====


Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,3.5813
100,2.3327


Saved fold 5 to /kaggle/working/qwen25_ce_headonly/fold5
Fold 5 AUC: 0.5518

OOF AUC: 0.5481
Saved folds under: /kaggle/working/qwen25_ce_headonly
