# NLI base results: BERTurk base (dbmdz/bert-base-turkish-cased)

Loads [yilmazzey/sdp2-nli](https://huggingface.co/datasets/yilmazzey/sdp2-nli) (snli_tr_1_1, multinli_tr_1_1, trglue_mnli) and runs **test-only** evaluation with this model.

**No prompts:** BERT NLI is sequence-pair classification (premise [SEP] hypothesis → label).

**Splits:** Test only where available: snli → `test`; multinli → `validation_matched`/`validation_mismatched` (no test); trglue → `test_matched`/`test_mismatched`.

**Metrics:** Accuracy, macro F1, per-class F1, confusion matrix (CSV + plot). Base BERTurk has a random head (~33% accuracy).

In [9]:
REPO_ID = "yilmazzey/sdp2-nli"
CONFIGS = ["snli_tr_1_1", "multinli_tr_1_1", "trglue_mnli"]
MODEL_ID = "dbmdz/bert-base-turkish-cased"
NUM_LABELS = 3  # entailment, neutral, contradiction
RESULTS_DIR = "results"
BATCH_SIZE = 32
EVAL_SPLITS = {
    "snli_tr_1_1": ["test"],
    "multinli_tr_1_1": ["validation_matched", "validation_mismatched"],
    "trglue_mnli": ["test_matched", "test_mismatched"],
}

In [10]:
import json
import random
from collections import Counter
from pathlib import Path

import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding

try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False

LABEL_NAMES = ["entailment", "neutral", "contradiction"]

# Reproducibility: fixed seed for random, numpy, torch (and cuda if available)
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [11]:
# Load all three dataset configs
datasets = {}
for cfg in CONFIGS:
    print(f"Loading {REPO_ID} :: {cfg} ...")
    datasets[cfg] = load_dataset(REPO_ID, cfg)
    print("  splits:", list(datasets[cfg].keys()))

Loading yilmazzey/sdp2-nli :: snli_tr_1_1 ...
  splits: ['train', 'validation', 'test']
Loading yilmazzey/sdp2-nli :: multinli_tr_1_1 ...
  splits: ['train', 'validation_matched', 'validation_mismatched']
Loading yilmazzey/sdp2-nli :: trglue_mnli ...
  splits: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']


In [12]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, num_labels=NUM_LABELS)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
print(f"Using device: {device}")

Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1828.46it/s, Materializing param=bert.pooler.dense.weight]                               
[1mBertForSequenceClassification LOAD REPORT[0m from: dbmdz/bert-base-turkish-cased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identi

Using device: cpu


In [13]:
def tokenize_fn(examples):
    # Dynamic padding: no padding here; DataCollatorWithPadding in DataLoader
    return tokenizer(
        examples["premise"],
        examples["hypothesis"],
        truncation=True,
        max_length=256,
    )


def run_inference(ds):
    remove_cols = [c for c in ds.column_names if c not in ("label",)]
    ds = ds.map(
        tokenize_fn,
        batched=True,
        remove_columns=remove_cols,
        desc="Tokenize",
    )
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    def collate_fn(examples):
        labels = torch.tensor([ex["label"] for ex in examples])
        batch = data_collator([{k: v for k, v in ex.items() if k != "label"} for ex in examples])
        batch["label"] = labels
        return batch

    # batch_size=32; can lower to 16/8 if too slow on CPU
    loader = torch.utils.data.DataLoader(ds, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    preds_list, labels_list = [], []
    with torch.no_grad():
        for batch in tqdm(loader, desc="Inference"):
            if len(preds_list) == 0:
                print("batch['input_ids'].shape:", batch["input_ids"].shape)
            out = model(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device),
            )
            preds_list.append(out.logits.argmax(-1).cpu().numpy())
            labels_list.append(batch["label"].numpy())
    y_pred = np.concatenate(preds_list)
    y_true = np.concatenate(labels_list)
    return y_true, y_pred

In [14]:
def compute_metrics(y_true, y_pred):
    acc = float(accuracy_score(y_true, y_pred))
    f1_macro = float(f1_score(y_true, y_pred, average="macro", zero_division=0))
    f1_per_class = f1_score(y_true, y_pred, average=None, zero_division=0)
    f1_per_class = {LABEL_NAMES[i]: float(f1_per_class[i]) for i in range(NUM_LABELS)}
    cm = confusion_matrix(y_true, y_pred)
    out = {"accuracy": acc, "f1_macro": f1_macro, "f1_per_class": f1_per_class}
    return out, cm


def save_confusion_plot(cm, path):
    if not HAS_PLOT:
        return
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [15]:
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)
all_metrics = {}

for config_name in CONFIGS:
    ds_dict = datasets[config_name]
    split_names = EVAL_SPLITS[config_name]
    all_metrics[config_name] = {}

    for split_name in split_names:
        if split_name not in ds_dict:
            print(f"  Skip {config_name}/{split_name} (missing)")
            continue
        ds = ds_dict[split_name]
        print(f"Evaluating {config_name} / {split_name} ...")
        y_true, y_pred = run_inference(ds)
        # Label distribution (true and predicted)
        print("  True label dist:", dict(Counter(y_true)))
        print("  Pred label dist:", dict(Counter(y_pred)))
        metrics, cm = compute_metrics(y_true, y_pred)
        all_metrics[config_name][split_name] = metrics

        cm_path = Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.csv"
        np.savetxt(cm_path, cm, fmt="%d", delimiter=",")
        save_confusion_plot(cm, Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.png")

        print(f"  accuracy={metrics['accuracy']:.4f}, f1_macro={metrics['f1_macro']:.4f}")

with open(Path(RESULTS_DIR) / "metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)
print(f"Saved {RESULTS_DIR}/metrics.json")

Evaluating snli_tr_1_1 / test ...


Tokenize: 100%|██████████| 9824/9824 [00:00<00:00, 58216.28 examples/s]
Inference:   0%|          | 0/307 [00:00<?, ?it/s]

batch['input_ids'].shape: torch.Size([32, 40])


Inference: 100%|██████████| 307/307 [01:32<00:00,  3.33it/s]


  True label dist: {np.int64(1): 3219, np.int64(0): 3368, np.int64(2): 3237}
  Pred label dist: {np.int64(2): 2323, np.int64(0): 4004, np.int64(1): 3497}
  accuracy=0.3347, f1_macro=0.3288
Evaluating multinli_tr_1_1 / validation_matched ...


Tokenize: 100%|██████████| 9809/9809 [00:00<00:00, 54379.68 examples/s]
Inference:   0%|          | 0/307 [00:00<?, ?it/s]

batch['input_ids'].shape: torch.Size([32, 73])


Inference: 100%|██████████| 307/307 [02:27<00:00,  2.08it/s]


  True label dist: {np.int64(1): 3123, np.int64(2): 3211, np.int64(0): 3475}
  Pred label dist: {np.int64(1): 3748, np.int64(2): 4457, np.int64(0): 1604}
  accuracy=0.3217, f1_macro=0.3110
Evaluating multinli_tr_1_1 / validation_mismatched ...


Tokenize: 100%|██████████| 9825/9825 [00:00<00:00, 53023.94 examples/s]
Inference:   0%|          | 0/308 [00:00<?, ?it/s]

batch['input_ids'].shape: torch.Size([32, 116])


Inference: 100%|██████████| 308/308 [02:33<00:00,  2.01it/s]


  True label dist: {np.int64(2): 3240, np.int64(0): 3456, np.int64(1): 3129}
  Pred label dist: {np.int64(1): 3518, np.int64(2): 5004, np.int64(0): 1303}
  accuracy=0.3129, f1_macro=0.2933
Evaluating trglue_mnli / test_matched ...


Tokenize: 100%|██████████| 9008/9008 [00:00<00:00, 50884.01 examples/s]
Inference:   0%|          | 0/282 [00:00<?, ?it/s]

batch['input_ids'].shape: torch.Size([32, 73])


Inference: 100%|██████████| 282/282 [02:04<00:00,  2.27it/s]


  True label dist: {np.int64(1): 3138, np.int64(2): 2946, np.int64(0): 2924}
  Pred label dist: {np.int64(1): 3635, np.int64(2): 3905, np.int64(0): 1468}
  accuracy=0.3323, f1_macro=0.3141
Evaluating trglue_mnli / test_mismatched ...


Tokenize: 100%|██████████| 9217/9217 [00:00<00:00, 52341.35 examples/s]
Inference:   0%|          | 0/289 [00:00<?, ?it/s]

batch['input_ids'].shape: torch.Size([32, 74])


Inference: 100%|██████████| 289/289 [02:04<00:00,  2.31it/s]

  True label dist: {np.int64(1): 3043, np.int64(0): 3101, np.int64(2): 3073}
  Pred label dist: {np.int64(2): 5031, np.int64(0): 1079, np.int64(1): 3107}
  accuracy=0.3328, f1_macro=0.3010
Saved results/metrics.json





In [16]:
# Summary: per config/split
for config_name, splits in all_metrics.items():
    for split_name, m in splits.items():
        print(f"{config_name} / {split_name}: acc={m['accuracy']:.4f}, F1_macro={m['f1_macro']:.4f}, F1_per_class={m['f1_per_class']}")

snli_tr_1_1 / test: acc=0.3347, F1_macro=0.3288, F1_per_class={'entailment': 0.37710255018990774, 'neutral': 0.35318642048838594, 'contradiction': 0.25611510791366904}
multinli_tr_1_1 / validation_matched: acc=0.3217, F1_macro=0.3110, F1_per_class={'entailment': 0.21894073636542627, 'neutral': 0.34463687963906275, 'contradiction': 0.36932707355242567}
multinli_tr_1_1 / validation_mismatched: acc=0.3129, F1_macro=0.2933, F1_per_class={'entailment': 0.17062408068922041, 'neutral': 0.32014442605686777, 'contradiction': 0.3891314895681708}
trglue_mnli / test_matched: acc=0.3323, F1_macro=0.3141, F1_per_class={'entailment': 0.17987249544626593, 'neutral': 0.36054923962793445, 'contradiction': 0.40198511166253104}
trglue_mnli / test_mismatched: acc=0.3328, F1_macro=0.3010, F1_per_class={'entailment': 0.1291866028708134, 'neutral': 0.3460162601626016, 'contradiction': 0.42769002961500496}
