# NLI base results: Qwen2-1.5B-Instruct (Qwen/Qwen2-1.5B-Instruct)

Loads [yilmazzey/sdp2-nli](https://huggingface.co/datasets/yilmazzey/sdp2-nli) (snli_tr_1_1, multinli_tr_1_1, trglue_mnli) and runs **test-only** evaluation with this model.

1.5B generative LLM (Qwen2 series, instruct-tuned with SFT/DPO). Multilingual support (improved tokenizer, strong on non-English like TurkishMMLU ~66.85%). Zero-shot prompted NLI (no fine-tuning). Expected ~50-65% on reasoning/NLI benchmarks due to size. Outputs parsed to 0=entailment, 1=neutral, 2=contradiction.

**Splits:** snli → test; multinli → validation_matched/mismatched; trglue → test_matched/test_mismatched. **Metrics:** Accuracy, macro F1, per-class F1, confusion matrix (CSV + plot).

In [1]:
# Install dependencies (run once)
!pip install -q -U transformers datasets accelerate scikit-learn tqdm huggingface_hub[hf_transfer]

import json
import random
from collections import Counter
from pathlib import Path

import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from tqdm import tqdm
from transformers import pipeline, AutoTokenizer

try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False

LABEL_NAMES = ["entailment", "neutral", "contradiction"]

# Enable faster Hugging Face downloads
import os
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

if torch.backends.mps.is_available():
    print("Apple Silicon MPS available; using for acceleration.")
elif torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("No GPU/MPS; running on CPU (1.5B very fast even on CPU).")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m101.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m520.7/520.7 kB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m153.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m596.3/596.3 kB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25hGPU: NVIDIA RTX PRO 6000 Blackwell Server Edition


In [10]:
REPO_ID = "yilmazzey/sdp2-nli"
CONFIGS = ["snli_tr_1_1", "multinli_tr_1_1", "trglue_mnli"]
MODEL_ID = "Qwen/Qwen2-7B-Instruct"
NUM_LABELS = 3  # entailment, neutral, contradiction
RESULTS_DIR = "results"
# Lower to 4-8 if memory low (1.5B very lightweight). Even 32 is fine on M4.
BATCH_SIZE = 16
EVAL_SPLITS = {
    "snli_tr_1_1": ["test"],
    "multinli_tr_1_1": ["validation_matched", "validation_mismatched"],
    "trglue_mnli": ["test_matched", "test_mismatched"],
}

In [8]:
from huggingface_hub import login
login()  # use token from huggingface-cli login or HF_TOKEN env

In [4]:
# Load all three dataset configs
datasets = {}
for cfg in CONFIGS:
    print(f"Loading {REPO_ID} :: {cfg} ...")
    datasets[cfg] = load_dataset(REPO_ID, cfg)
    print("  splits:", list(datasets[cfg].keys()))

Loading yilmazzey/sdp2-nli :: snli_tr_1_1 ...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

snli_tr_1_1/train-00000-of-00001.parquet:   0%|          | 0.00/25.4M [00:00<?, ?B/s]

snli_tr_1_1/validation-00000-of-00001.pa(…):   0%|          | 0.00/558k [00:00<?, ?B/s]

snli_tr_1_1/test-00000-of-00001.parquet:   0%|          | 0.00/557k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/548487 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9836 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9824 [00:00<?, ? examples/s]

  splits: ['train', 'validation', 'test']
Loading yilmazzey/sdp2-nli :: multinli_tr_1_1 ...


multinli_tr_1_1/train-00000-of-00001.par(…):   0%|          | 0.00/52.8M [00:00<?, ?B/s]

multinli_tr_1_1/validation_matched-00000(…):   0%|          | 0.00/835k [00:00<?, ?B/s]

multinli_tr_1_1/validation_mismatched-00(…):   0%|          | 0.00/872k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392599 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9809 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9825 [00:00<?, ? examples/s]

  splits: ['train', 'validation_matched', 'validation_mismatched']
Loading yilmazzey/sdp2-nli :: trglue_mnli ...


trglue_mnli/train-00000-of-00001.parquet:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

trglue_mnli/validation_matched-00000-of-(…):   0%|          | 0.00/1.08M [00:00<?, ?B/s]

trglue_mnli/validation_mismatched-00000-(…):   0%|          | 0.00/1.24M [00:00<?, ?B/s]

trglue_mnli/test_matched-00000-of-00001.(…):   0%|          | 0.00/1.08M [00:00<?, ?B/s]

trglue_mnli/test_mismatched-00000-of-000(…):   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/162788 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9050 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9200 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9008 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9217 [00:00<?, ? examples/s]

  splits: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']


In [11]:
print("Loading Qwen2-1.5B-Instruct (text-generation pipeline)...")

device = "mps" if torch.backends.mps.is_available() else 0 if torch.cuda.is_available() else -1
generator = pipeline(
    "text-generation",
    model=MODEL_ID,
    device=device,
    torch_dtype=torch.bfloat16 if torch.backends.mps.is_available() or torch.cuda.is_available() else torch.float32,
    max_length=None,  # Silence max_length warning
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully.")

Loading Qwen2-1.5B-Instruct (text-generation pipeline)...


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Passing `generation_config` together with generation-related arguments=({'max_length'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


Model loaded successfully.


In [12]:
def nli_prompt(premise, hypothesis):
    return [
        {"role": "system", "content": "You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text."},
        {"role": "user", "content": f"Premise: {premise}\nHypothesis: {hypothesis}\nLabel:"}
    ]

def parse_generated_label(gen_text, formatted_prompt):
    # Remove prompt part + any leading/trailing junk
    continuation = gen_text[len(formatted_prompt):].strip().lower()
    if not continuation:
        return 1  # neutral fallback

    # Take first word, remove punctuation
    first_word = continuation.split()[0].rstrip('.,!?;:')

    label_map = {"entailment": 0, "neutral": 1, "contradiction": 2}
    return label_map.get(first_word, 1)  # Default to neutral if unknown

def run_prompted_inference(ds):
    premises = list(ds["premise"])
    hypotheses = list(ds["hypothesis"])
    labels = list(ds["label"])
    n = len(ds)
    y_pred = []
    all_generations = []  # Collect all for full debug

    for start in tqdm(range(0, n, BATCH_SIZE), desc="Inference"):
        end = min(start + BATCH_SIZE, n)
        batch_premises = premises[start:end]
        batch_hypotheses = hypotheses[start:end]
        batch_prompts = [nli_prompt(p, h) for p, h in zip(batch_premises, batch_hypotheses)]

        formatted_prompts = tokenizer.apply_chat_template(batch_prompts, tokenize=False, add_generation_prompt=True)

        out = generator(
            formatted_prompts,
            max_new_tokens=5,           # Very strict to force single word
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
            max_length=None,
        )

        all_generations.extend(out)

        for i, gen in enumerate(out):
            gen_text = gen[0]["generated_text"]
            parsed = parse_generated_label(gen_text, formatted_prompts[i])
            y_pred.append(parsed)

    y_true = np.array(labels, dtype=np.int64)
    y_pred = np.array(y_pred, dtype=np.int64)

    # Debug: first 5 + every 100th
    for i in range(min(5, n)):
        print(f"Debug Sample {i}: Generated: {all_generations[i][0]['generated_text']}, Parsed Label: {y_pred[i]}")
    for i in range(100, n, 100):
        if i < n:
            print(f"Debug Sample {i}: Generated: {all_generations[i][0]['generated_text']}, Parsed Label: {y_pred[i]}")

    print("True label dist:", dict(Counter(y_true)))
    print("Pred label dist:", dict(Counter(y_pred)))

    return y_true, y_pred

In [13]:
def compute_metrics(y_true, y_pred):
    acc = float(accuracy_score(y_true, y_pred))
    f1_macro = float(f1_score(y_true, y_pred, average="macro", zero_division=0))
    f1_per_class = f1_score(y_true, y_pred, average=None, zero_division=0)
    f1_per_class = {LABEL_NAMES[i]: float(f1_per_class[i]) for i in range(NUM_LABELS)}
    cm = confusion_matrix(y_true, y_pred)
    out = {"accuracy": acc, "f1_macro": f1_macro, "f1_per_class": f1_per_class}
    return out, cm


def save_confusion_plot(cm, path):
    if not HAS_PLOT:
        return
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [14]:
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)
all_metrics = {}

for config_name in CONFIGS:
    ds_dict = datasets[config_name]
    split_names = EVAL_SPLITS[config_name]
    all_metrics[config_name] = {}

    for split_name in split_names:
        if split_name not in ds_dict:
            print(f"  Skip {config_name}/{split_name} (missing)")
            continue
        ds = ds_dict[split_name]
        print(f"Evaluating {config_name} / {split_name} ...")
        y_true, y_pred = run_prompted_inference(ds)
        metrics, cm = compute_metrics(y_true, y_pred)
        all_metrics[config_name][split_name] = metrics

        cm_path = Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.csv"
        np.savetxt(cm_path, cm, fmt="%d", delimiter=",")
        save_confusion_plot(cm, Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.png")

        print(f"  accuracy={metrics['accuracy']:.4f}, f1_macro={metrics['f1_macro']:.4f}")

with open(Path(RESULTS_DIR) / "metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)
print(f"Saved {RESULTS_DIR}/metrics.json")

Evaluating snli_tr_1_1 / test ...



Inference:   0%|          | 0/614 [00:00<?, ?it/s][AThe following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'pad_token_id', 'do_sample', 'temperature'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.

Inference:   0%|          | 1/614 [00:04<42:15,  4.14s/it][A
Inference:   0%|          | 2/614 [00:04<22:15,  2.18s/it][A
Inference:   0%|          | 3/614 [00:05<16:13,  1.59s/it][A
Inference:   1%|          | 4/614 [00:06<13:24,  1.32s/it][A
Inference:   1%|          | 5/614 [00:07<11:43,  1.16s/it][A
Inference:   1%|          | 6/614 [00:08<10:59,  1.09s/it][A
Inference:   1%|          | 7/614 [00:09<10:20,  1.02s/it][A
Inference:   1%|▏         | 8/614 [00:10<09:49,  1.03it/s][A
Infe

Debug Sample 0: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.
Hypothesis: Kilisenin tavanında çatlaklar var.
Label:<|im_end|>
<|im_start|>assistant
neutral, Parsed Label: 1
Debug Sample 1: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.
Hypothesis: Kilise 


Inference:   0%|          | 0/614 [00:00<?, ?it/s][A
Inference:   0%|          | 1/614 [00:00<09:14,  1.11it/s][A
Inference:   0%|          | 2/614 [00:01<08:51,  1.15it/s][A
Inference:   0%|          | 3/614 [00:02<09:07,  1.12it/s][A
Inference:   1%|          | 4/614 [00:03<08:35,  1.18it/s][A
Inference:   1%|          | 5/614 [00:04<08:53,  1.14it/s][A
Inference:   1%|          | 6/614 [00:05<09:02,  1.12it/s][A
Inference:   1%|          | 7/614 [00:06<08:58,  1.13it/s][A
Inference:   1%|▏         | 8/614 [00:07<09:05,  1.11it/s][A
Inference:   1%|▏         | 9/614 [00:08<09:09,  1.10it/s][A
Inference:   2%|▏         | 10/614 [00:08<08:51,  1.14it/s][A
Inference:   2%|▏         | 11/614 [00:09<08:54,  1.13it/s][A
Inference:   2%|▏         | 12/614 [00:10<08:51,  1.13it/s][A
Inference:   2%|▏         | 13/614 [00:11<08:43,  1.15it/s][A
Inference:   2%|▏         | 14/614 [00:12<08:47,  1.14it/s][A
Inference:   2%|▏         | 15/614 [00:13<08:45,  1.14it/s][A
Inference

Debug Sample 0: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Yeni haklar yeterince güzel.
Hypothesis: Herkes gerçekten en yeni faydaları seviyor
Label:<|im_end|>
<|im_start|>assistant
neutral, Parsed Label: 1
Debug Sample 1: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Bu site, tüm ödül kazananların bir listesini ve Hükümet Yönetici makalelerinin aranabilir bir veritabanını içerir.
Hypothesis: Web sitesinde yer alan Hükümet 


Inference:   0%|          | 0/615 [00:00<?, ?it/s][A
Inference:   0%|          | 1/615 [00:00<09:51,  1.04it/s][A
Inference:   0%|          | 2/615 [00:01<09:38,  1.06it/s][A
Inference:   0%|          | 3/615 [00:02<08:38,  1.18it/s][A
Inference:   1%|          | 4/615 [00:03<08:36,  1.18it/s][A
Inference:   1%|          | 5/615 [00:04<09:00,  1.13it/s][A
Inference:   1%|          | 6/615 [00:05<08:50,  1.15it/s][A
Inference:   1%|          | 7/615 [00:06<08:33,  1.18it/s][A
Inference:   1%|▏         | 8/615 [00:06<08:16,  1.22it/s][A
Inference:   1%|▏         | 9/615 [00:07<08:15,  1.22it/s][A
Inference:   2%|▏         | 10/615 [00:08<07:58,  1.26it/s][A
Inference:   2%|▏         | 11/615 [00:09<07:57,  1.26it/s][A
Inference:   2%|▏         | 12/615 [00:10<08:07,  1.24it/s][A
Inference:   2%|▏         | 13/615 [00:10<08:28,  1.18it/s][A
Inference:   2%|▏         | 14/615 [00:11<08:32,  1.17it/s][A
Inference:   2%|▏         | 15/615 [00:12<08:20,  1.20it/s][A
Inference

Debug Sample 0: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Katkınız, öğrencilerimize kaliteli bir eğitim sağlamamıza yardımcı oldu.
Hypothesis: Katkılarınızın öğrencilerimizin eğitimine faydası olmadı.
Label:<|im_end|>
<|im_start|>assistant
contradiction, Parsed Label: 2
Debug Sample 1: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Cevap onların nedeni ile ilgisi yoktur, ancak sözlükler bi-benzersiz ikame egzersizleri değil


Inference:   0%|          | 0/563 [00:00<?, ?it/s][A
Inference:   0%|          | 1/563 [00:00<07:54,  1.18it/s][A
Inference:   0%|          | 2/563 [00:01<08:02,  1.16it/s][A
Inference:   1%|          | 3/563 [00:02<07:43,  1.21it/s][A
Inference:   1%|          | 4/563 [00:03<08:04,  1.15it/s][A
Inference:   1%|          | 5/563 [00:04<08:15,  1.13it/s][A
Inference:   1%|          | 6/563 [00:05<08:17,  1.12it/s][A
Inference:   1%|          | 7/563 [00:06<08:02,  1.15it/s][A
Inference:   1%|▏         | 8/563 [00:06<07:57,  1.16it/s][A
Inference:   2%|▏         | 9/563 [00:07<07:39,  1.21it/s][A
Inference:   2%|▏         | 10/563 [00:08<07:50,  1.18it/s][A
Inference:   2%|▏         | 11/563 [00:09<07:57,  1.16it/s][A
Inference:   2%|▏         | 12/563 [00:10<07:43,  1.19it/s][A
Inference:   2%|▏         | 13/563 [00:11<07:48,  1.18it/s][A
Inference:   2%|▏         | 14/563 [00:11<07:36,  1.20it/s][A
Inference:   3%|▎         | 15/563 [00:12<07:28,  1.22it/s][A
Inference

Debug Sample 0: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Herkese merhabalar! Az önce Türk Telekom Play Store üzerinden F1 2012 satın aldım fakat ülkenizde geçerli değil diyor? Ne yapmam lazım? Son düzenleyen: Moderatör: 16 Mayıs 2021.
Hypothesis: Su, hayat için önemlidir.
Label:<|im_end|>
<|im_start|>assistant
neutral, Parsed Label: 1
Debug Sample 1: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Savcının fitne - fesat içi


Inference:   0%|          | 0/577 [00:00<?, ?it/s][A
Inference:   0%|          | 1/577 [00:00<08:55,  1.08it/s][A
Inference:   0%|          | 2/577 [00:01<08:26,  1.14it/s][A
Inference:   1%|          | 3/577 [00:02<08:16,  1.16it/s][A
Inference:   1%|          | 4/577 [00:03<08:04,  1.18it/s][A
Inference:   1%|          | 5/577 [00:04<08:03,  1.18it/s][A
Inference:   1%|          | 6/577 [00:05<08:19,  1.14it/s][A
Inference:   1%|          | 7/577 [00:06<08:12,  1.16it/s][A
Inference:   1%|▏         | 8/577 [00:06<08:13,  1.15it/s][A
Inference:   2%|▏         | 9/577 [00:07<08:13,  1.15it/s][A
Inference:   2%|▏         | 10/577 [00:08<08:13,  1.15it/s][A
Inference:   2%|▏         | 11/577 [00:09<08:18,  1.14it/s][A
Inference:   2%|▏         | 12/577 [00:10<08:11,  1.15it/s][A
Inference:   2%|▏         | 13/577 [00:11<08:15,  1.14it/s][A
Inference:   2%|▏         | 14/577 [00:12<08:13,  1.14it/s][A
Inference:   3%|▎         | 15/577 [00:13<08:11,  1.14it/s][A
Inference

Debug Sample 0: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: Yaşamı boyunca pek çok zorluğun üstesinden gelen Ebisu sonunda çocukların, balıkçıların, refahın ve talihin koruyucu tanrısına dönüştü.
Hypothesis: Kiraz ağaçları Japonya'da yaygın olarak yetişir.
Label:<|im_end|>
<|im_start|>assistant
neutral, Parsed Label: 1
Debug Sample 1: Generated: <|im_start|>system
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word only: entailment, neutral, or contradiction. No explanation, no other text.<|im_end|>
<|im_start|>user
Premise: 2018 ve 2020 yılları arasında yavaş yavaş dev




In [15]:
# Summary: per config/split
for config_name, splits in all_metrics.items():
    for split_name, m in splits.items():
        print(f"{config_name} / {split_name}: acc={m['accuracy']:.4f}, F1_macro={m['f1_macro']:.4f}, F1_per_class={m['f1_per_class']}")

snli_tr_1_1 / test: acc=0.7411, F1_macro=0.7403, F1_per_class={'entailment': 0.7970054707745465, 'neutral': 0.6639093137254902, 'contradiction': 0.7599611273080661}
multinli_tr_1_1 / validation_matched: acc=0.7137, F1_macro=0.7086, F1_per_class={'entailment': 0.7508600917431193, 'neutral': 0.6125878922997771, 'contradiction': 0.7622962854206431}
multinli_tr_1_1 / validation_mismatched: acc=0.7214, F1_macro=0.7150, F1_per_class={'entailment': 0.7574112734864301, 'neutral': 0.6114805740287015, 'contradiction': 0.7761813064731151}
trglue_mnli / test_matched: acc=0.8187, F1_macro=0.8181, F1_per_class={'entailment': 0.8259732266502539, 'neutral': 0.8015959376133478, 'contradiction': 0.8265867066466767}
trglue_mnli / test_mismatched: acc=0.8790, F1_macro=0.8778, F1_per_class={'entailment': 0.8926432191675626, 'neutral': 0.851109520400859, 'contradiction': 0.8896606156274665}
