# NLI base results: Gemma-3-4b-it (google/gemma-3-4b-it)

Loads [yilmazzey/sdp2-nli](https://huggingface.co/datasets/yilmazzey/sdp2-nli) (snli_tr_1_1, multinli_tr_1_1, trglue_mnli) and runs **test-only** evaluation with this model.

4B generative LLM (Gemma-3 series, instruct-tuned). Multilingual support (140+ languages incl. Turkish). Zero-shot prompted NLI (no fine-tuning). Expected ~50-60% on reasoning/NLI (e.g., MMLU 59.6, BoolQ 72.3 0-shot). Outputs parsed to 0=entailment, 1=neutral, 2=contradiction.

**Splits:** snli → test; multinli → validation_matched/mismatched; trglue → test_matched/test_mismatched. **Metrics:** Accuracy, macro F1, per-class F1, confusion matrix (CSV + plot).

In [None]:
# Colab: uncomment and run once to install/upgrade (Runtime -> Change runtime type -> GPU)
!pip install -q -U transformers datasets accelerate scikit-learn tqdm

import json
import random
from collections import Counter
from pathlib import Path

import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from tqdm import tqdm
from transformers import pipeline, AutoTokenizer

try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False

LABEL_NAMES = ["entailment", "neutral", "contradiction"]

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("No GPU; running on CPU (4B still feasible but slower).")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/10.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/10.4 MB[0m [31m98.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m8.1/10.4 MB[0m [31m119.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m113.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m520.7/520.7 kB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m261.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
[?25hGPU: NVIDIA RTX PRO 6000 Blackwell Server Edition


In [None]:
REPO_ID = "yilmazzey/sdp2-nli"
CONFIGS = ["snli_tr_1_1", "multinli_tr_1_1", "trglue_mnli"]
MODEL_ID = "google/gemma-3-4b-it"
NUM_LABELS = 3  # entailment, neutral, contradiction
RESULTS_DIR = "results"
# Lower to 4-8 if memory low (4B lightweight). If CPU only, feasible.
BATCH_SIZE = 8
EVAL_SPLITS = {
    "snli_tr_1_1": ["test"],
    "multinli_tr_1_1": ["validation_matched", "validation_mismatched"],
    "trglue_mnli": ["test_matched", "test_mismatched"],
}

In [None]:
# Load all three dataset configs
datasets = {}
for cfg in CONFIGS:
    print(f"Loading {REPO_ID} :: {cfg} ...")
    datasets[cfg] = load_dataset(REPO_ID, cfg)
    print("  splits:", list(datasets[cfg].keys()))

Loading yilmazzey/sdp2-nli :: snli_tr_1_1 ...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



snli_tr_1_1/train-00000-of-00001.parquet:   0%|          | 0.00/25.4M [00:00<?, ?B/s]

snli_tr_1_1/validation-00000-of-00001.pa(…):   0%|          | 0.00/558k [00:00<?, ?B/s]

snli_tr_1_1/test-00000-of-00001.parquet:   0%|          | 0.00/557k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/548487 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9836 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9824 [00:00<?, ? examples/s]

  splits: ['train', 'validation', 'test']
Loading yilmazzey/sdp2-nli :: multinli_tr_1_1 ...


multinli_tr_1_1/train-00000-of-00001.par(…):   0%|          | 0.00/52.8M [00:00<?, ?B/s]

multinli_tr_1_1/validation_matched-00000(…):   0%|          | 0.00/835k [00:00<?, ?B/s]

multinli_tr_1_1/validation_mismatched-00(…):   0%|          | 0.00/872k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392599 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9809 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9825 [00:00<?, ? examples/s]

  splits: ['train', 'validation_matched', 'validation_mismatched']
Loading yilmazzey/sdp2-nli :: trglue_mnli ...


trglue_mnli/train-00000-of-00001.parquet:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

trglue_mnli/validation_matched-00000-of-(…):   0%|          | 0.00/1.08M [00:00<?, ?B/s]

trglue_mnli/validation_mismatched-00000-(…):   0%|          | 0.00/1.24M [00:00<?, ?B/s]

trglue_mnli/test_matched-00000-of-00001.(…):   0%|          | 0.00/1.08M [00:00<?, ?B/s]

trglue_mnli/test_mismatched-00000-of-000(…):   0%|          | 0.00/1.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/162788 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9050 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9200 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9008 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9217 [00:00<?, ? examples/s]

  splits: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']


In [None]:
from huggingface_hub import login

login()  # use token from huggingface-cli login or HF_TOKEN env

In [None]:
print("Loading Gemma-3-4b-it (text-generation pipeline)...")

generator = pipeline(
    "text-generation",
    model=MODEL_ID,
    device=0 if torch.cuda.is_available() else -1,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    max_length=None,  # Silence warning
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

if hasattr(tokenizer, "pad_token") and tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully.")

Loading Gemma-3-4b-it (text-generation pipeline)...


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/883 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Passing `generation_config` together with generation-related arguments=({'max_length'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


Model loaded successfully.


In [None]:
def nli_prompt(premise, hypothesis):
    return [
        {"role": "system", "content": "You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text."},
        {"role": "user", "content": f"Premise: {premise}\nHypothesis: {hypothesis}\nLabel:"}
    ]

def parse_generated_label(gen_text, prompt_text):
    continuation = gen_text[len(prompt_text):].strip().lower()
    if not continuation:
        return 1  # neutral fallback
    first_word = continuation.split()[0].rstrip('.,!?')
    label_map = {"entailment": 0, "neutral": 1, "contradiction": 2}
    return label_map.get(first_word, 1)  # Default to neutral if unknown

def run_prompted_inference(ds):
    premises = list(ds["premise"])
    hypotheses = list(ds["hypothesis"])
    labels = list(ds["label"])
    n = len(ds)
    y_pred = []
    all_generations = []  # Collect all for full debug

    for start in tqdm(range(0, n, BATCH_SIZE), desc="Inference"):
        batch_premises = premises[start : start + BATCH_SIZE]
        batch_hypotheses = hypotheses[start : start + BATCH_SIZE]
        batch_prompts = [nli_prompt(p, h) for p, h in zip(batch_premises, batch_hypotheses)]

        formatted_prompts = tokenizer.apply_chat_template(batch_prompts, tokenize=False, add_generation_prompt=True)

        out = generator(
            formatted_prompts,
            max_new_tokens=5,  # Strict limit
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
            max_length=None,  # Silence warning
        )

        all_generations.extend(out)  # Save all

        for i, gen in enumerate(out):
            gen_text = gen[0]["generated_text"]
            y_pred.append(parse_generated_label(gen_text, formatted_prompts[i]))

    y_true = np.array(labels, dtype=np.int64)
    y_pred = np.array(y_pred, dtype=np.int64)

    # Log first 5 generations for debug
    for i in range(min(5, n)):
        print(f"Debug Sample {i}: Generated: {all_generations[i][0]['generated_text']}, Parsed Label: {y_pred[i]}")

    print("True label dist:", dict(Counter(y_true)))
    print("Pred label dist:", dict(Counter(y_pred)))

    return y_true, y_pred

In [None]:
def compute_metrics(y_true, y_pred):
    acc = float(accuracy_score(y_true, y_pred))
    f1_macro = float(f1_score(y_true, y_pred, average="macro", zero_division=0))
    f1_per_class = f1_score(y_true, y_pred, average=None, zero_division=0)
    f1_per_class = {LABEL_NAMES[i]: float(f1_per_class[i]) for i in range(NUM_LABELS)}
    cm = confusion_matrix(y_true, y_pred)
    out = {"accuracy": acc, "f1_macro": f1_macro, "f1_per_class": f1_per_class}
    return out, cm


def save_confusion_plot(cm, path):
    if not HAS_PLOT:
        return
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [None]:
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)
all_metrics = {}

for config_name in CONFIGS:
    ds_dict = datasets[config_name]
    split_names = EVAL_SPLITS[config_name]
    all_metrics[config_name] = {}

    for split_name in split_names:
        if split_name not in ds_dict:
            print(f"  Skip {config_name}/{split_name} (missing)")
            continue
        ds = ds_dict[split_name]
        print(f"Evaluating {config_name} / {split_name} ...")
        y_true, y_pred = run_prompted_inference(ds)
        metrics, cm = compute_metrics(y_true, y_pred)
        all_metrics[config_name][split_name] = metrics

        cm_path = Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.csv"
        np.savetxt(cm_path, cm, fmt="%d", delimiter=",")
        save_confusion_plot(cm, Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.png")

        print(f"  accuracy={metrics['accuracy']:.4f}, f1_macro={metrics['f1_macro']:.4f}")

with open(Path(RESULTS_DIR) / "metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)
print(f"Saved {RESULTS_DIR}/metrics.json")

Evaluating snli_tr_1_1 / test ...


Inference:   0%|          | 0/1228 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Passing `generation_config` together with generation-related arguments=({'temperature', 'pad_token_id', 'max_new_tokens', 'do_sample'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Inference:   1%|          | 10/1228 [00:11<14:39,  1.39it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Inference: 100%|██████████| 1228/1228 [13:32<00:00,  1.51it/s]


Debug Sample 0: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.
Hypothesis: Kilisenin tavanında çatlaklar var.
Label:<end_of_turn>
<start_of_turn>model
entailment
, Parsed Label: 0
Debug Sample 1: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Bu kilise korosu, kilisedeki kitaptan neşeli şarkılar söylerken kitlelere şarkı söyler.
Hypothesis: Kilise şarkıyla dolu.
Label:<end_of_turn>
<

Inference: 100%|██████████| 1227/1227 [14:16<00:00,  1.43it/s]


Debug Sample 0: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Yeni haklar yeterince güzel.
Hypothesis: Herkes gerçekten en yeni faydaları seviyor
Label:<end_of_turn>
<start_of_turn>model
neutral
, Parsed Label: 1
Debug Sample 1: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Bu site, tüm ödül kazananların bir listesini ve Hükümet Yönetici makalelerinin aranabilir bir veritabanını içerir.
Hypothesis: Web sitesinde yer alan Hükümet Yürütme makaleleri aranamaz.
Label:<end

Inference: 100%|██████████| 1229/1229 [14:25<00:00,  1.42it/s]


Debug Sample 0: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Katkınız, öğrencilerimize kaliteli bir eğitim sağlamamıza yardımcı oldu.
Hypothesis: Katkılarınızın öğrencilerimizin eğitimine faydası olmadı.
Label:<end_of_turn>
<start_of_turn>model
contradiction
, Parsed Label: 2
Debug Sample 1: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Cevap onların nedeni ile ilgisi yoktur, ancak sözlükler bi-benzersiz ikame egzersizleri değildir basit gerçeği ile; Başka bir deyişl

Inference: 100%|██████████| 1126/1126 [12:39<00:00,  1.48it/s]


Debug Sample 0: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Herkese merhabalar! Az önce Türk Telekom Play Store üzerinden F1 2012 satın aldım fakat ülkenizde geçerli değil diyor? Ne yapmam lazım? Son düzenleyen: Moderatör: 16 Mayıs 2021.
Hypothesis: Su, hayat için önemlidir.
Label:<end_of_turn>
<start_of_turn>model
neutral
, Parsed Label: 1
Debug Sample 1: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Savcının fitne - fesat içinde olduğunu söylemek için mangal gibi 

Inference: 100%|██████████| 1153/1153 [12:53<00:00,  1.49it/s]

Debug Sample 0: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: Yaşamı boyunca pek çok zorluğun üstesinden gelen Ebisu sonunda çocukların, balıkçıların, refahın ve talihin koruyucu tanrısına dönüştü.
Hypothesis: Kiraz ağaçları Japonya'da yaygın olarak yetişir.
Label:<end_of_turn>
<start_of_turn>model
neutral
, Parsed Label: 1
Debug Sample 1: Generated: <bos><start_of_turn>user
You are a helpful assistant for natural language inference. Classify the relationship between premise and hypothesis as entailment, neutral, or contradiction. Respond with exactly one word: entailment, neutral, or contradiction. No explanation or additional text.

Premise: 2018 ve 2020 yılları arasında yavaş yavaş devreye girecek ev cari açığı da bu şekild




In [12]:
# Summary: per config/split
for config_name, splits in all_metrics.items():
    for split_name, m in splits.items():
        print(f"{config_name} / {split_name}: acc={m['accuracy']:.4f}, F1_macro={m['f1_macro']:.4f}, F1_per_class={m['f1_per_class']}")

snli_tr_1_1 / test: acc=0.5518, F1_macro=0.5265, F1_per_class={'entailment': 0.7103448275862069, 'neutral': 0.38134129103188313, 'contradiction': 0.48766644837371753}
multinli_tr_1_1 / validation_matched: acc=0.6107, F1_macro=0.5752, F1_per_class={'entailment': 0.6766901258264022, 'neutral': 0.34350797266514804, 'contradiction': 0.7052991452991453}
multinli_tr_1_1 / validation_mismatched: acc=0.6192, F1_macro=0.5797, F1_per_class={'entailment': 0.6829937717724058, 'neutral': 0.3280373831775701, 'contradiction': 0.7281668645073767}
trglue_mnli / test_matched: acc=0.7633, F1_macro=0.7614, F1_per_class={'entailment': 0.7663079580178026, 'neutral': 0.7164958810528431, 'contradiction': 0.8015239477503628}
trglue_mnli / test_mismatched: acc=0.8151, F1_macro=0.8099, F1_per_class={'entailment': 0.816027689030884, 'neutral': 0.7333742582361367, 'contradiction': 0.8801988400994201}
