# NLI base results: Aya-23-8B (CohereForAI/aya-23-8B) via Ollama on M4

Loads [yilmazzey/sdp2-nli](https://huggingface.co/datasets/yilmazzey/sdp2-nli) (snli_tr_1_1, multinli_tr_1_1, trglue_mnli) and runs **test-only** zero-shot Turkish NLI evaluation with **CohereForAI/aya-23-8B** via **Ollama** (no Hugging Face pipeline).

Uses official Aya-23 chat template (system + user turn). Model is instructed to answer with exactly one word: entailment, neutral, or contradiction. Outputs parsed to 0=entailment, 1=neutral, 2=contradiction. Runs on Apple Silicon (M4) with pure Ollama (CPU/Metal), no CUDA, no quantization.

**Splits:** snli → test; multinli → validation_matched/mismatched; trglue → test_matched/test_mismatched. **Metrics:** Accuracy, macro F1, per-class F1, confusion matrix (CSV + seaborn plot). Results saved to `./results/`. **Prerequisite:** `ollama pull cohereforai/aya-23-8b` (or your Ollama model name for Aya-23-8B).

In [3]:
# Install ollama Python client if needed; standard libs for datasets/metrics/plots
# !pip install -q ollama datasets scikit-learn tqdm matplotlib seaborn
# If ollama is already installed via brew/pip, skip or run: pip install -q ollama

In [4]:
import json
import random
import re
from collections import Counter
from pathlib import Path

import numpy as np
from datasets import load_dataset
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from tqdm import tqdm
import ollama

try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False

LABEL_NAMES = ["entailment", "neutral", "contradiction"]

# Device: M4 / Apple Silicon — no CUDA; Ollama uses Metal/CPU
print("Running on Apple Silicon (M4) / CPU — Ollama handles Metal.")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

Running on Apple Silicon (M4) / CPU — Ollama handles Metal.


In [12]:
REPO_ID = "yilmazzey/sdp2-nli"
CONFIGS = ["snli_tr_1_1", "multinli_tr_1_1", "trglue_mnli"]
MODEL_ID = "aya:8b-23"  # Ollama model name (e.g. after: ollama pull cohereforai/aya-23-8b)
NUM_LABELS = 3
RESULTS_DIR = "results"
BATCH_SIZE = 6  # Safe on M4 36GB RAM; use 4–8
MAX_TOKENS = 10
TEMPERATURE = 0.0
TOP_P = 0.0
EVAL_SPLITS = {
    "snli_tr_1_1": ["test"],
    "multinli_tr_1_1": ["validation_matched", "validation_mismatched"],
    "trglue_mnli": ["test_matched", "test_mismatched"],
}

In [6]:
# Load all three dataset configs (same as Turkish-Gemma-9b-T1)
datasets = {}
for cfg in CONFIGS:
    print(f"Loading {REPO_ID} :: {cfg} ...")
    datasets[cfg] = load_dataset(REPO_ID, cfg)
    print("  splits:", list(datasets[cfg].keys()))

Loading yilmazzey/sdp2-nli :: snli_tr_1_1 ...
  splits: ['train', 'validation', 'test']
Loading yilmazzey/sdp2-nli :: multinli_tr_1_1 ...
  splits: ['train', 'validation_matched', 'validation_mismatched']
Loading yilmazzey/sdp2-nli :: trglue_mnli ...
  splits: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']


In [7]:
SYSTEM_PROMPT = """You are a natural language inference classifier. You must answer with exactly one word and nothing else: entailment, neutral, or contradiction. No explanation, no punctuation, no extra text. Only one of these three words."""


def nli_user_prompt(premise, hypothesis):
    return f"""Premise: {premise}
Hypothesis: {hypothesis}
Does the premise entail, is neutral to, or contradict the hypothesis? Answer with only one word: entailment, neutral, or contradiction."""


def ollama_chat_single(user_content: str) -> str:
    """Aya-23 chat: system + user turn. Returns assistant message content only."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_content},
    ]
    response = ollama.chat(
        model=MODEL_ID,
        messages=messages,
        options={
            "num_predict": MAX_TOKENS,
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
        },
    )
    return (response.get("message") or {}).get("content", "") or ""


LABEL_WORD_TO_ID = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2,
    "içerme": 0,
    "tarafsız": 1,
    "nötr": 1,
    "çelişki": 2,
}


def parse_generated_label(raw_text: str) -> int:
    """Extract first word from model output; strip punctuation; lowercase; map EN+TR; default 1 (neutral)."""
    if not raw_text or not isinstance(raw_text, str):
        return 1
    text = raw_text.strip()
    if not text:
        return 1
    # First token/word (split by whitespace)
    parts = text.split()
    first = parts[0] if parts else ""
    # Strip punctuation and lowercase
    first = re.sub(r"[.,;:!?\"'()\[\]]+", "", first).strip().lower()
    if not first:
        return 1
    return LABEL_WORD_TO_ID.get(first, 1)

In [8]:
def run_prompted_inference(ds):
    premises = ds["premise"]
    hypotheses = ds["hypothesis"]
    labels = ds["label"]
    n = len(labels)
    y_pred = []
    debug_indices = set(list(range(min(5, n))) + list(range(0, n, 100)))

    for start in tqdm(range(0, n, BATCH_SIZE), desc="Inference"):
        end = min(start + BATCH_SIZE, n)
        for i in range(start, end):
            user_content = nli_user_prompt(premises[i], hypotheses[i])
            raw = ollama_chat_single(user_content)
            label_id = parse_generated_label(raw)
            y_pred.append(label_id)
            if i in debug_indices:
                print(f"[sample {i}] raw: {repr(raw)} -> parsed: {label_id} ({LABEL_NAMES[label_id]})")

    y_true = np.array(labels, dtype=np.int64)
    y_pred = np.array(y_pred, dtype=np.int64)
    print("True label dist:", dict(Counter(y_true)))
    print("Pred label dist:", dict(Counter(y_pred)))
    return y_true, y_pred

In [9]:
def compute_metrics(y_true, y_pred):
    acc = float(accuracy_score(y_true, y_pred))
    f1_macro = float(f1_score(y_true, y_pred, average="macro", zero_division=0))
    f1_per_class = f1_score(y_true, y_pred, average=None, zero_division=0)
    f1_per_class = {LABEL_NAMES[i]: float(f1_per_class[i]) for i in range(NUM_LABELS)}
    cm = confusion_matrix(y_true, y_pred)
    out = {"accuracy": acc, "f1_macro": f1_macro, "f1_per_class": f1_per_class}
    return out, cm


def save_confusion_plot(cm, path):
    if not HAS_PLOT:
        return
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [13]:
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)
all_metrics = {}

for config_name in CONFIGS:
    ds_dict = datasets[config_name]
    split_names = EVAL_SPLITS[config_name]
    all_metrics[config_name] = {}

    for split_name in split_names:
        if split_name not in ds_dict:
            print(f"  Skip {config_name}/{split_name} (missing)")
            continue
        ds = ds_dict[split_name]
        print(f"Evaluating {config_name} / {split_name} ...")
        y_true, y_pred = run_prompted_inference(ds)
        metrics, cm = compute_metrics(y_true, y_pred)
        all_metrics[config_name][split_name] = metrics

        cm_path = Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.csv"
        np.savetxt(cm_path, cm, fmt="%d", delimiter=",")
        save_confusion_plot(cm, Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.png")

        print(f"  accuracy={metrics['accuracy']:.4f}, f1_macro={metrics['f1_macro']:.4f}")

with open(Path(RESULTS_DIR) / "metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)
print(f"Saved {RESULTS_DIR}/metrics.json")

Evaluating snli_tr_1_1 / test ...


Inference:   0%|          | 0/1638 [00:00<?, ?it/s]

[sample 0] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 3] raw: 'entailment' -> parsed: 0 (entailment)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1638 [00:23<35:10,  1.30s/it] 

[sample 100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   2%|▏         | 33/1638 [00:45<33:39,  1.26s/it]

[sample 200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   3%|▎         | 50/1638 [01:07<34:55,  1.32s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1638 [01:27<33:42,  1.29s/it]

[sample 400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   5%|▌         | 83/1638 [01:49<33:07,  1.28s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   6%|▌         | 100/1638 [02:11<33:36,  1.31s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 116/1638 [02:32<34:18,  1.35s/it]

[sample 700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   8%|▊         | 133/1638 [02:59<51:44,  2.06s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▉         | 150/1638 [03:26<38:54,  1.57s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  10%|█         | 166/1638 [03:52<39:45,  1.62s/it]

[sample 1000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 183/1638 [04:21<41:35,  1.71s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1638 [04:49<38:48,  1.62s/it]

[sample 1200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  13%|█▎        | 216/1638 [05:15<40:10,  1.70s/it]

[sample 1300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  14%|█▍        | 233/1638 [05:43<39:28,  1.69s/it]

[sample 1400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  15%|█▌        | 250/1638 [06:09<35:09,  1.52s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▌        | 266/1638 [06:36<40:01,  1.75s/it]

[sample 1600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 283/1638 [07:03<34:04,  1.51s/it]

[sample 1700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  18%|█▊        | 300/1638 [07:28<34:19,  1.54s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1638 [07:53<32:00,  1.45s/it]

[sample 1900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  20%|██        | 333/1638 [08:18<32:51,  1.51s/it]

[sample 2000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  21%|██▏       | 350/1638 [08:44<32:40,  1.52s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1638 [09:08<33:09,  1.56s/it]

[sample 2200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  23%|██▎       | 383/1638 [09:34<32:42,  1.56s/it]

[sample 2300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 400/1638 [10:01<32:23,  1.57s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1638 [10:25<32:15,  1.58s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1638 [10:53<31:40,  1.58s/it]

[sample 2600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  27%|██▋       | 450/1638 [11:20<31:40,  1.60s/it]

[sample 2700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  28%|██▊       | 466/1638 [11:45<32:15,  1.65s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 483/1638 [12:12<30:30,  1.59s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1638 [12:38<30:01,  1.58s/it]

[sample 3000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  32%|███▏      | 516/1638 [13:04<29:51,  1.60s/it]

[sample 3100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  33%|███▎      | 533/1638 [13:30<27:32,  1.50s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 550/1638 [13:56<28:37,  1.58s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  35%|███▍      | 566/1638 [14:21<27:42,  1.55s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  36%|███▌      | 583/1638 [14:49<27:40,  1.57s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 600/1638 [15:17<29:34,  1.71s/it]

[sample 3600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  38%|███▊      | 616/1638 [15:42<26:51,  1.58s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1638 [16:10<27:43,  1.66s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 650/1638 [16:38<25:35,  1.55s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 666/1638 [17:03<25:43,  1.59s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 683/1638 [17:30<26:32,  1.67s/it]

[sample 4100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  43%|████▎     | 700/1638 [17:57<24:06,  1.54s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▎     | 716/1638 [18:22<25:15,  1.64s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1638 [18:50<23:01,  1.53s/it]

[sample 4400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  46%|████▌     | 750/1638 [19:16<24:23,  1.65s/it]

[sample 4500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  47%|████▋     | 766/1638 [19:41<22:18,  1.53s/it]

[sample 4600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  48%|████▊     | 783/1638 [20:08<21:42,  1.52s/it]

[sample 4700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  49%|████▉     | 800/1638 [20:34<21:27,  1.54s/it]

[sample 4800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  50%|████▉     | 816/1638 [20:59<21:09,  1.54s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1638 [21:26<20:32,  1.53s/it]

[sample 5000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  52%|█████▏    | 850/1638 [21:52<20:02,  1.53s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1638 [22:18<20:33,  1.60s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 883/1638 [22:45<20:10,  1.60s/it]

[sample 5300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  55%|█████▍    | 900/1638 [23:14<20:19,  1.65s/it]

[sample 5400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  56%|█████▌    | 916/1638 [23:39<18:11,  1.51s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1638 [24:06<18:48,  1.60s/it]

[sample 5600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  58%|█████▊    | 950/1638 [24:34<17:51,  1.56s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▉    | 966/1638 [24:59<17:32,  1.57s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1638 [25:25<17:04,  1.56s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 1000/1638 [25:51<16:28,  1.55s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 1016/1638 [26:16<16:00,  1.54s/it]

[sample 6100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  63%|██████▎   | 1033/1638 [26:42<15:46,  1.56s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  64%|██████▍   | 1050/1638 [27:08<14:49,  1.51s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1638 [27:34<15:01,  1.58s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1638 [28:00<14:29,  1.57s/it]

[sample 6500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  67%|██████▋   | 1100/1638 [28:27<14:03,  1.57s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  68%|██████▊   | 1116/1638 [28:51<13:36,  1.56s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1638 [29:18<12:42,  1.51s/it]

[sample 6800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  70%|███████   | 1150/1638 [29:44<12:11,  1.50s/it]

[sample 6900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  71%|███████   | 1166/1638 [30:08<12:08,  1.54s/it]

[sample 7000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  72%|███████▏  | 1183/1638 [30:34<11:52,  1.57s/it]

[sample 7100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  73%|███████▎  | 1200/1638 [31:00<11:02,  1.51s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1638 [31:24<10:19,  1.47s/it]

[sample 7300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  75%|███████▌  | 1233/1638 [31:50<10:02,  1.49s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▋  | 1250/1638 [32:15<09:42,  1.50s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1638 [32:38<08:50,  1.43s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1283/1638 [33:04<09:01,  1.52s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  79%|███████▉  | 1300/1638 [33:29<08:16,  1.47s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1638 [33:52<07:51,  1.47s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1333/1638 [34:17<07:15,  1.43s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1350/1638 [34:42<06:57,  1.45s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1366/1638 [35:05<06:31,  1.44s/it]

[sample 8200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  84%|████████▍ | 1383/1638 [35:30<06:18,  1.48s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▌ | 1400/1638 [35:55<05:48,  1.46s/it]

[sample 8400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  86%|████████▋ | 1416/1638 [36:18<05:17,  1.43s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1433/1638 [36:42<04:56,  1.44s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1638 [37:07<04:30,  1.44s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1466/1638 [37:30<04:03,  1.42s/it]

[sample 8800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  91%|█████████ | 1483/1638 [37:55<03:47,  1.47s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1638 [38:19<03:16,  1.42s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1638 [38:44<03:28,  1.71s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▎| 1533/1638 [39:08<02:29,  1.42s/it]

[sample 9200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  95%|█████████▍| 1550/1638 [39:32<02:04,  1.41s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1638 [39:55<01:43,  1.44s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1638 [40:20<01:18,  1.42s/it]

[sample 9500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  98%|█████████▊| 1600/1638 [40:44<00:54,  1.43s/it]

[sample 9600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▊| 1616/1638 [41:07<00:31,  1.43s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1638 [41:31<00:07,  1.41s/it]

[sample 9800] raw: 'neutral' -> parsed: 1 (neutral)


Inference: 100%|██████████| 1638/1638 [41:37<00:00,  1.52s/it]


True label dist: {np.int64(1): 3219, np.int64(0): 3368, np.int64(2): 3237}
Pred label dist: {np.int64(2): 2825, np.int64(0): 6354, np.int64(1): 645}
  accuracy=0.6226, f1_macro=0.5644
Evaluating multinli_tr_1_1 / validation_matched ...


Inference:   0%|          | 0/1635 [00:00<?, ?it/s]

[sample 0] raw: 'entailment' -> parsed: 0 (entailment)
[sample 1] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1635 [00:27<45:33,  1.69s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1635 [00:56<46:09,  1.73s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1635 [01:25<42:46,  1.62s/it]

[sample 300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   4%|▍         | 66/1635 [01:53<44:43,  1.71s/it]

[sample 400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   5%|▌         | 83/1635 [02:23<45:13,  1.75s/it]

[sample 500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   6%|▌         | 100/1635 [02:52<46:03,  1.80s/it]

[sample 600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   7%|▋         | 116/1635 [03:20<44:26,  1.76s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 133/1635 [03:49<41:41,  1.67s/it]

[sample 800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   9%|▉         | 150/1635 [04:18<43:26,  1.76s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  10%|█         | 166/1635 [04:45<41:13,  1.68s/it]

[sample 1000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 183/1635 [05:15<41:55,  1.73s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1635 [05:44<40:17,  1.68s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 216/1635 [06:12<41:46,  1.77s/it]

[sample 1300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  14%|█▍        | 233/1635 [06:42<40:23,  1.73s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  15%|█▌        | 250/1635 [07:13<45:15,  1.96s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▋        | 266/1635 [07:43<41:50,  1.83s/it]

[sample 1600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  17%|█▋        | 283/1635 [08:13<40:57,  1.82s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 300/1635 [08:44<39:14,  1.76s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1635 [09:12<39:34,  1.80s/it]

[sample 1900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  20%|██        | 333/1635 [09:42<39:38,  1.83s/it]

[sample 2000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  21%|██▏       | 350/1635 [10:13<38:21,  1.79s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1635 [10:41<36:04,  1.71s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 383/1635 [11:11<36:24,  1.74s/it]

[sample 2300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  24%|██▍       | 400/1635 [11:42<37:04,  1.80s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1635 [12:10<36:16,  1.79s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1635 [12:40<34:54,  1.74s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 450/1635 [13:11<35:28,  1.80s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▊       | 466/1635 [13:40<36:25,  1.87s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  30%|██▉       | 483/1635 [14:12<34:37,  1.80s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1635 [14:42<34:47,  1.84s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  32%|███▏      | 516/1635 [15:12<33:46,  1.81s/it]

[sample 3100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  33%|███▎      | 533/1635 [15:43<33:32,  1.83s/it]

[sample 3200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  34%|███▎      | 550/1635 [16:14<33:37,  1.86s/it]

[sample 3300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  35%|███▍      | 566/1635 [16:44<33:56,  1.90s/it]

[sample 3400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  36%|███▌      | 583/1635 [17:17<34:12,  1.95s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 600/1635 [17:48<32:29,  1.88s/it]

[sample 3600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  38%|███▊      | 616/1635 [18:17<32:55,  1.94s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1635 [18:49<30:05,  1.80s/it]

[sample 3800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  40%|███▉      | 650/1635 [19:20<29:25,  1.79s/it]

[sample 3900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  41%|████      | 666/1635 [19:49<28:39,  1.77s/it]

[sample 4000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  42%|████▏     | 683/1635 [20:20<29:31,  1.86s/it]

[sample 4100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 700/1635 [20:51<28:47,  1.85s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▍     | 716/1635 [21:21<28:44,  1.88s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1635 [21:52<28:00,  1.86s/it]

[sample 4400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  46%|████▌     | 750/1635 [22:23<27:25,  1.86s/it]

[sample 4500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  47%|████▋     | 766/1635 [22:52<26:56,  1.86s/it]

[sample 4600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 783/1635 [23:23<25:22,  1.79s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 800/1635 [23:55<25:13,  1.81s/it]

[sample 4800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  50%|████▉     | 816/1635 [24:24<25:06,  1.84s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1635 [24:55<23:25,  1.75s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  52%|█████▏    | 850/1635 [25:27<24:02,  1.84s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1635 [25:56<23:02,  1.80s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 883/1635 [26:28<24:43,  1.97s/it]

[sample 5300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  55%|█████▌    | 900/1635 [26:59<22:47,  1.86s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▌    | 916/1635 [27:28<21:05,  1.76s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1635 [28:00<22:06,  1.89s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  58%|█████▊    | 950/1635 [28:33<22:49,  2.00s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▉    | 966/1635 [29:02<20:47,  1.86s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1635 [29:33<20:33,  1.89s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 1000/1635 [30:05<19:55,  1.88s/it]

[sample 6000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  62%|██████▏   | 1016/1635 [30:34<18:57,  1.84s/it]

[sample 6100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 1033/1635 [31:06<19:02,  1.90s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  64%|██████▍   | 1050/1635 [31:37<17:40,  1.81s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1635 [32:07<17:24,  1.83s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1635 [32:38<17:02,  1.85s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1100/1635 [33:09<16:09,  1.81s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  68%|██████▊   | 1116/1635 [33:39<15:28,  1.79s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1635 [34:12<15:50,  1.89s/it]

[sample 6800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  70%|███████   | 1150/1635 [34:44<14:52,  1.84s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████▏  | 1166/1635 [35:14<14:33,  1.86s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  72%|███████▏  | 1183/1635 [35:46<14:14,  1.89s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1200/1635 [36:18<13:30,  1.86s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1635 [36:47<12:18,  1.76s/it]

[sample 7300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  75%|███████▌  | 1233/1635 [37:19<12:29,  1.87s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▋  | 1250/1635 [37:50<11:28,  1.79s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1635 [38:21<11:36,  1.89s/it]

[sample 7600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  78%|███████▊  | 1283/1635 [38:52<10:42,  1.82s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|███████▉  | 1300/1635 [39:23<10:10,  1.82s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1635 [39:53<09:49,  1.85s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1333/1635 [40:24<09:33,  1.90s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1350/1635 [40:56<09:07,  1.92s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  84%|████████▎ | 1366/1635 [41:25<07:59,  1.78s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▍ | 1383/1635 [41:58<07:49,  1.86s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▌ | 1400/1635 [42:29<07:08,  1.82s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1416/1635 [42:58<06:46,  1.86s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  88%|████████▊ | 1433/1635 [43:29<06:08,  1.82s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1635 [44:00<05:28,  1.78s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1466/1635 [44:29<05:17,  1.88s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1483/1635 [45:01<04:48,  1.89s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1635 [45:33<04:08,  1.84s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1635 [46:02<03:44,  1.88s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▍| 1533/1635 [46:33<03:09,  1.86s/it]

[sample 9200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  95%|█████████▍| 1550/1635 [47:05<02:43,  1.92s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1635 [47:34<02:07,  1.85s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1635 [48:05<01:32,  1.78s/it]

[sample 9500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  98%|█████████▊| 1600/1635 [48:37<01:04,  1.84s/it]

[sample 9600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▉| 1616/1635 [49:06<00:35,  1.88s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1635 [49:37<00:03,  1.77s/it]

[sample 9800] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|██████████| 1635/1635 [49:40<00:00,  1.82s/it]


True label dist: {np.int64(1): 3123, np.int64(2): 3211, np.int64(0): 3475}
Pred label dist: {np.int64(0): 7329, np.int64(2): 2272, np.int64(1): 208}
  accuracy=0.5299, f1_macro=0.4398
Evaluating multinli_tr_1_1 / validation_mismatched ...


Inference:   0%|          | 0/1638 [00:00<?, ?it/s]

[sample 0] raw: 'entailment' -> parsed: 0 (entailment)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1638 [00:30<49:03,  1.81s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1638 [01:02<50:58,  1.91s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1638 [01:33<49:44,  1.88s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1638 [02:03<49:33,  1.89s/it]

[sample 400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   5%|▌         | 83/1638 [02:35<48:42,  1.88s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   6%|▌         | 100/1638 [03:07<48:26,  1.89s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 116/1638 [03:37<47:46,  1.88s/it]

[sample 700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   8%|▊         | 133/1638 [04:09<47:44,  1.90s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▉         | 150/1638 [04:40<44:51,  1.81s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  10%|█         | 166/1638 [05:09<43:20,  1.77s/it]

[sample 1000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  11%|█         | 183/1638 [05:40<42:50,  1.77s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1638 [06:11<43:37,  1.82s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 216/1638 [06:41<45:24,  1.92s/it]

[sample 1300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  14%|█▍        | 233/1638 [07:12<41:42,  1.78s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  15%|█▌        | 250/1638 [07:44<41:35,  1.80s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▌        | 266/1638 [08:15<44:59,  1.97s/it]

[sample 1600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 283/1638 [08:47<42:10,  1.87s/it]

[sample 1700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  18%|█▊        | 300/1638 [09:18<42:46,  1.92s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1638 [09:49<40:50,  1.85s/it]

[sample 1900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  20%|██        | 333/1638 [10:21<41:10,  1.89s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  21%|██▏       | 350/1638 [10:53<40:28,  1.89s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1638 [11:22<40:15,  1.90s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 383/1638 [11:55<39:14,  1.88s/it]

[sample 2300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 400/1638 [12:27<38:00,  1.84s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1638 [12:57<39:03,  1.92s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1638 [13:28<37:18,  1.86s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  27%|██▋       | 450/1638 [14:00<36:43,  1.85s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 466/1638 [14:29<35:36,  1.82s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 483/1638 [15:02<35:22,  1.84s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1638 [15:35<36:16,  1.91s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  32%|███▏      | 516/1638 [16:05<34:11,  1.83s/it]

[sample 3100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  33%|███▎      | 533/1638 [16:38<36:22,  1.98s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 550/1638 [17:10<34:26,  1.90s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  35%|███▍      | 566/1638 [17:40<32:50,  1.84s/it]

[sample 3400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  36%|███▌      | 583/1638 [18:11<32:39,  1.86s/it]

[sample 3500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  37%|███▋      | 600/1638 [18:43<32:12,  1.86s/it]

[sample 3600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  38%|███▊      | 616/1638 [19:13<32:29,  1.91s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1638 [19:45<31:55,  1.91s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 650/1638 [20:18<31:00,  1.88s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 666/1638 [20:47<29:13,  1.80s/it]

[sample 4000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  42%|████▏     | 683/1638 [21:19<29:02,  1.82s/it]

[sample 4100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 700/1638 [21:51<28:27,  1.82s/it]

[sample 4200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  44%|████▎     | 716/1638 [22:22<29:31,  1.92s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1638 [22:54<28:10,  1.87s/it]

[sample 4400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  46%|████▌     | 750/1638 [23:26<27:23,  1.85s/it]

[sample 4500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  47%|████▋     | 766/1638 [23:55<26:45,  1.84s/it]

[sample 4600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 783/1638 [24:27<26:56,  1.89s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 800/1638 [24:59<25:58,  1.86s/it]

[sample 4800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  50%|████▉     | 816/1638 [25:28<25:44,  1.88s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1638 [26:00<24:53,  1.86s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  52%|█████▏    | 850/1638 [26:33<25:51,  1.97s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1638 [27:02<24:12,  1.88s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 883/1638 [27:34<23:47,  1.89s/it]

[sample 5300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▍    | 900/1638 [28:06<22:59,  1.87s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▌    | 916/1638 [28:37<23:46,  1.98s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1638 [29:10<24:05,  2.05s/it]

[sample 5600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  58%|█████▊    | 950/1638 [29:42<21:53,  1.91s/it]

[sample 5700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  59%|█████▉    | 966/1638 [30:12<21:15,  1.90s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1638 [30:43<21:23,  1.96s/it]

[sample 5900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  61%|██████    | 1000/1638 [31:15<20:12,  1.90s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 1016/1638 [31:46<19:56,  1.92s/it]

[sample 6100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 1033/1638 [32:18<19:00,  1.88s/it]

[sample 6200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  64%|██████▍   | 1050/1638 [32:50<18:17,  1.87s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1638 [33:20<17:45,  1.86s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1638 [33:53<17:53,  1.93s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1100/1638 [34:26<17:30,  1.95s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  68%|██████▊   | 1116/1638 [34:56<16:17,  1.87s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1638 [35:28<15:27,  1.84s/it]

[sample 6800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  70%|███████   | 1150/1638 [36:00<15:49,  1.94s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████   | 1166/1638 [36:31<14:38,  1.86s/it]

[sample 7000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  72%|███████▏  | 1183/1638 [37:02<14:00,  1.85s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1200/1638 [37:33<13:31,  1.85s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1638 [38:03<13:07,  1.87s/it]

[sample 7300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  75%|███████▌  | 1233/1638 [38:35<12:52,  1.91s/it]

[sample 7400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  76%|███████▋  | 1250/1638 [39:07<12:07,  1.87s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1638 [39:36<11:37,  1.88s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1283/1638 [40:07<11:11,  1.89s/it]

[sample 7700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  79%|███████▉  | 1300/1638 [40:39<10:30,  1.87s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1638 [41:09<10:05,  1.88s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1333/1638 [41:42<09:33,  1.88s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1350/1638 [42:14<09:07,  1.90s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1366/1638 [42:44<08:18,  1.83s/it]

[sample 8200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  84%|████████▍ | 1383/1638 [43:16<08:12,  1.93s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▌ | 1400/1638 [43:49<07:37,  1.92s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▋ | 1416/1638 [44:18<06:37,  1.79s/it]

[sample 8500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  87%|████████▋ | 1433/1638 [44:50<06:27,  1.89s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1638 [45:22<06:21,  2.03s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1466/1638 [45:52<05:33,  1.94s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1483/1638 [46:24<04:55,  1.91s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1638 [46:56<04:09,  1.80s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1638 [47:26<03:50,  1.89s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▎| 1533/1638 [47:58<03:09,  1.80s/it]

[sample 9200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  95%|█████████▍| 1550/1638 [48:31<02:48,  1.91s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1638 [49:01<02:15,  1.88s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1638 [49:33<01:41,  1.85s/it]

[sample 9500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  98%|█████████▊| 1600/1638 [50:05<01:16,  2.01s/it]

[sample 9600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  99%|█████████▊| 1616/1638 [50:35<00:40,  1.86s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1638 [51:08<00:09,  1.85s/it]

[sample 9800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference: 100%|██████████| 1638/1638 [51:16<00:00,  1.88s/it]


True label dist: {np.int64(2): 3240, np.int64(0): 3456, np.int64(1): 3129}
Pred label dist: {np.int64(0): 7521, np.int64(2): 2134, np.int64(1): 170}
  accuracy=0.5275, f1_macro=0.4350
Evaluating trglue_mnli / test_matched ...


Inference:   0%|          | 0/1502 [00:00<?, ?it/s]

[sample 0] raw: 'entailment' -> parsed: 0 (entailment)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'entailment' -> parsed: 0 (entailment)
[sample 4] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   1%|          | 16/1502 [00:30<47:21,  1.91s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1502 [01:01<45:08,  1.84s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1502 [01:32<42:54,  1.77s/it]

[sample 300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   4%|▍         | 66/1502 [02:02<44:39,  1.87s/it]

[sample 400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   6%|▌         | 83/1502 [02:34<45:21,  1.92s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 100/1502 [03:06<43:46,  1.87s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 116/1502 [03:36<43:08,  1.87s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▉         | 133/1502 [04:08<43:12,  1.89s/it]

[sample 800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  10%|▉         | 150/1502 [04:41<42:12,  1.87s/it]

[sample 900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  11%|█         | 166/1502 [05:11<40:55,  1.84s/it]

[sample 1000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 183/1502 [05:43<40:32,  1.84s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 200/1502 [06:15<40:19,  1.86s/it]

[sample 1200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  14%|█▍        | 216/1502 [06:44<39:49,  1.86s/it]

[sample 1300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  16%|█▌        | 233/1502 [07:15<38:53,  1.84s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 250/1502 [07:47<39:35,  1.90s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 266/1502 [08:16<38:31,  1.87s/it]

[sample 1600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  19%|█▉        | 283/1502 [08:47<37:49,  1.86s/it]

[sample 1700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  20%|█▉        | 300/1502 [09:19<37:50,  1.89s/it]

[sample 1800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  21%|██        | 316/1502 [09:48<35:30,  1.80s/it]

[sample 1900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  22%|██▏       | 333/1502 [10:20<37:35,  1.93s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 350/1502 [10:52<37:17,  1.94s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 366/1502 [11:21<33:13,  1.75s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 383/1502 [11:52<33:38,  1.80s/it]

[sample 2300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  27%|██▋       | 400/1502 [12:23<35:01,  1.91s/it]

[sample 2400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  28%|██▊       | 416/1502 [12:54<34:05,  1.88s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 433/1502 [13:25<32:22,  1.82s/it]

[sample 2600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  30%|██▉       | 450/1502 [13:57<32:47,  1.87s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 466/1502 [14:27<32:31,  1.88s/it]

[sample 2800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  32%|███▏      | 483/1502 [14:59<31:56,  1.88s/it]

[sample 2900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  33%|███▎      | 500/1502 [15:31<31:15,  1.87s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▍      | 516/1502 [16:01<31:04,  1.89s/it]

[sample 3100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  35%|███▌      | 533/1502 [16:34<32:05,  1.99s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 550/1502 [17:06<29:33,  1.86s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  38%|███▊      | 566/1502 [17:36<29:52,  1.92s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▉      | 583/1502 [18:07<27:42,  1.81s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 600/1502 [18:39<27:22,  1.82s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 616/1502 [19:09<27:52,  1.89s/it]

[sample 3700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  42%|████▏     | 633/1502 [19:41<27:10,  1.88s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 650/1502 [20:12<26:08,  1.84s/it]

[sample 3900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  44%|████▍     | 666/1502 [20:42<26:02,  1.87s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▌     | 683/1502 [21:14<25:33,  1.87s/it]

[sample 4100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  47%|████▋     | 700/1502 [21:46<24:16,  1.82s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 716/1502 [22:16<24:51,  1.90s/it]

[sample 4300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  49%|████▉     | 733/1502 [22:48<24:07,  1.88s/it]

[sample 4400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  50%|████▉     | 750/1502 [23:19<23:50,  1.90s/it]

[sample 4500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  51%|█████     | 766/1502 [23:50<23:12,  1.89s/it]

[sample 4600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  52%|█████▏    | 783/1502 [24:22<23:10,  1.93s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 800/1502 [24:53<21:40,  1.85s/it]

[sample 4800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  54%|█████▍    | 816/1502 [25:24<21:51,  1.91s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▌    | 833/1502 [25:57<21:16,  1.91s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 850/1502 [26:29<19:42,  1.81s/it]

[sample 5100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  58%|█████▊    | 866/1502 [27:00<20:12,  1.91s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  59%|█████▉    | 883/1502 [27:31<18:29,  1.79s/it]

[sample 5300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  60%|█████▉    | 900/1502 [28:03<18:11,  1.81s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 916/1502 [28:32<18:05,  1.85s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 933/1502 [29:05<18:23,  1.94s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 950/1502 [29:37<17:10,  1.87s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  64%|██████▍   | 966/1502 [30:07<17:00,  1.90s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 983/1502 [30:40<16:36,  1.92s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1000/1502 [31:11<15:02,  1.80s/it]

[sample 6000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  68%|██████▊   | 1016/1502 [31:41<15:14,  1.88s/it]

[sample 6100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  69%|██████▉   | 1033/1502 [32:13<14:59,  1.92s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  70%|██████▉   | 1050/1502 [32:45<14:24,  1.91s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████   | 1066/1502 [33:15<13:40,  1.88s/it]

[sample 6400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  72%|███████▏  | 1083/1502 [33:48<13:37,  1.95s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1100/1502 [34:19<12:33,  1.87s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1116/1502 [34:48<11:45,  1.83s/it]

[sample 6700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  75%|███████▌  | 1133/1502 [35:21<11:38,  1.89s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1150/1502 [35:53<11:13,  1.91s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1166/1502 [36:23<10:32,  1.88s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  79%|███████▉  | 1183/1502 [36:56<10:23,  1.95s/it]

[sample 7100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  80%|███████▉  | 1200/1502 [37:28<09:40,  1.92s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████  | 1216/1502 [37:59<08:56,  1.87s/it]

[sample 7300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1233/1502 [38:32<08:47,  1.96s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1250/1502 [39:04<07:42,  1.83s/it]

[sample 7500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  84%|████████▍ | 1266/1502 [39:36<07:39,  1.95s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▌ | 1283/1502 [40:08<07:10,  1.97s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1300/1502 [40:41<06:35,  1.96s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  88%|████████▊ | 1316/1502 [41:12<05:59,  1.93s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1333/1502 [41:45<05:15,  1.87s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1350/1502 [42:17<04:41,  1.85s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1366/1502 [42:48<04:24,  1.95s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1383/1502 [43:20<03:51,  1.95s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1400/1502 [43:54<03:21,  1.97s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▍| 1416/1502 [44:24<02:42,  1.89s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  95%|█████████▌| 1433/1502 [44:56<02:06,  1.84s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1450/1502 [45:29<01:38,  1.90s/it]

[sample 8700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  98%|█████████▊| 1466/1502 [45:59<01:06,  1.85s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▊| 1483/1502 [46:32<00:36,  1.94s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1500/1502 [47:05<00:03,  1.98s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|██████████| 1502/1502 [47:08<00:00,  1.88s/it]


True label dist: {np.int64(1): 3138, np.int64(2): 2946, np.int64(0): 2924}
Pred label dist: {np.int64(0): 5736, np.int64(2): 2773, np.int64(1): 499}
  accuracy=0.5694, f1_macro=0.5141
Evaluating trglue_mnli / test_mismatched ...


Inference:   0%|          | 0/1537 [00:00<?, ?it/s]

[sample 0] raw: 'neutral' -> parsed: 1 (neutral)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 3] raw: 'entailment' -> parsed: 0 (entailment)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1537 [00:31<49:21,  1.95s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1537 [01:03<47:57,  1.91s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1537 [01:36<48:46,  1.97s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1537 [02:07<46:44,  1.91s/it]

[sample 400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   5%|▌         | 83/1537 [02:40<47:39,  1.97s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 100/1537 [03:13<45:44,  1.91s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 116/1537 [03:44<45:08,  1.91s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▊         | 133/1537 [04:17<45:54,  1.96s/it]

[sample 800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  10%|▉         | 150/1537 [04:49<43:27,  1.88s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 166/1537 [05:20<44:41,  1.96s/it]

[sample 1000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 183/1537 [05:53<45:21,  2.01s/it]

[sample 1100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  13%|█▎        | 200/1537 [06:26<43:01,  1.93s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  14%|█▍        | 216/1537 [06:58<43:41,  1.98s/it]

[sample 1300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  15%|█▌        | 233/1537 [07:30<42:33,  1.96s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▋        | 250/1537 [08:03<40:30,  1.89s/it]

[sample 1500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  17%|█▋        | 266/1537 [08:33<40:00,  1.89s/it]

[sample 1600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  18%|█▊        | 283/1537 [09:06<40:01,  1.92s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  20%|█▉        | 300/1537 [09:38<38:26,  1.86s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  21%|██        | 316/1537 [10:08<38:53,  1.91s/it]

[sample 1900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 333/1537 [10:41<38:53,  1.94s/it]

[sample 2000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  23%|██▎       | 350/1537 [11:13<37:42,  1.91s/it]

[sample 2100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  24%|██▍       | 366/1537 [11:44<38:05,  1.95s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▍       | 383/1537 [12:16<35:20,  1.84s/it]

[sample 2300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  26%|██▌       | 400/1537 [12:49<36:27,  1.92s/it]

[sample 2400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  27%|██▋       | 416/1537 [13:21<37:11,  1.99s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 433/1537 [13:53<36:03,  1.96s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 450/1537 [14:27<36:37,  2.02s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  30%|███       | 466/1537 [14:58<34:10,  1.91s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███▏      | 483/1537 [15:31<34:59,  1.99s/it]

[sample 2900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  33%|███▎      | 500/1537 [16:04<33:08,  1.92s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 516/1537 [16:34<32:17,  1.90s/it]

[sample 3100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  35%|███▍      | 533/1537 [17:07<31:03,  1.86s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  36%|███▌      | 550/1537 [17:39<30:50,  1.88s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 566/1537 [18:10<31:15,  1.93s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  38%|███▊      | 583/1537 [18:44<30:50,  1.94s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▉      | 600/1537 [19:17<31:19,  2.01s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|████      | 616/1537 [19:47<27:52,  1.82s/it]

[sample 3700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  41%|████      | 633/1537 [20:20<30:00,  1.99s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 650/1537 [20:54<29:26,  1.99s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 666/1537 [21:26<27:59,  1.93s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▍     | 683/1537 [21:59<27:45,  1.95s/it]

[sample 4100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  46%|████▌     | 700/1537 [22:32<26:20,  1.89s/it]

[sample 4200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  47%|████▋     | 716/1537 [23:03<26:51,  1.96s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 733/1537 [23:36<25:41,  1.92s/it]

[sample 4400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 750/1537 [24:09<26:18,  2.01s/it]

[sample 4500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  50%|████▉     | 766/1537 [24:40<25:48,  2.01s/it]

[sample 4600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  51%|█████     | 783/1537 [25:13<24:20,  1.94s/it]

[sample 4700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  52%|█████▏    | 800/1537 [25:47<24:05,  1.96s/it]

[sample 4800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  53%|█████▎    | 816/1537 [26:18<23:21,  1.94s/it]

[sample 4900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  54%|█████▍    | 833/1537 [26:51<23:19,  1.99s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▌    | 850/1537 [27:24<22:28,  1.96s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▋    | 866/1537 [27:55<21:34,  1.93s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  57%|█████▋    | 883/1537 [28:28<21:33,  1.98s/it]

[sample 5300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▊    | 900/1537 [29:02<21:08,  1.99s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|█████▉    | 916/1537 [29:33<20:19,  1.96s/it]

[sample 5500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  61%|██████    | 933/1537 [30:06<19:29,  1.94s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 950/1537 [30:40<19:44,  2.02s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 966/1537 [31:12<19:39,  2.07s/it]

[sample 5800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  64%|██████▍   | 983/1537 [31:45<17:56,  1.94s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1000/1537 [32:18<17:20,  1.94s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1016/1537 [32:50<17:14,  1.99s/it]

[sample 6100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1033/1537 [33:23<16:22,  1.95s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  68%|██████▊   | 1050/1537 [33:56<16:12,  2.00s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1066/1537 [34:27<15:22,  1.96s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  70%|███████   | 1083/1537 [35:00<14:21,  1.90s/it]

[sample 6500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  72%|███████▏  | 1100/1537 [35:33<14:07,  1.94s/it]

[sample 6600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  73%|███████▎  | 1116/1537 [36:04<13:31,  1.93s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▎  | 1133/1537 [36:38<14:04,  2.09s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  75%|███████▍  | 1150/1537 [37:12<13:14,  2.05s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▌  | 1166/1537 [37:44<12:00,  1.94s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1183/1537 [38:17<11:37,  1.97s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1200/1537 [38:50<11:05,  1.97s/it]

[sample 7200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  79%|███████▉  | 1216/1537 [39:22<10:49,  2.02s/it]

[sample 7300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  80%|████████  | 1233/1537 [39:56<09:47,  1.93s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1250/1537 [40:29<09:35,  2.00s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1266/1537 [41:01<09:02,  2.00s/it]

[sample 7600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  83%|████████▎ | 1283/1537 [41:34<08:19,  1.97s/it]

[sample 7700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  85%|████████▍ | 1300/1537 [42:07<07:49,  1.98s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▌ | 1316/1537 [42:38<06:59,  1.90s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1333/1537 [43:12<06:41,  1.97s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  88%|████████▊ | 1350/1537 [43:46<06:27,  2.07s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1366/1537 [44:18<05:39,  1.98s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1383/1537 [44:51<05:00,  1.95s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1400/1537 [45:24<04:31,  1.98s/it]

[sample 8400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  92%|█████████▏| 1416/1537 [45:56<04:03,  2.01s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1433/1537 [46:29<03:25,  1.97s/it]

[sample 8600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  94%|█████████▍| 1450/1537 [47:02<02:48,  1.94s/it]

[sample 8700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  95%|█████████▌| 1466/1537 [47:33<02:18,  1.95s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▋| 1483/1537 [48:06<01:46,  1.97s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  98%|█████████▊| 1500/1537 [48:40<01:14,  2.01s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▊| 1516/1537 [49:13<00:43,  2.06s/it]

[sample 9100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference: 100%|█████████▉| 1533/1537 [49:45<00:07,  1.87s/it]

[sample 9200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference: 100%|██████████| 1537/1537 [49:52<00:00,  1.95s/it]

True label dist: {np.int64(1): 3043, np.int64(0): 3101, np.int64(2): 3073}
Pred label dist: {np.int64(1): 619, np.int64(0): 5805, np.int64(2): 2793}
  accuracy=0.6156, f1_macro=0.5645
Saved results/metrics.json





In [14]:
# Summary: per config/split
for config_name, splits in all_metrics.items():
    for split_name, m in splits.items():
        print(f"{config_name} / {split_name}: acc={m['accuracy']:.4f}, F1_macro={m['f1_macro']:.4f}, F1_per_class={m['f1_per_class']}")

snli_tr_1_1 / test: acc=0.6226, F1_macro=0.5644, F1_per_class={'entailment': 0.6751697181649866, 'neutral': 0.2287784679089027, 'contradiction': 0.7891784889475421}
multinli_tr_1_1 / validation_matched: acc=0.5299, F1_macro=0.4398, F1_per_class={'entailment': 0.6175490559052202, 'neutral': 0.05764034824377064, 'contradiction': 0.6441728980485136}
multinli_tr_1_1 / validation_mismatched: acc=0.5275, F1_macro=0.4350, F1_per_class={'entailment': 0.6169263004463879, 'neutral': 0.050318278266141256, 'contradiction': 0.6378861183475996}
trglue_mnli / test_matched: acc=0.5694, F1_macro=0.5141, F1_per_class={'entailment': 0.6702078521939954, 'neutral': 0.2562551553478141, 'contradiction': 0.6158419304074139}
trglue_mnli / test_mismatched: acc=0.6156, F1_macro=0.5645, F1_per_class={'entailment': 0.6925668088928812, 'neutral': 0.3134898962315674, 'contradiction': 0.6873508353221957}
