# NLI base results: Llama-3.1-8B-Instruct (meta-llama/Llama-3.1-8B-Instruct) via Ollama on M4

Loads [yilmazzey/sdp2-nli](https://huggingface.co/datasets/yilmazzey/sdp2-nli) (snli_tr_1_1, multinli_tr_1_1, trglue_mnli) and runs **test-only** zero-shot Turkish NLI evaluation with **Llama-3.1-8B-Instruct** via **Ollama** (no Hugging Face pipeline).

Uses Llama 3.1 Instruct chat format (system + user turn; Ollama applies the template). Model is instructed to answer with exactly one word: entailment, neutral, or contradiction. Outputs parsed to 0=entailment, 1=neutral, 2=contradiction. Runs on Apple Silicon (M4) with pure Ollama (CPU/Metal), no CUDA, no quantization.

**Splits:** snli → test; multinli → validation_matched/mismatched; trglue → test_matched/test_mismatched. **Metrics:** Accuracy, macro F1, per-class F1, confusion matrix (CSV + seaborn plot). Results saved to `./results/`. **Prerequisite:** `ollama pull llama3.1:8b` (or your Ollama model name for Llama-3.1-8B-Instruct).

In [1]:
# Install ollama Python client if needed; standard libs for datasets/metrics/plots
# !pip install -q ollama datasets scikit-learn tqdm matplotlib seaborn
# If ollama is already installed via brew/pip, skip or run: pip install -q ollama

In [2]:
import json
import random
import re
from collections import Counter
from pathlib import Path

import numpy as np
from datasets import load_dataset
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from tqdm import tqdm
import ollama

try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False

LABEL_NAMES = ["entailment", "neutral", "contradiction"]

# Device: M4 / Apple Silicon — no CUDA; Ollama uses Metal/CPU
print("Running on Apple Silicon (M4) / CPU — Ollama handles Metal.")

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

  from .autonotebook import tqdm as notebook_tqdm


Running on Apple Silicon (M4) / CPU — Ollama handles Metal.


In [3]:
REPO_ID = "yilmazzey/sdp2-nli"
CONFIGS = ["snli_tr_1_1", "multinli_tr_1_1", "trglue_mnli"]
MODEL_ID = "llama3.1:8b"  # Ollama model name (e.g. after: ollama pull llama3.1:8b)
NUM_LABELS = 3
RESULTS_DIR = "results"
BATCH_SIZE = 6  # Safe on M4 36GB RAM; use 4–8
MAX_TOKENS = 10
TEMPERATURE = 0.0
TOP_P = 0.0
EVAL_SPLITS = {
    "snli_tr_1_1": ["test"],
    "multinli_tr_1_1": ["validation_matched", "validation_mismatched"],
    "trglue_mnli": ["test_matched", "test_mismatched"],
}

In [4]:
# Load all three dataset configs (same as Turkish-Gemma-9b-T1)
datasets = {}
for cfg in CONFIGS:
    print(f"Loading {REPO_ID} :: {cfg} ...")
    datasets[cfg] = load_dataset(REPO_ID, cfg)
    print("  splits:", list(datasets[cfg].keys()))

Loading yilmazzey/sdp2-nli :: snli_tr_1_1 ...
  splits: ['train', 'validation', 'test']
Loading yilmazzey/sdp2-nli :: multinli_tr_1_1 ...
  splits: ['train', 'validation_matched', 'validation_mismatched']
Loading yilmazzey/sdp2-nli :: trglue_mnli ...
  splits: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']


In [5]:
# Llama 3.1 Instruct: system + user turn; Ollama applies <|start_header_id|>system/user/assistant<|end_header_id|> + <|eot_id|>
SYSTEM_PROMPT = """You are a natural language inference classifier. You must answer with exactly one word and nothing else: entailment, neutral, or contradiction. No explanation, no punctuation, no extra text. Only one of these three words."""


def nli_user_prompt(premise, hypothesis):
    return f"""Premise: {premise}
Hypothesis: {hypothesis}
Does the premise entail, is neutral to, or contradict the hypothesis? Answer with only one word: entailment, neutral, or contradiction."""


def ollama_chat_single(user_content: str) -> str:
    """Llama 3.1 chat: system + user turn. Returns assistant message content only."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_content},
    ]
    response = ollama.chat(
        model=MODEL_ID,
        messages=messages,
        options={
            "num_predict": MAX_TOKENS,
            "temperature": TEMPERATURE,
            "top_p": TOP_P,
        },
    )
    return (response.get("message") or {}).get("content", "") or ""


LABEL_WORD_TO_ID = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2,
    "içerme": 0,
    "tarafsız": 1,
    "nötr": 1,
    "çelişki": 2,
}


def parse_generated_label(raw_text: str) -> int:
    """Extract first word from model output; strip punctuation; lowercase; map EN+TR; default 1 (neutral)."""
    if not raw_text or not isinstance(raw_text, str):
        return 1
    text = raw_text.strip()
    if not text:
        return 1
    # First token/word (split by whitespace)
    parts = text.split()
    first = parts[0] if parts else ""
    # Strip punctuation and lowercase (handles quoted output e.g. 'entailment')
    first = re.sub(r"[.,;:!?\"'()\[\]]+", "", first).strip().lower()
    if not first:
        return 1
    return LABEL_WORD_TO_ID.get(first, 1)

In [6]:
def run_prompted_inference(ds):
    premises = ds["premise"]
    hypotheses = ds["hypothesis"]
    labels = ds["label"]
    n = len(labels)
    y_pred = []
    debug_indices = set(list(range(min(5, n))) + list(range(0, n, 100)))

    for start in tqdm(range(0, n, BATCH_SIZE), desc="Inference"):
        end = min(start + BATCH_SIZE, n)
        for i in range(start, end):
            user_content = nli_user_prompt(premises[i], hypotheses[i])
            raw = ollama_chat_single(user_content)
            label_id = parse_generated_label(raw)
            y_pred.append(label_id)
            if i in debug_indices:
                print(f"[sample {i}] raw: {repr(raw)} -> parsed: {label_id} ({LABEL_NAMES[label_id]})")

    y_true = np.array(labels, dtype=np.int64)
    y_pred = np.array(y_pred, dtype=np.int64)
    print("True label dist:", dict(Counter(y_true)))
    print("Pred label dist:", dict(Counter(y_pred)))
    return y_true, y_pred

In [7]:
def compute_metrics(y_true, y_pred):
    acc = float(accuracy_score(y_true, y_pred))
    f1_macro = float(f1_score(y_true, y_pred, average="macro", zero_division=0))
    f1_per_class = f1_score(y_true, y_pred, average=None, zero_division=0)
    f1_per_class = {LABEL_NAMES[i]: float(f1_per_class[i]) for i in range(NUM_LABELS)}
    cm = confusion_matrix(y_true, y_pred)
    out = {"accuracy": acc, "f1_macro": f1_macro, "f1_per_class": f1_per_class}
    return out, cm


def save_confusion_plot(cm, path):
    if not HAS_PLOT:
        return
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", xticklabels=LABEL_NAMES, yticklabels=LABEL_NAMES, ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [8]:
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)
all_metrics = {}

for config_name in CONFIGS:
    ds_dict = datasets[config_name]
    split_names = EVAL_SPLITS[config_name]
    all_metrics[config_name] = {}

    for split_name in split_names:
        if split_name not in ds_dict:
            print(f"  Skip {config_name}/{split_name} (missing)")
            continue
        ds = ds_dict[split_name]
        print(f"Evaluating {config_name} / {split_name} ...")
        y_true, y_pred = run_prompted_inference(ds)
        metrics, cm = compute_metrics(y_true, y_pred)
        all_metrics[config_name][split_name] = metrics

        cm_path = Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.csv"
        np.savetxt(cm_path, cm, fmt="%d", delimiter=",")
        save_confusion_plot(cm, Path(RESULTS_DIR) / f"confusion_{config_name}_{split_name}.png")

        print(f"  accuracy={metrics['accuracy']:.4f}, f1_macro={metrics['f1_macro']:.4f}")

with open(Path(RESULTS_DIR) / "metrics.json", "w") as f:
    json.dump(all_metrics, f, indent=2)
print(f"Saved {RESULTS_DIR}/metrics.json")
sys.stdout = _nli_orig_stdout
_nli_log_file.close()
print(f"Log closed: {_nli_log_path}")

Evaluating snli_tr_1_1 / test ...


Inference:   0%|          | 0/1638 [00:00<?, ?it/s]

[sample 0] raw: 'neutral' -> parsed: 1 (neutral)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'entailment' -> parsed: 0 (entailment)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1638 [00:19<33:34,  1.24s/it]

[sample 100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   2%|▏         | 33/1638 [00:40<32:57,  1.23s/it]

[sample 200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   3%|▎         | 50/1638 [01:02<33:50,  1.28s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1638 [01:22<33:54,  1.29s/it]

[sample 400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   5%|▌         | 83/1638 [01:44<33:02,  1.28s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   6%|▌         | 100/1638 [02:07<34:39,  1.35s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 116/1638 [02:29<36:41,  1.45s/it]

[sample 700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   8%|▊         | 133/1638 [02:56<42:50,  1.71s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▉         | 150/1638 [03:29<46:21,  1.87s/it]

[sample 900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  10%|█         | 166/1638 [03:56<39:26,  1.61s/it]

[sample 1000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  11%|█         | 183/1638 [04:22<38:09,  1.57s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1638 [04:49<37:17,  1.56s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 216/1638 [05:14<36:19,  1.53s/it]

[sample 1300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  14%|█▍        | 233/1638 [05:40<35:42,  1.53s/it]

[sample 1400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  15%|█▌        | 250/1638 [06:05<35:33,  1.54s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▌        | 266/1638 [06:29<33:53,  1.48s/it]

[sample 1600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 283/1638 [06:54<33:04,  1.46s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 300/1638 [07:19<33:25,  1.50s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1638 [07:43<32:01,  1.45s/it]

[sample 1900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  20%|██        | 333/1638 [08:08<32:01,  1.47s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  21%|██▏       | 350/1638 [08:32<31:00,  1.44s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1638 [08:57<35:29,  1.67s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 383/1638 [09:24<32:58,  1.58s/it]

[sample 2300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 400/1638 [09:49<32:14,  1.56s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1638 [10:14<30:22,  1.49s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1638 [10:40<31:19,  1.56s/it]

[sample 2600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  27%|██▋       | 450/1638 [11:08<31:19,  1.58s/it]

[sample 2700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  28%|██▊       | 466/1638 [11:32<30:32,  1.56s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 483/1638 [11:57<28:45,  1.49s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1638 [12:24<31:47,  1.68s/it]

[sample 3000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  32%|███▏      | 516/1638 [12:50<28:25,  1.52s/it]

[sample 3100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  33%|███▎      | 533/1638 [13:15<26:38,  1.45s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 550/1638 [13:43<30:47,  1.70s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  35%|███▍      | 566/1638 [14:06<26:20,  1.47s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  36%|███▌      | 583/1638 [14:32<26:10,  1.49s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 600/1638 [14:56<24:48,  1.43s/it]

[sample 3600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  38%|███▊      | 616/1638 [15:19<24:55,  1.46s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1638 [15:44<24:44,  1.48s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 650/1638 [16:08<23:17,  1.41s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 666/1638 [16:31<22:50,  1.41s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 683/1638 [16:55<22:37,  1.42s/it]

[sample 4100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  43%|████▎     | 700/1638 [17:19<22:34,  1.44s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▎     | 716/1638 [17:42<22:21,  1.46s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1638 [18:07<21:33,  1.43s/it]

[sample 4400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  46%|████▌     | 750/1638 [18:31<20:55,  1.41s/it]

[sample 4500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  47%|████▋     | 766/1638 [18:54<21:19,  1.47s/it]

[sample 4600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  48%|████▊     | 783/1638 [19:18<19:39,  1.38s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 800/1638 [19:42<19:32,  1.40s/it]

[sample 4800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  50%|████▉     | 816/1638 [20:04<19:30,  1.42s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1638 [20:28<18:24,  1.37s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  52%|█████▏    | 850/1638 [20:52<18:16,  1.39s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1638 [21:16<18:44,  1.46s/it]

[sample 5200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  54%|█████▍    | 883/1638 [21:40<17:56,  1.43s/it]

[sample 5300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▍    | 900/1638 [22:04<17:23,  1.41s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▌    | 916/1638 [22:27<16:32,  1.37s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1638 [22:51<17:01,  1.45s/it]

[sample 5600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  58%|█████▊    | 950/1638 [23:15<16:26,  1.43s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▉    | 966/1638 [23:39<16:34,  1.48s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1638 [24:04<16:04,  1.47s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 1000/1638 [24:28<15:11,  1.43s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 1016/1638 [24:51<14:41,  1.42s/it]

[sample 6100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  63%|██████▎   | 1033/1638 [25:15<14:37,  1.45s/it]

[sample 6200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  64%|██████▍   | 1050/1638 [25:39<13:36,  1.39s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1638 [26:02<13:42,  1.44s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1638 [26:27<13:04,  1.41s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1100/1638 [26:51<12:59,  1.45s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  68%|██████▊   | 1116/1638 [27:14<12:10,  1.40s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1638 [27:38<11:57,  1.42s/it]

[sample 6800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  70%|███████   | 1150/1638 [28:02<11:05,  1.36s/it]

[sample 6900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  71%|███████   | 1166/1638 [28:24<11:16,  1.43s/it]

[sample 7000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  72%|███████▏  | 1183/1638 [28:49<11:06,  1.46s/it]

[sample 7100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  73%|███████▎  | 1200/1638 [29:13<10:22,  1.42s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1638 [29:35<09:49,  1.40s/it]

[sample 7300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  75%|███████▌  | 1233/1638 [30:00<10:37,  1.57s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▋  | 1250/1638 [30:25<10:04,  1.56s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1638 [30:50<09:23,  1.51s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1283/1638 [31:16<09:15,  1.56s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  79%|███████▉  | 1300/1638 [31:43<08:39,  1.54s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1638 [32:07<08:11,  1.53s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1333/1638 [32:32<07:28,  1.47s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1350/1638 [32:58<07:16,  1.52s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1366/1638 [33:22<07:02,  1.55s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  84%|████████▍ | 1383/1638 [33:49<07:20,  1.73s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▌ | 1400/1638 [34:16<05:50,  1.47s/it]

[sample 8400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  86%|████████▋ | 1416/1638 [34:39<05:21,  1.45s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1433/1638 [35:04<04:58,  1.45s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1638 [35:30<04:50,  1.54s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1466/1638 [35:54<04:21,  1.52s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1483/1638 [36:21<04:05,  1.59s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1638 [36:47<03:32,  1.54s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1638 [37:11<03:02,  1.50s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▎| 1533/1638 [37:38<02:45,  1.58s/it]

[sample 9200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  95%|█████████▍| 1550/1638 [38:03<02:10,  1.49s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1638 [38:27<01:47,  1.50s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1638 [38:53<01:26,  1.57s/it]

[sample 9500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  98%|█████████▊| 1600/1638 [39:19<00:58,  1.55s/it]

[sample 9600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▊| 1616/1638 [39:44<00:32,  1.50s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1638 [40:09<00:07,  1.51s/it]

[sample 9800] raw: 'neutral' -> parsed: 1 (neutral)


Inference: 100%|██████████| 1638/1638 [40:15<00:00,  1.47s/it]


True label dist: {np.int64(1): 3219, np.int64(0): 3368, np.int64(2): 3237}
Pred label dist: {np.int64(1): 1769, np.int64(0): 7555, np.int64(2): 500}
  accuracy=0.4081, f1_macro=0.3161
Evaluating multinli_tr_1_1 / validation_matched ...


Inference:   0%|          | 0/1635 [00:00<?, ?it/s]

[sample 0] raw: 'entailment' -> parsed: 0 (entailment)
[sample 1] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1635 [00:30<48:44,  1.81s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1635 [01:02<49:39,  1.86s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1635 [01:33<46:35,  1.76s/it]

[sample 300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   4%|▍         | 66/1635 [02:04<47:34,  1.82s/it]

[sample 400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   5%|▌         | 83/1635 [02:36<48:58,  1.89s/it]

[sample 500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   6%|▌         | 100/1635 [03:06<48:18,  1.89s/it]

[sample 600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   7%|▋         | 116/1635 [03:36<47:35,  1.88s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 133/1635 [04:07<44:42,  1.79s/it]

[sample 800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   9%|▉         | 150/1635 [04:38<46:07,  1.86s/it]

[sample 900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  10%|█         | 166/1635 [05:07<42:52,  1.75s/it]

[sample 1000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  11%|█         | 183/1635 [05:38<44:02,  1.82s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1635 [06:10<43:43,  1.83s/it]

[sample 1200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  13%|█▎        | 216/1635 [06:40<45:51,  1.94s/it]

[sample 1300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  14%|█▍        | 233/1635 [07:12<42:59,  1.84s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  15%|█▌        | 250/1635 [07:44<45:01,  1.95s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▋        | 266/1635 [08:14<42:35,  1.87s/it]

[sample 1600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  17%|█▋        | 283/1635 [08:46<42:38,  1.89s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 300/1635 [09:18<41:51,  1.88s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1635 [09:48<42:26,  1.93s/it]

[sample 1900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  20%|██        | 333/1635 [10:19<38:52,  1.79s/it]

[sample 2000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  21%|██▏       | 350/1635 [10:50<37:47,  1.76s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1635 [11:18<36:23,  1.72s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 383/1635 [11:48<36:01,  1.73s/it]

[sample 2300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  24%|██▍       | 400/1635 [12:18<37:00,  1.80s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1635 [12:47<36:25,  1.79s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1635 [13:16<34:21,  1.71s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 450/1635 [13:47<34:45,  1.76s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▊       | 466/1635 [14:16<36:17,  1.86s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  30%|██▉       | 483/1635 [14:45<33:03,  1.72s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1635 [15:15<32:33,  1.72s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  32%|███▏      | 516/1635 [15:43<31:28,  1.69s/it]

[sample 3100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  33%|███▎      | 533/1635 [16:13<32:24,  1.76s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 550/1635 [16:43<32:00,  1.77s/it]

[sample 3300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  35%|███▍      | 566/1635 [17:11<31:41,  1.78s/it]

[sample 3400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  36%|███▌      | 583/1635 [17:43<32:50,  1.87s/it]

[sample 3500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  37%|███▋      | 600/1635 [18:12<30:26,  1.76s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  38%|███▊      | 616/1635 [18:40<30:55,  1.82s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1635 [19:10<28:23,  1.70s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 650/1635 [19:40<28:04,  1.71s/it]

[sample 3900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  41%|████      | 666/1635 [20:08<27:22,  1.70s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 683/1635 [20:38<27:41,  1.75s/it]

[sample 4100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 700/1635 [21:07<27:11,  1.74s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▍     | 716/1635 [21:35<27:13,  1.78s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1635 [22:05<27:12,  1.81s/it]

[sample 4400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  46%|████▌     | 750/1635 [22:35<26:34,  1.80s/it]

[sample 4500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  47%|████▋     | 766/1635 [23:03<26:14,  1.81s/it]

[sample 4600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 783/1635 [23:33<24:03,  1.69s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 800/1635 [24:03<24:19,  1.75s/it]

[sample 4800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  50%|████▉     | 816/1635 [24:31<23:55,  1.75s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1635 [25:01<23:15,  1.74s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  52%|█████▏    | 850/1635 [25:32<23:51,  1.82s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1635 [26:00<22:30,  1.76s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 883/1635 [26:31<23:52,  1.91s/it]

[sample 5300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  55%|█████▌    | 900/1635 [27:01<21:45,  1.78s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▌    | 916/1635 [27:29<20:53,  1.74s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1635 [28:00<20:52,  1.78s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  58%|█████▊    | 950/1635 [28:32<24:11,  2.12s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▉    | 966/1635 [29:01<20:11,  1.81s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1635 [29:32<20:14,  1.86s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 1000/1635 [30:04<19:28,  1.84s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 1016/1635 [30:32<18:12,  1.76s/it]

[sample 6100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 1033/1635 [31:03<19:01,  1.90s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  64%|██████▍   | 1050/1635 [31:33<16:47,  1.72s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1635 [32:02<16:59,  1.79s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1635 [32:33<17:14,  1.87s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1100/1635 [33:04<16:08,  1.81s/it]

[sample 6600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  68%|██████▊   | 1116/1635 [33:33<15:07,  1.75s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1635 [34:05<15:53,  1.90s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  70%|███████   | 1150/1635 [34:37<14:47,  1.83s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████▏  | 1166/1635 [35:08<14:28,  1.85s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  72%|███████▏  | 1183/1635 [35:39<14:10,  1.88s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1200/1635 [36:12<13:27,  1.86s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1635 [36:42<12:45,  1.83s/it]

[sample 7300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  75%|███████▌  | 1233/1635 [37:14<12:32,  1.87s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▋  | 1250/1635 [37:46<11:40,  1.82s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1635 [38:16<11:43,  1.91s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1283/1635 [38:47<10:38,  1.81s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|███████▉  | 1300/1635 [39:19<10:10,  1.82s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1635 [39:48<09:48,  1.84s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1333/1635 [40:19<09:28,  1.88s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1350/1635 [40:51<09:20,  1.97s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  84%|████████▎ | 1366/1635 [41:20<08:03,  1.80s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▍ | 1383/1635 [41:54<07:57,  1.89s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▌ | 1400/1635 [42:25<07:02,  1.80s/it]

[sample 8400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  87%|████████▋ | 1416/1635 [42:54<06:36,  1.81s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  88%|████████▊ | 1433/1635 [43:25<06:12,  1.84s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1635 [43:55<05:24,  1.75s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1466/1635 [44:25<05:15,  1.87s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1483/1635 [44:56<04:43,  1.87s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1635 [45:28<04:07,  1.83s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1635 [45:56<03:41,  1.86s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▍| 1533/1635 [46:27<03:07,  1.83s/it]

[sample 9200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  95%|█████████▍| 1550/1635 [46:59<02:40,  1.89s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1635 [47:27<02:06,  1.83s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1635 [47:58<01:34,  1.82s/it]

[sample 9500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  98%|█████████▊| 1600/1635 [48:29<01:02,  1.79s/it]

[sample 9600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▉| 1616/1635 [48:59<00:37,  1.95s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1635 [49:30<00:03,  1.74s/it]

[sample 9800] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|██████████| 1635/1635 [49:33<00:00,  1.82s/it]


True label dist: {np.int64(1): 3123, np.int64(2): 3211, np.int64(0): 3475}
Pred label dist: {np.int64(0): 7737, np.int64(2): 1512, np.int64(1): 560}
  accuracy=0.5049, f1_macro=0.4264
Evaluating multinli_tr_1_1 / validation_mismatched ...


Inference:   0%|          | 0/1638 [00:00<?, ?it/s]

[sample 0] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1638 [00:30<48:22,  1.79s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1638 [01:03<54:04,  2.02s/it]

[sample 200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   3%|▎         | 50/1638 [01:36<50:19,  1.90s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1638 [02:07<50:09,  1.91s/it]

[sample 400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   5%|▌         | 83/1638 [02:38<48:35,  1.87s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   6%|▌         | 100/1638 [03:10<45:17,  1.77s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 116/1638 [03:38<44:59,  1.77s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 133/1638 [04:07<44:29,  1.77s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▉         | 150/1638 [04:39<48:50,  1.97s/it]

[sample 900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  10%|█         | 166/1638 [05:10<45:10,  1.84s/it]

[sample 1000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 183/1638 [05:41<42:58,  1.77s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 200/1638 [06:12<43:50,  1.83s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 216/1638 [06:42<44:44,  1.89s/it]

[sample 1300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  14%|█▍        | 233/1638 [07:12<40:30,  1.73s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  15%|█▌        | 250/1638 [07:43<42:50,  1.85s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▌        | 266/1638 [08:14<43:51,  1.92s/it]

[sample 1600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 283/1638 [08:46<41:45,  1.85s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 300/1638 [09:16<40:15,  1.81s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  19%|█▉        | 316/1638 [09:45<38:43,  1.76s/it]

[sample 1900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  20%|██        | 333/1638 [10:16<40:15,  1.85s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  21%|██▏       | 350/1638 [10:46<39:45,  1.85s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 366/1638 [11:14<37:21,  1.76s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 383/1638 [11:45<38:23,  1.84s/it]

[sample 2300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 400/1638 [12:16<35:29,  1.72s/it]

[sample 2400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 416/1638 [12:44<36:34,  1.80s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  26%|██▋       | 433/1638 [13:13<35:04,  1.75s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  27%|██▋       | 450/1638 [13:42<33:46,  1.71s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 466/1638 [14:09<32:17,  1.65s/it]

[sample 2800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  29%|██▉       | 483/1638 [14:39<32:01,  1.66s/it]

[sample 2900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 500/1638 [15:08<32:24,  1.71s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  32%|███▏      | 516/1638 [15:35<30:37,  1.64s/it]

[sample 3100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  33%|███▎      | 533/1638 [16:05<32:15,  1.75s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 550/1638 [16:34<30:52,  1.70s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  35%|███▍      | 566/1638 [17:01<29:18,  1.64s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  36%|███▌      | 583/1638 [17:29<29:24,  1.67s/it]

[sample 3500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  37%|███▋      | 600/1638 [17:58<29:42,  1.72s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  38%|███▊      | 616/1638 [18:26<29:16,  1.72s/it]

[sample 3700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▊      | 633/1638 [18:55<28:59,  1.73s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 650/1638 [19:24<27:44,  1.69s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 666/1638 [19:51<26:12,  1.62s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 683/1638 [20:22<28:23,  1.78s/it]

[sample 4100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  43%|████▎     | 700/1638 [20:51<26:32,  1.70s/it]

[sample 4200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  44%|████▎     | 716/1638 [21:20<27:19,  1.78s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  45%|████▍     | 733/1638 [21:48<25:43,  1.71s/it]

[sample 4400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  46%|████▌     | 750/1638 [22:17<25:02,  1.69s/it]

[sample 4500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  47%|████▋     | 766/1638 [22:45<25:22,  1.75s/it]

[sample 4600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 783/1638 [23:14<25:46,  1.81s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  49%|████▉     | 800/1638 [23:44<24:51,  1.78s/it]

[sample 4800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  50%|████▉     | 816/1638 [24:12<24:10,  1.77s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  51%|█████     | 833/1638 [24:42<23:21,  1.74s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  52%|█████▏    | 850/1638 [25:12<24:17,  1.85s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 866/1638 [25:39<22:24,  1.74s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 883/1638 [26:09<22:16,  1.77s/it]

[sample 5300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▍    | 900/1638 [26:39<21:52,  1.78s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▌    | 916/1638 [27:08<22:11,  1.84s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 933/1638 [27:39<22:42,  1.93s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  58%|█████▊    | 950/1638 [28:09<20:32,  1.79s/it]

[sample 5700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  59%|█████▉    | 966/1638 [28:37<19:36,  1.75s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|██████    | 983/1638 [29:07<19:48,  1.81s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 1000/1638 [29:37<18:51,  1.77s/it]

[sample 6000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 1016/1638 [30:05<18:23,  1.77s/it]

[sample 6100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  63%|██████▎   | 1033/1638 [30:36<17:31,  1.74s/it]

[sample 6200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  64%|██████▍   | 1050/1638 [31:06<17:11,  1.75s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1066/1638 [31:34<16:52,  1.77s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  66%|██████▌   | 1083/1638 [32:05<16:54,  1.83s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1100/1638 [32:37<17:43,  1.98s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  68%|██████▊   | 1116/1638 [33:08<17:20,  1.99s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1133/1638 [33:42<16:56,  2.01s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  70%|███████   | 1150/1638 [34:16<15:55,  1.96s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████   | 1166/1638 [34:47<14:24,  1.83s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  72%|███████▏  | 1183/1638 [35:17<13:15,  1.75s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1200/1638 [35:48<13:19,  1.83s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1216/1638 [36:16<12:26,  1.77s/it]

[sample 7300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  75%|███████▌  | 1233/1638 [36:47<12:35,  1.87s/it]

[sample 7400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  76%|███████▋  | 1250/1638 [37:19<12:12,  1.89s/it]

[sample 7500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1266/1638 [37:48<11:44,  1.90s/it]

[sample 7600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1283/1638 [38:20<11:31,  1.95s/it]

[sample 7700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  79%|███████▉  | 1300/1638 [38:52<10:49,  1.92s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  80%|████████  | 1316/1638 [39:22<09:48,  1.83s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1333/1638 [39:53<09:13,  1.81s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1350/1638 [40:24<08:34,  1.79s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1366/1638 [40:53<07:59,  1.76s/it]

[sample 8200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  84%|████████▍ | 1383/1638 [41:24<08:06,  1.91s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  85%|████████▌ | 1400/1638 [41:55<07:18,  1.84s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▋ | 1416/1638 [42:23<06:14,  1.69s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1433/1638 [42:53<06:15,  1.83s/it]

[sample 8600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1450/1638 [43:24<05:56,  1.90s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1466/1638 [43:52<05:21,  1.87s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1483/1638 [44:22<04:34,  1.77s/it]

[sample 8900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1500/1638 [44:53<03:54,  1.70s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1516/1638 [45:21<03:38,  1.79s/it]

[sample 9100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▎| 1533/1638 [45:51<02:58,  1.70s/it]

[sample 9200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  95%|█████████▍| 1550/1638 [46:23<02:45,  1.88s/it]

[sample 9300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▌| 1566/1638 [46:52<02:08,  1.78s/it]

[sample 9400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  97%|█████████▋| 1583/1638 [47:21<01:35,  1.74s/it]

[sample 9500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  98%|█████████▊| 1600/1638 [47:52<01:12,  1.91s/it]

[sample 9600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  99%|█████████▊| 1616/1638 [48:20<00:39,  1.78s/it]

[sample 9700] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|█████████▉| 1633/1638 [48:52<00:08,  1.74s/it]

[sample 9800] raw: 'neutral' -> parsed: 1 (neutral)


Inference: 100%|██████████| 1638/1638 [49:00<00:00,  1.80s/it]


True label dist: {np.int64(2): 3240, np.int64(0): 3456, np.int64(1): 3129}
Pred label dist: {np.int64(2): 1590, np.int64(0): 7766, np.int64(1): 469}
  accuracy=0.5109, f1_macro=0.4306
Evaluating trglue_mnli / test_matched ...


Inference:   0%|          | 0/1502 [00:00<?, ?it/s]

[sample 0] raw: 'neutral' -> parsed: 1 (neutral)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'entailment' -> parsed: 0 (entailment)
[sample 3] raw: 'neutral' -> parsed: 1 (neutral)
[sample 4] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   1%|          | 16/1502 [00:29<45:51,  1.85s/it]

[sample 100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   2%|▏         | 33/1502 [00:59<44:48,  1.83s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1502 [01:29<38:56,  1.61s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1502 [01:57<42:23,  1.77s/it]

[sample 400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:   6%|▌         | 83/1502 [02:27<44:09,  1.87s/it]

[sample 500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   7%|▋         | 100/1502 [02:57<41:38,  1.78s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 116/1502 [03:26<39:28,  1.71s/it]

[sample 700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:   9%|▉         | 133/1502 [03:57<41:07,  1.80s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  10%|▉         | 150/1502 [04:27<40:06,  1.78s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 166/1502 [04:55<37:01,  1.66s/it]

[sample 1000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  12%|█▏        | 183/1502 [05:25<38:46,  1.76s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 200/1502 [05:56<39:06,  1.80s/it]

[sample 1200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  14%|█▍        | 216/1502 [06:23<37:20,  1.74s/it]

[sample 1300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  16%|█▌        | 233/1502 [06:53<37:11,  1.76s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  17%|█▋        | 250/1502 [07:23<37:14,  1.78s/it]

[sample 1500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  18%|█▊        | 266/1502 [07:52<36:24,  1.77s/it]

[sample 1600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  19%|█▉        | 283/1502 [08:22<36:18,  1.79s/it]

[sample 1700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  20%|█▉        | 300/1502 [08:53<35:56,  1.79s/it]

[sample 1800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  21%|██        | 316/1502 [09:22<35:11,  1.78s/it]

[sample 1900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  22%|██▏       | 333/1502 [09:53<35:35,  1.83s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 350/1502 [10:24<36:04,  1.88s/it]

[sample 2100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  24%|██▍       | 366/1502 [10:53<32:20,  1.71s/it]

[sample 2200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  25%|██▌       | 383/1502 [11:24<34:26,  1.85s/it]

[sample 2300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  27%|██▋       | 400/1502 [11:55<35:08,  1.91s/it]

[sample 2400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  28%|██▊       | 416/1502 [12:25<34:06,  1.88s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 433/1502 [12:58<34:34,  1.94s/it]

[sample 2600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  30%|██▉       | 450/1502 [13:29<31:52,  1.82s/it]

[sample 2700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███       | 466/1502 [13:59<31:41,  1.84s/it]

[sample 2800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  32%|███▏      | 483/1502 [14:31<31:27,  1.85s/it]

[sample 2900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  33%|███▎      | 500/1502 [15:02<30:30,  1.83s/it]

[sample 3000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  34%|███▍      | 516/1502 [15:31<29:25,  1.79s/it]

[sample 3100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  35%|███▌      | 533/1502 [16:02<29:37,  1.83s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 550/1502 [16:33<28:19,  1.78s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  38%|███▊      | 566/1502 [17:01<27:49,  1.78s/it]

[sample 3400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▉      | 583/1502 [17:31<25:51,  1.69s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|███▉      | 600/1502 [18:02<26:10,  1.74s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  41%|████      | 616/1502 [18:31<26:28,  1.79s/it]

[sample 3700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  42%|████▏     | 633/1502 [19:01<25:52,  1.79s/it]

[sample 3800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  43%|████▎     | 650/1502 [19:31<25:10,  1.77s/it]

[sample 3900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▍     | 666/1502 [20:00<24:45,  1.78s/it]

[sample 4000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  45%|████▌     | 683/1502 [20:30<24:00,  1.76s/it]

[sample 4100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  47%|████▋     | 700/1502 [21:01<24:08,  1.81s/it]

[sample 4200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 716/1502 [21:30<24:28,  1.87s/it]

[sample 4300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  49%|████▉     | 733/1502 [22:01<22:33,  1.76s/it]

[sample 4400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  50%|████▉     | 750/1502 [22:31<22:31,  1.80s/it]

[sample 4500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  51%|█████     | 766/1502 [23:00<21:49,  1.78s/it]

[sample 4600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  52%|█████▏    | 783/1502 [23:32<22:22,  1.87s/it]

[sample 4700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  53%|█████▎    | 800/1502 [24:02<20:59,  1.79s/it]

[sample 4800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  54%|█████▍    | 816/1502 [24:32<20:21,  1.78s/it]

[sample 4900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▌    | 833/1502 [25:02<19:36,  1.76s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  57%|█████▋    | 850/1502 [25:34<19:06,  1.76s/it]

[sample 5100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  58%|█████▊    | 866/1502 [26:02<19:16,  1.82s/it]

[sample 5200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  59%|█████▉    | 883/1502 [26:32<17:47,  1.72s/it]

[sample 5300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  60%|█████▉    | 900/1502 [27:03<18:10,  1.81s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  61%|██████    | 916/1502 [27:30<16:19,  1.67s/it]

[sample 5500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 933/1502 [28:01<17:35,  1.85s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  63%|██████▎   | 950/1502 [28:32<16:35,  1.80s/it]

[sample 5700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  64%|██████▍   | 966/1502 [29:01<17:14,  1.93s/it]

[sample 5800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 983/1502 [29:33<15:37,  1.81s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  67%|██████▋   | 1000/1502 [30:03<14:41,  1.76s/it]

[sample 6000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  68%|██████▊   | 1016/1502 [30:31<14:41,  1.81s/it]

[sample 6100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  69%|██████▉   | 1033/1502 [31:02<14:12,  1.82s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  70%|██████▉   | 1050/1502 [31:34<13:43,  1.82s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  71%|███████   | 1066/1502 [32:02<12:52,  1.77s/it]

[sample 6400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  72%|███████▏  | 1083/1502 [32:33<13:07,  1.88s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  73%|███████▎  | 1100/1502 [33:04<12:22,  1.85s/it]

[sample 6600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▍  | 1116/1502 [33:34<11:24,  1.77s/it]

[sample 6700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  75%|███████▌  | 1133/1502 [34:05<10:53,  1.77s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1150/1502 [34:36<10:52,  1.85s/it]

[sample 6900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  78%|███████▊  | 1166/1502 [35:04<10:06,  1.81s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  79%|███████▉  | 1183/1502 [35:34<09:38,  1.81s/it]

[sample 7100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  80%|███████▉  | 1200/1502 [36:06<09:13,  1.83s/it]

[sample 7200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████  | 1216/1502 [36:34<08:24,  1.76s/it]

[sample 7300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  82%|████████▏ | 1233/1502 [37:05<08:15,  1.84s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  83%|████████▎ | 1250/1502 [37:37<07:35,  1.81s/it]

[sample 7500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  84%|████████▍ | 1266/1502 [38:07<07:16,  1.85s/it]

[sample 7600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  85%|████████▌ | 1283/1502 [38:38<06:29,  1.78s/it]

[sample 7700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1300/1502 [39:08<06:02,  1.80s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  88%|████████▊ | 1316/1502 [39:38<05:49,  1.88s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▊ | 1333/1502 [40:09<04:57,  1.76s/it]

[sample 8000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1350/1502 [40:39<04:15,  1.68s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1366/1502 [41:07<03:58,  1.75s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1383/1502 [41:37<03:26,  1.73s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1400/1502 [42:07<03:02,  1.79s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  94%|█████████▍| 1416/1502 [42:35<02:28,  1.73s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  95%|█████████▌| 1433/1502 [43:05<01:58,  1.72s/it]

[sample 8600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  97%|█████████▋| 1450/1502 [43:35<01:31,  1.75s/it]

[sample 8700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  98%|█████████▊| 1466/1502 [44:03<01:00,  1.68s/it]

[sample 8800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  99%|█████████▊| 1483/1502 [44:33<00:34,  1.83s/it]

[sample 8900] raw: 'neutral' -> parsed: 1 (neutral)


Inference: 100%|█████████▉| 1500/1502 [45:03<00:03,  1.74s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference: 100%|██████████| 1502/1502 [45:06<00:00,  1.80s/it]


True label dist: {np.int64(1): 3138, np.int64(2): 2946, np.int64(0): 2924}
Pred label dist: {np.int64(1): 1596, np.int64(0): 5264, np.int64(2): 2148}
  accuracy=0.7023, f1_macro=0.6981
Evaluating trglue_mnli / test_mismatched ...


Inference:   0%|          | 0/1537 [00:00<?, ?it/s]

[sample 0] raw: 'neutral' -> parsed: 1 (neutral)
[sample 1] raw: 'entailment' -> parsed: 0 (entailment)
[sample 2] raw: 'contradiction' -> parsed: 2 (contradiction)
[sample 3] raw: 'entailment' -> parsed: 0 (entailment)
[sample 4] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   1%|          | 16/1537 [00:28<46:43,  1.84s/it]

[sample 100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   2%|▏         | 33/1537 [00:58<43:44,  1.74s/it]

[sample 200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   3%|▎         | 50/1537 [01:29<45:02,  1.82s/it]

[sample 300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   4%|▍         | 66/1537 [01:58<44:09,  1.80s/it]

[sample 400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   5%|▌         | 83/1537 [02:28<43:16,  1.79s/it]

[sample 500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   7%|▋         | 100/1537 [02:59<42:41,  1.78s/it]

[sample 600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   8%|▊         | 116/1537 [03:28<42:51,  1.81s/it]

[sample 700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:   9%|▊         | 133/1537 [03:59<42:12,  1.80s/it]

[sample 800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  10%|▉         | 150/1537 [04:29<40:56,  1.77s/it]

[sample 900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  11%|█         | 166/1537 [04:58<42:21,  1.85s/it]

[sample 1000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  12%|█▏        | 183/1537 [05:29<41:34,  1.84s/it]

[sample 1100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  13%|█▎        | 200/1537 [05:59<39:26,  1.77s/it]

[sample 1200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  14%|█▍        | 216/1537 [06:29<40:14,  1.83s/it]

[sample 1300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  15%|█▌        | 233/1537 [06:59<39:21,  1.81s/it]

[sample 1400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  16%|█▋        | 250/1537 [07:30<39:08,  1.83s/it]

[sample 1500] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  17%|█▋        | 266/1537 [07:59<38:41,  1.83s/it]

[sample 1600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  18%|█▊        | 283/1537 [08:30<37:08,  1.78s/it]

[sample 1700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  20%|█▉        | 300/1537 [09:00<36:33,  1.77s/it]

[sample 1800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  21%|██        | 316/1537 [09:29<36:36,  1.80s/it]

[sample 1900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  22%|██▏       | 333/1537 [09:59<35:40,  1.78s/it]

[sample 2000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  23%|██▎       | 350/1537 [10:29<35:05,  1.77s/it]

[sample 2100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  24%|██▍       | 366/1537 [10:58<34:58,  1.79s/it]

[sample 2200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  25%|██▍       | 383/1537 [11:28<32:52,  1.71s/it]

[sample 2300] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  26%|██▌       | 400/1537 [11:59<33:42,  1.78s/it]

[sample 2400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  27%|██▋       | 416/1537 [12:28<34:33,  1.85s/it]

[sample 2500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  28%|██▊       | 433/1537 [12:58<32:41,  1.78s/it]

[sample 2600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  29%|██▉       | 450/1537 [13:29<32:57,  1.82s/it]

[sample 2700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  30%|███       | 466/1537 [13:58<31:28,  1.76s/it]

[sample 2800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  31%|███▏      | 483/1537 [14:29<32:54,  1.87s/it]

[sample 2900] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  33%|███▎      | 500/1537 [15:01<31:59,  1.85s/it]

[sample 3000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  34%|███▎      | 516/1537 [15:30<30:49,  1.81s/it]

[sample 3100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  35%|███▍      | 533/1537 [16:00<30:47,  1.84s/it]

[sample 3200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  36%|███▌      | 550/1537 [16:31<29:41,  1.81s/it]

[sample 3300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  37%|███▋      | 566/1537 [17:00<29:04,  1.80s/it]

[sample 3400] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  38%|███▊      | 583/1537 [17:31<28:30,  1.79s/it]

[sample 3500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  39%|███▉      | 600/1537 [18:03<29:48,  1.91s/it]

[sample 3600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  40%|████      | 616/1537 [18:31<26:50,  1.75s/it]

[sample 3700] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  41%|████      | 633/1537 [19:03<27:53,  1.85s/it]

[sample 3800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  42%|████▏     | 650/1537 [19:34<27:37,  1.87s/it]

[sample 3900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  43%|████▎     | 666/1537 [20:05<27:11,  1.87s/it]

[sample 4000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  44%|████▍     | 683/1537 [20:36<26:41,  1.88s/it]

[sample 4100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  46%|████▌     | 700/1537 [21:08<25:01,  1.79s/it]

[sample 4200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  47%|████▋     | 716/1537 [21:37<25:59,  1.90s/it]

[sample 4300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  48%|████▊     | 733/1537 [22:08<25:13,  1.88s/it]

[sample 4400] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  49%|████▉     | 750/1537 [22:39<24:44,  1.89s/it]

[sample 4500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  50%|████▉     | 766/1537 [23:09<24:08,  1.88s/it]

[sample 4600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  51%|█████     | 783/1537 [23:41<23:54,  1.90s/it]

[sample 4700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  52%|█████▏    | 800/1537 [24:13<22:58,  1.87s/it]

[sample 4800] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  53%|█████▎    | 816/1537 [24:42<21:37,  1.80s/it]

[sample 4900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  54%|█████▍    | 833/1537 [25:15<21:43,  1.85s/it]

[sample 5000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  55%|█████▌    | 850/1537 [25:46<21:09,  1.85s/it]

[sample 5100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  56%|█████▋    | 866/1537 [26:15<20:01,  1.79s/it]

[sample 5200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  57%|█████▋    | 883/1537 [26:45<20:06,  1.85s/it]

[sample 5300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  59%|█████▊    | 900/1537 [27:17<19:52,  1.87s/it]

[sample 5400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  60%|█████▉    | 916/1537 [27:46<18:30,  1.79s/it]

[sample 5500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  61%|██████    | 933/1537 [28:17<18:55,  1.88s/it]

[sample 5600] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  62%|██████▏   | 950/1537 [28:48<17:20,  1.77s/it]

[sample 5700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  63%|██████▎   | 966/1537 [29:17<17:33,  1.85s/it]

[sample 5800] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  64%|██████▍   | 983/1537 [29:47<16:13,  1.76s/it]

[sample 5900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  65%|██████▌   | 1000/1537 [30:17<15:49,  1.77s/it]

[sample 6000] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  66%|██████▌   | 1016/1537 [30:46<15:39,  1.80s/it]

[sample 6100] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  67%|██████▋   | 1033/1537 [31:17<14:50,  1.77s/it]

[sample 6200] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  68%|██████▊   | 1050/1537 [31:47<14:53,  1.84s/it]

[sample 6300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  69%|██████▉   | 1066/1537 [32:15<14:06,  1.80s/it]

[sample 6400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  70%|███████   | 1083/1537 [32:45<13:16,  1.76s/it]

[sample 6500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  72%|███████▏  | 1100/1537 [33:15<12:44,  1.75s/it]

[sample 6600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  73%|███████▎  | 1116/1537 [33:43<12:07,  1.73s/it]

[sample 6700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  74%|███████▎  | 1133/1537 [34:15<13:03,  1.94s/it]

[sample 6800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  75%|███████▍  | 1150/1537 [34:46<12:31,  1.94s/it]

[sample 6900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  76%|███████▌  | 1166/1537 [35:15<11:14,  1.82s/it]

[sample 7000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  77%|███████▋  | 1183/1537 [35:46<10:51,  1.84s/it]

[sample 7100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  78%|███████▊  | 1200/1537 [36:16<10:46,  1.92s/it]

[sample 7200] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  79%|███████▉  | 1216/1537 [36:47<10:17,  1.92s/it]

[sample 7300] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  80%|████████  | 1233/1537 [37:18<09:18,  1.84s/it]

[sample 7400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  81%|████████▏ | 1250/1537 [37:49<08:44,  1.83s/it]

[sample 7500] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  82%|████████▏ | 1266/1537 [38:19<08:28,  1.88s/it]

[sample 7600] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  83%|████████▎ | 1283/1537 [38:50<07:40,  1.81s/it]

[sample 7700] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  85%|████████▍ | 1300/1537 [39:21<07:21,  1.86s/it]

[sample 7800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  86%|████████▌ | 1316/1537 [39:50<06:38,  1.80s/it]

[sample 7900] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  87%|████████▋ | 1333/1537 [40:22<06:18,  1.85s/it]

[sample 8000] raw: 'neutral' -> parsed: 1 (neutral)


Inference:  88%|████████▊ | 1350/1537 [40:53<06:05,  1.95s/it]

[sample 8100] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  89%|████████▉ | 1366/1537 [41:23<05:06,  1.79s/it]

[sample 8200] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  90%|████████▉ | 1383/1537 [41:54<04:30,  1.76s/it]

[sample 8300] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  91%|█████████ | 1400/1537 [42:26<04:17,  1.88s/it]

[sample 8400] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  92%|█████████▏| 1416/1537 [42:58<04:11,  2.08s/it]

[sample 8500] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  93%|█████████▎| 1433/1537 [43:30<03:20,  1.92s/it]

[sample 8600] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  94%|█████████▍| 1450/1537 [44:02<02:45,  1.90s/it]

[sample 8700] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  95%|█████████▌| 1466/1537 [44:32<02:10,  1.84s/it]

[sample 8800] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  96%|█████████▋| 1483/1537 [45:03<01:38,  1.82s/it]

[sample 8900] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference:  98%|█████████▊| 1500/1537 [45:34<01:09,  1.88s/it]

[sample 9000] raw: 'entailment' -> parsed: 0 (entailment)


Inference:  99%|█████████▊| 1516/1537 [46:05<00:39,  1.90s/it]

[sample 9100] raw: 'contradiction' -> parsed: 2 (contradiction)


Inference: 100%|█████████▉| 1533/1537 [46:35<00:06,  1.74s/it]

[sample 9200] raw: 'neutral' -> parsed: 1 (neutral)


Inference: 100%|██████████| 1537/1537 [46:41<00:00,  1.82s/it]

True label dist: {np.int64(1): 3043, np.int64(0): 3101, np.int64(2): 3073}
Pred label dist: {np.int64(1): 1481, np.int64(0): 5180, np.int64(2): 2556}
  accuracy=0.7386, f1_macro=0.7260
Saved results/metrics.json





NameError: name '_nli_orig_stdout' is not defined

In [None]:
# Summary: per config/split
for config_name, splits in all_metrics.items():
    for split_name, m in splits.items():
        print(f"{config_name} / {split_name}: acc={m['accuracy']:.4f}, F1_macro={m['f1_macro']:.4f}, F1_per_class={m['f1_per_class']}")