# LLM Zero-Shot Evaluation on QNLI

Оценка базовой языковой модели **Qwen2.5-0.5B** на задаче QNLI без дообучения.

## 1. Setup

In [24]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm

MODEL_NAME = "Qwen/Qwen2.5-0.5B"
BATCH_SIZE = 8
MAX_LENGTH = 512
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {DEVICE}")

Device: cuda


## 2. Load Dataset

QNLI (Question-answering Natural Language Inference) - задача определения, содержит ли предложение ответ на вопрос.

In [25]:
dataset = load_dataset("glue", "qnli", split="validation")
print(f"Validation size: {len(dataset)}")

Validation size: 5463


## 3. Load Model

Используем decoder-only модель с left-padding для корректной генерации.

In [26]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32
).to(DEVICE)
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

## 4. Prompt Engineering

Простой промпт с явным указанием формата ответа.

In [27]:
def create_prompt(question, sentence):
    return f"""Does the sentence answer the question?

Question: {question}
Sentence: {sentence}

Answer with only "yes" or "no":"""

In [28]:
def parse_answer(text):
    """Парсим ответ модели"""
    text = text.lower().strip()
    if "yes" in text:
        return 0  # entailment
    elif "no" in text:
        return 1  # not_entailment
    else:
        return -1  # не распознано

## 5. Inference

Генерация ответов для всего validation set.

In [30]:
all_preds = []
all_labels = []
raw_outputs = []

with torch.no_grad():
    for i in tqdm(range(0, len(dataset), BATCH_SIZE), desc="Evaluating"):
        batch = dataset[i:i+BATCH_SIZE]
        
        prompts = [
            create_prompt(q, s) 
            for q, s in zip(batch["question"], batch["sentence"])
        ]
        
        inputs = tokenizer(
            prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=MAX_LENGTH
        ).to(DEVICE)
        
        outputs = model.generate(
            **inputs,
            max_new_tokens=5,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id
        )
        
        for j, output in enumerate(outputs):
            input_len = inputs["input_ids"][j].shape[0]
            generated = tokenizer.decode(output[input_len:], skip_special_tokens=True)
            raw_outputs.append(generated)
            all_preds.append(parse_answer(generated))
        
        all_labels.extend(batch["label"])

Evaluating: 100%|██████████| 683/683 [00:57<00:00, 11.98it/s]


## 6. Analysis of Generated Answers

Проверяем, что генерирует модель.

In [31]:
from collections import Counter

print("Распределение ответов модели:")
print(Counter(raw_outputs).most_common(10))

unrecognized = sum(1 for p in all_preds if p == -1)
print(f"\nНе распознано: {unrecognized} ({unrecognized/len(all_preds)*100:.1f}%)")

Распределение ответов модели:
[(' no', 3292), (' yes', 2170), (' yes\nYou are an', 1)]

Не распознано: 0 (0.0%)


## 7. Results

Метрики производительности модели.

In [32]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

valid_idx = [i for i, p in enumerate(all_preds) if p != -1]
filtered_preds = [all_preds[i] for i in valid_idx]
filtered_labels = [all_labels[i] for i in valid_idx]

print(f"Оценка на {len(filtered_preds)} из {len(all_preds)} примеров")

accuracy = accuracy_score(filtered_labels, filtered_preds)
print(f"\n{'='*50}")
print(f"ACCURACY: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"{'='*50}")

print(f"\nClassification Report:")
print(classification_report(filtered_labels, filtered_preds, target_names=["entailment", "not_entailment"]))

print("Confusion Matrix:")
print(confusion_matrix(filtered_labels, filtered_preds))

Оценка на 5463 из 5463 примеров

ACCURACY: 0.5792 (57.92%)

Classification Report:
                precision    recall  f1-score   support

    entailment       0.59      0.48      0.53      2702
not_entailment       0.57      0.68      0.62      2761

      accuracy                           0.58      5463
     macro avg       0.58      0.58      0.57      5463
  weighted avg       0.58      0.58      0.57      5463

Confusion Matrix:
[[1287 1415]
 [ 884 1877]]


## 8. Examples

Примеры предсказаний модели.

In [33]:
print("\nПРИМЕРЫ:")
for i in range(10):
    true_label = "entailment" if all_labels[i] == 0 else "not_entailment"
    pred_label = "entailment" if all_preds[i] == 0 else ("not_entailment" if all_preds[i] == 1 else "???")
    mark = "✓" if all_labels[i] == all_preds[i] else "✗"
    print(f"{mark} '{raw_outputs[i][:15]:<15}' | Pred: {pred_label:<15} | True: {true_label}")


ПРИМЕРЫ:
✓ ' yes           ' | Pred: entailment      | True: entailment
✓ ' no            ' | Pred: not_entailment  | True: not_entailment
✗ ' yes           ' | Pred: entailment      | True: not_entailment
✗ ' no            ' | Pred: not_entailment  | True: entailment
✗ ' yes           ' | Pred: entailment      | True: not_entailment
✗ ' yes           ' | Pred: entailment      | True: not_entailment
✓ ' no            ' | Pred: not_entailment  | True: not_entailment
✓ ' no            ' | Pred: not_entailment  | True: not_entailment
✗ ' yes           ' | Pred: entailment      | True: not_entailment
✗ ' no            ' | Pred: not_entailment  | True: entailment
