# Load Turkish NLI Datasets

This notebook loads three Turkish Natural Language Inference (NLI) datasets from the `yilmazzey/sdp2-nli` collection.

## Install Required Libraries

Install the datasets library if not already available.

In [1]:
!pip install datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m26.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Import Libraries

In [2]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## Load MultiNLI Turkish Dataset

In [3]:
cache_dir = "/Users/denizcii/Downloads/sdp2-nli/datasets"

multinli_ds = load_dataset("yilmazzey/sdp2-nli", "multinli_tr_1_1", cache_dir=cache_dir)
print("MultiNLI Turkish dataset loaded successfully!")
print(multinli_ds)

snli_ds = load_dataset("yilmazzey/sdp2-nli", "snli_tr_1_1", cache_dir=cache_dir)
print("SNLI Turkish dataset loaded successfully!")
print(snli_ds)

trglue_ds = load_dataset("yilmazzey/sdp2-nli", "trglue_mnli", cache_dir=cache_dir)
print("TrGLUE MNLI dataset loaded successfully!")
print(trglue_ds)

Generating train split: 100%|██████████| 392599/392599 [00:00<00:00, 3901649.71 examples/s]
Generating validation_matched split: 100%|██████████| 9809/9809 [00:00<00:00, 3004156.84 examples/s]
Generating validation_mismatched split: 100%|██████████| 9825/9825 [00:00<00:00, 3571592.72 examples/s]


MultiNLI Turkish dataset loaded successfully!
DatasetDict({
    train: Dataset({
        features: ['annotator_labels', 'genre', 'pairID', 'promptID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 392599
    })
    validation_matched: Dataset({
        features: ['annotator_labels', 'genre', 'pairID', 'promptID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 9809
    })
    validation_mismatched: Dataset({
        features: ['annotator_labels', 'genre', 'pairID', 'promptID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 9825
    })
})


Generating train split: 100%|██████████| 548487/548487 [00:00<00:00, 6129917.39 examples/s]
Generating validation split: 100%|██████████| 9836/9836 [00:00<00:00, 3121608.21 examples/s]
Generating test split: 100%|██████████| 9824/9824 [00:00<00:00, 1570545.91 examples/s]


SNLI Turkish dataset loaded successfully!
DatasetDict({
    train: Dataset({
        features: ['annotator_labels', 'captionID', 'pairID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 548487
    })
    validation: Dataset({
        features: ['annotator_labels', 'captionID', 'pairID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 9836
    })
    test: Dataset({
        features: ['annotator_labels', 'captionID', 'pairID', 'translation_annotations', 'premise', 'hypothesis', 'label'],
        num_rows: 9824
    })
})


Generating train split: 100%|██████████| 162788/162788 [00:00<00:00, 4231108.00 examples/s]
Generating validation_matched split: 100%|██████████| 9050/9050 [00:00<00:00, 3310233.82 examples/s]
Generating validation_mismatched split: 100%|██████████| 9200/9200 [00:00<00:00, 2349893.23 examples/s]
Generating test_matched split: 100%|██████████| 9008/9008 [00:00<00:00, 3935655.25 examples/s]
Generating test_mismatched split: 100%|██████████| 9217/9217 [00:00<00:00, 3832546.84 examples/s]

TrGLUE MNLI dataset loaded successfully!
DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 162788
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9050
    })
    validation_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9200
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9008
    })
    test_mismatched: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9217
    })
})





## Explore Dataset Splits

Check what splits are available in each dataset.

## Load Pre-trained Turkish NLI Model

Load BERTurk model already fine-tuned on All-NLI-TR for evaluation.

In [4]:
!pip install transformers torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m26.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "emrecan/bert-base-turkish-cased-allnli_tr"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_name, cache_dir=cache_dir)

print(f"Model loaded: {model_name}")
print(f"Number of labels: {model.config.num_labels}")
print(f"Label mapping: {model.config.id2label}")

Model loaded: emrecan/bert-base-turkish-cased-allnli_tr
Number of labels: 3
Label mapping: {0: 'entailment', 1: 'neutral', 2: 'contradiction'}


## Zero-Shot Evaluation

Evaluate the pre-trained model on the datasets without additional training.

In [9]:
from sklearn.metrics import accuracy_score, classification_report, f1_score
import numpy as np

def evaluate_nli(model, tokenizer, dataset, dataset_name, split_name=None):
    """Evaluate NLI model on a dataset"""
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
    # Get evaluation split
    if split_name:
        eval_data = dataset[split_name]
    elif 'test' in dataset:
        eval_data = dataset['test']
        split_name = 'test'
    elif 'test_matched' in dataset:
        eval_data = dataset['test_matched']
        split_name = 'test_matched'
    elif 'validation' in dataset:
        eval_data = dataset['validation']
        split_name = 'validation'
    elif 'validation_matched' in dataset:
        eval_data = dataset['validation_matched']
        split_name = 'validation_matched'
    else:
        print(f"No test/validation split found in {dataset_name}")
        return None, None
    
    predictions = []
    true_labels = []
    
    print(f"\nEvaluating {dataset_name} - {split_name} ({len(eval_data)} samples)...")
    
    for i, example in enumerate(eval_data):
        if i % 500 == 0:
            print(f"  Processing {i}/{len(eval_data)}...")
        
        # Format input based on dataset structure
        premise = example.get('premise', example.get('sentence1', ''))
        hypothesis = example.get('hypothesis', example.get('sentence2', ''))
        
        inputs = tokenizer(premise, hypothesis, 
                          return_tensors="pt", 
                          padding=True, 
                          truncation=True, 
                          max_length=512)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            pred = torch.argmax(outputs.logits, dim=-1).item()
        
        predictions.append(pred)
        true_labels.append(example['label'])
    
    accuracy = accuracy_score(true_labels, predictions)
    f1_macro = f1_score(true_labels, predictions, average='macro')
    
    print(f"\n{dataset_name} - {split_name} Results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Macro: {f1_macro:.4f}")
    print("\nClassification Report:")
    print(classification_report(true_labels, predictions, 
                               target_names=['entailment', 'neutral', 'contradiction']))
    
    return accuracy, f1_macro

# Evaluate on all datasets (full splits)
results = {}

# MultiNLI-TR: has validation_matched and validation_mismatched
results['MultiNLI-TR (matched)'] = evaluate_nli(model, tokenizer, multinli_ds, "MultiNLI-TR", 
                                                 split_name='validation_matched')
results['MultiNLI-TR (mismatched)'] = evaluate_nli(model, tokenizer, multinli_ds, "MultiNLI-TR", 
                                                    split_name='validation_mismatched')

# SNLI-TR: has standard test split
results['SNLI-TR'] = evaluate_nli(model, tokenizer, snli_ds, "SNLI-TR")

# TrGLUE-MNLI: has test_matched and test_mismatched
results['TrGLUE-MNLI (matched)'] = evaluate_nli(model, tokenizer, trglue_ds, "TrGLUE-MNLI", 
                                                 split_name='test_matched')
results['TrGLUE-MNLI (mismatched)'] = evaluate_nli(model, tokenizer, trglue_ds, "TrGLUE-MNLI", 
                                                    split_name='test_mismatched')

print("\n" + "="*70)
print("SUMMARY - F1 MACRO SCORES")
print("="*70)
for dataset_name, (acc, f1) in results.items():
    print(f"{dataset_name:35} | Accuracy: {acc:.4f} | F1 Macro: {f1:.4f}")


Evaluating MultiNLI-TR - validation_matched (9809 samples)...
  Processing 0/9809...
  Processing 500/9809...
  Processing 1000/9809...
  Processing 1500/9809...
  Processing 2000/9809...
  Processing 2500/9809...
  Processing 3000/9809...
  Processing 3500/9809...
  Processing 4000/9809...
  Processing 4500/9809...
  Processing 5000/9809...
  Processing 5500/9809...
  Processing 6000/9809...
  Processing 6500/9809...
  Processing 7000/9809...
  Processing 7500/9809...
  Processing 8000/9809...
  Processing 8500/9809...
  Processing 9000/9809...
  Processing 9500/9809...

MultiNLI-TR - validation_matched Results:
Accuracy: 0.7983
F1 Macro: 0.7979

Classification Report:
               precision    recall  f1-score   support

   entailment       0.86      0.78      0.82      3475
      neutral       0.74      0.78      0.76      3123
contradiction       0.80      0.84      0.82      3211

     accuracy                           0.80      9809
    macro avg       0.80      0.80      0.8

## (Optional) Fine-tune on sdp2-nli

Further fine-tune the model on your datasets for improved performance.

In [None]:
from transformers import Trainer, TrainingArguments
from datasets import concatenate_datasets

# Combine datasets for training
def prepare_dataset(examples):
    """Tokenize premise-hypothesis pairs"""
    premise = examples.get('premise', examples.get('sentence1'))
    hypothesis = examples.get('hypothesis', examples.get('sentence2'))
    return tokenizer(premise, hypothesis, truncation=True, padding='max_length', max_length=512)

# Prepare training data
train_datasets = []
if 'train' in multinli_ds:
    train_datasets.append(multinli_ds['train'])
if 'train' in snli_ds:
    train_datasets.append(snli_ds['train'])
if 'train' in trglue_ds:
    train_datasets.append(trglue_ds['train'])

# Combine and tokenize
combined_train = concatenate_datasets(train_datasets)
tokenized_train = combined_train.map(prepare_dataset, batched=True)

# Prepare validation data
eval_datasets = []
if 'validation_matched' in multinli_ds:
    eval_datasets.append(multinli_ds['validation_matched'])
elif 'validation' in multinli_ds:
    eval_datasets.append(multinli_ds['validation'])
    
if 'validation' in snli_ds:
    eval_datasets.append(snli_ds['validation'])
    
if 'validation_matched' in trglue_ds:
    eval_datasets.append(trglue_ds['validation_matched'])
elif 'validation' in trglue_ds:
    eval_datasets.append(trglue_ds['validation'])

combined_eval = concatenate_datasets(eval_datasets)
tokenized_eval = combined_eval.map(prepare_dataset, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    processing_class=tokenizer,
)

# Uncomment to fine-tune
# trainer.train()
# trainer.save_model("./fine_tuned_model")

print("Fine-tuning setup complete. Uncomment trainer.train() to start training.")

Map: 100%|██████████| 28695/28695 [00:03<00:00, 9048.28 examples/s]
  trainer = Trainer(


Fine-tuning setup complete. Uncomment trainer.train() to start training.


In [None]:
print("Available splits in each dataset:\n")
print(f"MultiNLI-TR: {list(multinli_ds.keys())}")
print(f"SNLI-TR: {list(snli_ds.keys())}")
print(f"TrGLUE-MNLI: {list(trglue_ds.keys())}")


Available splits in each dataset:

MultiNLI-TR: ['train', 'validation_matched', 'validation_mismatched']
SNLI-TR: ['train', 'validation', 'test']
TrGLUE-MNLI: ['train', 'validation_matched', 'validation_mismatched', 'test_matched', 'test_mismatched']
