# Language model applied to Ancient Greek

Students:
<pre>Sarah Batara [qq577530]</pre>
<pre>Arjun Menon []</pre>
<pre>Javier Marsicano [qq577517]</pre>

## Motivation
We chose this topic because we were interested in language models and we had some prior experience with NLP as well. We found out that, for the majority of language tasks, LLMs achieve state-of-the-art performance. However, LLMs have challenges with some use cases, low-resource languages is one of them. As the name implies, datasets and corpora for low-resource languages are scarce or simply insufficient for training models that require large amounts of training data like LLMs. Currently there's still a lot of research and work going on to apply Deep Learning to low-resource languages, for instance, dialects spoken by specific ethnic groups in Africa or Asia.

Thus, Ancient or classical languages are a good example of low-resource languages, Ancient Greek in particular. Furthermore, it's worth noting that this language has evolved significantly over the time, with changes in grammar, vocabulary, lexicon, etc. Specifically, Ancient Greek has undergone several distinct phases over a few centuries, including Classical Greek, Hellenistic Greek, and Koine Greek, each with its own unique characteristics. This evolution makes it even more challenging to create a single, comprehensive language model that can accurately represent the complexities of Ancient Greek.

In particular, nowadays there's still little work on language models applied to Ancient Greek specifically, but we could find two interesting papers that were published fairly recently (two years ago). One is [Kevin Krahn, Derrick Tate, and Andrew C. Lamicela (2023) Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation](https://aclanthology.org/2023.alp-1.2/) based on multilingual models; and the other is [Exploring Large Language Models for Classical Philology](https://aclanthology.org/2023.acl-long.846/) based on monolingual models. In their research, the authors leveraged a multilingual translation model to achieve high accuracy in translating Ancient Greek texts.

While their results were promising, we identified several areas for potential improvement. In this project, we aim to explore a more targeted approach focusing specifically on Koine Greek, to determine whether a specialized model can yield better translation performance.

##Model

For low-resource languages BERT - or any of its variants - is the preferred model. There's a variant specially trained for [Ancient Greek](https://github.com/pranaydeeps/Ancient-Greek-BERT) which seems to be the first one published. Later on an even more elaborated language model was published, GreBERTa, which derives from RoBERTa, and is suitable to be fine-tuned to perform NLP tasks and trained specifically on Ancient Greek without using any modern Greek dataset.   

##Datasets

In our work, we plan to use the same datasets employed in these studies, along with several additional ones we have found published in repositories. The New Testament has been originally written in Ancient Greek (Koine) and it has been one of the most translated texts throughout history. Because of that, it has become a central reference point for translation theory and linguistics. Hence, almost all related work used the New Testament as part of the dataset since there's already plenty of repositories with the text already formatted and annotated. The most relevant ones are:

https://github.com/STEPBible/STEPBible-Data

https://github.com/Faithlife/SBLGNT

https://github.com/OpenGreekAndLatin/First1KGreek

https://github.com/proiel/proiel-treebank/





In [None]:
!pip install evaluate seqeval Dataset
!pip install datasets==3.6.0

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Dataset
  Downloading dataset-1.6.2-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting sqlalchemy<2.0.0,>=1.3.2 (from Dataset)
  Downloading SQLAlchemy-1.4.54-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting banal>=1.0.1 (from Dataset)
  Downloading banal-1.0.6-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?

In [None]:
#Import required libraries
import  pandas as pd
import csv
import re
import json
import argparse
from pathlib import Path
from collections import Counter
from typing import Dict, List, Tuple
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
    AutoModel,
    get_linear_schedule_with_warmup
)
from datasets import Dataset, DatasetDict
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
import random
from dataclasses import dataclass
from tqdm import tqdm
import os, warnings, unicodedata, numpy as np
import evaluate
from seqeval.metrics import classification_report

try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    path_prefix = '/content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/'
else:
    path_prefix = ''

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##POS Tagging and Word Alignment

Fine-tuning GreBerta

In [None]:
def load_pos_data(filepath: Path) -> List[Dict]:
    """Load and extract POS tagging data from alignment JSON"""
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)

    pos_data = []
    for verse in data['data']:
        tokens = [t['word'] for t in verse['greek_tokens']]
        pos_tags = [t['pos'] for t in verse['greek_tokens']]

        if tokens:  # Skip empty verses
            pos_data.append({
                'tokens': tokens,
                'pos_tags': pos_tags,
                'verse_id': verse['verse_id']
            })

    return pos_data


def create_label_mapping(train_data: List[Dict]) -> Dict[str, int]:
    """Create mapping from POS tags to integer IDs"""
    all_tags = set()
    for example in train_data:
        all_tags.update(example['pos_tags'])

    # Sort for consistency
    sorted_tags = sorted(all_tags)
    tag_to_id = {tag: i for i, tag in enumerate(sorted_tags)}
    id_to_tag = {i: tag for tag, i in tag_to_id.items()}

    return tag_to_id, id_to_tag


def tokenize_and_align_labels(examples, tokenizer, tag_to_id):
    """
    Tokenize text and align POS labels with subword tokens.

    When a word is split into subwords (e.g., 'λόγος' -> ['λ', '##όγος']),
    we assign the label to the first subword and -100 to the rest (ignored in loss).
    """
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
        padding=False,
        max_length=512
    )

    labels = []
    for i, label_list in enumerate(examples['pos_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None

        for word_idx in word_ids:
            # Special tokens get -100 (ignored in loss)
            if word_idx is None:
                label_ids.append(-100)
            # First subword of each word gets the label
            elif word_idx != previous_word_idx:
                label_ids.append(tag_to_id[label_list[word_idx]])
            # Other subwords get -100
            else:
                label_ids.append(-100)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs


def compute_metrics(eval_pred, id_to_tag):
    """Compute accuracy and per-class metrics"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id_to_tag[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id_to_tag[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    # Flatten for metrics
    flat_predictions = [tag for sent in true_predictions for tag in sent]
    flat_labels = [tag for sent in true_labels for tag in sent]

    accuracy = accuracy_score(flat_labels, flat_predictions)

    return {
        'accuracy': accuracy,
        'num_examples': len(flat_labels)
    }


def train_pos_tagger():
    # Fine-tune GreBerta for POS tagging
    args = {
        'epochs': 3,  # Number of training epochs
        'batch_size': 16,  # Training batch size
        'lr': 2e-5,  # Learning rate
        'model_name': 'bowphs/GreBerta',  # Base model name
        'output_dir': path_prefix + 'pos_tagger_output', # Output directory
    }

    print("=" * 80)
    print("FINE-TUNING GREBERTA FOR POS TAGGING")
    print("=" * 80)

    # 1. Load data
    print("\n1. Loading data...")
    train_data = load_pos_data(Path(path_prefix + 'data/train.json'))
    dev_data = load_pos_data(Path(path_prefix + 'data/dev.json'))
    test_data = load_pos_data(Path(path_prefix + 'data/test.json'))

    print(f"  Train: {len(train_data):,} verses")
    print(f"  Dev:   {len(dev_data):,} verses")
    print(f"  Test:  {len(test_data):,} verses")

    # 2. Create label mapping
    print("\n2. Creating label mapping...")
    tag_to_id, id_to_tag = create_label_mapping(train_data)
    num_labels = len(tag_to_id)
    print(f"  Found {num_labels} POS tags:")
    for tag, idx in sorted(tag_to_id.items(), key=lambda x: x[1]):
        # Count occurrences in train
        count = sum(1 for ex in train_data for t in ex['pos_tags'] if t == tag)
        print(f"    {idx:2d}. {tag:5s} ({count:,} tokens)")

    # 3. Load tokenizer and model
    print(f"\n3. Loading {args['model_name']}...")
    tokenizer = AutoTokenizer.from_pretrained(args['model_name'], add_prefix_space=True)
    model = AutoModelForTokenClassification.from_pretrained(
        args['model_name'],
        num_labels=num_labels,
        id2label=id_to_tag,
        label2id=tag_to_id
    )
    print(f"  ✓ Model loaded with {num_labels} labels")
    print(f"  ✓ Model has {sum(p.numel() for p in model.parameters()):,} parameters")

    # 4. Create datasets
    print("\n4. Creating Hugging Face datasets...")
    train_dataset = Dataset.from_list(train_data)

    dev_dataset = Dataset.from_list(dev_data)
    test_dataset = Dataset.from_list(test_data)


    # Tokenize
    print("  Tokenizing...")
    train_dataset = train_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, tag_to_id),
        batched=True,
        remove_columns=['tokens', 'pos_tags', 'verse_id']
    )
    dev_dataset = dev_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, tag_to_id),
        batched=True,
        remove_columns=['tokens', 'pos_tags', 'verse_id']
    )
    test_dataset = test_dataset.map(
        lambda x: tokenize_and_align_labels(x, tokenizer, tag_to_id),
        batched=True,
        remove_columns=['tokens', 'pos_tags', 'verse_id']
    )
    print(f"  ✓ Tokenized {len(train_dataset):,} training examples")

    # 5. Setup training
    print(f"\n5. Setting up training...")
    print(f"  Epochs: {args['epochs']}")
    print(f"  Batch size: {args['batch_size']}")
    print(f"  Learning rate: {args['lr']}")
    print(f"  Output dir: {args['output_dir']}")

    training_args = TrainingArguments(
        output_dir=args['output_dir'],
        learning_rate=args['lr'],
        per_device_train_batch_size=args['batch_size'],
        per_device_eval_batch_size=args['batch_size'],
        num_train_epochs=args['epochs'],
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        push_to_hub=False,
        logging_dir=f'{args['output_dir']}/logs',
        logging_steps=50,
        report_to="none" # Added to prevent W&B login prompt
    )

    data_collator = DataCollatorForTokenClassification(tokenizer)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=lambda x: compute_metrics(x, id_to_tag),
    )

    # 6. Train!
    print("\n" + "=" * 80)
    print("6. TRAINING STARTED")
    print("=" * 80)

    train_result = trainer.train()

    print("\n" + "=" * 80)
    print("TRAINING COMPLETE!")
    print("=" * 80)
    print(f"\nTraining metrics:")
    for key, value in train_result.metrics.items():
        print(f"  {key}: {value}")

    # 7. Evaluate on dev set
    print("\n7. Evaluating on dev set...")
    dev_results = trainer.evaluate(eval_dataset=dev_dataset)
    print(f"Dev accuracy: {dev_results['eval_accuracy']:.4f}")

    # 8. Evaluate on test set
    print("\n8. Evaluating on test set...")
    test_results = trainer.evaluate(eval_dataset=test_dataset)
    print(f"Test accuracy: {test_results['eval_accuracy']:.4f}")

    # 9. Save model
    print(f"\n9. Saving model to {args['output_dir']}...")
    trainer.save_model(args['output_dir'])
    tokenizer.save_pretrained(args['output_dir'])

    # Save label mappings
    import json
    with open(Path(args['output_dir']) / 'label_mapping.json', 'w') as f:
        json.dump({'tag_to_id': tag_to_id, 'id_to_tag': id_to_tag}, f, indent=2)

    print("\n" + "=" * 80)
    print("✓ FINE-TUNING COMPLETE!")
    print("=" * 80)
    print(f"\nModel saved to: {args['output_dir']}")
    print(f"Dev accuracy:   {dev_results['eval_accuracy']:.4f}")
    print(f"Test accuracy:  {test_results['eval_accuracy']:.4f}")
    print("\nTo use the model:")
    print(f"  from transformers import AutoModelForTokenClassification, AutoTokenizer")
    print(f"  model = AutoModelForTokenClassification.from_pretrained('{args['output_dir']}')")
    print(f"  tokenizer = AutoTokenizer.from_pretrained('{args['output_dir']}')")
    print("=" * 80)


train_pos_tagger()

FINE-TUNING GREBERTA FOR POS TAGGING

1. Loading data...
  Train: 7,198 verses
  Dev:   284 verses
  Test:  443 verses

2. Creating label mapping...
  Found 13 POS tags:
     0. A-    (7,860 tokens)
     1. C-    (16,112 tokens)
     2. D-    (5,655 tokens)
     3. I-    (15 tokens)
     4. N-    (24,573 tokens)
     5. P-    (9,662 tokens)
     6. RA    (17,088 tokens)
     7. RD    (1,579 tokens)
     8. RI    (1,102 tokens)
     9. RP    (10,446 tokens)
    10. RR    (1,490 tokens)
    11. V-    (25,398 tokens)
    12. X-    (905 tokens)

3. Loading bowphs/GreBerta...


Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at bowphs/GreBerta and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  ✓ Model loaded with 13 labels
  ✓ Model has 125,397,517 parameters

4. Creating Hugging Face datasets...
  Tokenizing...


Map:   0%|          | 0/7198 [00:00<?, ? examples/s]

Map:   0%|          | 0/284 [00:00<?, ? examples/s]

Map:   0%|          | 0/443 [00:00<?, ? examples/s]

  ✓ Tokenized 7,198 training examples

5. Setting up training...
  Epochs: 3
  Batch size: 16
  Learning rate: 2e-05
  Output dir: /content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/pos_tagger_output

6. TRAINING STARTED


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Num Examples
1,0.0413,0.04067,0.987398,5158
2,0.0236,0.033632,0.989531,5158
3,0.0124,0.03352,0.991276,5158



TRAINING COMPLETE!

Training metrics:
  train_runtime: 114.9723
  train_samples_per_second: 187.819
  train_steps_per_second: 11.742
  total_flos: 429776895059172.0
  train_loss: 0.09274685502052307
  epoch: 3.0

7. Evaluating on dev set...


Dev accuracy: 0.9913

8. Evaluating on test set...
Test accuracy: 0.9932

9. Saving model to /content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/pos_tagger_output...

✓ FINE-TUNING COMPLETE!

Model saved to: /content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/pos_tagger_output
Dev accuracy:   0.9913
Test accuracy:  0.9932

To use the model:
  from transformers import AutoModelForTokenClassification, AutoTokenizer
  model = AutoModelForTokenClassification.from_pretrained('/content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/pos_tagger_output')
  tokenizer = AutoTokenizer.from_pretrained('/content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/pos_tagger_output')


Word Alignment

In [None]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

@dataclass
class AlignmentExample:
    """Single verse with alignment pairs"""
    verse_id: str
    greek_tokens: List[str]
    english_tokens: List[str]
    alignments: List[Tuple[int, int]]  # (greek_idx, english_idx) pairs


class AlignmentModel(nn.Module):
    """Cross-lingual word alignment model"""

    def __init__(self, greek_model_name='bowphs/GreBerta',
                 english_model_name='bert-base-uncased',
                 hidden_dim=256):
        super().__init__()

        # Encoders
        self.greek_encoder = AutoModel.from_pretrained(greek_model_name)
        self.english_encoder = AutoModel.from_pretrained(english_model_name)

        # Get hidden sizes
        greek_hidden = self.greek_encoder.config.hidden_size
        english_hidden = self.english_encoder.config.hidden_size

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(greek_hidden + english_hidden, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_dim, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 2)  # Binary: aligned or not
        )

    def forward(self, greek_input_ids, greek_attention_mask,
                english_input_ids, english_attention_mask,
                greek_indices, english_indices):
        """
        Args:
            greek_input_ids: [batch_size, max_greek_len]
            greek_attention_mask: [batch_size, max_greek_len]
            english_input_ids: [batch_size, max_english_len]
            english_attention_mask: [batch_size, max_english_len]
            greek_indices: [batch_size, num_pairs] - which Greek token for each pair
            english_indices: [batch_size, num_pairs] - which English token for each pair

        Returns:
            logits: [batch_size, num_pairs, 2] - alignment scores
        """
        # Encode Greek
        greek_outputs = self.greek_encoder(
            input_ids=greek_input_ids,
            attention_mask=greek_attention_mask
        )
        greek_embeddings = greek_outputs.last_hidden_state  # [batch, greek_len, hidden]

        # Encode English
        english_outputs = self.english_encoder(
            input_ids=english_input_ids,
            attention_mask=english_attention_mask
        )
        english_embeddings = english_outputs.last_hidden_state  # [batch, eng_len, hidden]

        # Gather embeddings for specified pairs
        batch_size, num_pairs = greek_indices.shape

        # Get Greek embeddings for each pair
        greek_pair_embeddings = torch.gather(
            greek_embeddings,
            dim=1,
            index=greek_indices.unsqueeze(-1).expand(-1, -1, greek_embeddings.size(-1))
        )  # [batch, num_pairs, greek_hidden]

        # Get English embeddings for each pair
        english_pair_embeddings = torch.gather(
            english_embeddings,
            dim=1,
            index=english_indices.unsqueeze(-1).expand(-1, -1, english_embeddings.size(-1))
        )  # [batch, num_pairs, english_hidden]

        # Concatenate embeddings
        combined = torch.cat([greek_pair_embeddings, english_pair_embeddings], dim=-1)

        # Classify each pair
        logits = self.classifier(combined)  # [batch, num_pairs, 2]

        return logits

def _get_single_item(self, idx):
    # ← paste your entire original __getitem__ body here (without the debug print)
    example = self.examples[idx]
    # ... rest of your code exactly as before

class AlignmentDataset(Dataset):
    """Dataset for word alignment training"""

    def __init__(self, examples: List[AlignmentExample],
                 greek_tokenizer, english_tokenizer,
                 max_pairs_per_verse=50):
        self.examples = examples
        self.greek_tokenizer = greek_tokenizer
        self.english_tokenizer = english_tokenizer
        self.max_pairs_per_verse = max_pairs_per_verse

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):

        if torch.is_tensor(idx):
          idx = idx.tolist()
        if isinstance(idx, list):
            return [self._get_single_item(i) for i in idx]
        return self._get_single_item(idx)

    def _get_single_item(self, idx):

        example = self.examples[idx]


        # Tokenize Greek (join with spaces)
        greek_text = ' '.join(example.greek_tokens)
        greek_encoded = self.greek_tokenizer(
            greek_text,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        # Tokenize English
        english_text = ' '.join(example.english_tokens)
        english_encoded = self.english_tokenizer(
            english_text,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        # Map original word indices to subword indices
        greek_word_to_token = self._get_word_to_token_map(
            example.greek_tokens, greek_encoded
        )
        english_word_to_token = self._get_word_to_token_map(
            example.english_tokens, english_encoded
        )

        # Create training pairs
        # Positive examples: actual alignments
        positive_pairs = []
        for greek_idx, english_idx in example.alignments:
            if greek_idx in greek_word_to_token and english_idx in english_word_to_token:
                positive_pairs.append((
                    greek_word_to_token[greek_idx],
                    english_word_to_token[english_idx],
                    1  # label: aligned
                ))

        # Negative examples: random non-aligned pairs
        num_negatives = min(len(positive_pairs) * 2, self.max_pairs_per_verse)
        negative_pairs = []

        all_greek_indices = list(greek_word_to_token.values())
        all_english_indices = list(english_word_to_token.values())

        # Create set of positive pairs for fast lookup
        positive_set = {(g, e) for g, e, _ in positive_pairs}

        attempts = 0
        while len(negative_pairs) < num_negatives and attempts < num_negatives * 10:
            g_idx = random.choice(all_greek_indices)
            e_idx = random.choice(all_english_indices)
            if (g_idx, e_idx) not in positive_set:
                negative_pairs.append((g_idx, e_idx, 0))  # label: not aligned
            attempts += 1

        # Combine and shuffle
        all_pairs = positive_pairs + negative_pairs
        random.shuffle(all_pairs)

        # Limit total pairs
        all_pairs = all_pairs[:self.max_pairs_per_verse]

        if not all_pairs:
            # Create dummy pair if no valid pairs
            all_pairs = [(1, 1, 0)]

        greek_indices = torch.tensor([p[0] for p in all_pairs], dtype=torch.long)
        english_indices = torch.tensor([p[1] for p in all_pairs], dtype=torch.long)
        labels = torch.tensor([p[2] for p in all_pairs], dtype=torch.long)

        return {
            'greek_input_ids': greek_encoded['input_ids'].squeeze(0),
            'greek_attention_mask': greek_encoded['attention_mask'].squeeze(0),
            'english_input_ids': english_encoded['input_ids'].squeeze(0),
            'english_attention_mask': english_encoded['attention_mask'].squeeze(0),
            'greek_indices': greek_indices,
            'english_indices': english_indices,
            'labels': labels,
            'verse_id': example.verse_id
        }

    def _get_word_to_token_map(self, words, encoded):
        """Map word indices to their first subword token index"""
        word_to_token = {}
        word_ids = encoded.word_ids()

        for token_idx, word_idx in enumerate(word_ids):
            if word_idx is not None and word_idx not in word_to_token:
                word_to_token[word_idx] = token_idx

        return word_to_token


def collate_fn(batch):
    """Custom collate function for batching"""
    # Find max lengths
    max_greek_len = max(item['greek_input_ids'].size(0) for item in batch)
    max_english_len = max(item['english_input_ids'].size(0) for item in batch)
    max_pairs = max(item['labels'].size(0) for item in batch)

    # Pad everything
    greek_input_ids = []
    greek_attention_mask = []
    english_input_ids = []
    english_attention_mask = []
    greek_indices = []
    english_indices = []
    labels = []
    verse_ids = []

    for item in batch:
        # Pad Greek
        g_len = item['greek_input_ids'].size(0)
        greek_input_ids.append(
            torch.cat([item['greek_input_ids'],
                      torch.zeros(max_greek_len - g_len, dtype=torch.long)])
        )
        greek_attention_mask.append(
            torch.cat([item['greek_attention_mask'],
                      torch.zeros(max_greek_len - g_len, dtype=torch.long)])
        )

        # Pad English
        e_len = item['english_input_ids'].size(0)
        english_input_ids.append(
            torch.cat([item['english_input_ids'],
                      torch.zeros(max_english_len - e_len, dtype=torch.long)])
        )
        english_attention_mask.append(
            torch.cat([item['english_attention_mask'],
                      torch.zeros(max_english_len - e_len, dtype=torch.long)])
        )

        # Pad pairs
        num_pairs = item['labels'].size(0)
        greek_indices.append(
            torch.cat([item['greek_indices'],
                      torch.zeros(max_pairs - num_pairs, dtype=torch.long)])
        )
        english_indices.append(
            torch.cat([item['english_indices'],
                      torch.zeros(max_pairs - num_pairs, dtype=torch.long)])
        )
        labels.append(
            torch.cat([item['labels'],
                      torch.full((max_pairs - num_pairs,), -100, dtype=torch.long)])  # -100 = ignore
        )

        verse_ids.append(item['verse_id'])

    return {
        'greek_input_ids': torch.stack(greek_input_ids),
        'greek_attention_mask': torch.stack(greek_attention_mask),
        'english_input_ids': torch.stack(english_input_ids),
        'english_attention_mask': torch.stack(english_attention_mask),
        'greek_indices': torch.stack(greek_indices),
        'english_indices': torch.stack(english_indices),
        'labels': torch.stack(labels),
        'verse_ids': verse_ids
    }


def load_alignment_data(filepath: Path) -> List[AlignmentExample]:
    """Load alignment data from JSON"""
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)

    examples = []
    for verse in data['data']:
        if not verse['alignments']:
            continue

        greek_tokens = [t['word'] for t in verse['greek_tokens']]
        english_tokens = [t['word'] for t in verse['english_tokens']]
        alignments = [(a['greek_idx'], a['english_idx'])
                     for a in verse['alignments']]

        examples.append(AlignmentExample(
            verse_id=verse['verse_id'],
            greek_tokens=greek_tokens,
            english_tokens=english_tokens,
            alignments=alignments
        ))

    return examples


def train_epoch(model, dataloader, optimizer, scheduler, device):
    """Train for one epoch"""
    model.train()
    total_loss = 0
    all_preds = []
    all_labels = []

    progress_bar = tqdm(dataloader, desc='Training')

    for batch in progress_bar:
        # Move to device
        greek_input_ids = batch['greek_input_ids'].to(device)
        greek_attention_mask = batch['greek_attention_mask'].to(device)
        english_input_ids = batch['english_input_ids'].to(device)
        english_attention_mask = batch['english_attention_mask'].to(device)
        greek_indices = batch['greek_indices'].to(device)
        english_indices = batch['english_indices'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        logits = model(
            greek_input_ids, greek_attention_mask,
            english_input_ids, english_attention_mask,
            greek_indices, english_indices
        )

        # Compute loss (only on valid pairs)
        loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
        loss = loss_fn(logits.view(-1, 2), labels.view(-1))

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        # Track metrics
        total_loss += loss.item()

        # Get predictions
        preds = torch.argmax(logits, dim=-1)
        valid_mask = labels != -100
        all_preds.extend(preds[valid_mask].cpu().numpy())
        all_labels.extend(labels[valid_mask].cpu().numpy())

        progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})

    # Compute metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels, all_preds, average='binary'
    )

    return {
        'loss': total_loss / len(dataloader),
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


def evaluate(model, dataloader, device):
    """Evaluate model"""
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc='Evaluating'):
            greek_input_ids = batch['greek_input_ids'].to(device)
            greek_attention_mask = batch['greek_attention_mask'].to(device)
            english_input_ids = batch['english_input_ids'].to(device)
            english_attention_mask = batch['english_attention_mask'].to(device)
            greek_indices = batch['greek_indices'].to(device)
            english_indices = batch['english_indices'].to(device)
            labels = batch['labels'].to(device)

            logits = model(
                greek_input_ids, greek_attention_mask,
                english_input_ids, english_attention_mask,
                greek_indices, english_indices
            )

            preds = torch.argmax(logits, dim=-1)
            valid_mask = labels != -100
            all_preds.extend(preds[valid_mask].cpu().numpy())
            all_labels.extend(labels[valid_mask].cpu().numpy())

    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels, all_preds, average='binary'
    )

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


def train_alignment():
    args = {
        'epochs': 3,
        'batch_size': 8,
        'lr': 2e-5,
        'output_dir': path_prefix+'alignment_model_output_colab'
    }

    print("=" * 80)
    print("TRAINING WORD ALIGNMENT MODEL")
    print("=" * 80)

    # Device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"\nUsing device: {device}")

    # Load data
    print("\n1. Loading data...")
    train_examples = load_alignment_data(Path(path_prefix+'data/train.json'))
    dev_examples = load_alignment_data(Path(path_prefix+'data/dev.json'))
    test_examples = load_alignment_data(Path(path_prefix+'data/test.json'))

    print(f"  Train: {len(train_examples):,} verses with alignments")
    print(f"  Dev:   {len(dev_examples):,} verses")
    print(f"  Test:  {len(test_examples):,} verses")

    # Load tokenizers
    print("\n2. Loading tokenizers...")
    greek_tokenizer = AutoTokenizer.from_pretrained('bowphs/GreBerta')
    english_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    print("  ✓ Tokenizers loaded")

    # Create datasets
    print("\n3. Creating datasets...")
    train_dataset = AlignmentDataset(train_examples, greek_tokenizer, english_tokenizer)
    dev_dataset = AlignmentDataset(dev_examples, greek_tokenizer, english_tokenizer)
    test_dataset = AlignmentDataset(test_examples, greek_tokenizer, english_tokenizer)

    train_dataloader = DataLoader(
        train_dataset, batch_size=args['batch_size'],
        shuffle=True, collate_fn=collate_fn, num_workers=0
    )
    dev_dataloader = DataLoader(
        dev_dataset, batch_size=args['batch_size'],
        shuffle=False, collate_fn=collate_fn, num_workers=0
    )
    test_dataloader = DataLoader(
        test_dataset, batch_size=args['batch_size'],
        shuffle=False, collate_fn=collate_fn, num_workers=0
    )

    print(f"  ✓ Created {len(train_dataloader):,} training batches")

    # Create model
    print("\n4. Creating model...")
    model = AlignmentModel()
    model = model.to(device)

    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"  ✓ Model created")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params:,}")

    # Setup training
    print("\n5. Setting up training...")
    optimizer = torch.optim.AdamW(model.parameters(), lr=args['lr'])
    num_training_steps = len(train_dataloader) * args['epochs']
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_training_steps // 10,
        num_training_steps=num_training_steps
    )

    print(f"  Epochs: {args['epochs']}")
    print(f"  Batch size: {args['batch_size']}")
    print(f"  Learning rate: {args['lr']}")
    print(f"  Output dir: {args['output_dir']}")

    # Train!
    print("\n" + "=" * 80)
    print("6. TRAINING STARTED")
    print("=" * 80)

    best_f1 = 0
    for epoch in range(args['epochs']):
        print(f"\nEpoch {epoch + 1}/{args['epochs']}")
        print("-" * 80)

        # Train
        train_metrics = train_epoch(model, train_dataloader, optimizer, scheduler, device)
        print(f"Train - Loss: {train_metrics['loss']:.4f}, "
              f"P: {train_metrics['precision']:.4f}, "
              f"R: {train_metrics['recall']:.4f}, "
              f"F1: {train_metrics['f1']:.4f}")

        # Evaluate
        dev_metrics = evaluate(model, dev_dataloader, device)
        print(f"Dev   - P: {dev_metrics['precision']:.4f}, "
              f"R: {dev_metrics['recall']:.4f}, "
              f"F1: {dev_metrics['f1']:.4f}")

        # Save best model
        if dev_metrics['f1'] > best_f1:
            best_f1 = dev_metrics['f1']
            output_dir = Path(args['output_dir'])
            output_dir.mkdir(exist_ok=True)
            torch.save(model.state_dict(), output_dir / 'best_model.pt')
            print(f"  ✓ Saved best model (F1: {best_f1:.4f})")

    # Final evaluation
    print("\n" + "=" * 80)
    print("7. FINAL EVALUATION")
    print("=" * 80)

    # Load best model
    model.load_state_dict(torch.load(Path(args['output_dir']) / 'best_model.pt'))

    print("\nDev set:")
    dev_metrics = evaluate(model, dev_dataloader, device)
    print(f"  Precision: {dev_metrics['precision']:.4f}")
    print(f"  Recall:    {dev_metrics['recall']:.4f}")
    print(f"  F1 Score:  {dev_metrics['f1']:.4f}")

    print("\nTest set:")
    test_metrics = evaluate(model, test_dataloader, device)
    print(f"  Precision: {test_metrics['precision']:.4f}")
    print(f"  Recall:    {test_metrics['recall']:.4f}")
    print(f"  F1 Score:  {test_metrics['f1']:.4f}")

    # Save final artifacts
    print(f"\n8. Saving model and tokenizers...")
    output_dir = Path(args['output_dir'])
    output_dir.mkdir(exist_ok=True)

    torch.save(model.state_dict(), output_dir / 'model.pt')
    greek_tokenizer.save_pretrained(output_dir / 'greek_tokenizer')
    english_tokenizer.save_pretrained(output_dir / 'english_tokenizer')

    # Save config
    config = {
        'greek_model': 'bowphs/GreBerta',
        'english_model': 'bert-base-uncased',
        'dev_f1': dev_metrics['f1'],
        'test_f1': test_metrics['f1'],
    }
    with open(output_dir / 'config.json', 'w') as f:
        json.dump(config, f, indent=2)

    print("\n" + "=" * 80)
    print("✓ TRAINING COMPLETE!")
    print("=" * 80)
    print(f"\nModel saved to: {args['output_dir']}")
    print(f"Test F1 Score: {test_metrics['f1']:.4f}")
    print("\nTo use the model, see test_alignment.py")
    print("=" * 80)


train_alignment()

TRAINING WORD ALIGNMENT MODEL

Using device: cuda

1. Loading data...
  Train: 5,846 verses with alignments
  Dev:   284 verses
  Test:  380 verses

2. Loading tokenizers...
  ✓ Tokenizers loaded

3. Creating datasets...
  ✓ Created 731 training batches

4. Creating model...


Some weights of RobertaModel were not initialized from the model checkpoint at bowphs/GreBerta and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  ✓ Model created
  Total parameters: 235,886,978
  Trainable parameters: 235,886,978

5. Setting up training...
  Epochs: 3
  Batch size: 8
  Learning rate: 2e-05
  Output dir: /content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/alignment_model_output_colab

6. TRAINING STARTED

Epoch 1/3
--------------------------------------------------------------------------------


Training: 100%|██████████| 731/731 [02:46<00:00,  4.39it/s, loss=0.3591]


Train - Loss: 0.4756, P: 0.7374, R: 0.5678, F1: 0.6415


Evaluating: 100%|██████████| 36/36 [00:00<00:00, 39.09it/s]


Dev   - P: 0.7717, R: 0.7666, F1: 0.7691
  ✓ Saved best model (F1: 0.7691)

Epoch 2/3
--------------------------------------------------------------------------------


Training: 100%|██████████| 731/731 [02:45<00:00,  4.41it/s, loss=0.2739]


Train - Loss: 0.2988, P: 0.8158, R: 0.8457, F1: 0.8305


Evaluating: 100%|██████████| 36/36 [00:00<00:00, 39.17it/s]


Dev   - P: 0.8004, R: 0.8550, F1: 0.8268
  ✓ Saved best model (F1: 0.8268)

Epoch 3/3
--------------------------------------------------------------------------------


Training: 100%|██████████| 731/731 [02:45<00:00,  4.42it/s, loss=0.2892]


Train - Loss: 0.2557, P: 0.8450, R: 0.8797, F1: 0.8620


Evaluating: 100%|██████████| 36/36 [00:00<00:00, 36.13it/s]


Dev   - P: 0.8254, R: 0.8567, F1: 0.8408
  ✓ Saved best model (F1: 0.8408)

7. FINAL EVALUATION

Dev set:


Evaluating: 100%|██████████| 36/36 [00:00<00:00, 37.13it/s]


  Precision: 0.8272
  Recall:    0.8658
  F1 Score:  0.8461

Test set:


Evaluating: 100%|██████████| 48/48 [00:01<00:00, 30.73it/s]


  Precision: 0.8754
  Recall:    0.9137
  F1 Score:  0.8942

8. Saving model and tokenizers...

✓ TRAINING COMPLETE!

Model saved to: /content/drive/MyDrive/UofT DL Term project/GreBerta-experiment-2/alignment_model_output_colab
Test F1 Score: 0.8942

To use the model, see test_alignment.py


# Named Entity Recognition (NER)

Another relevant application of large language models (LLMs) in linguistic tasks is Named Entity Recognition (NER). Recent work by Beersmans et al. (2024) demonstrates this by combining transformer-based models with domain-specific knowledge to identify individuals in Ancient Greek texts. Their study, “Gotta catch ’em all!: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge,” was presented at the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024) and provides a strong example of how NLP techniques can be adapted for historical languages.

The authors built a model for Ancient greek NER task with F1 score of 0.826, which is State-of-Art as of current. Since Ancient Greek is a low-resource, highly inflected ancient language with limited annotated corpora, unlike modern high-resource languages (e.g., English, where CoNLL-2003 NER F1 scores exceed 0.93), ancient languages suffer from data scarcity, orthographic variations (e.g., diacritics, dialects), and domain noise (e.g., fragmentary inscriptions or papyri). SOTA in this niche is typically in the 0.80–0.89 range for transformer-based models on similar tasks.

We attempted to do hyperparameters tuning for a better performance as there were 2 hyperparameters not tested in the original paper - Warmup ratio and batch size. Both of these are sensitive to transformer tuning.
Large batch size reduces noise, resulting in better token representation.

This is particulary useful for complex morphologically rich languages like Ancient Greek. However, smaller batches tend to act like regularization allowing for better generalization. Thus tuning of batch size is to find the balance between overfitting and better token representation.

Early in training, weights are not yet adapted to the task. A high learning rate too soon can destroy useful pretrained knowledge. Tuning warmup would allow the model to gently adjust before training. This parameter works well with small datasets as too high of Learning Rate early on can overfit quickly.


Data augmentation is another potential approach to improving the model’s F1 score, but it is not practical for this project. Given limited resources available for Koine Greek, we would need multiple models to do the following:

1.   Start with Koine greek - English translation pairs.
2.   Annotate the English translations of the Koine Greek sentences with existing English NER Labelling models such as dslim/bert-base-NER
3. Then align the annotated english tokens to the corresponding Koine Greek tokens using the alignment model above.
4. With this we will obtain infered NER labels for the Koine Greek tokens.

However even after this automated pipeline, human verification would still be required to ensure label accuracy. Producing a dataset of few thousands Koine Greek tokens under these constraints would be extremely time-consuming and effectively not feasible within the scope of this project.

The code below will take 3 hours to run on T4 colab

In [None]:
from datasets import Dataset as HFDataset, DatasetDict

if IN_COLAB:

    ner_path_prefix = '/content/drive/MyDrive/UofT DL Term project/NER Data/'
else:
    ner_path_prefix = './NER Data'


def read_conll(p: Path):
    """
    Parse CoNLL with format:
        [line_id]  token  [POS]  NER
    Example:
        110089790	βίβλος	O
    Returns: {"tokens": [...], "ner_tags": [...]}
    """
    sents, labs = [], []
    with p.open(encoding="utf-8") as f:
        sent, lab = [], []
        for i, raw in enumerate(f, 1):
            line = raw.strip()
            if not line or line.startswith("#"):
                if sent:
                    sents.append(sent)
                    labs.append(lab)
                    sent, lab = [], []
                continue

            # Split on whitespace (handles tabs and spaces)
            parts = line.split()
            if len(parts) < 2:
                print(f"Warning: Line {i} in {p.name} has <2 columns → SKIPPED")
                print(f"    → {line!r}")
                continue

            if len(parts) == 2:
                token = parts[0]
                ner   = parts[1]
            else:
                token = parts[1]   # skip ID
                ner   = parts[-1]  # last column is NER

            sent.append(unicodedata.normalize("NFC", token))
            lab.append(ner)

        if sent:
            sents.append(sent)
            labs.append(lab)

    print(f"Loaded {len(sents)} sentences from {p.name}")
    return {"tokens": sents, "ner_tags": labs}

# load data
train_path = Path(ner_path_prefix + 'train.conll')
val_path   = Path(ner_path_prefix + 'val.conll')
test_path  = Path(ner_path_prefix + 'test.conll')

raw = {
    "train": read_conll(train_path),
    "validation": read_conll(val_path),
    "test": read_conll(test_path),
}
data = DatasetDict({k: HFDataset.from_dict(v) for k, v in raw.items()})

#Model name -------------------------------------------------------------
model_name = "Marijke/AG_BERT_hypopt_NER"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
#------------------------------------------------------------------------

all_labels = sorted({l for s in data["train"]["ner_tags"] for l in s})
label2id   = {l: i for i, l in enumerate(all_labels)}
id2label   = {i: l for l, i in label2id.items()}

#tokenise + align labels
def tokenise_align(example):
    tok = tokenizer(example["tokens"], truncation=True, is_split_into_words=True)
    aligned = []
    for i, labs in enumerate(example["ner_tags"]):
        word_ids = tok.word_ids(batch_index=i)
        prev = None
        ids  = []
        for wid in word_ids:
            if wid is None:
                ids.append(-100)
            elif wid != prev:
                ids.append(label2id[labs[wid]])
            else:
                ids.append(-100)               # sub-word → ignore
            prev = wid
        aligned.append(ids)
    tok["labels"] = aligned
    return tok

tokenised = data.map(tokenise_align, batched=True,
                     remove_columns=data["train"].column_names)


model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(all_labels),
    id2label=id2label,
    label2id=label2id,
)

collator = DataCollatorForTokenClassification(tokenizer)


#Compute metrics to evaluate model performance for NER task
def compute_metrics(p):
    preds, labels = p
    preds = np.argmax(preds, axis=2)

    true_labels = []
    pred_labels = []

    for prediction, label in zip(preds, labels):
        true_seq = [id2label[l] for l in label if l != -100]
        pred_seq = [id2label[pred] for pred, l in zip(prediction, label) if l != -100]
        if true_seq:  # Only add if not empty
            true_labels.append(true_seq)
            pred_labels.append(pred_seq)

    if not true_labels:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}


    metric = evaluate.load("seqeval")
    results = metric.compute(predictions=pred_labels, references=true_labels)

    return {
      "precision": results["overall_precision"],
      "recall": results['overall_recall'],
      "f1": results["overall_f1"]
    }



Loaded 30686 sentences from train.conll
Loaded 4434 sentences from val.conll
Loaded 4701 sentences from test.conll


Map:   0%|          | 0/30686 [00:00<?, ? examples/s]

Map:   0%|          | 0/4434 [00:00<?, ? examples/s]

Map:   0%|          | 0/4701 [00:00<?, ? examples/s]

Hyperparameter tuning was performed using Hyperopt. Although Hyperopt is less commonly used today, we chose it to maintain consistency with the methodology described in the referenced paper.

In [None]:
from datasets import load_dataset
import evaluate
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK, STATUS_FAIL
from hyperopt.early_stop import no_progress_loss

# -------------------------------
# HYPEROPT SEARCH SPACE
# Include 2 new parameters that were not tried in the paper - batch size and warmup ratio
# -------------------------------
FIXED_LR = 6.040686648207059e-05
FIXED_WD = 0.01
FIXED_EPOCH = 3

space = {
    "batch_size":    hp.choice("batch_size", [8, 16, 32]),            # 3 options
    "warmup_ratio":  hp.choice("warmup_ratio", [0.0, 0.06, 0.1, 0.2]), # 4 options
    "seed": 123 #for reproducibility
}

# -------------------------------
# OBJECTIVE FUNCTION used by Hyperopt to test parameters
# -------------------------------
def objective(params):
  # we are keeping these 3 hyperparameters from the paper itself as they have found the optimal values for the Learning Rate, Weight Decay
  # and number of training epoch

    try:
        batch_size = params["batch_size"]
        warmup_ratio = params["warmup_ratio"]

        model_for_trial = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=len(all_labels),
            id2label=id2label,
            label2id=label2id,
        )


        total_steps = int(len(tokenised["train"]) / batch_size * FIXED_EPOCH)
        warmup_steps = int(total_steps * warmup_ratio)

        training_args = TrainingArguments(
            output_dir=f"./hyperopt_trial_{int(FIXED_EPOCH)}_{batch_size}_{FIXED_LR:.2e}",
            num_train_epochs=FIXED_EPOCH,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size * 2,
            learning_rate=FIXED_LR,
            weight_decay=FIXED_WD,
            warmup_steps=warmup_steps,
            lr_scheduler_type="linear",
            eval_strategy="epoch",
            save_strategy="epoch",
            logging_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            greater_is_better=True,
            report_to="none",
            seed=params["seed"],
            dataloader_num_workers=4,
            disable_tqdm=False,
        )

        trainer = Trainer(
            model=model_for_trial,
            args=training_args,
            train_dataset=tokenised["train"],
            eval_dataset=tokenised["validation"],
            tokenizer=tokenizer,
            data_collator=collator,
            compute_metrics=compute_metrics,
        )

        trainer.train()
        metrics = trainer.evaluate()

        return {
            "loss": -metrics["eval_f1"],
            "status": STATUS_OK,
            "eval_f1": metrics["eval_f1"],
            "params": params,
        }

    except Exception as e:
        print(f"Trial failed: {e}")
        return {"loss": 10.0, "status": STATUS_FAIL}

# -------------------------------
# RUN HYPEROPT
# -------------------------------
trials = Trials()

best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=12,
    trials=trials,
    rstate=np.random.default_rng(42),
    show_progressbar=True,
)

# -------------------------------
# PRINT BEST RESULT
# -------------------------------
best_trial = trials.best_trial
print("\n" + "="*60)
print("Best Hyperparameters found:")
print("="*60)
print(f"Best eval micro F1 : {best_trial['result']['eval_f1']:.4f}")
print(f"Batch size         : {int(best_trial['result']['params']['batch_size'])}")
print(f"Warmup ratio       : {best_trial['result']['params']['warmup_ratio']}")
print("="*60)

  0%|          | 0/12 [00:00<?, ?trial/s, best loss=?]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0351,0.093696,0.824993,0.837818,0.831356
2,0.0208,0.102668,0.822042,0.850673,0.836113
3,0.0113,0.11854,0.826864,0.850224,0.838382


Downloading builder script: 0.00B [00:00, ?B/s]

  8%|▊         | 1/12 [12:02<2:12:26, 722.44s/trial, best loss: -0.8383816051293389]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0441,0.100207,0.815089,0.79133,0.803034
2,0.0406,0.108174,0.825643,0.821076,0.823353
3,0.02,0.114458,0.828738,0.849178,0.838834


 17%|█▋        | 2/12 [25:39<2:09:39, 777.92s/trial, best loss: -0.8388335179032853]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0519,0.098099,0.818537,0.813154,0.815837
2,0.0342,0.108257,0.830836,0.828849,0.829841
3,0.0173,0.11908,0.826112,0.843647,0.834788


 25%|██▌       | 3/12 [39:14<1:59:16, 795.13s/trial, best loss: -0.8388335179032853]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0391,0.094385,0.806828,0.83722,0.821743
2,0.0278,0.108474,0.817944,0.840807,0.829218
3,0.0138,0.119901,0.82284,0.84006,0.831361


 33%|███▎      | 4/12 [51:10<1:41:49, 763.64s/trial, best loss: -0.8388335179032853]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0404,0.096233,0.81681,0.825112,0.82094
2,0.0268,0.103557,0.811997,0.843797,0.827591
3,0.0131,0.119147,0.829025,0.843647,0.836272


 42%|████▏     | 5/12 [1:03:03<1:26:57, 745.35s/trial, best loss: -0.8388335179032853]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0339,0.096309,0.828423,0.832138,0.830276
2,0.0228,0.103959,0.824971,0.850374,0.83748
3,0.0121,0.115516,0.831309,0.849327,0.840222


 50%|█████     | 6/12 [1:15:05<1:13:44, 737.47s/trial, best loss: -0.8402218114602588]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0339,0.098631,0.828236,0.833931,0.831074
2,0.023,0.103365,0.82447,0.855157,0.839533
3,0.0127,0.114783,0.829045,0.852466,0.840593


 58%|█████▊    | 7/12 [1:27:07<1:01:02, 732.48s/trial, best loss: -0.840592527083794] 

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0498,0.096462,0.809453,0.814051,0.811745
2,0.0358,0.112934,0.819001,0.82855,0.823748
3,0.0176,0.119417,0.825066,0.835426,0.830214


 67%|██████▋   | 8/12 [1:40:39<50:31, 757.79s/trial, best loss: -0.840592527083794]  

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.035,0.096184,0.827612,0.836024,0.831797
2,0.0209,0.103945,0.822383,0.851271,0.836577
3,0.0114,0.116258,0.829155,0.850224,0.839557


 75%|███████▌  | 9/12 [1:52:39<37:18, 746.13s/trial, best loss: -0.840592527083794]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0437,0.105039,0.806384,0.804335,0.805358
2,0.0401,0.111479,0.82484,0.808072,0.81637
3,0.02,0.124998,0.829265,0.837818,0.833519


 83%|████████▎ | 10/12 [2:06:14<25:34, 767.18s/trial, best loss: -0.840592527083794]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0516,0.109468,0.822748,0.795815,0.809057
2,0.0339,0.109853,0.827704,0.835127,0.831399
3,0.0172,0.122048,0.830787,0.837369,0.834065


 92%|█████████▏| 11/12 [2:19:49<13:01, 781.75s/trial, best loss: -0.840592527083794]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0322,0.105476,0.821339,0.830792,0.826038
2,0.0251,0.101783,0.820688,0.84559,0.832953
3,0.0128,0.117092,0.829479,0.847085,0.83819


100%|██████████| 12/12 [2:31:54<00:00, 759.52s/trial, best loss: -0.840592527083794]

BEST HYPERPARAMETERS FOUND
Best eval micro F1 : 0.8406
Batch size         : 32
Warmup ratio       : 0.1


In [None]:
# Create a model with the best hyper parameters found.
# ------------------------------------------------------------
# Hyper-parameters
# ------------------------------------------------------------
LEARNING_RATE = 6.040686648207059e-05 #From paper
EPOCHS        = 3                     #From paper
WEIGHT_DECAY  = 0.01                   #From paper
BATCH_SIZE    = 32                     #Best parameter from above
WARMUP_RATIO  = 0.1                    #Best parameter from above
SEED          = 123
OUTPUT_DIR    = "./tuned_ner_model"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    seed=SEED,
    logging_steps=10,
    save_total_limit=2,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

trainer.train()

#Save the model
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"\nModel saved to {OUTPUT_DIR}")



  trainer = Trainer(



STARTING TRAINING ...



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0135,0.114615,0.845205,0.808819,0.826612
2,0.02,0.112511,0.822129,0.846338,0.834058
3,0.012,0.128031,0.831515,0.85426,0.842734



Model saved to ./tuned_ner_model


In [None]:
#Quick test
from transformers import pipeline
import unicodedata

ner = pipeline("ner", model=OUTPUT_DIR, tokenizer=OUTPUT_DIR,
               aggregation_strategy="simple")

txt = unicodedata.normalize("NFC", """
  ᾿Ανέστη δὲ βασιλεὺς ἕτερος ἐπ᾿ Αἴγυπτον, ὃς οὐκ ᾔδει τὸν ᾿Ιωσήφ.
  εἶπε δὲ τῷ ἔθνει αὐτοῦ· ἰδοὺ τὸ γένος τῶν υἱῶν ᾿Ισραὴλ μέγα πλῆθος καὶ ἰσχύει ὑπὲρ ἡμᾶς·
  δεῦτε οὖν κατασοφισώμεθα αὐτούς, μή ποτε πληθυνθῇ, καὶ ἡνίκα ἂν συμβῇ ἡμῖν πόλεμος,
  προστεθήσονται καὶ οὗτοι πρὸς τοὺς ὑπεναντίους καὶ ἐκπολεμήσαντες ἡμᾶς ἐξελεύσονται ἐκ τῆς γῆς.
  καὶ ἐπέστησεν αὐτοῖς ἐπιστάτας τῶν ἔργων, ἵνα κακώσωσιν αὐτοὺς ἐν τοῖς ἔργοις· καὶ Ισραήλᾠκοδόμησαν πόλεις ὀχυρὰς τῷ Φαραώ, τήν τε Πειθὼ καὶ Ῥαμεσσῆ καὶ ῎Ων, ἥ ἐστιν ῾Ηλιούπολις.
  καθότι δὲ αὐτοὺς ἐταπείνουν, τοσούτῳ πλείους ἐγίγνοντο, καὶ ἴσχυον σφόδρα σφόδρα· καὶ ἐβδελύσσοντο οἱ Αἰγύπτιοι ἀπὸ τῶν υἱῶν ᾿.
  καὶ κατεδυνάστευον οἱ Αἰγύπτιοι τοὺς υἱοὺς ᾿Ισραὴλ βίᾳ καὶ κατωδύνων αὐτῶν τὴν ζωὴν ἐν τοῖς ἔργοις τοῖς σκληροῖς, τῷ πηλῷ καὶ τῇ πλινθείᾳ καὶ πᾶσι τοῖς ἔργοις τοῖς ἐν τοῖς πεδίοις, κατὰ πάντα τὰ ἔργα, ὧν κατεδουλοῦντο αὐτοὺς μετὰ βίας.
""")

merged_results = []

for r in ner(txt):
    if r['word'].startswith("##"):
        merged_results[-1]['word'] += r['word'][2:]  # remove ## and join the subwords together instead of splitting it
        merged_results[-1]['score'] = max(merged_results[-1]['score'], r['score'])
    else:
        merged_results.append(r)

for r in merged_results:
    print(f"{r['word']:<20} → {r['entity_group']:<6} ({r['score']:.3f})")


Device set to use cuda:0


αιγυπτον             → LOC    (0.987)
φαραωω               → PERS   (0.999)
ραμεσση              → LOC    (0.908)
αιγυπτιοι            → GRP    (1.000)
αιγυπτιοι            → GRP    (1.000)


In [None]:
#Final test F1 score
FINAL_MODEL_DIR = OUTPUT_DIR
tokenizer_test = AutoTokenizer.from_pretrained(FINAL_MODEL_DIR)
model_test = AutoModelForTokenClassification.from_pretrained(FINAL_MODEL_DIR)

trainer_test = Trainer(
    model=model_test,
    args=TrainingArguments(
        output_dir="./temp_eval",
        per_device_eval_batch_size=32,
    ),
    eval_dataset=tokenised['test'], #used the Test dataset that was previously processed in same manner as the Train and Val
    tokenizer=tokenizer_test,
    data_collator=DataCollatorForTokenClassification(tokenizer_test),
    compute_metrics=compute_metrics,
)

print("Running official test set evaluation...")
results = trainer.evaluate()

print("\n" + "═" * 60)
print("FINAL OFFICIAL TEST RESULTS (same as paper)")
print("═" * 60)
print(f"Precision : {results['eval_precision']:.4f}")
print(f"Recall    : {results['eval_recall']:.4f}")
print(f"Micro F1  : {results['eval_f1']:.4f}")
print("═" * 60)

if results['eval_f1'] > 0.826:
    print("We did better than the paper's 0.826!")
else:
    print("Close to or matches the original paper result.")

  trainer_test = Trainer(


Running official test set evaluation...



════════════════════════════════════════════════════════════
FINAL OFFICIAL TEST RESULTS (same as paper)
════════════════════════════════════════════════════════════
Precision : 0.8315
Recall    : 0.8543
Micro F1  : 0.8427
════════════════════════════════════════════════════════════
We did better than the paper's 0.826!
