To do:


1.   Test more hyperparameters to get metrics in 0.9 - HuggingFace Trainer has native support for hyperparameter search using either Optuna, Ray Tune, or Weights & Biases.
2.   Data augmentation - use another LLM to do NER categorization of text. and then add that to the training model.



Another relevant application of large language models (LLMs) in linguistic tasks is Named Entity Recognition (NER). Recent work by Beersmans et al. (2024) demonstrates this by combining transformer-based models with domain-specific knowledge to identify individuals in Ancient Greek texts. Their study, “Gotta catch ’em all!: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge,” was presented at the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024) and provides a strong example of how modern NLP techniques can be adapted for historical languages.

The authors built a model for Ancient greek NER task with F1 score of 0.826, which is State-of-Art as of current. Since Ancient Greek is a low-resource, highly inflected ancient language with limited annotated corpora (e.g., the dataset used here aggregates ~5,579 test tokens from projects like First1KGreek and NERAncientGreekML4AL). Unlike modern high-resource languages (e.g., English, where CoNLL-2003 NER F1 scores exceed 0.93), ancient languages suffer from data scarcity, orthographic variations (e.g., diacritics, dialects), and domain noise (e.g., fragmentary inscriptions or papyri). SOTA in this niche is typically in the 0.80–0.89 range for transformer-based models on similar tasks.

We attempted to do hyperparameters tuning for a better performance - there are 2 hyperparameters not tested in the original paper - Warmup ratio and batch size. Both of these are sensitive to transformer tuning.
Large batch size reduces noise, resulting in better token representation. This is particulary useful for complex morphologically rich languages like Ancient Greek. However, smaller batches tend to act like regularization allowing for beter generalization. Thus tuning of the batch size is to find the balance between overfitting and token representation.

[describe warm up]

Data augmentation is another potential approach to improving the model’s F1 score, but it is not practical for this project. Given our limited resources, we would need multiple models to first annotate the English translations of the Ancient Greek sentences and then align those annotations back to the corresponding Koine Greek tokens to infer NER labels. Even after this automated pipeline, human verification would still be required to ensure label accuracy. Producing a dataset of roughly 100,000 Koine Greek tokens under these constraints would be extremely time-consuming and effectively not feasible within the scope of this project.

In [4]:
#!pip install --upgrade transformers
!pip install -q transformers datasets seqeval torch tqdm evaluate


In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:


# Clone the repo
#!git clone https://github.com/NER-AncientLanguages/NERAncientGreekML4AL.git
#%cd NERAncientGreekML4AL

# Verify data exists
#!ls final_dataset/normal/*.conll

Cloning into 'NERAncientGreekML4AL'...
remote: Enumerating objects: 251, done.[K
remote: Total 251 (delta 0), reused 0 (delta 0), pack-reused 251 (from 3)[K
Receiving objects: 100% (251/251), 106.47 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (98/98), done.
Updating files: 100% (199/199), done.
Downloading Data/homogenisation/full_dataset_FINAL.csv (113 MB)
Error downloading object: Data/homogenisation/full_dataset_FINAL.csv (82d984c): Smudge error: Error downloading Data/homogenisation/full_dataset_FINAL.csv (82d984c506fbdcea63db80edc6d34c42f5128b2de3a34df706c6ecae87f02254): batch response: This repository exceeded its LFS budget. The account responsible for the budget should increase it to restore access.

Errors logged to /content/NERAncientGreekML4AL/.git/lfs/logs/20251118T021904.003062136.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: Data/homogenisation/full_dataset_FINAL.csv: smudge filter lfs failed
You can inspe

In [12]:
import os, warnings, unicodedata, numpy as np
from pathlib import Path
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    TrainingArguments, Trainer, DataCollatorForTokenClassification
)

import evaluate
from seqeval.metrics import classification_report

def read_conll(p: Path):
    """
    Parse CoNLL with format:
        [line_id]  token  [POS]  NER
    Example:
        110089790	βίβλος	O
    Returns: {"tokens": [...], "ner_tags": [...]}
    """
    sents, labs = [], []
    with p.open(encoding="utf-8") as f:
        sent, lab = [], []
        for i, raw in enumerate(f, 1):
            line = raw.strip()
            if not line or line.startswith("#"):
                if sent:
                    sents.append(sent)
                    labs.append(lab)
                    sent, lab = [], []
                continue

            # Split on whitespace (handles tabs and spaces)
            parts = line.split()
            if len(parts) < 2:
                print(f"Warning: Line {i} in {p.name} has <2 columns → SKIPPED")
                print(f"    → {line!r}")
                continue

            if len(parts) == 2:
                token = parts[0]
                ner   = parts[1]
            else:
                token = parts[1]   # skip ID
                ner   = parts[-1]  # last column is NER

            sent.append(unicodedata.normalize("NFC", token))
            lab.append(ner)

        if sent:
            sents.append(sent)
            labs.append(lab)

    print(f"Loaded {len(sents)} sentences from {p.name}")
    return {"tokens": sents, "ner_tags": labs}

# load data
train_path = Path("/content/drive/My Drive/Deep Learning Group Project/train.conll")
val_path   = Path("/content/drive/My Drive/Deep Learning Group Project/val.conll")
test_path  = Path("/content/drive/My Drive/Deep Learning Group Project/test.conll")

raw = {
    "train": read_conll(train_path),
    "validation": read_conll(val_path),
    "test": read_conll(test_path),
}
data = DatasetDict({k: Dataset.from_dict(v) for k, v in raw.items()})

#Model name -------------------------------------------------------------
model_name = "Marijke/AG_BERT_hypopt_NER"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
#------------------------------------------------------------------------

all_labels = sorted({l for s in data["train"]["ner_tags"] for l in s})
label2id   = {l: i for i, l in enumerate(all_labels)}
id2label   = {i: l for l, i in label2id.items()}

#tokenise + align labels
def tokenise_align(example):
    tok = tokenizer(example["tokens"], truncation=True, is_split_into_words=True)
    aligned = []
    for i, labs in enumerate(example["ner_tags"]):
        word_ids = tok.word_ids(batch_index=i)
        prev = None
        ids  = []
        for wid in word_ids:
            if wid is None:
                ids.append(-100)
            elif wid != prev:
                ids.append(label2id[labs[wid]])
            else:
                ids.append(-100)               # sub-word → ignore
            prev = wid
        aligned.append(ids)
    tok["labels"] = aligned
    return tok

tokenised = data.map(tokenise_align, batched=True,
                     remove_columns=data["train"].column_names)


model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(all_labels),
    id2label=id2label,
    label2id=label2id,
)

collator = DataCollatorForTokenClassification(tokenizer)


def compute_metrics(p):
    preds, labels = p
    preds = np.argmax(preds, axis=2)

    true_labels = []
    pred_labels = []

    for prediction, label in zip(preds, labels):
        true_seq = [id2label[l] for l in label if l != -100]
        pred_seq = [id2label[pred] for pred, l in zip(prediction, label) if l != -100]
        if true_seq:  # Only add if not empty
            true_labels.append(true_seq)
            pred_labels.append(pred_seq)

    if not true_labels:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}


    metric = evaluate.load("seqeval")
    results = metric.compute(predictions=pred_labels, references=true_labels)

    return {
      "precision": results["overall_precision"],
      "recall": results['overall_recall'],
      "f1": results["overall_f1"]
    }



Loaded 30686 sentences from train.conll
Loaded 4434 sentences from val.conll
Loaded 4701 sentences from test.conll


Map:   0%|          | 0/30686 [00:00<?, ? examples/s]

Map:   0%|          | 0/4434 [00:00<?, ? examples/s]

Map:   0%|          | 0/4701 [00:00<?, ? examples/s]

Hyperparameter tuning was performed using Hyperopt. Although Hyperopt is less commonly used today, we chose it to maintain consistency with the methodology described in the referenced paper.

In [22]:
import os
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
)
import evaluate
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK, STATUS_FAIL
from hyperopt.early_stop import no_progress_loss

# -------------------------------
# HYPEROPT SEARCH SPACE
# Include 2 new parameters that were not tried in the paper - batch size and warmup ratio
# -------------------------------
FIXED_LR = 6.040686648207059e-05
FIXED_WD = 0.01
FIXED_EPOCH = 3

space = {
    "batch_size":    hp.choice("batch_size", [8, 16, 32]),            # 3 options
    "warmup_ratio":  hp.choice("warmup_ratio", [0.0, 0.06, 0.1, 0.2]), # 4 options
    "seed": 123 #for reproducibility
}

# -------------------------------
# OBJECTIVE FUNCTION used by Hyperopt to test parameters
# -------------------------------
def objective(params):
  # we are keeping these 3 hyperparameters from the paper itself as they have found the optimal values for the Learning Rate, Weight Decay
  # and number of training epoch

    try:

        # Corrected: Directly use the values from params, as hp.choice returns the value itself, not an index
        batch_size = params["batch_size"]
        warmup_ratio = params["warmup_ratio"]

        model_for_trial = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=len(all_labels),
            id2label=id2label,
            label2id=label2id,
        )


        total_steps = int(len(tokenised["train"]) / batch_size * FIXED_EPOCH)
        warmup_steps = int(total_steps * warmup_ratio)

        training_args = TrainingArguments(
            output_dir=f"./hyperopt_trial_{int(FIXED_EPOCH)}_{batch_size}_{FIXED_LR:.2e}",
            num_train_epochs=FIXED_EPOCH,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size * 2,
            learning_rate=FIXED_LR,
            weight_decay=FIXED_WD,
            warmup_steps=warmup_steps,
            lr_scheduler_type="linear",
            eval_strategy="epoch",
            save_strategy="epoch",
            logging_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            greater_is_better=True,
            report_to="none",
            seed=params["seed"],
            dataloader_num_workers=4,
            disable_tqdm=False,
        )

        trainer = Trainer(
            model=model_for_trial,
            args=training_args,
            train_dataset=tokenised["train"],
            eval_dataset=tokenised["validation"],
            tokenizer=tokenizer,
            data_collator=collator,
            compute_metrics=compute_metrics,
        )

        trainer.train()
        metrics = trainer.evaluate()

        return {
            "loss": -metrics["eval_f1"],
            "status": STATUS_OK,
            "eval_f1": metrics["eval_f1"],
            "params": params,
        }

    except Exception as e:
        print(f"Trial failed: {e}")
        return {"loss": 10.0, "status": STATUS_FAIL}

# -------------------------------
# RUN HYPEROPT
# -------------------------------
trials = Trials()

best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=12,
    trials=trials,
    rstate=np.random.default_rng(42),
    show_progressbar=True,
)

# -------------------------------
# PRINT BEST RESULT
# -------------------------------
best_trial = trials.best_trial
print("\n" + "="*60)
print("BEST HYPERPARAMETERS FOUND")
print("="*60)
print(f"Best eval micro F1 : {best_trial['result']['eval_f1']:.4f}")
print(f"Batch size         : {int(best_trial['result']['params']['batch_size'])}")
print(f"Warmup ratio       : {best_trial['result']['params']['warmup_ratio']}")
print("="*60)

# Optional: retrain on full train+val with best params and evaluate on test set

  0%|          | 0/12 [00:00<?, ?trial/s, best loss=?]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.035,0.095455,0.826412,0.841854,0.834061
2,0.0208,0.104416,0.825974,0.852765,0.839156
3,0.0114,0.11828,0.828293,0.85157,0.83977


  8%|▊         | 1/12 [11:41<2:08:33, 701.21s/trial, best loss: -0.8397700471698114]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0441,0.115932,0.812135,0.768311,0.789615
2,0.0406,0.108344,0.824427,0.823318,0.823873
3,0.0202,0.11544,0.830653,0.841704,0.836142


 17%|█▋        | 2/12 [24:50<2:05:28, 752.89s/trial, best loss: -0.8397700471698114]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0519,0.101465,0.81825,0.809567,0.813885
2,0.0335,0.111172,0.829559,0.821375,0.825447
3,0.0169,0.122214,0.831281,0.837369,0.834314


 25%|██▌       | 3/12 [37:55<1:55:09, 767.71s/trial, best loss: -0.8397700471698114]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0393,0.09385,0.814826,0.841256,0.82783
2,0.0278,0.109792,0.81571,0.8429,0.829082
3,0.014,0.120506,0.827865,0.841106,0.834433


 33%|███▎      | 4/12 [49:26<1:38:18, 737.33s/trial, best loss: -0.8397700471698114]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0404,0.097479,0.822495,0.811061,0.816738
2,0.0268,0.102771,0.816739,0.846039,0.831131
3,0.0133,0.118005,0.831167,0.841106,0.836107


 42%|████▏     | 5/12 [1:01:00<1:24:12, 721.85s/trial, best loss: -0.8397700471698114]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0338,0.097444,0.829641,0.828401,0.82902
2,0.0231,0.104894,0.824975,0.853214,0.838857
3,0.0119,0.116623,0.830093,0.855157,0.842439


 50%|█████     | 6/12 [1:12:43<1:11:31, 715.32s/trial, best loss: -0.8424385215726697]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.034,0.099151,0.830648,0.832138,0.831392
2,0.0229,0.103374,0.820671,0.855755,0.837846
3,0.0125,0.116794,0.830523,0.854111,0.842152


 58%|█████▊    | 7/12 [1:24:25<59:15, 711.08s/trial, best loss: -0.8424385215726697]  

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0493,0.100252,0.813259,0.781166,0.796889
2,0.0355,0.1102,0.822118,0.834529,0.828277
3,0.0176,0.118238,0.825474,0.83991,0.832629


 67%|██████▋   | 8/12 [1:37:34<49:03, 735.85s/trial, best loss: -0.8424385215726697]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0351,0.093263,0.823667,0.840658,0.832076
2,0.021,0.103993,0.826864,0.850224,0.838382
3,0.0113,0.115343,0.827386,0.852616,0.839812


 75%|███████▌  | 9/12 [1:49:16<36:15, 725.32s/trial, best loss: -0.8424385215726697]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0438,0.109722,0.809429,0.787892,0.798515
2,0.0399,0.106748,0.826685,0.827055,0.82687
3,0.0201,0.118187,0.826151,0.839611,0.832827


 83%|████████▎ | 10/12 [2:02:23<24:48, 744.33s/trial, best loss: -0.8424385215726697]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0515,0.103613,0.818583,0.804634,0.811548
2,0.0343,0.109292,0.823027,0.835575,0.829254
3,0.0167,0.124583,0.833012,0.840359,0.836669


 92%|█████████▏| 11/12 [2:15:30<12:37, 757.46s/trial, best loss: -0.8424385215726697]

  trainer = Trainer(



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0322,0.10515,0.820698,0.833333,0.826967
2,0.0251,0.101871,0.81998,0.8429,0.831282
3,0.0129,0.114957,0.828934,0.849626,0.839153


100%|██████████| 12/12 [2:27:14<00:00, 736.17s/trial, best loss: -0.8424385215726697]

BEST HYPERPARAMETERS FOUND
Best eval micro F1 : 0.8424
Batch size         : 32
Warmup ratio       : 0.1


In [24]:
# Create a model with the best hyper parameters found.
# ------------------------------------------------------------
# Hyper-parameters
# ------------------------------------------------------------
LEARNING_RATE = 6.040686648207059e-05 #From paper
EPOCHS        = 3                     #From paper
WEIGHT_DECAY  = 0.01                   #From paper
BATCH_SIZE    = 32                     #Best parameter from above
WARMUP_RATIO  = 0.1                    #Best parameter from above
SEED          = 123
OUTPUT_DIR    = "/content/drive/My Drive/Deep Learning Group Project/tuned_ner_model"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    seed=SEED,
    logging_steps=10,
    save_total_limit=2,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

#Train the model
print("\nSTARTING TRAINING ...\n")
trainer.train()

#Save the model
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"\nModel saved to {OUTPUT_DIR}")



  trainer = Trainer(



STARTING TRAINING ...



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0306,0.097973,0.827334,0.827952,0.827643
2,0.0228,0.104885,0.826486,0.847982,0.837096
3,0.017,0.113758,0.832677,0.853214,0.84282



Model saved to /content/drive/My Drive/Deep Learning Group Project/tuned_ner_model


In [25]:
#Quick test
from transformers import pipeline
import unicodedata

ner = pipeline("ner", model=OUTPUT_DIR, tokenizer=OUTPUT_DIR,
               aggregation_strategy="simple")

txt = unicodedata.normalize("NFC", """
  ᾿Ανέστη δὲ βασιλεὺς ἕτερος ἐπ᾿ Αἴγυπτον, ὃς οὐκ ᾔδει τὸν ᾿Ιωσήφ.
  εἶπε δὲ τῷ ἔθνει αὐτοῦ· ἰδοὺ τὸ γένος τῶν υἱῶν ᾿Ισραὴλ μέγα πλῆθος καὶ ἰσχύει ὑπὲρ ἡμᾶς·
  δεῦτε οὖν κατασοφισώμεθα αὐτούς, μή ποτε πληθυνθῇ, καὶ ἡνίκα ἂν συμβῇ ἡμῖν πόλεμος,
  προστεθήσονται καὶ οὗτοι πρὸς τοὺς ὑπεναντίους καὶ ἐκπολεμήσαντες ἡμᾶς ἐξελεύσονται ἐκ τῆς γῆς.
  καὶ ἐπέστησεν αὐτοῖς ἐπιστάτας τῶν ἔργων, ἵνα κακώσωσιν αὐτοὺς ἐν τοῖς ἔργοις· καὶ Ισραήλᾠκοδόμησαν πόλεις ὀχυρὰς τῷ Φαραώ, τήν τε Πειθὼ καὶ Ῥαμεσσῆ καὶ ῎Ων, ἥ ἐστιν ῾Ηλιούπολις.
  καθότι δὲ αὐτοὺς ἐταπείνουν, τοσούτῳ πλείους ἐγίγνοντο, καὶ ἴσχυον σφόδρα σφόδρα· καὶ ἐβδελύσσοντο οἱ Αἰγύπτιοι ἀπὸ τῶν υἱῶν ᾿.
  καὶ κατεδυνάστευον οἱ Αἰγύπτιοι τοὺς υἱοὺς ᾿Ισραὴλ βίᾳ καὶ κατωδύνων αὐτῶν τὴν ζωὴν ἐν τοῖς ἔργοις τοῖς σκληροῖς, τῷ πηλῷ καὶ τῇ πλινθείᾳ καὶ πᾶσι τοῖς ἔργοις τοῖς ἐν τοῖς πεδίοις, κατὰ πάντα τὰ ἔργα, ὧν κατεδουλοῦντο αὐτοὺς μετὰ βίας.
""")

merged_results = []

for r in ner(txt):
    if r['word'].startswith("##"):
        merged_results[-1]['word'] += r['word'][2:]  # remove ## and join the subwords together instead of splitting it
        merged_results[-1]['score'] = max(merged_results[-1]['score'], r['score'])
    else:
        merged_results.append(r)

for r in merged_results:
    print(f"{r['word']:<20} → {r['entity_group']:<6} ({r['score']:.3f})")


Device set to use cuda:0


αιγυπτον             → LOC    (0.991)
φαραω                → PERS   (0.998)
ραμεσση              → PERS   (0.837)
αιγυπτιοι            → GRP    (0.999)
αιγυπτιοι            → GRP    (0.999)


In [27]:
FINAL_MODEL_DIR = OUTPUT_DIR
tokenizer_test = AutoTokenizer.from_pretrained(FINAL_MODEL_DIR)
model_test = AutoModelForTokenClassification.from_pretrained(FINAL_MODEL_DIR)

trainer_test = Trainer(
    model=model_test,
    args=TrainingArguments(
        output_dir="./temp_eval",
        per_device_eval_batch_size=32,
    ),
    eval_dataset=tokenised['test'], #used the Test dataset that was previously processed in same manner as the Train and Val
    tokenizer=tokenizer_test,
    data_collator=DataCollatorForTokenClassification(tokenizer_test),
    compute_metrics=compute_metrics,
)

print("Running official test set evaluation...")
results = trainer.evaluate()

print("\n" + "═" * 60)
print("FINAL OFFICIAL TEST RESULTS (same as paper)")
print("═" * 60)
print(f"Precision : {results['eval_precision']:.4f}")
print(f"Recall    : {results['eval_recall']:.4f}")
print(f"Micro F1  : {results['eval_f1']:.4f}")
print("═" * 60)

if results['eval_f1'] > 0.826:
    print("We did better than the paper's 0.826!")
else:
    print("Close to or matches the original paper result.")

  trainer_test = Trainer(


Running official test set evaluation...



════════════════════════════════════════════════════════════
FINAL OFFICIAL TEST RESULTS (same as paper)
════════════════════════════════════════════════════════════
Precision : 0.8327
Recall    : 0.8532
Micro F1  : 0.8428
════════════════════════════════════════════════════════════
We did better than the paper's 0.826!


In [None]:
# ------------------------------------------------------------
# Hyper-parameters
# ------------------------------------------------------------
LEARNING_RATE = 3e-5
BATCH_SIZE    = 32
EPOCHS        = 5
WEIGHT_DECAY  = 0.01
WARMUP_RATIO  = 1.0
SEED          = 123
OUTPUT_DIR    = f"./tuned_ner_model_lr{LEARNING_RATE}_bs{BATCH_SIZE}_ep{EPOCHS}"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    seed=SEED,
    logging_steps=10,
    save_total_limit=2,
    report_to=[],
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Reference:
Beersmans, M., Keersmaekers, A., de Graaf, E., Van de Cruys, T., Depauw, M., & Fantoli, M. (2024). “Gotta catch ’em all!”: Retrieving people in Ancient Greek texts combining transformer models and domain knowledge. In J. Pavlopoulos et al. (Eds.), Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024) (pp. 152–164). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.ml4al-1.16