# Language model applied to Ancient Greek

Students:
<pre>Sarah Batara []</pre>
<pre>Arjun Menon []</pre>
<pre>Javier Marsicano [qq577517]</pre>


Training Data
https://github.com/kevinkrahn/ancient-greek-datasets

https://github.com/Clear-Bible/macula-greek

https://github.com/Heidelberg-NLP/ancient-language-models?tab=readme-ov-file

https://huggingface.co/datasets/hmcgovern/original-language-bibles-greek

https://opengreekandlatin.github.io/First1KGreek/

https://github.com/PerseusDL/canonical-greekLit/tree/master?tab=readme-ov-file



Test Data








## Motivation
We chose this topic because we were interested in language models and we had some prior experience with NLP as well. We found out that, for the majority of language tasks, LLMs achieve state-of-the-art performance, except for a few use cases, a low-resource language is one of them. As the name implies, datasets and corpora for low-resource languages are scarce or simply insufficient for training models that require large amounts of training data like LLMs. Currently there's still a lot of research and work going on to apply Deep Learning to low-resource languages, for instance, dialects spoken by specific ethnic groups in Africa or Asia.

Thus, Ancient or classical languages are a good example of low-resource languages, Ancient Greek in particular. Furthermore, it's worth noting that this language has evolved significantly over the time, with changes in grammar, vocabulary, lexicon, etc. Specifically, Ancient Greek has undergone several distinct phases over a few centuries, including Classical Greek, Hellenistic Greek, and Koine Greek, each with its own unique characteristics. This evolution makes it even more challenging to create a single, comprehensive language model that can accurately represent the complexities of Ancient Greek.

In particular, nowadays there's still little work on language models applied to Ancient Greek, but we could find two interesting papers that were published fairly recently (two years ago). One is [Kevin Krahn, Derrick Tate, and Andrew C. Lamicela (2023) Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation](https://aclanthology.org/2023.alp-1.2/) based on multilingual models; and the other is [Exploring Large Language Models for Classical Philology](https://aclanthology.org/2023.acl-long.846/) based on monolingual models. In their research, the authors leveraged a multilingual translation model to achieve high accuracy in translating Ancient Greek texts.

While their results were promising, we identified several areas for potential improvement. Notably, their model was trained on both Modern and Ancient Greek, which may have limited its ability to capture the nuanced linguistic and semantic distinctions between the two. In this project, we aim to explore a more targeted approach focusing specifically on Koine Greek, to determine whether a specialized model can yield better translation performance.

For low-resource languages BERT - or any of its variants - is the preferred model. In our work, we plan to use the same datasets employed in these studies, along with several additional ones we have found published in repositories.


In [None]:
#Import required libraries
import  pandas as pd
import csv
import re

In [None]:
#Data Processing
#Location: https://github.com/biblicalhumanities/Nestle1904/tree/master/morph


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Replace 'path/to/your/file.csv' with the actual path to your CSV file in Google Drive

file_path = '/content/drive/MyDrive/Deep Learning Group Project/Nestle1904.csv'
good = []
bad = []

with open(file_path, "r", encoding="utf-8") as f:
    reader = csv.reader(f, delimiter='\t')
    for i, row in enumerate(reader, start=1):
        if len(row) < 2:  # e.g., missing fields
            bad.append({"line_number": i, "content": row})
        else:
            good.append(row)

bad_df = pd.DataFrame(bad)
good_df = pd.DataFrame(good)

In [None]:
bad_df.size

0

In [None]:
good_df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,﻿BCV,text,func_morph,form_morph,strongs,lemma,normalized,,
1,Matt 1:1,Βίβλος,N-NSF,N-NSF,976,βίβλος,Βίβλος,,
2,Matt 1:1,γενέσεως,N-GSF,N-GSF,1078,γένεσις,γενέσεως,,
3,Matt 1:1,Ἰησοῦ,N-GSM,N-GSM,2424,Ἰησοῦς,Ἰησοῦ,,
4,Matt 1:1,Χριστοῦ,N-GSM,N-GSM,5547,Χριστός,Χριστοῦ,,
...,...,...,...,...,...,...,...,...,...
137775,Rev 22:21,τοῦ,T-GSM,T-GSM,3588,ὁ,τοῦ,,
137776,Rev 22:21,Κυρίου,N-GSM,N-GSM,2962,κύριος,Κυρίου,,
137777,Rev 22:21,Ἰησοῦ,N-GSM,N-GSM,2424,Ἰησοῦς,Ἰησοῦ,,
137778,Rev 22:21,μετὰ,PREP,PREP,3326,μετά,μετά,,


Where is the english translation? Do we have to manually align the reference verse with the greek words?

In [None]:
#sr_corpus.txt from Kahn dataset

sr_corpus_path = '/content/drive/MyDrive/Deep Learning Group Project/sr_corpus.txt'

# Initialize lists to store parsed data
records = []
current_record = {}
with open(sr_corpus_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        # Check if the line looks like metadata (e.g., contains commas and an identifier)
        if ',' in line and 'tlg' in line:
            # Split metadata into author, title, identifier
            metadata = [part.strip() for part in line.split(',', 2)]
            if len(metadata) == 3:
                current_record['author'] = metadata[0]
                current_record['title'] = metadata[1]
                current_record['identifier'] = metadata[2]
            i += 1
            # Collect all lines until next metadata or end of file
            block = []
            while i < len(lines) and not (',' in lines[i].strip() and 'tlg' in lines[i].strip()):
                if lines[i].strip():  # Skip empty lines
                    block.append(lines[i].strip())
                i += 1
            # Split block into Greek and English
            greek_text = []
            english_text = []
            for j, line in enumerate(block):
                # Check if line starts with Latin alphabetic character (A-Z or a-z)
                if re.match(r'^[A-Za-z]', line):
                    # Everything before this is Greek, after is English
                    greek_text = block[:j]
                    english_text = block[j:]
                    break
            current_record['greek_text'] = ' '.join(greek_text) if greek_text else ''
            current_record['english_text'] = ' '.join(english_text) if english_text else ''
            # Add the record to the list
            records.append(current_record.copy())
            current_record = {}
        else:
            i += 1

sr_corpus_df = pd.DataFrame(records)
sr_corpus_df.head(10)

Unnamed: 0,author,title,identifier,greek_text,english_text
0,Dinarchus,Against Demosthenes,tlg0029.tlg004,"ἐκεῖνοι ἦσαν, ἐκεῖνοι, ὦ Ἀθηναῖοι, ἄξιοι σύμβο...","They were counsellors, Athenians, they were le..."
1,Dinarchus,Against Demosthenes,tlg0029.tlg004,"οὐκ αἰσχύνεσθε, ἄνδρες Ἀθηναῖοι, τὴν κατὰ Δημο...","Are you not ashamed, Athenians, that you shoul..."
2,Dinarchus,Against Demosthenes,tlg0029.tlg004,εἰσί τινες ἐν τῷ δικαστηρίῳ τῶν ἐν τοῖς τριακο...,Are there any people in the court who were amo...
3,Dinarchus,Against Demosthenes,tlg0029.tlg004,"εἴπατέ μοι πρὸς Διός, ὦ ἄνδρες, προῖκα τοῦτον ...","Really, gentlemen, tell me: do you think he go..."
4,Dinarchus,Against Demosthenes,tlg0029.tlg004,"ἢ τὸ γράψαι Ταυροσθένην Ἀθηναῖον εἶναι, τὸν το...",Did he get nothing for proposing that Taurosth...
5,Dinarchus,Against Demosthenes,tlg0029.tlg004,"περὶ τούτων οὖν μάρτυρας ὑμῖν δεῖ καλεῖν, ἢ πε...",Is there any need then for me to call up witne...
6,Dinarchus,Against Demosthenes,tlg0029.tlg004,"πολλοί, ὦ ἄνδρες, πολλοὶ τῶν πολιτῶν καὶ τῶν ἄ...","Gentlemen, you have very many witnesses, as I ..."
7,Dinarchus,Against Demosthenes,tlg0029.tlg004,ἐπιωρκηκὼς μὲν τὰς σεμνὰς θεὰς ἐν Ἀρείῳ πάγῳ κ...,having broken the oaths he took on the Areopag...
8,Dinarchus,Against Demosthenes,tlg0029.tlg004,"ἀλλ’ ὅμως, ἄνδρες δικασταί, οὕτω Δημοσθένης τῷ...","Despite this, gentlemen of the jury, Demosthen..."
9,Dinarchus,Against Demosthenes,tlg0029.tlg004,"πρᾶγμα κατασκευάζων οὐ γεγενημένον, ἀλλὰ ψεύδε...","This is a story of his own invention, not base..."


In [None]:
sr_corpus_df['author'].unique()

array(['Dinarchus', 'Thucydides', 'Pseudo-Plutarch', 'Homer', 'Aeschylus',
       'Anonymous', 'Palaiphatos', 'Eusebius of Caesaria', 'Ignatius',
       'Lysias', 'Plato', 'Andocides', 'Xenophon', 'Aretaeus',
       'Euripides', 'Pseudo-Justin Martyr', 'Hyperides', 'Isocrates',
       'Plutarch', 'Sophocles', 'Antiphon', 'Didache', '', 'Aristotle',
       'Shepherd of Hermas', 'Pindar', 'Apollodorus of Athens', 'Galen',
       'Isaeus', 'Appian', 'Lycurgus', 'Agathemerus', 'Herodotus',
       'Pausanias', 'Polycarp', 'Diogenes Laertius', 'Aeschines',
       'Demosthenes', 'Martyrdom of Polycarp', 'Epistle of Barnabus',
       'Hippocrates', 'Bacchylides', 'Clement of Rome', 'Septuagint'],
      dtype=object)

Following Yamshchikov et al. (2022) and Singh
et al. (2021), we lowercase all the Greek text and
strip diacritics, but keep punctuation.

This dataset consist of Greek texts spanning a
thousand years, covering different dialects and time
periods of the language. We do not filter out any
texts based on their dialect or time period. This dataset does not contain modern greek.


In [None]:
def strip_diacritics(text):
    # Define a mapping of Greek characters with diacritics to their base characters
    diacritic_map = {
        'ά': 'α', 'έ': 'ε', 'ή': 'η', 'ί': 'ι', 'ό': 'ο', 'ύ': 'υ', 'ώ': 'ω',
        'Ἀ': 'Α', 'Ἄ': 'Α', 'Ἆ': 'Α', 'Ἕ': 'Ε', 'Ἔ': 'Ε', 'Ἦ': 'Η', 'Ἢ': 'Η', 'Ἣ': 'Η', 'Ἤ': 'Η', 'Ἥ': 'Η', 'Ἶ': 'Η', 'Ἷ': 'Η',
        'Ἰ': 'Ι', 'Ἱ': 'Ι', 'Ἳ': 'Ι', 'Ἴ': 'Ι', 'Ἵ': 'Ι', 'Ἶ': 'Ι', 'Ἷ': 'Ι', 'Ὦ': 'Ω', 'Ὢ': 'Ω', 'Ὣ': 'Ω', 'Ὤ': 'Ω', 'Ὥ': 'Ω', 'Ὦ': 'Ω', 'Ὧ': 'Ω',
        'ά': 'α', 'έ': 'ε', 'ή': 'η', 'ί': 'ι', 'ό': 'ο', 'ύ': 'υ', 'ώ': 'ω',
        'ὰ': 'α', 'ὲ': 'ε', 'ὴ': 'η', 'ὶ': 'ι', 'ὸ': 'ο', 'ὺ': 'υ', 'ὼ': 'ω',
        'ᾶ': 'α', 'ῆ': 'η', 'ῖ': 'ι', 'ῦ': 'υ', 'ῶ': 'ω',
        'Ἀ': 'Α', 'Ἁ': 'Α', 'Ἂ': 'Α', 'Ἃ': 'Α', 'Ἄ': 'Α', 'Ἅ': 'Α', 'Ἆ': 'Α', 'Ἇ': 'Α',
        'Ἐ': 'Ε', 'Ἑ': 'Ε', 'Ἒ': 'Ε', 'Ἓ': 'Ε', 'Ἔ': 'Ε', 'Ἕ': 'Ε',
        'Ἠ': 'Η', 'Ἡ': 'Η', 'Ἐ': 'Η', 'Ἑ': 'Η', 'Ἒ': 'Η', 'Ἓ': 'Η', 'Ἔ': 'Η', 'Ἕ': 'Η', 'ἠ': 'Η', 'ἡ': 'Η', 'ἢ': 'Η', 'ἣ': 'Η', 'ἤ': 'Η', 'ἥ': 'Η', 'ἦ': 'Η', 'ἧ': 'Η',
        'Ἰ': 'Ι', 'Ἱ': 'Ι', 'Ἢ': 'Ι', 'Ἣ': 'Ι', 'Ἤ': 'Ι', 'Ἥ': 'Ι', 'Ἦ': 'Ι', 'Ἷ': 'Ι',
        'Ὠ': 'Ω', 'Ὡ': 'Ω', 'Ὢ': 'Ω', 'Ὣ': 'Ω', 'Ὤ': 'Ω', 'Ὥ': 'Ω', 'Ὦ': 'Ω', 'Ὧ': 'Ω',
        'Ὑ': 'Υ', 'Ὓ': 'Υ', 'Ὕ': 'Υ', 'Ὗ': 'Υ',
        'ῤ': 'ρ', 'Ῥ': 'Ρ',
        'Ὸ': 'Ο', 'Ό': 'Ο',
        'Ὼ': 'Ω', 'Ώ': 'Ω', 'ῼ': 'Ω',
        'Ὲ': 'Ε', 'Έ': 'Ε',
        'Ὴ': 'Η', 'Ή': 'Η', 'ῌ': 'Η',
        'Ῐ': 'Ι', 'Ῑ': 'Ι', 'Ὶ': 'Ι', 'Ί': 'Ι',
        'Ὺ': 'Υ', 'Ύ': 'Υ', 'Ὺ': 'Υ', 'Ύ': 'Υ',
        'ά': 'α', 'έ': 'ε', 'ῐ': 'ι', 'ῑ': 'ι', 'ό': 'ο', 'ῠ': 'υ', 'ῡ': 'υ', 'ώ': 'ω',
        'ΐ': 'ί', 'Ή': 'Ή', 'ΐ': 'ί', 'ΐ': 'ΐ', 'ΐ': 'Ϊ', 'ῒ': 'ϊ', 'ῒ': 'ῗ',
    }
    # Use regex to replace characters with diacritics
    for diacritic_char, base_char in diacritic_map.items():
        text = text.replace(diacritic_char, base_char)
    return text

sr_corpus_df['greek_text'] = sr_corpus_df['greek_text'].str.lower()
sr_corpus_df['greek_text'] = sr_corpus_df['greek_text'].apply(strip_diacritics)
sr_corpus_df.head(10)

Unnamed: 0,author,title,identifier,greek_text,english_text
0,Dinarchus,Against Demosthenes,tlg0029.tlg004,"ἐκεινοι Ησαν, ἐκεινοι, ὦ ἀθηναιοι, ἄξιοι συμβο...","They were counsellors, Athenians, they were le..."
1,Dinarchus,Against Demosthenes,tlg0029.tlg004,"οὐκ αἰσχυνεσθε, ἄνδρες ἀθηναιοι, την κατα δημο...","Are you not ashamed, Athenians, that you shoul..."
2,Dinarchus,Against Demosthenes,tlg0029.tlg004,εἰσι τινες ἐν τῷ δικαστηριῳ των ἐν τοις τριακο...,Are there any people in the court who were amo...
3,Dinarchus,Against Demosthenes,tlg0029.tlg004,"εἴπατε μοι προς διος, ὦ ἄνδρες, προικα τουτον ...","Really, gentlemen, tell me: do you think he go..."
4,Dinarchus,Against Demosthenes,tlg0029.tlg004,"Η το γραψαι ταυροσθενην ἀθηναιον εἶναι, τον το...",Did he get nothing for proposing that Taurosth...
5,Dinarchus,Against Demosthenes,tlg0029.tlg004,"περι τουτων οὖν μαρτυρας ὑμιν δει καλειν, Η πε...",Is there any need then for me to call up witne...
6,Dinarchus,Against Demosthenes,tlg0029.tlg004,"πολλοι, ὦ ἄνδρες, πολλοι των πολιτων και των ἄ...","Gentlemen, you have very many witnesses, as I ..."
7,Dinarchus,Against Demosthenes,tlg0029.tlg004,ἐπιωρκηκως μεν τας σεμνας θεας ἐν ἀρειῳ παγῳ κ...,having broken the oaths he took on the Areopag...
8,Dinarchus,Against Demosthenes,tlg0029.tlg004,"ἀλλ’ ὅμως, ἄνδρες δικασται, οὕτω δημοσθενης τῷ...","Despite this, gentlemen of the jury, Demosthen..."
9,Dinarchus,Against Demosthenes,tlg0029.tlg004,"πραγμα κατασκευαζων οὐ γεγενημενον, ἀλλα ψευδε...","This is a story of his own invention, not base..."


In [None]:
new_dataset = sr_corpus_df[['greek_text', 'english_text']]

In [None]:
# https://huggingface.co/datasets/hmcgovern/original-language-bibles-greek

dataset_hf_path = '/content/drive/MyDrive/Deep Learning Group Project/train-00000-of-00001.parquet'
dataset_hf = pd.read_parquet(dataset_hf_path)
dataset_hf.head(10)

Unnamed: 0,reference,text,transliteration,translation,dStrongs,manuscript_source
0,Mat.1.1.01,Βίβλος,Biblos,[The] book,G0976=N-NSF,NKO
1,Mat.1.1.02,γενέσεως,geneseōs,of [the] genealogy,G1078=N-GSF,NKO
2,Mat.1.1.03,Ἰησοῦ,Iēsou,of Jesus,G2424G=N-GSM-P,NKO
3,Mat.1.1.04,Χριστοῦ,Christou,Christ,G5547=N-GSM-T,NKO
4,Mat.1.1.05,υἱοῦ,huiou,son,G5207=N-GSM,NKO
5,Mat.1.1.06,Δαυὶδ,Dauid,of David,G1138=N-GSM-P,NKO
6,Mat.1.1.07,υἱοῦ,huiou,son,G5207=N-GSM,NKO
7,Mat.1.1.08,Ἀβραάμ.,Abraam,of Abraham.,G0011=N-GSM-P,NKO
8,Mat.1.2.01,Ἀβραὰμ,Abraam,Abraham,G0011=N-NSM-P,NKO
9,Mat.1.2.02,ἐγέννησεν,egennēsen,begat,G1080=V-AAI-3S,NKO


In [None]:
dataset_hf.size

852582

In [None]:
dataset_hf.tail (30)

Unnamed: 0,reference,text,transliteration,translation,dStrongs,manuscript_source
142067,Rev.22.19.31,τῶν,tōn,of those,G3588=T-GPM,NKO
142068,Rev.22.19.32,γεγραμμένων,gegrammenōn,written,G1125=V-RPP-GPM,NKO
142069,Rev.22.19.33,ἐν,en,in,G1722=PREP,NKO
142070,Rev.22.19.34,τῷ,tō,<the>,G3588=T-DSN,NKO
142071,Rev.22.19.35,βιβλίῳ,bibliō,book,G0975=N-DSN,NKO
142072,Rev.22.19.36,τούτῳ.,toutō,this.,G3778=D-DSN,NKO
142073,Rev.22.20.01,Λέγει,Legei,Says,G3004G=V-PAI-3S,NKO
142074,Rev.22.20.02,ὁ,ho,the [One],G3588=T-NSM,NKO
142075,Rev.22.20.03,μαρτυρῶν,marturōn,testifying,G3140=V-PAP-NSM,NKO
142076,Rev.22.20.04,ταῦτα·,tauta,these things;,G3778=D-APN,NKO


In [None]:
#to use the dataset above to create a new dataset in the ConLL format:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import pandas as pd


# Load English NER model
ner_model = pipeline(
    "ner",
    model="Jean-Baptiste/roberta-large-ner-english",
    aggregation_strategy="simple"
)

# Function to get NER label for a single word
def get_ner_label(word):
    result = ner_model(word)
    if len(result) == 0:
        return "O"
    return result[0]["entity_group"]

# Apply to your dataset
dataset_hf["ner_label"] = dataset_hf["translation"].apply(get_ner_label)

print(dataset_hf[["text", "translation", "ner_label"]].head())


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/849 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


KeyboardInterrupt: 

In [None]:
dataset_hf.to_csv('/content/drive/MyDrive/Deep Learning Group Project/dataset_hf_.csv', index=False)

https://arxiv.org/abs/2001.08210 - finetuning mBART - multilingual model, works for low resource language, also pretrained in modern greek. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text

In [None]:
# -------------------------------------------------
#  Fine-tune mBART on Ancient Greek (no HF token)
# -------------------------------------------------
import pandas as pd
from datasets import Dataset
from transformers import (
    MBartForConditionalGeneration,
    MBart50TokenizerFast,
    Trainer,
    TrainingArguments,
    pipeline,
)

MODEL_NAME = "facebook/mbart-large-50-many-to-many-mmt"

tokenizer = MBart50TokenizerFast.from_pretrained(
    MODEL_NAME
)

def preprocess(batch):

    src_texts = batch["greek_text"]
    tgt_texts = batch["english_text"]

    valid_inputs = []
    valid_labels = []

    for src, tgt in zip(src_texts, tgt_texts):
        if not src or not tgt:
            continue
        if not isinstance(src, str) or not isinstance(tgt, str):
            continue

        valid_inputs.append(src.strip())
        valid_labels.append(tgt.strip())

    if not valid_inputs:
        return {"input_ids": [], "labels": []}

    model_inputs = tokenizer(
        batch["greek_text"],
        max_length=256,
        truncation=True,
        padding=False,
    )
    print(model_inputs)

    model_inputs["labels"] = tokenizer(
        text_target=batch["english_text"],
        max_length=256,
        truncation=True,
        padding=False,
    )
    print(model_inputs)

    return model_inputs

temp_dataset = Dataset.from_pandas(new_dataset)
tokenized = temp_dataset.map(preprocess, batched=True, remove_columns=["english_text", "greek_text"])
tokenized = tokenized.train_test_split(test_size=0.2)
print(tokenized)

model = MBartForConditionalGeneration.from_pretrained(MODEL_NAME)




In [None]:
dataset