Ok, used my latinepi (which needs cleaning up) to download *a lot* of inscriptions from edh. Then used a combination of scripts & llm to annotate them. In this notebook, we clean out the unsuccessfully annotated inscriptions, leaving us with a core of real data, and then we generate a whole bunch of synthetic inscriptions (thank god for formulaic epigraphy, eh?) that are correctly annotated, and mix them both together. The goal is to train the latinCy spaCy model to recognize the elements of funerary inscriptions, for data extraction from transcriptions. Why not?

In [1]:
!mkdir assets         # To store your raw data files (jsonl, csv)
!mkdir configs        # To store configuration files
!mkdir scripts        # To store helper scripts (like data conversion)
!mkdir training       # To store the output of the training process
!mkdir corpus         # To store the processed .spacy files

In [None]:
#!pip install -U spacy #already in colab
#!python -m spacy download en_core_web_lg
#!pip install "la-core-web-sm @ https://huggingface.co/latincy/la_core_web_sm/resolve/main/la_core_web_sm-any-py3-none-any.whl"
!pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl"
#
# this is what we're going to retrain.
!pip install spacy-transformers

In [None]:
# then you have to run this. It will say things have crashed. Ignore and continue.
import os
os.kill(os.getpid(), 9)

In [1]:
# don't use any of this, just generate some fresh ones below
# start with some synthethic training annotations

#!wget https://gist.githubusercontent.com/shawngraham/2121b3ee828c4547fc2fd9470158e6d8/raw/ea0e2c578097ea67ab6c54e170d4559ea7c29790/simple-training.jsonl -O assets/synthetic-training.jsonl ## simple ones


#!wget https://gist.githubusercontent.com/shawngraham/f44663efc80916a75c736a38f024b371/raw/6585104793170cb5ef7c57dde16adf2d591dff04/synthetic-training.jsonl -O assets/synthetic-training.jsonl

#!wget https://gist.githubusercontent.com/shawngraham/f44663efc80916a75c736a38f024b371/raw/a924e157c8d87d13377dcc93e890879251d0c674/synthetic-training.jsonl # this is a mixture of 400+ lines of completely synthetic data and 1000+ lines of real data with llm generated annotations

#!wget https://gist.githubusercontent.com/shawngraham/f44663efc80916a75c736a38f024b371/raw/9b1d724d19b30ded168268af0fd959dccaae521e/synthetic-training.jsonl -O assets/synthethic-training.jsonl

## and some synthetic testing data
#!wget https://gist.githubusercontent.com/shawngraham/3633224a209ab01f650f9dee9183888d/raw/9cc9dfeb566dc8465d744e6745af98af363a227c/testing-epigraphs-synthetic.csv -O assets/test-fake-epigraphs.csv

## Real inscriptions from EDH
#!wget https://gist.githubusercontent.com/shawngraham/8229265886e776624476331194c79934/raw/dac32c34cfb8528edaf1c2d961ed8b8f77e24c86/inscriptions.csv -O assets/inscriptions.csv

## clean up some real inscriptions to add with the synthetic ones

In [2]:
###### ok, try to clean up some real ones and then mix them in with the synthetic

## this is a couple hundred rows of real inscriptions that were annotated
## through combination of scripts & llm, but the results had issues with
## annotation offsets
!wget https://gist.githubusercontent.com/shawngraham/d949119d45f5cc661205a3bfbb266d86/raw/c93dfe9324d01478fe88f59f9cf9ee9d560dfc1e/annotations-to-clean-up-for-dataset.jsonl -O assets/to-clean.jsonl

--2025-11-21 14:11:51--  https://gist.githubusercontent.com/shawngraham/d949119d45f5cc661205a3bfbb266d86/raw/c93dfe9324d01478fe88f59f9cf9ee9d560dfc1e/annotations-to-clean-up-for-dataset.jsonl
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 459000 (448K) [text/plain]
Saving to: ‘assets/to-clean.jsonl’


2025-11-21 14:11:51 (18.5 MB/s) - ‘assets/to-clean.jsonl’ saved [459000/459000]



In [3]:
## so we clean out the ones that are no good, just keeps the cleanest/easiest to fix
import json
import spacy

def clean_real_inscriptions(input_path, output_path, model='la_core_web_lg'):
    """
    Cleans real inscription data:
    1. Copies transcription → text
    2. Re-validates all annotation spans
    3. Drops records with errors or unalignable annotations
    4. Outputs clean subset ready for training
    """
    nlp = spacy.load(model)

    stats = {
        "total": 0,
        "has_error_flag": 0,
        "no_transcription": 0,
        "no_annotations": 0,
        "perfect_match": 0,
        "fixed_spans": 0,
        "dropped_records": 0,
        "saved": 0
    }

    salvaged = []

    with open(input_path, 'r', encoding='utf-8') as f:
        for line in f:
            try:
                record = json.loads(line)
                stats["total"] += 1
            except:
                continue

            # Skip flagged errors
            if record.get("_error"):
                stats["has_error_flag"] += 1
                continue

            # Use transcription as text
            text = record.get('transcription', '').strip()
            if not text:
                stats["no_transcription"] += 1
                continue

            # Update record to use text field
            record['text'] = text
            record.pop('transcription', None)  # Remove redundant field

            # Skip if no annotations
            annotations = record.get('annotations', [])
            if not isinstance(annotations, list) or not annotations:
                stats["no_annotations"] += 1
                continue

            # Validate/fix annotations
            doc = nlp.make_doc(text)
            validated_ents = []
            record_is_salvageable = True

            for entity in annotations:
                if not isinstance(entity, list) or len(entity) != 3:
                    continue

                start, end, label = entity

                # Validate span is within text bounds
                if start < 0 or end > len(text) or start >= end:
                    record_is_salvageable = False
                    break

                # Check if span matches actual text
                span_text = text[start:end]

                # Try alignment
                span = doc.char_span(start, end, label=label, alignment_mode="expand")

                if span is not None:
                    # Span aligned successfully
                    if span.start_char == start and span.end_char == end:
                        # Perfect match
                        validated_ents.append([start, end, label])
                        stats["perfect_match"] += 1
                    else:
                        # Adjusted but acceptable
                        validated_ents.append([span.start_char, span.end_char, label])
                        stats["fixed_spans"] += 1
                else:
                    # Could not align - record is bad
                    record_is_salvageable = False
                    break

            if record_is_salvageable and validated_ents:
                record['annotations'] = validated_ents
                salvaged.append(record)
                stats["saved"] += 1
            else:
                stats["dropped_records"] += 1

    # Save cleaned data
    with open(output_path, 'w', encoding='utf-8') as f:
        for record in salvaged:
            f.write(json.dumps(record) + '\n')

    print(f"\n✅ Real inscription cleaning complete")
    print(f"   Total records: {stats['total']}")
    print(f"   Flagged as error: {stats['has_error_flag']}")
    print(f"   No transcription: {stats['no_transcription']}")
    print(f"   No annotations: {stats['no_annotations']}")
    print(f"   Perfect annotations: {stats['perfect_match']}")
    print(f"   Fixed spans: {stats['fixed_spans']}")
    print(f"   Dropped (unrecoverable): {stats['dropped_records']}")
    print(f"   ✅ SAVED: {stats['saved']} clean records")
    print(f"   Saved to: {output_path}")

    return stats

# Run it
clean_real_inscriptions('assets/to-clean.jsonl', 'assets/real_inscriptions_clean.jsonl')


✅ Real inscription cleaning complete
   Total records: 1100
   Flagged as error: 221
   No transcription: 0
   No annotations: 9
   Perfect annotations: 2111
   Fixed spans: 7932
   Dropped (unrecoverable): 175
   ✅ SAVED: 695 clean records
   Saved to: assets/real_inscriptions_clean.jsonl


{'total': 1100,
 'has_error_flag': 221,
 'no_transcription': 0,
 'no_annotations': 9,
 'perfect_match': 2111,
 'fixed_spans': 7932,
 'dropped_records': 175,
 'saved': 695}

## make some synthetic

In [4]:
#!/usr/bin/env python3
"""
Generate synthetic Latin inscriptions with:
- Weighted name selection (based on real data frequencies)
- Focused label set: TRIBE, PRAENOMEN, NOMEN, COGNOMEN, EPITHET, DEDICATORY_FORMULA, ORIGIN
"""

import json
import random
from collections import Counter


class FrequencyExtractor:
    """Extract frequency distributions from real inscription data"""

    @staticmethod
    def extract_frequencies(input_path):
        """Extract praenomen, nomen, cognomen frequencies from real data"""
        praenomen_freq = Counter()
        nomen_freq = Counter()
        cognomen_freq = Counter()
        tribe_freq = Counter()
        origin_freq = Counter()

        with open(input_path) as f:
            for line in f:
                try:
                    record = json.loads(line)
                except:
                    continue

                text = record.get('text', '')
                annotations = record.get('annotations', [])

                if not text or not isinstance(annotations, list):
                    continue

                for entity in annotations:
                    if isinstance(entity, list) and len(entity) == 3:
                        start, end, label = entity
                    else:
                        continue

                    if start < 0 or end > len(text):
                        continue

                    word = text[start:end].strip()

                    if label == 'PRAENOMEN':
                        praenomen_freq[word] += 1
                    elif label == 'NOMEN':
                        nomen_freq[word] += 1
                    elif label == 'COGNOMEN':
                        cognomen_freq[word] += 1
                    elif label == 'TRIBE':
                        tribe_freq[word] += 1
                    elif label == 'ORIGIN':
                        origin_freq[word] += 1

        return praenomen_freq, nomen_freq, cognomen_freq, tribe_freq, origin_freq

    @staticmethod
    def get_weighted_list(freq_counter, top_n=30):
        """Convert frequency counter to weighted selection list"""
        # Get top N items, weighted by frequency
        items = [word for word, _ in freq_counter.most_common(top_n)]
        return items if items else []


class FocusedInscriptionGenerator:
    """Generate inscriptions focused on 7 core labels"""

    def __init__(self, praenomina, nomina, cognomina, tribes, origins):
        self.PRAENOMINA = praenomina or ['M', 'L', 'C', 'T', 'Q']
        self.NOMINA = nomina or ['IVLIVS', 'CLAVDIVS', 'FLAVIVS']
        self.COGNOMINA = cognomina or ['VICTOR', 'FELIX', 'MAXIMVS']
        self.TRIBES = tribes or ['GAL', 'POL', 'VEL']
        self.ORIGINS = origins or ['Roma', 'Italia', 'Hispania', 'Gallia', 'Britannia']

        self.RELATIONSHIPS = ['CONIVGI', 'FILIO', 'FILIAE', 'PATRI', 'FRATRI', 'SORORI']
        self.EPITHETS = ['BENE MERENTI', 'PIISSIMO', 'DVLCISSIMO', 'CARISSIMO', 'OPTIMO']

        self.DATIVE_MAP = {
            'IVLIVS': 'IVLIO', 'CLAVDIVS': 'CLAVDIO', 'FLAVIVS': 'FLAVIO',
            'VICTOR': 'VICTORI', 'FELIX': 'FELICI', 'MAXIMVS': 'MAXIMO',
        }

    @staticmethod
    def _to_roman(num):
        """Convert integer to Roman numeral."""
        roman_map = [(100, 'C'), (90, 'XC'), (50, 'L'), (40, 'XL'), (10, 'X'),
                     (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
        result = ''
        for val, sym in roman_map:
            count = num // val
            result += sym * count
            num -= val * count
        return result

    def _to_dative(self, latin_word):
        """Convert nominative Latin name to dative case."""
        return self.DATIVE_MAP.get(latin_word, latin_word)

    def generate_inscriptions(self, count=1000):
        """Generate focused inscriptions"""
        data = []

        for i in range(count):
            spans = []
            text_parts = []

            # Optional: Dedicatory formula header (D M = Dis Manibus) - 70% chance
            if random.random() < 0.7:
                start = 0
                text_parts.append("D M")
                spans.append([start, 3, "DEDICATORY_FORMULA"])

            # Deceased person - Praenomen + Nomen + Cognomen (all in dative)
            prae_dec = random.choice(self.PRAENOMINA)
            nomen_dec = random.choice(self.NOMINA)
            cog_dec = random.choice(self.COGNOMINA)

            # Mark praenomen (dative)
            start = len(" ".join(text_parts)) + 1 if text_parts else 0
            text_parts.append(prae_dec)
            spans.append([start, start + len(prae_dec), "PRAENOMEN"])

            # Mark nomen (dative)
            nomen_dat = self._to_dative(nomen_dec)
            start = len(" ".join(text_parts)) + 1
            text_parts.append(nomen_dat)
            spans.append([start, start + len(nomen_dat), "NOMEN"])

            # Optional tribe (30% chance)
            if random.random() < 0.3:
                tribe = random.choice(self.TRIBES)
                start = len(" ".join(text_parts)) + 1
                text_parts.append(tribe)
                spans.append([start, start + len(tribe), "TRIBE"])

            # Mark cognomen (dative) - 90% include it
            if random.random() < 0.9:
                cog_dat = self._to_dative(cog_dec)
                start = len(" ".join(text_parts)) + 1
                text_parts.append(cog_dat)
                spans.append([start, start + len(cog_dat), "COGNOMEN"])

            # Optional age at death (60% chance)
            if random.random() < 0.6:
                years = random.randint(15, 90)
                roman_num = self._to_roman(years)
                age_string = f"vixit ann(is) {roman_num}"
                start = len(" ".join(text_parts)) + 1
                text_parts.append(age_string)
                # Age is not labeled - just part of text for context

            # Optional origin (40% chance) - 70% with deceased, 30% independent
            if random.random() < 0.4:
                origin = random.choice(self.ORIGINS)
                start = len(" ".join(text_parts)) + 1
                text_parts.append(f"ex {origin}")
                spans.append([start + 3, start + 3 + len(origin), "ORIGIN"])

            # Optional dedicator (50% chance)
            if random.random() < 0.5:
                ded_prae = random.choice(self.PRAENOMINA)
                ded_nomen = random.choice(self.NOMINA)
                ded_cog = random.choice(self.COGNOMINA)

                # Add each component separately
                start = len(" ".join(text_parts)) + 1
                text_parts.append(ded_prae)
                spans.append([start, start + len(ded_prae), "PRAENOMEN"])

                start = len(" ".join(text_parts)) + 1
                text_parts.append(ded_nomen)
                spans.append([start, start + len(ded_nomen), "NOMEN"])

                start = len(" ".join(text_parts)) + 1
                text_parts.append(ded_cog)
                spans.append([start, start + len(ded_cog), "COGNOMEN"])

                # Optional epithet with dedicator (50% chance)
                if random.random() < 0.5:
                    epithet = random.choice(self.EPITHETS)
                    start = len(" ".join(text_parts)) + 1
                    text_parts.append(epithet)
                    spans.append([start, start + len(epithet), "EPITHET"])

                # Optional action verb (60% chance)
                if random.random() < 0.6:
                    actions = ['posuit', 'fecit', 'dedicavit']
                    action = random.choice(actions)
                    start = len(" ".join(text_parts)) + 1
                    text_parts.append(action)
                    # Action not labeled

            # Construct final text
            full_text = " ".join(text_parts)

            # Create entry
            entry = {
                "id": f"gen_{i+1:04d}",
                "text": full_text,
                "annotations": spans
            }
            data.append(json.dumps(entry))

        return data


def main():
    # Extract frequencies from real data
    print("Extracting frequencies from real inscriptions...")
    praenomina_freq, nomina_freq, cognomina_freq, tribe_freq, origin_freq = \
        FrequencyExtractor.extract_frequencies('assets/real_inscriptions_clean.jsonl')

    # Get weighted lists
    praenomina = FrequencyExtractor.get_weighted_list(praenomina_freq, top_n=20)
    nomina = FrequencyExtractor.get_weighted_list(nomina_freq, top_n=30)
    cognomina = FrequencyExtractor.get_weighted_list(cognomina_freq, top_n=40)
    tribes = FrequencyExtractor.get_weighted_list(tribe_freq, top_n=15)
    origins = FrequencyExtractor.get_weighted_list(origin_freq, top_n=20)

    print(f"✅ Extracted frequencies:")
    print(f"   Praenomina: {len(praenomina)} unique")
    print(f"   Nomina: {len(nomina)} unique")
    print(f"   Cognomina: {len(cognomina)} unique")
    print(f"   Tribes: {len(tribes)} unique")
    print(f"   Origins: {len(origins)} unique")

    # Generate synthetic data
    print("\nGenerating 3000 synthetic inscriptions with weighted selection...")
    gen = FocusedInscriptionGenerator(praenomina, nomina, cognomina, tribes, origins)
    inscriptions = gen.generate_inscriptions(3000)

    # Write to file
    with open('assets/synthetic_focused.jsonl', 'w') as f:
        for inscription in inscriptions:
            f.write(inscription + '\n')

    print(f"✅ Saved to assets/synthetic_focused.jsonl")

    # Show sample
    print("\nSample inscriptions:")
    with open('assets/synthetic_focused.jsonl') as f:
        for _ in range(3):
            sample = json.loads(f.readline())
            print(f"\n  Text: {sample['text']}")
            print(f"  Labels: {[ann[2] for ann in sample['annotations']]}")


if __name__ == "__main__":
    main()

Extracting frequencies from real inscriptions...
✅ Extracted frequencies:
   Praenomina: 20 unique
   Nomina: 30 unique
   Cognomina: 40 unique
   Tribes: 15 unique
   Origins: 0 unique

Generating 3000 synthetic inscriptions with weighted selection...
✅ Saved to assets/synthetic_focused.jsonl

Sample inscriptions:

  Text: Tito Cornelio Galeria Fortunati vixit ann(is) XXX Publi Claudiae Rufus DVLCISSIMO fecit
  Labels: ['PRAENOMEN', 'NOMEN', 'TRIBE', 'COGNOMEN', 'PRAENOMEN', 'NOMEN', 'COGNOMEN', 'EPITHET']

  Text: Tiberius Marcus Fortunati Aulus Caecilius filius
  Labels: ['PRAENOMEN', 'NOMEN', 'COGNOMEN', 'PRAENOMEN', 'NOMEN', 'COGNOMEN']

  Text: D M Luci Lucio Arnensi Secundo ex Italia Tiberius Attius nini Pii
  Labels: ['DEDICATORY_FORMULA', 'PRAENOMEN', 'NOMEN', 'TRIBE', 'COGNOMEN', 'ORIGIN', 'PRAENOMEN', 'NOMEN', 'COGNOMEN']


## combine real & synthethic

In [5]:
#Step 2 harmonize real & synthetic
#!/usr/bin/env python3
"""
Harmonize all data to focused 7-label schema:
- PRAENOMEN
- NOMEN
- COGNOMEN
- TRIBE
- EPITHET
- DEDICATORY_FORMULA
- ORIGIN
"""

import json
from sklearn.model_selection import train_test_split


REAL_LABEL_MAPPING = {
    # Direct mappings
    "PRAENOMEN": "PRAENOMEN",
    "NOMEN": "NOMEN",
    "COGNOMEN": "COGNOMEN",
    "TRIBE": "TRIBE",
    "EPITHET": "EPITHET",
    "ORIGIN": "ORIGIN",

    # Consolidate formulas
    "FUNERARY_FORMULA": "DEDICATORY_FORMULA",
    "DEDICATION_TO_THE_GODS": "DEDICATORY_FORMULA",
    "BENE_MERENTI": "EPITHET",

    # Drop these labels (too noisy/rare)
    "FILIATION": None,
    "RELATIONSHIP": None,
    "AGE_YEARS": None,
    "AGE_DAYS": None,
    "AGE_MONTHS": None,
    "AGE_PREFIX": None,
    "OCCUPATION": None,
    "MILITARY_UNIT": None,
    "AGE": None,
    "ACTION": None,
    "VERB": None,
    "DECEASED_NAME": None,
}


def harmonize_real_data(input_path, output_path):
    """Harmonize real inscriptions to 7-label schema"""
    harmonized = []
    dropped = 0

    with open(input_path) as f:
        for line in f:
            try:
                record = json.loads(line)
            except:
                continue

            text = record.get('text', '')
            if not text:
                continue

            annotations = record.get('annotations', [])
            if not isinstance(annotations, list):
                continue

            # Filter and remap annotations
            new_annotations = []
            for entity in annotations:
                if not isinstance(entity, list) or len(entity) != 3:
                    continue

                start, end, label = entity

                # Remap label
                new_label = REAL_LABEL_MAPPING.get(label)

                if new_label is None:
                    # Drop this label
                    dropped += 1
                    continue

                # Validate span
                if start < 0 or end > len(text) or start >= end:
                    continue

                new_annotations.append([start, end, new_label])

            # Only keep records with at least one annotation
            if new_annotations:
                record['annotations'] = new_annotations
                harmonized.append(record)

    # Write output
    with open(output_path, 'w') as f:
        for record in harmonized:
            f.write(json.dumps(record) + '\n')

    print(f"✅ Harmonized real inscriptions")
    print(f"   Records saved: {len(harmonized)}")
    print(f"   Annotations dropped: {dropped}")


def combine_and_split(real_path, synthetic_path, train_out, dev_out, real_weight=0.2):
    """Combine real + synthetic and split into train/dev"""

    # Load data
    real_records = []
    with open(real_path) as f:
        real_records = [json.loads(line) for line in f]

    synthetic_records = []
    with open(synthetic_path) as f:
        synthetic_records = [json.loads(line) for line in f]

    # Combine
    combined = real_records + synthetic_records

    print(f"\n✅ Combined data:")
    print(f"   Real: {len(real_records)} ({len(real_records)/len(combined)*100:.1f}%)")
    print(f"   Synthetic: {len(synthetic_records)} ({len(synthetic_records)/len(combined)*100:.1f}%)")
    print(f"   Total: {len(combined)}")

    # Split
    train, dev = train_test_split(combined, test_size=0.2, random_state=42)

    # Write
    with open(train_out, 'w') as f:
        for record in train:
            f.write(json.dumps(record) + '\n')

    with open(dev_out, 'w') as f:
        for record in dev:
            f.write(json.dumps(record) + '\n')

    print(f"\n✅ Split into train/dev:")
    print(f"   Train: {len(train)}")
    print(f"   Dev: {len(dev)}")


if __name__ == "__main__":
    # Harmonize real data
    print("=== HARMONIZING REAL DATA ===")
    harmonize_real_data('assets/real_inscriptions_clean.jsonl',
                       'assets/real_focused.jsonl')

    # Combine and split
    print("\n=== COMBINING AND SPLITTING ===")
    combine_and_split('assets/real_focused.jsonl',
                     'assets/synthetic_focused.jsonl',
                     'assets/train_focused.jsonl',
                     'assets/dev_focused.jsonl',
                     real_weight=0.2)

    print("\n✅ Ready for alignment and training")

=== HARMONIZING REAL DATA ===
✅ Harmonized real inscriptions
   Records saved: 627
   Annotations dropped: 4142

=== COMBINING AND SPLITTING ===

✅ Combined data:
   Real: 627 (17.3%)
   Synthetic: 3000 (82.7%)
   Total: 3627

✅ Split into train/dev:
   Train: 2901
   Dev: 726

✅ Ready for alignment and training


In [6]:
# STEP 3: Align and convert to .spacy
import spacy
import json

def align_annotations(input_path, output_path):
    nlp = spacy.load('la_core_web_lg')
    corrected_records = []

    stats = {"total": 0, "perfect": 0, "fixed": 0, "malformed": 0, "dropped": 0}

    with open(input_path, 'r') as f:
        for line in f:
            try:
                record = json.loads(line)
                stats["total"] += 1
            except:
                continue

            text = record.get('text', '')
            if not text:
                continue

            annotations = record.get('annotations', [])
            doc = nlp.make_doc(text)
            corrected_ents = []

            for entity in annotations:
                if isinstance(entity, list) and len(entity) == 3:
                    start, end, label = entity
                else:
                    stats["malformed"] += 1
                    continue

                try:
                    start, end = int(start), int(end)
                except (ValueError, TypeError):
                    stats["malformed"] += 1
                    continue

                if start < 0 or end > len(text) or start >= end:
                    stats["malformed"] += 1
                    continue

                span = doc.char_span(start, end, label=label, alignment_mode="expand")

                if span is not None:
                    if span.start_char == start and span.end_char == end:
                        stats["perfect"] += 1
                    else:
                        stats["fixed"] += 1
                    corrected_ents.append([span.start_char, span.end_char, label])
                else:
                    stats["dropped"] += 1

            record['annotations'] = corrected_ents
            corrected_records.append(record)

    with open(output_path, 'w') as f:
        for record in corrected_records:
            f.write(json.dumps(record) + '\n')

    print(f"\n✅ Aligned {input_path}")
    print(f"   Perfect: {stats['perfect']}, Fixed: {stats['fixed']}, Dropped: {stats['dropped']}")

# Align both
align_annotations('assets/train_focused.jsonl', 'assets/train_focused_aligned.jsonl')
align_annotations('assets/dev_focused.jsonl', 'assets/dev_focused_aligned.jsonl')




✅ Aligned assets/train_focused.jsonl
   Perfect: 17280, Fixed: 120, Dropped: 21

✅ Aligned assets/dev_focused.jsonl
   Perfect: 4447, Fixed: 32, Dropped: 3


In [7]:
# STEP 4: Convert to .spacy format
from spacy.tokens import DocBin
from spacy.util import filter_spans

def create_spacy_file(input_path, output_path, model='la_core_web_lg'):
    nlp = spacy.load(model)
    db = DocBin()

    with open(input_path) as f:
        for line in f:
            record = json.loads(line)
            text = record.get('text', '')
            if not text:
                continue

            doc = nlp.make_doc(text)
            ents = []

            for start, end, label in record.get('annotations', []):
                span = doc.char_span(start, end, label=label, alignment_mode="expand")
                if span:
                    ents.append(span)

            doc.ents = filter_spans(ents)
            db.add(doc)

    db.to_disk(output_path)
    print(f"✅ Created {output_path}")

create_spacy_file('assets/train_focused_aligned.jsonl', 'corpus/train.spacy')
create_spacy_file('assets/dev_focused_aligned.jsonl', 'corpus/dev.spacy')

print("\n✅ Ready for training!")

✅ Created corpus/train.spacy
✅ Created corpus/dev.spacy

✅ Ready for training!


In [8]:
# STEP 5: Check sample output
with open('assets/synthetic_focused.jsonl') as f:
    samples = [json.loads(f.readline()) for _ in range(3)]
    print("\nSample synthetic inscriptions:")
    for sample in samples:
        print(f"\n  {sample['text']}")
        labels = [ann[2] for ann in sample['annotations']]
        print(f"  Labels: {set(labels)}")


Sample synthetic inscriptions:

  Tito Cornelio Galeria Fortunati vixit ann(is) XXX Publi Claudiae Rufus DVLCISSIMO fecit
  Labels: {'COGNOMEN', 'TRIBE', 'EPITHET', 'PRAENOMEN', 'NOMEN'}

  Tiberius Marcus Fortunati Aulus Caecilius filius
  Labels: {'COGNOMEN', 'NOMEN', 'PRAENOMEN'}

  D M Luci Lucio Arnensi Secundo ex Italia Tiberius Attius nini Pii
  Labels: {'TRIBE', 'COGNOMEN', 'PRAENOMEN', 'ORIGIN', 'DEDICATORY_FORMULA', 'NOMEN'}


In [27]:
import spacy
from pathlib import Path

# --- 1. Generate the base config ---
!python -m spacy init config configs/config.cfg --lang la --pipeline tok2vec,ner --optimize accuracy --force

print("✅ Base 'config.cfg' generated.")

# --- 2. Load and Modify ---
config_path = Path("configs/config.cfg")
config = spacy.util.load_config(config_path)

# Define the model we are using
LATIN_MODEL = "la_core_web_lg"

# --- Part A: Initialize Vectors (CRITICAL FOR LG MODELS) ---
# This loads the 300-dim vectors into the vocab so the tok2vec layer can find them.
config["initialize"]["vectors"] = LATIN_MODEL

# --- Part B: Source the tok2vec component ---
config["components"]["tok2vec"] = {
    "source": LATIN_MODEL,
    "component": "tok2vec"
}

# --- Part C: Connect NER to the vectors ---
# ERROR CORRECTION: The tok2vec OUTPUT width is 96, even if the input vectors are 300.
config["components"]["ner"]["model"]["tok2vec"] = {
    "@architectures": "spacy.Tok2VecListener.v1",
    "width": 96,  # <--- Reverted to 96. This matches the output of la_core_web_lg.
    "upstream": "tok2vec"
}

config["nlp"]["batch_size"] = 200

# --- Part D: Paths and Freezing ---
config["paths"]["train"] = "./corpus/train.spacy"
config["paths"]["dev"] = "./corpus/dev.spacy"

# Freeze tok2vec so we don't ruin the pretrained Latin intelligence
#config["training"]["frozen_components"] = ["tok2vec"]
# or unfreeze it, see what happens
config["training"]["frozen_components"] = []
#config["training"]["max_epochs"]= 100
config["training"]["max_epochs"] = 5
# Mark it as annotating so it actually runs
config["training"]["annotating_components"] = ["tok2vec"]

# --- 3. Save ---
config.to_disk(config_path)

print(f"✅ Config updated for {LATIN_MODEL}. Listener width set to 96 (correct output dim).")

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: la
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
configs/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
✅ Base 'config.cfg' generated.
✅ Config updated for la_core_web_lg. Listener width set to 96 (correct output dim).


In [28]:
# Start the training process!
!python -m spacy train configs/config.cfg --output ./training/ #--gpu-id 0

[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Set annotations on update for: ['tok2vec'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     69.33   12.19   15.28   10.14    0.12
  0     200        205.46   4614.68   88.87   93.87   84.37    0.89
  1     400        216.32   2682.71   88.79   92.78   85.14    0.89
  2     600        297.30   2872.65   88.36   91.27   85.63    0.88
  3     800        384.43   2750.82   88.53   90.74   86.42    0.89
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


In [29]:
import spacy
from spacy.scorer import Scorer
from spacy.training import Example
from spacy.tokens import DocBin

def evaluate_final(model_path, dev_data_path):
    print(f"--- Evaluating {model_path} ---")
    nlp = spacy.load(model_path)
    db = DocBin().from_disk(dev_data_path)
    docs = list(db.get_docs(nlp.vocab))

    examples = []
    for doc in docs:
        examples.append(Example(nlp(doc.text), doc))

    scores = Scorer().score(examples)

    print(f"{'LABEL':<30} {'PREC':<8} {'REC':<8} {'F1':<8}")
    print("-" * 60)
    for label, metrics in scores['ents_per_type'].items():
        print(f"{label:<30} {metrics['p']:.2f}     {metrics['r']:.2f}     {metrics['f']:.2f}")

evaluate_final("training/model-best", "corpus/dev.spacy")

--- Evaluating training/model-best ---
LABEL                          PREC     REC      F1      
------------------------------------------------------------
PRAENOMEN                      0.95     0.97     0.96
COGNOMEN                       0.93     0.87     0.90
NOMEN                          0.94     0.90     0.92
DEDICATORY_FORMULA             0.92     0.56     0.70
ORIGIN                         1.00     1.00     1.00
EPITHET                        0.99     0.74     0.85
TRIBE                          0.84     0.89     0.87


In [30]:
import shutil
import spacy

# The best model should be saved as model-best during training
# Load it
nlp = spacy.load("training/model-best")

# Test on real inscriptions from your dataset
test_cases = [
    "Dis Manibus Lucio Ocratio Corrintho vixit annos XXX dies XI Ocratia Silvana filio piissimo bene merenti fecit",
    "Caius Pompeius Caius libertus librarius",
    "Dis Manibus sacrum Fortunatus municipum municipii Ipolcobulensiorum servus annorum XXXXIII pius",
    "Dis Manibus Spediae Luci filiae Severae coniugi Luci Valeri Montani Quinti fili primi pili legionis XIII",
]

print("Testing on real inscriptions:\n")
for text in test_cases:
    doc = nlp(text)
    print(f"Text: {text[:70]}...")
    print("Entities predicted:")
    for ent in doc.ents:
        print(f"  {ent.text:25} → {ent.label_:15}")
    print()

Testing on real inscriptions:

Text: Dis Manibus Lucio Ocratio Corrintho vixit annos XXX dies XI Ocratia Si...
Entities predicted:
  Dis Manibus               → DEDICATORY_FORMULA
  Lucio                     → NOMEN          
  Ocratio                   → NOMEN          

Text: Caius Pompeius Caius libertus librarius...
Entities predicted:
  Caius                     → PRAENOMEN      
  Pompeius                  → NOMEN          

Text: Dis Manibus sacrum Fortunatus municipum municipii Ipolcobulensiorum se...
Entities predicted:
  Dis Manibus               → DEDICATORY_FORMULA

Text: Dis Manibus Spediae Luci filiae Severae coniugi Luci Valeri Montani Qu...
Entities predicted:
  Dis Manibus               → DEDICATORY_FORMULA
  Spediae                   → NOMEN          



## Real Data Time

Loading in data from P.

In [34]:
# CELL 3: Process Leiden conventions
import csv
import re
import json
from collections import Counter

class LeidenProcessor:
    """Convert Leiden conventions to clean transcription"""

    # Map abbreviations to their expansions
    # Capitalize proper nouns, keep others lowercase
    ABBREV_PROPER = {  # Names (capitalize)
        'Q': 'Quintus', 'C': 'Caius', 'M': 'Marcus', 'L': 'Lucius',
        'T': 'Titus', 'P': 'Publius', 'D': 'Dis', 'A': 'Aulus',
        'Cn': 'Gnaeus', 'TI': 'Tiberius', 'S': 'Sextus', 'N': 'Numerius',
    }

    ABBREV_COMMON = {  # Common words (lowercase)
        'a': 'animo', 'l': 'libens', 'v': 'votum', 'p': 'posuit',
        's': 'sacrum', 'f': 'fecit', 'm': 'mensis', 'an': 'anno',
        'ann': 'annorum', 'h': 'hic', 'e': 'est', 'pos': 'posuit',
        't': 'tibi', 'd': 'de', 'sit': 'sit'
    }

    @staticmethod
    def process(leiden_text):
        """Full pipeline: Leiden → clean transcription"""

        # Step 1: Remove damage markers [3] (n unknown letters)
        text = re.sub(r'\[\d+\]', '', leiden_text)

        # Step 2: Remove question marks and uncertain markers
        text = re.sub(r'\?', '', text)
        text = re.sub(r'\[([^\]]*)\]', r'\1', text)  # [text] → text

        # Step 3: Join words broken across lines intelligently
        # Handle patterns like "Gem/ellian" or "ann]or/um"
        # Remove line breaks only when joining word fragments
        text = re.sub(r'([a-z])/([a-z])', r'\1\2', text, flags=re.IGNORECASE)
        text = re.sub(r'(\])/([a-z])', r'\1\2', text, flags=re.IGNORECASE)

        # Step 4: Expand abbreviations with proper case handling
        def expand_abbrev(match):
            abbrev = match.group(1)
            expansion = match.group(2) if match.group(2) else ""

            # If expansion provided in parentheses, use it
            if expansion:
                # Keep expansion as-is, preserve case
                return abbrev + expansion

            # Try proper noun abbreviations first
            if abbrev in LeidenProcessor.ABBREV_PROPER:
                return LeidenProcessor.ABBREV_PROPER[abbrev]

            # Try common abbreviations
            if abbrev.lower() in LeidenProcessor.ABBREV_COMMON:
                return LeidenProcessor.ABBREV_COMMON[abbrev.lower()]

            # Return original if not found
            return abbrev

        # Pattern: X(expansion) captures abbreviation and optional expansion text
        text = re.sub(r'([A-Za-z]+)\(([^)]*)\)', expand_abbrev, text)

        # Step 5: Clean line break markers and multiple spaces
        text = text.replace('/', ' ')
        text = text.replace('\\', ' ')
        text = re.sub(r'\s+', ' ', text).strip()

        # Step 6: Remove remaining brackets and junk
        text = re.sub(r'[\[\]]', '', text)

        # Step 7: Capitalize only first letter and proper nouns
        # Simple heuristic: capitalize after space, but preserve lowercase articles/prepositions
        LOWERCASE_WORDS = {'et', 'de', 'a', 'in', 'ex', 'ab'}

        words = text.split()
        result = []

        for i, word in enumerate(words):
            if i == 0:  # First word always capitalized
                result.append(word.capitalize())
            elif word.lower() in LOWERCASE_WORDS:
                result.append(word.lower())
            elif word[0].isupper():  # Already capitalized (likely a name)
                result.append(word)
            else:
                result.append(word.capitalize())

        return ' '.join(result)


# Test it
test_cases = [
    "Iovi / Optimo / Maximo / Q(uintus) Cassius / Cassianus / a(nimo) l(ibens) [v(otum) p(osuit)]",
    "D(is) M(anibus) s(acrum) / Severus / Tongini / an(norum) XXI / h(ic) s(itus) e(st)",
    "Fig(lina) Gem/ellian[a]",
    "C[3] / N[3] / Pap(iria?) [ann]or/um L h(ic) s(itus) e(st)",
    "Saturn/inus Bo/uti f(ilius) an(norum)",
]

print("Updated Leiden processor:\n")
for leiden in test_cases:
    transcribed = LeidenProcessor.process(leiden)
    print(f"Leiden:        {leiden}")
    print(f"Transcription: {transcribed}\n")

Updated Leiden processor:

Leiden:        Iovi / Optimo / Maximo / Q(uintus) Cassius / Cassianus / a(nimo) l(ibens) [v(otum) p(osuit)]
Transcription: Iovi Optimo Maximo Quintus Cassius Cassianus Animo Libens Votum Posuit

Leiden:        D(is) M(anibus) s(acrum) / Severus / Tongini / an(norum) XXI / h(ic) s(itus) e(st)
Transcription: Dis Manibus Sacrum Severus Tongini Annorum XXI Hic Situs Est

Leiden:        Fig(lina) Gem/ellian[a]
Transcription: Figlina Gemelliana

Leiden:        C[3] / N[3] / Pap(iria?) [ann]or/um L h(ic) s(itus) e(st)
Transcription: C N Papiria Annorum L Hic Situs Est

Leiden:        Saturn/inus Bo/uti f(ilius) an(norum)
Transcription: Saturninus Bouti Filius Annorum



In [36]:
# CELL 4: Load your CSV and process it
import pandas as pd

# Read CSV (adjust filename to match your upload)
df = pd.read_csv('assets/actual_inscriptions.csv')  # or whatever your file is named

print(f"Loaded {len(df)} inscriptions\n")

# Process all inscriptions
records = []
for idx, row in df.iterrows():
    leiden_text = row.get('text', '') if isinstance(row, dict) else row['text']

    if not leiden_text:
        continue

    transcription = LeidenProcessor.process(leiden_text)

    record = {
        "id": f"actual_{idx}",
        "text": transcription,
        "leiden_source": leiden_text,
        "annotations": []
    }
    records.append(record)

print(f"Processed {len(records)} inscriptions")

# Show samples
print("\nFirst 3 samples:")
for rec in records[:3]:
    print(f"\n  Leiden:        {rec['leiden_source']}")
    print(f"  Transcription: {rec['text']}")

Loaded 138 inscriptions

Processed 138 inscriptions

First 3 samples:

  Leiden:        Iovi / Optimo / Maximo / Q(uintus) Cassius / Cassianus / a(nimo) l(ibens) [v(otum) p(osuit)]
  Transcription: Iovi Optimo Maximo Quintus Cassius Cassianus Animo Libens Votum Posuit

  Leiden:        D(is) M(anibus) s(acrum) / Sycecale / v(ixit) an(n)o m(ensibus) [V] / soror[es] / Tricism[a] / Salcea / et Veget[a]
  Transcription: Dis Manibus Sacrum Sycecale Vixit Anno Mensibus V Sorores Tricisma Salcea et Vegeta

  Leiden:        Fig(lina) Gem/ellian[a]
  Transcription: Figlina Gemelliana


In [38]:
# CELL 5: Run model inference
import spacy

# Load model
try:
    nlp = spacy.load("training/model-best")
    print(f"✅ Loaded model")
except OSError:
    print("❌ Model not found at training/model-best")
    print("   Make sure the model folder exists and is in the correct location")

# Run predictions
predictions = []

for i, record in enumerate(records, 1):
    text = record['text']

    # Run model
    doc = nlp(text)

    # Extract entities
    entities = []
    for ent in doc.ents:
        entities.append({
            "text": ent.text,
            "label": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char
        })

    pred = {
        "id": record['id'],
        "leiden_source": record['leiden_source'],
        "text": text,
        "entities": entities
    }
    predictions.append(pred)

    if i % 100 == 0:
        print(f"  Processed {i}/{len(records)}...")

print(f"✅ Generated predictions for {len(predictions)} inscriptions")

✅ Loaded model
  Processed 100/138...
✅ Generated predictions for 138 inscriptions


In [39]:
# CELL 6: Display predictions
print("=" * 100)
print("PREDICTIONS")
print("=" * 100)

for i, pred in enumerate(predictions[:10], 1):
    print(f"\n[{i}] {pred['leiden_source']}")
    print(f"    Transcription: {pred['text']}")
    print(f"    Entities:")

    if pred['entities']:
        for ent in pred['entities']:
            print(f"      • {ent['text']:30} → {ent['label']}")
    else:
        print(f"      (none detected)")

PREDICTIONS

[1] Iovi / Optimo / Maximo / Q(uintus) Cassius / Cassianus / a(nimo) l(ibens) [v(otum) p(osuit)]
    Transcription: Iovi Optimo Maximo Quintus Cassius Cassianus Animo Libens Votum Posuit
    Entities:
      • Iovi                           → DEDICATORY_FORMULA
      • Quintus                        → PRAENOMEN
      • Cassius                        → NOMEN
      • Cassianus                      → COGNOMEN

[2] D(is) M(anibus) s(acrum) / Sycecale / v(ixit) an(n)o m(ensibus) [V] / soror[es] / Tricism[a] / Salcea / et Veget[a]
    Transcription: Dis Manibus Sacrum Sycecale Vixit Anno Mensibus V Sorores Tricisma Salcea et Vegeta
    Entities:
      • Dis Manibus                    → DEDICATORY_FORMULA

[3] Fig(lina) Gem/ellian[a]
    Transcription: Figlina Gemelliana
    Entities:
      (none detected)

[4] D(is) M(anibus) s(acrum) / Severus / Tongini / an(norum) XXI / h(ic) s(itus) e(st) s(it) t(ibi) / t(erra) l(evis) m(ater)
    Transcription: Dis Manibus Sacrum Severus Tong

In [40]:
# CELL 7: Analyze statistics
from collections import Counter

label_counts = Counter()
entity_counts = Counter()

for pred in predictions:
    for ent in pred['entities']:
        label_counts[ent['label']] += 1
        entity_counts[ent['text']] += 1

print("\n" + "=" * 100)
print("STATISTICS")
print("=" * 100)

print(f"\nTotal inscriptions: {len(predictions)}")
print(f"Total entities detected: {sum(len(p['entities']) for p in predictions)}")

print(f"\nEntities by label:")
for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
    print(f"  {label:25} {count:5}")

print(f"\nMost common entities:")
for entity, count in entity_counts.most_common(10):
    print(f"  {entity:30} {count:5}")


STATISTICS

Total inscriptions: 138
Total entities detected: 221

Entities by label:
  COGNOMEN                     87
  DEDICATORY_FORMULA           63
  NOMEN                        34
  PRAENOMEN                    29
  TRIBE                         6
  EPITHET                       2

Most common entities:
  Dis Manibus                       21
  Annorum                           16
  Est                               10
  Situs                              9
  Hic                                8
  Filio                              6
  Galeria                            5
  Iovi                               4
  Caius                              4
  Filius                             4


In [44]:
# CELL 8: Save predictions to CSV (wide format)
import pandas as pd

# Define all possible labels
LABELS = ['PRAENOMEN', 'NOMEN', 'COGNOMEN', 'TRIBE', 'EPITHET', 'DEDICATORY_FORMULA', 'ORIGIN']

# Flatten predictions into wide format
export_data = []
for pred in predictions:
    row = {
        'id': pred['id'],
        'leiden_source': pred['leiden_source'],
        'text': pred['text'],
    }

    # Add a column for each label with entities found
    for label in LABELS:
        entities_with_label = [e['text'] for e in pred['entities'] if e['label'] == label]
        row[label] = ' | '.join(entities_with_label) if entities_with_label else ''

    # Add total entity count
    row['entity_count'] = len(pred['entities'])

    export_data.append(row)

# Create dataframe
export_df = pd.DataFrame(export_data)

# Reorder columns: metadata first, then labels, then count
column_order = ['id', 'leiden_source', 'text'] + LABELS + ['entity_count']
export_df = export_df[column_order]

# Save
export_df.to_csv('predictions.csv', index=False)

print("✅ Saved predictions to predictions.csv\n")
print(f"Dimensions: {export_df.shape[0]} rows × {export_df.shape[1]} columns\n")
print(export_df.head(10).to_string())

# Download
from google.colab import files
files.download('predictions.csv')


✅ Saved predictions to predictions.csv

Dimensions: 138 rows × 11 columns

         id                                                                                                              leiden_source                                                                                     text PRAENOMEN    NOMEN         COGNOMEN TRIBE EPITHET DEDICATORY_FORMULA ORIGIN  entity_count
0  actual_0                               Iovi / Optimo / Maximo / Q(uintus) Cassius / Cassianus / a(nimo) l(ibens) [v(otum) p(osuit)]                   Iovi Optimo Maximo Quintus Cassius Cassianus Animo Libens Votum Posuit   Quintus  Cassius        Cassianus                             Iovi                    4
1  actual_1        D(is) M(anibus) s(acrum) / Sycecale / v(ixit) an(n)o m(ensibus) [V] / soror[es] / Tricism[a] / Salcea / et Veget[a]      Dis Manibus Sacrum Sycecale Vixit Anno Mensibus V Sorores Tricisma Salcea et Vegeta                                                          Dis Manibus     

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!zip -r inscription_model.zip training/model-best