# Latin Epigraphic Inscription Parser (latinepi) - Complete Workflow Demo

This notebook demonstrates the complete workflow for extracting structured personal data from Roman Latin epigraphic inscriptions using the `latinepi` tool with Latin BERT.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shawngraham/latinitas/blob/main/latinepi_demo.ipynb)

**Note on Google Colab Compatibility:** Google Colab currently uses Python 3.12, which is incompatible with Latin BERT (requires Python 3.8-3.10). The demo will work in stub mode without Latin BERT, or you can run it locally with Python 3.10 for full functionality.

## Features Demonstrated

1. **Installation** - Set up latinepi in Google Colab
2. **Latin BERT Setup** - Install and configure the Latin BERT model (requires Python 3.10)
3. **Basic Processing** - Extract entities from CSV and JSON files
4. **EDH Integration** - Download inscriptions from Epigraphic Database Heidelberg
5. **Bulk Search** - Search and download multiple inscriptions by criteria
6. **Confidence Filtering** - Apply thresholds and flag ambiguous entities
7. **Complete Pipeline** - Search ‚Üí Download ‚Üí Extract ‚Üí Analyze
8. **Visualization** - Analyze and visualize the extracted data

## About the Tool

**latinepi** extracts prosopographical data from Latin inscriptions using Named Entity Recognition:
- Personal names (praenomen, nomen, cognomen)
- Social status markers
- Locations
- Ages, occupations, military service
- And more!

The tool can use **Latin BERT** (https://github.com/dbamman/latin-bert/), a transformer model trained on 642.7M tokens of Latin text for accurate entity extraction. However, it works in stub mode without Latin BERT for demonstration purposes.

Repository: https://github.com/shawngraham/latinitas

## 1. Installation

First, let's clone the repository and install dependencies.

In [None]:
# Clone the repository
!git clone https://github.com/shawngraham/latinitas.git
%cd latinitas

# Install core dependencies
!pip install -q pandas requests

print("‚úÖ Installation complete!")

## 2a. Optional: Install Latin BERT for Enhanced Accuracy

The tool can use a pre-trained Latin BERT model for more accurate named entity recognition. This installs the Latin BERT model from https://github.com/dbamman/latin-bert/

**CRITICAL: Python Version Requirement**
- Latin BERT requires Python 3.8, 3.9, or 3.10 (NOT 3.11 or 3.12)
- **Google Colab now uses Python 3.12** - you will encounter errors
- The notebook will check your version and provide workarounds

**Workaround Options:**
1. **Skip Latin BERT** - The tool works without it (uses stub mode for demo purposes)
2. **Local environment** - Install with Python 3.10 on your machine
3. **Wait for updates** - Latin BERT maintainers may update for Python 3.12+

This section will:
1. Check Python version compatibility
2. Attempt installation (may fail on Python 3.12+)
3. Configure the tool to use the model if successful

In [None]:
# Step 0: Check Python version compatibility
import sys
print("Step 0: Checking Python version...")
python_version = sys.version_info
print(f"Current Python version: {python_version.major}.{python_version.minor}.{python_version.micro}")

COMPATIBLE = False

if python_version.major == 3 and 8 <= python_version.minor <= 10:
    print("‚úÖ Python version is compatible with Latin BERT")
    print("   Proceeding with installation...\n")
    COMPATIBLE = True
else:
    if python_version.major == 3 and python_version.minor >= 11:
        print(f"\n‚ùå INCOMPATIBLE: Python {python_version.major}.{python_version.minor} detected!")
        print("   Latin BERT dependencies require Python 3.8-3.10")
        print("   (The 'cltk' package does not support Python 3.11+)")
    else:
        print(f"\n‚ö†Ô∏è  WARNING: Python {python_version.major}.{python_version.minor} may not be compatible")
        print("   Latin BERT requires Python 3.8-3.10")
    
    print("\nüìå RECOMMENDED SOLUTIONS:")
    print("   1. SKIP THIS SECTION - The demo works in stub mode without Latin BERT")
    print("   2. Run locally with Python 3.10:")
    print("      conda create -n latinbert python=3.10")
    print("      conda activate latinbert")
    print("      jupyter notebook")
    print("   3. Wait for Latin BERT to support Python 3.12+")
    
    print("\n‚ö†Ô∏è  Installation will likely FAIL. Stopping here.")
    print("   You can continue with the rest of the notebook (skip to section 2).")
    
# Only proceed if compatible
if not COMPATIBLE:
    raise SystemExit("Stopping installation due to Python version incompatibility. "
                     "Please skip to section 2 to use the tool in stub mode.")

# Step 1: Install PyTorch and transformers
print("Step 1: Installing PyTorch and transformers...")
!pip install -q torch transformers

# Step 2: Clone Latin BERT repository
print("\nStep 2: Cloning Latin BERT repository...")
!git clone https://github.com/dbamman/latin-bert.git /content/latin-bert

# Step 3: Install dependencies
print("\nStep 3: Installing Latin BERT dependencies...")
!pip install -q -r /content/latin-bert/requirements.txt

# Step 4: Download the pre-trained model
print("\nStep 4: Downloading pre-trained Latin BERT model...")
import os
os.chdir('/content/latin-bert')
!bash scripts/download.sh

# Step 5: Set environment variables for latinepi
print("\nStep 5: Configuring latinepi to use Latin BERT...")
os.chdir('/content/latinitas')
os.environ['LATINEPI_USE_STUB'] = 'false'
os.environ['LATIN_BERT_PATH'] = '/content/latin-bert/models/latin_bert/'

print("\n‚úÖ Latin BERT installation complete!")
print(f"   Python version: {python_version.major}.{python_version.minor}.{python_version.micro}")
print(f"   Model location: {os.environ['LATIN_BERT_PATH']}")
print("   The tool will now use Latin BERT for entity extraction.")
print("\nNote: Installation may take several minutes depending on your connection speed.")

## 2. Basic Setup

Create sample data and set up working directories.

In [None]:
import json
import csv
import pandas as pd
from pathlib import Path

# Create output directories
Path('data').mkdir(exist_ok=True)
Path('output').mkdir(exist_ok=True)
Path('edh_downloads').mkdir(exist_ok=True)

# Create sample CSV data
sample_inscriptions = [
    {"id": 1, "text": "D M GAIVS IVLIVS CAESAR", "location": "Rome"},
    {"id": 2, "text": "MARCVS ANTONIVS FELIX", "location": "Pompeii"},
    {"id": 3, "text": "D M MARCIA TVRPILIA", "location": "Ostia"},
    {"id": 4, "text": "LVCIVS CORNELIVS SCIPIO", "location": "Rome"},
    {"id": 5, "text": "D M GAIVS IVLIVS CICERO ANNORVM XXX", "location": "Forum"},
]

# Save as CSV
with open('data/sample_inscriptions.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['id', 'text', 'location'])
    writer.writeheader()
    writer.writerows(sample_inscriptions)

# Save as JSON
with open('data/sample_inscriptions.json', 'w', encoding='utf-8') as f:
    json.dump(sample_inscriptions, f, indent=2, ensure_ascii=False)

print("‚úÖ Sample data created:")
print(f"  - data/sample_inscriptions.csv ({len(sample_inscriptions)} inscriptions)")
print(f"  - data/sample_inscriptions.json ({len(sample_inscriptions)} inscriptions)")
print("\nüìÑ Sample inscription texts:")
for insc in sample_inscriptions[:3]:
    print(f"  {insc['id']}: {insc['text']}")

## 3. Basic Entity Extraction

Process the sample CSV file and extract entities.

## 3a. Using Latin BERT Model for Enhanced Extraction

If you installed the Latin BERT dependencies in section 2a, the entity extraction will now use the transformer model instead of the simple stub implementation. This provides more accurate results for complex inscriptions.

In [None]:
# Demonstrate using the Latin BERT model directly via the Python API
# This shows the model in action before running the full CLI

print("Testing Latin BERT model integration...")
print("="*60)

# Ensure environment is set to use the model
import os
if 'LATIN_BERT_PATH' in os.environ:
    print(f"Using Latin BERT model from: {os.environ['LATIN_BERT_PATH']}")
else:
    print("Note: LATIN_BERT_PATH not set. Using default model.")
    print("Run section 2a to install Latin BERT for best results.")

os.environ['LATINEPI_USE_STUB'] = 'false'

# Import the entity extraction function
import sys
sys.path.insert(0, 'latinepi')
from parser import extract_entities

# Test inscriptions
test_inscriptions = [
    "D M GAIVS IVLIVS CAESAR ANNORVM LX",
    "MARCVS ANTONIVS FELIX VIXIT ANNOS XXV",
    "LVCIVS CORNELIVS SCIPIO"
]

print("\nExtracting entities from test inscriptions:\n")

for i, inscription in enumerate(test_inscriptions, 1):
    print(f"{i}. '{inscription}'")
    entities = extract_entities(inscription, use_model=True)
    
    if entities:
        for entity_name, entity_data in entities.items():
            print(f"   {entity_name}: {entity_data['value']} (confidence: {entity_data['confidence']:.3f})")
    else:
        print("   No entities extracted")
    print()

print("="*60)
print("Status Check:")
if any('text' in extract_entities(t, use_model=True) and 
       extract_entities(t, use_model=True)['text']['confidence'] == 0.50 
       for t in test_inscriptions):
    print("‚ö†Ô∏è  Model may not be loaded (stub mode active)")
    print("   Install transformers and Latin BERT for full functionality")
else:
    print("‚úÖ Latin BERT model appears to be active")
print("="*60)

In [None]:
# Process CSV to JSON output
!python3 latinepi/cli.py \
    --input data/sample_inscriptions.csv \
    --output output/entities.json

# Display results
print("\n" + "="*60)
print("EXTRACTED ENTITIES (JSON)")
print("="*60)

with open('output/entities.json', 'r') as f:
    entities = json.load(f)

# Pretty print first 3 results
for entity in entities[:3]:
    print(f"\nüìú Inscription {entity.get('inscription_id')}:")
    for key, value in entity.items():
        if key != 'inscription_id' and not key.endswith('_confidence'):
            confidence_key = f"{key}_confidence"
            confidence = entity.get(confidence_key, 'N/A')
            print(f"   {key}: {value} (confidence: {confidence})")

## 4. CSV Output Format

Process the same data but output as CSV for easier analysis in spreadsheet tools.

In [None]:
# Process to CSV output
!python3 latinepi/cli.py \
    --input data/sample_inscriptions.json \
    --output output/entities.csv \
    --output-format csv

# Display as pandas DataFrame
print("\n" + "="*60)
print("EXTRACTED ENTITIES (CSV)")
print("="*60 + "\n")

df = pd.read_csv('output/entities.csv')
print(df.to_string(index=False))

print(f"\nüìä Extracted {len(df)} inscription records with {len(df.columns)} fields")

## 5. Confidence Threshold Filtering

Apply confidence thresholds to filter high-quality entities.

In [None]:
# High confidence threshold (0.9)
!python3 latinepi/cli.py \
    --input data/sample_inscriptions.json \
    --output output/high_confidence.json \
    --confidence-threshold 0.9

# Low confidence with ambiguous flagging
!python3 latinepi/cli.py \
    --input data/sample_inscriptions.json \
    --output output/with_ambiguous.json \
    --confidence-threshold 0.7 \
    --flag-ambiguous

print("\n" + "="*60)
print("CONFIDENCE FILTERING RESULTS")
print("="*60)

# Compare results
with open('output/high_confidence.json', 'r') as f:
    high_conf = json.load(f)

with open('output/with_ambiguous.json', 'r') as f:
    with_amb = json.load(f)

print(f"\n‚úÖ High confidence (‚â•0.9): {len(high_conf)} inscriptions processed")
print(f"   Average entities per inscription: {sum(len([k for k in r.keys() if not k.endswith('_confidence') and k != 'inscription_id']) for r in high_conf) / len(high_conf):.1f}")

print(f"\n‚ö†Ô∏è  With ambiguous flagging (‚â•0.7): {len(with_amb)} inscriptions processed")
ambiguous_count = sum(sum(1 for k in r.keys() if k.endswith('_ambiguous') and r[k]) for r in with_amb)
print(f"   Total ambiguous entities flagged: {ambiguous_count}")

# Show example with ambiguous flags
print("\nüìã Example with ambiguous flags:")
example = with_amb[0]
for key, value in example.items():
    if not key.endswith('_confidence'):
        print(f"   {key}: {value}")

## 6. EDH Single Inscription Download

Download a specific inscription from the Epigraphic Database Heidelberg.

In [None]:
# Download inscription HD000001 from EDH
print("üì• Downloading inscription HD000001 from EDH...\n")

!python3 latinepi/cli.py \
    --download-edh HD000001 \
    --download-dir edh_downloads/

# Check what was downloaded
import os
edh_files = list(Path('edh_downloads').glob('*.json'))
print(f"\n‚úÖ Downloaded {len(edh_files)} file(s) to edh_downloads/")

if edh_files:
    # Show structure of downloaded file
    with open(edh_files[0], 'r') as f:
        edh_data = json.load(f)
    
    print(f"\nüìÑ Downloaded file: {edh_files[0].name}")
    print(f"   Top-level keys: {list(edh_data.keys())}")
    
    # Show inscriptions if present
    if 'inscriptions' in edh_data:
        print(f"   Number of inscriptions: {len(edh_data['inscriptions'])}")
        if edh_data['inscriptions']:
            first_insc = edh_data['inscriptions'][0]
            print(f"   Inscription fields: {list(first_insc.keys())[:10]}...")

## 7. EDH Bulk Search and Download

Search for multiple inscriptions by criteria and download them in parallel.

‚ö†Ô∏è **Note**: This example uses small limits to avoid long download times. Adjust `--search-limit` for production use.

In [None]:
# Search for inscriptions from Rome (modern findspot)
print("üîç Searching EDH for inscriptions from Rome...\n")

!python3 latinepi/cli.py \
    --search-edh \
    --search-findspot-modern "rome*" \
    --search-limit 20 \
    --search-workers 5 \
    --download-dir edh_downloads/rome/

# Check results
rome_files = list(Path('edh_downloads/rome').glob('*.json'))
print(f"\n‚úÖ Downloaded {len(rome_files)} inscriptions from Rome")
print(f"   Files saved to: edh_downloads/rome/")

# Show some inscription IDs
if rome_files:
    print(f"\nüìã Sample inscription IDs:")
    for f in rome_files[:5]:
        print(f"   - {f.stem}")

## 8. Temporal Search (By Date Range)

Search inscriptions by time period.

In [None]:
# Search for 1st century AD inscriptions
print("üîç Searching for 1st century AD inscriptions...\n")

!python3 latinepi/cli.py \
    --search-edh \
    --search-year-from 1 \
    --search-year-to 100 \
    --search-limit 15 \
    --download-dir edh_downloads/first_century/

# Check results
century_files = list(Path('edh_downloads/first_century').glob('*.json'))
print(f"\n‚úÖ Downloaded {len(century_files)} inscriptions from 1st century AD")

## 9. Complete Pipeline: Search ‚Üí Download ‚Üí Extract ‚Üí Analyze

Demonstrate the full workflow from search to analysis.

In [None]:
# Step 1: Search and download inscriptions from a specific province
print("üîç Step 1: Searching for inscriptions from Dalmatia...\n")

!python3 latinepi/cli.py \
    --search-edh \
    --search-province "Dalmatia" \
    --search-limit 10 \
    --download-dir edh_downloads/dalmatia/

# Step 2: Process all downloaded inscriptions
print("\nüîß Step 2: Extracting entities from downloaded inscriptions...\n")

# Get list of downloaded files
dalmatia_files = list(Path('edh_downloads/dalmatia').glob('*.json'))

if dalmatia_files:
    # Process each file and collect results
    all_results = []
    
    for file_path in dalmatia_files:
        # For now, process files individually (in production you might batch this)
        output_file = f'output/dalmatia_{file_path.stem}.json'
        !python3 latinepi/cli.py \
            --input {str(file_path)} \
            --output {output_file} \
            --confidence-threshold 0.7
        
        with open(output_file, 'r') as f:
            results = json.load(f)
            all_results.extend(results)
    
    print(f"\n‚úÖ Processed {len(dalmatia_files)} inscriptions")
    print(f"   Total entity records extracted: {len(all_results)}")
    
    # Save combined results
    with open('output/dalmatia_combined.json', 'w') as f:
        json.dump(all_results, f, indent=2, ensure_ascii=False)
    
    print(f"\nüíæ Combined results saved to: output/dalmatia_combined.json")
else:
    print("‚ö†Ô∏è  No files downloaded. The search may not have returned results.")

## 10. Data Analysis and Visualization

Analyze the extracted entities to gain insights.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter

# Load all extracted entities
with open('output/entities.json', 'r') as f:
    entities = json.load(f)

print("üìä ENTITY EXTRACTION ANALYSIS")
print("="*60 + "\n")

# Count entity types
entity_types = Counter()
confidence_scores = []

for record in entities:
    for key, value in record.items():
        if not key.endswith('_confidence') and key != 'inscription_id' and not key.endswith('_ambiguous'):
            entity_types[key] += 1
            confidence_key = f"{key}_confidence"
            if confidence_key in record:
                confidence_scores.append(record[confidence_key])

# Print statistics
print(f"Total inscriptions processed: {len(entities)}")
print(f"Total entities extracted: {sum(entity_types.values())}")
print(f"Average entities per inscription: {sum(entity_types.values()) / len(entities):.2f}")
print(f"\nMost common entity types:")
for entity_type, count in entity_types.most_common():
    print(f"  {entity_type}: {count}")

if confidence_scores:
    avg_confidence = sum(confidence_scores) / len(confidence_scores)
    print(f"\nAverage confidence score: {avg_confidence:.3f}")
    print(f"Confidence range: {min(confidence_scores):.3f} - {max(confidence_scores):.3f}")

# Visualization
if entity_types:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Entity type distribution
    types, counts = zip(*entity_types.most_common())
    ax1.barh(types, counts, color='steelblue')
    ax1.set_xlabel('Count')
    ax1.set_title('Entity Type Distribution')
    ax1.invert_yaxis()
    
    # Confidence score distribution
    if confidence_scores:
        ax2.hist(confidence_scores, bins=20, color='coral', edgecolor='black')
        ax2.set_xlabel('Confidence Score')
        ax2.set_ylabel('Frequency')
        ax2.set_title('Confidence Score Distribution')
        ax2.axvline(avg_confidence, color='red', linestyle='--', label=f'Mean: {avg_confidence:.3f}')
        ax2.legend()
    
    plt.tight_layout()
    plt.savefig('output/analysis.png', dpi=150, bbox_inches='tight')
    print("\nüìà Visualizations saved to: output/analysis.png")
    plt.show()

## 11. Advanced Analysis: Name Patterns

Analyze Roman naming conventions in the extracted data.

In [None]:
print("üìä ROMAN NAMING PATTERNS ANALYSIS")
print("="*60 + "\n")

# Count full name combinations
tria_nomina = 0  # praenomen + nomen + cognomen
duo_nomina = 0   # two of three
single_names = 0 # just one name

praenomina = Counter()
nomina = Counter()
cognomina = Counter()

for record in entities:
    has_praenomen = 'praenomen' in record
    has_nomen = 'nomen' in record
    has_cognomen = 'cognomen' in record
    
    name_count = sum([has_praenomen, has_nomen, has_cognomen])
    
    if name_count == 3:
        tria_nomina += 1
    elif name_count == 2:
        duo_nomina += 1
    elif name_count == 1:
        single_names += 1
    
    # Collect name components
    if has_praenomen:
        praenomina[record['praenomen']] += 1
    if has_nomen:
        nomina[record['nomen']] += 1
    if has_cognomen:
        cognomina[record['cognomen']] += 1

total_with_names = tria_nomina + duo_nomina + single_names

if total_with_names > 0:
    print(f"Naming Conventions:")
    print(f"  Tria nomina (3 names): {tria_nomina} ({tria_nomina/total_with_names*100:.1f}%)")
    print(f"  Duo nomina (2 names):  {duo_nomina} ({duo_nomina/total_with_names*100:.1f}%)")
    print(f"  Single names:          {single_names} ({single_names/total_with_names*100:.1f}%)")
    
    print(f"\nMost common praenomina:")
    for name, count in praenomina.most_common(5):
        print(f"  {name}: {count}")
    
    print(f"\nMost common nomina:")
    for name, count in nomina.most_common(5):
        print(f"  {name}: {count}")
    
    print(f"\nMost common cognomina:")
    for name, count in cognomina.most_common(5):
        print(f"  {name}: {count}")
else:
    print("‚ö†Ô∏è  No name entities found in the data.")

## 12. Export Results for Further Analysis

Prepare data for external tools (Excel, R, etc.).

In [None]:
# Convert all JSON results to CSV for spreadsheet analysis
print("üíæ Exporting results to CSV format...\n")

# Load entities
with open('output/entities.json', 'r') as f:
    entities = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(entities)

# Save as CSV
df.to_csv('output/all_entities_export.csv', index=False)
print(f"‚úÖ Exported {len(df)} records to: output/all_entities_export.csv")

# Create summary statistics CSV
summary_data = []
for col in df.columns:
    if not col.endswith('_confidence') and col != 'inscription_id':
        summary_data.append({
            'entity_type': col,
            'count': df[col].notna().sum(),
            'unique_values': df[col].nunique(),
            'avg_confidence': df[f"{col}_confidence"].mean() if f"{col}_confidence" in df.columns else None
        })

summary_df = pd.DataFrame(summary_data)
summary_df.to_csv('output/entity_summary.csv', index=False)
print(f"‚úÖ Summary statistics saved to: output/entity_summary.csv")

# Display summary
print("\nüìä Entity Summary:")
print(summary_df.to_string(index=False))

print("\n" + "="*60)
print("üì¶ All outputs saved to 'output/' directory")
print("   Download these files to analyze in Excel, R, or other tools.")
print("="*60)

## Summary

This notebook demonstrated:

‚úÖ **Installation** - Set up latinepi in Google Colab

‚úÖ **Latin BERT Integration** - Install and configure Latin BERT model (when Python 3.8-3.10 available)

‚úÖ **Basic Processing** - Extract entities from CSV and JSON files

‚úÖ **Confidence Filtering** - Apply thresholds and flag ambiguous entities

‚úÖ **EDH Integration** - Download single inscriptions from EDH

‚úÖ **Bulk Search** - Search and download multiple inscriptions by:
   - Geographic location (Rome, provinces)
   - Time period (1st century AD)
   - Combined criteria

‚úÖ **Complete Pipeline** - Search ‚Üí Download ‚Üí Extract ‚Üí Analyze

‚úÖ **Data Analysis** - Visualize entity distributions and confidence scores

‚úÖ **Export** - Prepare data for external analysis tools

## Known Limitations

**Python Compatibility:**
- Latin BERT requires Python 3.8, 3.9, or 3.10
- Google Colab now uses Python 3.12 (incompatible)
- For full functionality, run locally with Python 3.10
- The tool works in stub mode without Latin BERT

## Next Steps

- **Latin BERT**: Run locally with Python 3.10 for ML-based entity extraction
- **Scale up**: Increase `--search-limit` for larger datasets
- **Customize searches**: Try different provinces, date ranges, locations
- **Adjust thresholds**: Experiment with confidence filtering
- **Deep analysis**: Use exported CSV files in R, Python, or Excel
- **Fine-tuning**: Train Latin BERT on domain-specific inscription data
- **Alternative models**: Explore other Latin NLP models compatible with Python 3.12+

## Resources

- **Repository**: https://github.com/shawngraham/latinitas
- **Latin BERT**: https://github.com/dbamman/latin-bert/
- **Documentation**: See README.md and SETUP.md
- **EDH Database**: https://edh.ub.uni-heidelberg.de/
- **Issues/Questions**: https://github.com/shawngraham/latinitas/issues

## Alternative Latin NLP Models

If Latin BERT is incompatible with your Python version, consider:
- **Classical Language Toolkit (CLTK)**: Basic NER without Latin BERT
- **Hugging Face models**: Search for Latin-compatible transformers
- **Custom training**: Fine-tune modern BERT models on Latin corpora

---

*Created for digital humanities and ancient history research*