# Make Turtles

This notebook contains code to turn JSON files downloaded from PAThs into RDF triples serialized as Turtles (.ttl). It uses the [RDFLib](https://github.com/RDFLib/rdflib) library to make the conversion.

In [1]:
import json
from rdflib import BNode, Graph, URIRef, Literal, Namespace, URIRef, RDF
from rdflib.namespace import RDF, RDFS, DCMITYPE, DCTERMS, FOAF, OWL, SKOS
import re
import glob
import os

## Authors

We want to end up with turtles like the following example:

```
<http://paths.uniroma1.it/atlas/authors/11> a coptic:Author ;
    dcterms:creator <http://paths.uniroma1.it/atlas/works/253> ;
    dcterms:description """Probably a native of the Thebaid, Stephen of Thebes is for us a fairly elusive figure. Indeed, no historically reliable pieces of information are available. However, moving from the analysis of the sole composition attributed to him in Coptic, his life and activity seem to be closely related to the monastic environment of Lower Egypt, and specifically to the sites of Scetis, Kellia, and Nitria. \r
A number of possible identifications have been suggested, including that with Stephen the Anchorite (see Lucchesi 2007a and Lucchesi 1998b), with the Stephen mentioned by Palladius in Historia Lausiaca 55.3 (see Dechow 1988; Lucchesi 2007a), and again with a homonymous anchorite celebrated on May 7 in the calendar provided by Abū al-Barakāt in his Lamp of Darkness (see Suciu 2018b, also for an updated status quaestionis).\r
However, in the absence of concluding evidence, all these possibilities remain hypothetical. \r
Indeed, the only firm source of information available is the text itself of the Sermo asceticum (CC0253), a collection of aphorisms and gnomic sentences addressed by an ascetic teacher to his spiritual pupil. Although the Sahidic redaction (probably the original one) is acephalous and fragmentarily preserved – and consequently anonymous – the indirect transmission in Greek, Arabic, Georgian, and Ethiopic seems to attribute the authorship to an Egyptian ascetic named Stephen, who probably lived contemporaneously with Athanasius. This would make Stephen of Thebes one of the earliest (and least known) original authors in Coptic.\r
""" ;
    dcterms:identifier "11" ;
    owl:sameAs <https://viaf.org/viaf/64111548> ;
    foaf:name "Stephen of Thebes",
        "ⲥⲧⲉⲫⲁⲛⲟⲥ"@cop,
        "Στέφανος"@el .
```

I'll use as many common metadata terms as possible, but I am going to define an Author class in the COPTIC namespace for the purposes of easy querying. Not every LLM understands the concept of `dcterms:creator`.

In [2]:
# Load JSON data
with open("../data/authors.json", "r", encoding="utf-8") as f:
    authors = json.load(f)

# Define namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
FRBR = Namespace("http://purl.org/vocab/frbr/core#")
SCHEMA = Namespace("http://schema.org/")
COPTIC = Namespace("http://www.semanticweb.org/sjhuskey/ontologies/2025/7/coptic-metadata-viewer/")

# Create graph
g = Graph()
g.bind("dcterms", DCTERMS)
g.bind("foaf", FOAF)
g.bind("owl", OWL)
g.bind("paths", PATHS)
g.bind("schema1", SCHEMA)
g.bind("coptic", COPTIC)

for author in authors:
    core = author.get("core", {})
    backlinks = author.get("backlinks", {})
    plugins = author.get("plugins", {})

    aid = core["id"]["val"]
    author_uri = PATHS[f"authors/{aid}"]

    g.add((author_uri, RDF.type, COPTIC.Author))
    g.add((author_uri, DCTERMS.identifier, Literal(aid)))

    # Add simple string fields using dcterms
    if (name := core.get("name", {}).get("val")):
        g.add((author_uri, FOAF.name, Literal(name)))
    if (copticname := core.get("copticname", {}).get("val")):
        g.add((author_uri, FOAF.name, Literal(copticname, lang="cop")))
    if (greekname := core.get("greekname", {}).get("val")):
        g.add((author_uri, FOAF.name, Literal(greekname, lang="el")))
    if (title := core.get("title", {}).get("val")):
        g.add((author_uri, SCHEMA.title, Literal(title)))
    if (bio := core.get("bio", {}).get("val")):
        g.add((author_uri, DCTERMS.description, Literal(bio)))
    if (viaf := core.get("viaf", {}).get("val")):
        if isinstance(viaf, str) and viaf.isdigit():
            g.add((author_uri, OWL.sameAs, URIRef(f"https://viaf.org/viaf/{viaf}")))
    

    # Link to works (via backlinks)
    if isinstance(author.get("backlinks"), dict):
        works_backlink = author["backlinks"].get("paths__works")
        if isinstance(works_backlink, dict):
            for work in works_backlink.get("data", []):
                if isinstance(work, dict) and "id" in work:
                    work_id = work["id"]
                    work_uri = PATHS[f"works/{work_id}"]

                    # Link author to work
                    g.add((author_uri, DCTERMS.creator , work_uri))
                    g.add((work_uri, DCTERMS.creator , author_uri))

# Save graph as Turtle
g.serialize(destination="../turtles/authors.ttl", format="turtle")
print("✅ RDF data saved to ../turtles/authors.ttl")


✅ RDF data saved to ../turtles/authors.ttl


## Collections

## Collections

We want to generate turtles like this:

```
<http://paths.uniroma1.it/atlas/collections/20> a dcmitype:Collection ;
    dcterms:hasPart <http://paths.uniroma1.it/atlas/manuscripts/2200>,
        <http://paths.uniroma1.it/atlas/manuscripts/2201>,
        <http://paths.uniroma1.it/atlas/manuscripts/2203>,
        <http://paths.uniroma1.it/atlas/manuscripts/24>,
        <http://paths.uniroma1.it/atlas/manuscripts/3223>,
        <http://paths.uniroma1.it/atlas/manuscripts/324>,
        <http://paths.uniroma1.it/atlas/manuscripts/378>,
        <http://paths.uniroma1.it/atlas/manuscripts/3972>,
        <http://paths.uniroma1.it/atlas/manuscripts/3973>,
        <http://paths.uniroma1.it/atlas/manuscripts/3974>,
        <http://paths.uniroma1.it/atlas/manuscripts/3975>,
        <http://paths.uniroma1.it/atlas/manuscripts/3979>,
        <http://paths.uniroma1.it/atlas/manuscripts/3981>,
        <http://paths.uniroma1.it/atlas/manuscripts/3982>,
        <http://paths.uniroma1.it/atlas/manuscripts/424>,
        <http://paths.uniroma1.it/atlas/manuscripts/6255>,
        <http://paths.uniroma1.it/atlas/manuscripts/6256>,
        <http://paths.uniroma1.it/atlas/manuscripts/6788>,
        <http://paths.uniroma1.it/atlas/manuscripts/687>,
        <http://paths.uniroma1.it/atlas/manuscripts/692>,
        <http://paths.uniroma1.it/atlas/manuscripts/709> ;
    dcterms:identifier "20",
        "BS.OCT" ;
    dcterms:spatial "Berlin",
        "Germany",
        "Staatsbibliothek zu Berlin - Preußischer Kulturbesitz" ;
    schema1:location "Berlin",
        "Germany" ;
    schema1:name "Germany, Berlin, Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, Ms. or. oct.",
        "Staatsbibliothek zu Berlin - Preußischer Kulturbesitz" .
```

In [3]:
# Load JSON data
with open("../data/collections.json", "r", encoding="utf-8") as f:
    collections = json.load(f)

def get_place_uri_and_node(label, class_uri):
    place_id = label.lower().replace(" ", "-").replace(",", "").replace(".", "")
    place_uri = PATHS[f"places/{place_id}"]
    g.add((place_uri, RDF.type, class_uri))
    g.add((place_uri, RDFS.label, Literal(label)))
    return place_uri

# Define namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
FRBR = Namespace("http://purl.org/vocab/frbr/core#")
SCHEMA = Namespace("http://schema.org/")

# Create graph
g = Graph()
g.bind("dcmitype", DCMITYPE)
g.bind("dcterms", DCTERMS)
g.bind("foaf", FOAF)
g.bind("owl", OWL)
g.bind("frbr", FRBR)
g.bind("paths", PATHS)
g.bind("schema1", SCHEMA)

for collection in collections:
    core = collection.get("core", {})
    backlinks = collection.get("backlinks", {})

    cid = core["id"]["val"]
    collection_uri = PATHS[f"collections/{cid}"]

    g.add((collection_uri, RDF.type, DCMITYPE.Collection))
    g.add((collection_uri, DCTERMS.identifier, Literal(cid)))

    # Add simple string fields using dcterms
    if (cmclname := core.get("cmclname", {}).get("val")):
        g.add((collection_uri, DCTERMS.identifier, Literal(cmclname)))
    if institution := core.get("institution", {}).get("val"):
        g.add((collection_uri, SCHEMA.name, Literal(institution)))
    if (fullname := core.get("fullname", {}).get("val")):
        g.add((collection_uri, SCHEMA.name, Literal(fullname)))
    if town := core.get("town", {}).get("val"):
        g.add((collection_uri, SCHEMA.location, Literal(town)))
    if country := core.get("country", {}).get("val"):
        g.add((collection_uri, SCHEMA.location, Literal(country)))

    # Link to works (via backlinks)
    if isinstance(collection.get("backlinks"), dict):
        manuscripts_backlinks = collection["backlinks"].get("paths__manuscripts")
        if isinstance(manuscripts_backlinks, dict):
            for manuscript in manuscripts_backlinks.get("data", []):
                if isinstance(manuscript, dict) and "id" in manuscript:
                    manuscript_id = manuscript["id"]
                    manuscript_uri = PATHS[f"manuscripts/{manuscript_id}"]
                    g.add((collection_uri, DCTERMS.hasPart, manuscript_uri))
                    g.add((manuscript_uri, DCTERMS.isPartOf, collection_uri))

# Save graph as Turtle
g.serialize(destination="../turtles/collections.ttl", format="turtle")
print("✅ RDF data saved to ../turtles/collections.ttl")

✅ RDF data saved to ../turtles/collections.ttl


## Manuscripts

We want to end up with output like the following:

```
<http://paths.uniroma1.it/atlas/manuscripts/3493> a coptic:Manuscript ;
    dcterms:bibliographicCitation [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ],
        [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ],
        [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ],
        [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ],
        [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ],
        [ dcterms:creator "Behlmer, H." ;
            dcterms:description "190",
                "24-25" ;
            dcterms:extent "11-27" ;
            dcterms:issued "2003" ;
            dcterms:title "Streiflichter auf die christliche Besiedlung Thebens: Koptische Ostraka aus dem Grab des Senneferi (TT 99)" ] ;
    dcterms:hasPart "Unidentified literary text" ;
    dcterms:identifier "3493" ;
    dcterms:medium "ostrakon",
        "stone" ;
    time:hasBeginning "501" ;
    time:hasEnd "650" .
```


In [4]:
# Namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
TIME = Namespace("http://www.w3.org/2006/time#")
COPTIC = Namespace("http://www.semanticweb.org/sjhuskey/ontologies/2025/7/coptic-metadata-viewer/")

g = Graph()
g.bind("dcterms", DCTERMS)
g.bind("paths", PATHS)
g.bind("time", TIME)
g.bind("coptic", COPTIC)

# Load JSON
with open("../data/manuscripts.json", encoding="utf-8") as f:
    manuscripts = json.load(f)

for m in manuscripts:
    core = m.get("core", {})
    plugins = m.get("plugins", [])

    # Function for looping through plugins
    def get_plugin_data(plugins, key):
        if isinstance(plugins, list):
            for plugin in plugins:
                if isinstance(plugin, dict):
                    maybe_data = plugin.get(key)
                    if maybe_data and "data" in maybe_data:
                        data = maybe_data["data"]
                        if isinstance(data, dict):
                            return list(data.values())
                        elif isinstance(data, list):
                            return data
        elif isinstance(plugins, dict):
            maybe_data = plugins.get(key)
            if maybe_data and "data" in maybe_data:
                data = maybe_data["data"]
                if isinstance(data, dict):
                    return list(data.values())
                elif isinstance(data, list):
                    return data
        return []

    
    mid = core.get("id", {}).get("val")
    if not mid:
        continue

    m_uri = PATHS[f"manuscripts/{mid}"]
    g.add((m_uri, RDF.type, COPTIC.Manuscript))
    g.add((m_uri, DCTERMS.identifier, Literal(mid)))

    def add_literal(field, predicate):
        val = core.get(field, {}).get("val")
        if val:
            g.add((m_uri, predicate, Literal(val)))
    
    # Get identifiers
    id_fields = {
    "cmclid": "CMCL",
    "tm": "TM",
    "ldab": "LDAB",
    "lcbm": "LCBM"
    }

    # Build the dictionary
    identifiers = {}
    for field, label in id_fields.items():
        val = core.get(field, {}).get("val")
        if val:  # Only include non-null values
            identifiers[label] = val
    
    # Add identifiers to the graph
    for id_type, id_value in identifiers.items():
        id_node = BNode()
        g.add((manuscript_uri, DCTERMS.identifier, id_node))
        g.add((id_node, RDF.value, Literal(id_value)))
        g.add((id_node, DCTERMS.publisher, Literal(id_type)))


    # Descriptions
    add_literal("modernhistory", DCTERMS.description)
    add_literal("gennotes", DCTERMS.description)

    # Contents → hasPart
    contents = core.get("contents", {}).get("val")
    if contents:
        g.add((m_uri, DCTERMS.hasPart, Literal(contents)))

    # Chrono range → temporal (both as separate strings)
    if (chronofrom := core.get("chronofrom", {}).get("val")):
        g.add((m_uri, TIME.hasBeginning, Literal(chronofrom)))
    if (chronoto := core.get("chronoto", {}).get("val")):
        g.add((m_uri, TIME.hasEnd, Literal(chronoto)))
    # Typology → has_type
    for field in ["bookform", "writingsupport"]:
        val = core.get(field, {}).get("val")
        if val:
            g.add((m_uri, DCTERMS.medium, Literal(val)))

    # Shelfmark block
    for shelf in get_plugin_data(plugins, "paths__m_shelfmarks"):
        sid = shelf.get("id", {}).get("val")
        if not sid:
            continue
        shelf_uri = PATHS[f"shelfmark/{sid}"]
        collection_id = shelf.get("collection", {}).get("val")
        if not collection_id:
            continue
        collection_uri = PATHS[f"collections/{collection_id}"]
        g.add((manuscript_uri, DCTERMS.isPartOf, collection_uri))
        g.add((collection_uri, DCTERMS.hasPart, manuscript_uri))

        # Full shelf string
        if (full := shelf.get("fullsegnat", {}).get("val")):
            g.add((manuscript_uri, DCTERMS.identifier, Literal(full)))

    # Bibliography block
    for bib_item in get_plugin_data(plugins, "paths__m_biblio"):
        bib_uri = BNode()
        g.add((m_uri, DCTERMS.bibliographicCitation, bib_uri))
        
        # Add bibliographic details
        if (short := bib_item.get("short", {}).get("val")):
            g.add((bib_uri, DCTERMS.description, Literal(short)))
        if (authors := bib_item.get("authors", {}).get("val")):
            g.add((bib_uri, DCTERMS.creator, Literal(authors)))
        if (title := bib_item.get("title", {}).get("val")):
            g.add((bib_uri, DCTERMS.title, Literal(title)))
        if (details := bib_item.get("details", {}).get("val")):
            g.add((bib_uri, DCTERMS.description, Literal(details)))
        if (pages := bib_item.get("pages", {}).get("val")):
            g.add((bib_uri, DCTERMS.extent, Literal(pages)))
        if (series := bib_item.get("series", {}).get("val")):
            g.add((bib_uri, DCTERMS.isPartOf, Literal(series)))
        if (year := bib_item.get("year", {}).get("val")):
            g.add((bib_uri, DCTERMS.issued, Literal(year)))
        

# Serialize to Turtle
g.serialize("../turtles/manuscripts.ttl", format="turtle")
print("✅ RDF data saved to ../turtles/manuscripts.ttl")


✅ RDF data saved to ../turtles/manuscripts.ttl


## Persons

We want output like this:

```
<http://paths.uniroma1.it/atlas/persons/100> a foaf:Person ;
    rdfs:label "Cyrus" ;
    ns1:transliteration "Kyre" ;
    dcterms:identifier "100" ;
    schema1:gender "male" ;
    schema1:roleName "donor" ;
    time:hasBeginning "851" ;
    time:hasEnd "950" ;
    foaf:name "ⲁⲡⲁ ⲕⲩⲣⲉ"@cop .
```

In [5]:
# Define Namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
SCHEMA = Namespace("http://schema.org/")
LEXINFO = Namespace("http://lexinfo.net/ontology/2.0/lexinfo#")
TIME = Namespace("http://www.w3.org/2006/time#")

g = Graph()
g.bind("paths", PATHS)
g.bind("foaf", FOAF)
g.bind("dcterms", DCTERMS)
g.bind("owl", OWL)
g.bind("schema", SCHEMA)
g.bind("lexinfo", LEXINFO)
g.bind("time", TIME)

# Load person records (replace this with actual file read if needed)
with open("../data/persons.json", "r", encoding="utf-8") as f:
    persons = json.load(f)

for person in persons:
    core = person.get("core", {})
    plugins = person.get("plugins", {})
    
    # Initialize empty list in case no nameforms found
    nameforms_data = {}

    # Handle plugins being either dict or list
    if isinstance(plugins, dict):
        nameforms_data = (
            plugins.get("paths__m_nameforms", {})
            .get("data", {})
        )
    elif isinstance(plugins, list):
        for p in plugins:
            if isinstance(p, dict) and "paths__m_nameforms" in p:
                nameforms_data = (
                    p.get("paths__m_nameforms", {})
                    .get("data", {})
                )
                break  # Found it, stop looking
    
    manual_links = person.get("manualLinks", {})
    manuallinks_data = list(manual_links.values()) if isinstance(manual_links, dict) else []

    pid = core.get("id", {}).get("val")
    if not pid:
        continue

    
    person_uri = PATHS[f"persons/{pid}"]
    g.add((person_uri, RDF.type, FOAF.Person))
    g.add((person_uri, DCTERMS.identifier, Literal(pid)))

    # foaf:name
    name = core.get("name", {}).get("val")
    if name:
        g.add((person_uri, RDFS.label, Literal(name)))

    # copt:profession
    if (prof := core.get("profession", {}).get("val")):
        g.add((person_uri, SCHEMA.roleName, Literal(prof)))

    # type → e.g., copyist, scribe, etc.
    if (ptype := core.get("type", {}).get("val")):
        g.add((person_uri, SCHEMA.roleName, Literal(ptype)))

    # date range
    datefrom = core.get("datefrom", {}).get("val")
    dateto = core.get("dateto", {}).get("val")
    if datefrom:
        g.add((person_uri, TIME.hasBeginning, Literal(datefrom)))
    if dateto and dateto != datefrom:
        g.add((person_uri, TIME.hasEnd, Literal(dateto)))

    # sex
    if (sex := core.get("sex", {}).get("val")):
        g.add((person_uri, SCHEMA.gender, Literal(sex)))

    # place of birth
    if (pb := core.get("placebirth", {}).get("val")):
        place_uri = PATHS[f"places/{pb}"]
        g.add((person_uri, SCHEMA.birthPlace, place_uri))

    for nf_id, nf_record in nameforms_data.items():
        nameform = nf_record.get("nameform", {}).get("val")
        language = nf_record.get("language", {}).get("val")
        translit = nf_record.get("transliteration", {}).get("val")

        if nameform:
            lang_tag = {"Coptic": "cop", "Greek": "el"}.get(language, None)
            if lang_tag:
                g.add((person_uri, FOAF.name, Literal(nameform, lang=lang_tag)))
            else:
                g.add((person_uri, FOAF.name, Literal(nameform)))

        if translit:
            g.add((person_uri, LEXINFO.transliteration, Literal(translit)))

    # colophon links (manualLinks)
    for mlink in manuallinks_data:
        if mlink.get("tb_stripped") == "colophons":
            col_id = mlink.get("ref_id")
            col_uri = PATHS[f"colophons/{col_id}"]
            g.add((person_uri, DCTERMS.isReferencedBy, col_uri))
            g.add((col_uri, DCTERMS.references, person_uri))

# Serialize
g.serialize("../turtles/persons.ttl", format="turtle")
print("✅ Saved RDF data to ../turtles/persons.ttl")


✅ Saved RDF data to ../turtles/persons.ttl


## Works

We want output like this:

```
<http://paths.uniroma1.it/atlas/works/966> a frbr:Work ;
    dcterms:identifier "966",
        "cc0969" ;
    dcterms:isPartOf <http://paths.uniroma1.it/atlas/manuscripts/393> ;
    dcterms:temporal "“Classical” translations – acts of councils and Canones (end of 4th-6th cent.)" ;
    dcterms:title "Collectio Nicaena C" .
```

In [6]:
# Namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
SCHEMA = Namespace("http://schema.org/")
FRBR = Namespace("http://purl.org/vocab/frbr/core#")

g = Graph()
g.bind("paths", PATHS)
g.bind("dcterms", DCTERMS)
g.bind("rdfs", RDFS)
g.bind("schema", SCHEMA)
g.bind("frbr", FRBR)

def extract_cmcl_number(where_clause):
    match = re.search(r'cc0*(\d+)', where_clause)
    return match.group(1) if match else None

# Load JSON
with open("../data/works.json", encoding="utf-8") as f:
    works = json.load(f)

for work in works:
    core = work.get("core", {})
    # Normalize plugins into a dict (even if it's wrapped in a list)
    raw_plugins = work.get("plugins", {})
    if isinstance(raw_plugins, list):
        # Look for the first dict in the list
        plugins = next((item for item in raw_plugins if isinstance(item, dict)), {})
    else:
        plugins = raw_plugins
    # Normalize links into a dict (even if it's wrapped in a list)
    raw_links = work.get("links", {})
    if isinstance(raw_links, list):
        # Look for the first dict in the list
        links = next((item for item in raw_links if isinstance(item, dict)), {})
    else:
        links = raw_links
    raw_manual_links = work.get("manualLinks", {})
    if isinstance(raw_manual_links, list):
        # Look for the first dict in the list
        manual_links = next((item for item in raw_manual_links if isinstance(item, dict)), {})
    else:
        manual_links = raw_manual_links

    wid = core.get("id", {}).get("val")
    if not wid:
        continue

    work_uri = PATHS[f"works/{wid}"]
    g.add((work_uri, RDF.type, FRBR.Work))
    g.add((work_uri, DCTERMS.identifier, Literal(wid)))

    # Title
    if (title := core.get("title", {}).get("val")):
        g.add((work_uri, DCTERMS.title, Literal(title)))

    # Clavis Coptica ID
    if (clavis := core.get("cmcl", {}).get("val")):
        g.add((work_uri, DCTERMS.identifier, Literal(clavis)))

    # Literary period
    if (period := core.get("litperiod", {}).get("val")):
        g.add((work_uri, DCTERMS.temporal, Literal(period)))

    # Notes
    if (notes := core.get("notes", {}).get("val")):
        g.add((work_uri, DCTERMS.description, Literal(notes)))

    
    # Authorship: link to pre-existing person URIs
    author_data = plugins.get("paths__m_wkauthors", {}).get("data", [])
    for author in author_data:
        author_id = author.get("author", {}).get("val")
        author_type = author.get("type", {}).get("val")
        if author_id:
            author_uri = PATHS[f"authors/{author_id}"]
            g.add((work_uri, DCTERMS.creator, author_uri))
            g.add((author_uri, DCTERMS.creator, work_uri))
            if author_type == "creator":
                g.add((work_uri, RDFS.label, Literal("established authorship")))
            if author_type == "stated":
                g.add((work_uri, RDFS.label, Literal("stated authorship")))

    # Manuscript backlinks
    for entry in manual_links.values():
        if entry.get("tb_stripped") == "manuscripts":
            mid = entry.get("ref_id")
            if mid:
                m_uri = PATHS[f"manuscripts/{mid}"]
                g.add((work_uri, DCTERMS.isPartOf, m_uri))
                g.add((m_uri, DCTERMS.hasPart, work_uri))    

# Output
g.serialize("../turtles/works.ttl", format="turtle")
print("✅ Saved RDF data to ../turtles/works.ttl")


✅ Saved RDF data to ../turtles/works.ttl


## Titles

We want output like this:

```
<http://paths.uniroma1.it/atlas/titles/1041> a coptic:Title ;
    dcterms:description "The title is clearly separated by dotted lines on top and at the bottom of the page.",
        "ⲛⲉⲛⲧⲟⲗⲏ ⲛⲡⲉⲛⲙⲉⲣⲓⲧ ⲛ̄ⲉⲓⲱⲧ ⲁⲡⲁ ⲁⲛⲟⲩⲡ ⲛ̄ⲛⲉ̣ⲣ̣ⲧ̣ⲏ̣ ϩⲛ ⲟⲩⲉⲓⲣⲏⲛⲏ ϩⲁⲙⲏⲛ:"@cop,
        "The commandments of our beloved Father Apa Anoup of Nerte. In peace Amen"@en ;
    dcterms:identifier "1041",
        "ccT0814-I" ;
    dcterms:isPartOf <http://paths.uniroma1.it/atlas/manuscripts/870> ;
    dcterms:type "title" .
```

In [7]:
# Replace this with your actual sample records (truncated here for brevity)
with open("../data/titles.json", "r", encoding="utf-8") as f:
    titles_data = json.load(f)
    
# Namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
SCHEMA = Namespace("http://schema.org/")
LEXINFO = Namespace("http://lexinfo.net/ontology/2.0/lexinfo#")
COPTIC = Namespace("http://www.semanticweb.org/sjhuskey/ontologies/2025/7/coptic-metadata-viewer/")

# Create graph
g = Graph()
g.bind("dcterms", DCTERMS)
g.bind("paths", PATHS)
g.bind("schema", SCHEMA)
g.bind("lexinfo", LEXINFO)
g.bind("coptic", COPTIC)

# Helper to extract manuscript ID from 'where'
def extract_id(where_clause):
    match = re.search(r"id\|=\|(\d+)", where_clause)
    return match.group(1) if match else None

def extract_cmcl_number(where_clause):
    match = re.search(r'cc0*(\d+)', where_clause)
    return match.group(1) if match else None

for record in titles_data:
    core = record.get("core", {})
    links = record.get("links", {})
    plugins = record.get("plugins", {})

    title_id = core.get("id", {}).get("val")
    if not title_id:
        continue

    title_uri = PATHS[f"titles/{title_id}"]
    g.add((title_uri, RDF.type, COPTIC.Title))
    g.add((title_uri, DCTERMS.type, Literal("title")))
    g.add((title_uri, DCTERMS.identifier, Literal(title_id)))
    # Add simple string fields using dcterms
    if (description := core.get("description", {}).get("val")):
        g.add((title_uri, DCTERMS.description, Literal(description)))
    if (cc := core.get("cc", {}).get("val")):
        g.add((title_uri, DCTERMS.identifier, Literal(cc)))
    if (msid := core.get("msid", {}).get("val")):
        m_uri = PATHS[f"manuscripts/{msid}"]
        g.add((m_uri, DCTERMS.hasPart, title_uri))
        g.add((title_uri, DCTERMS.isPartOf, m_uri))
    if (text := core.get("text", {}).get("val")):
        g.add((title_uri, DCTERMS.description, Literal(text, lang="cop")))
    if (translation := core.get("translation", {}).get("val")):
        g.add((title_uri, DCTERMS.description, Literal(translation, lang="en")))

    # Link to the work
    work_link = links.get("paths__works", {}).get("where", "")
    work_id = extract_cmcl_number(work_link)
    if work_id:
        work_uri = PATHS[f"works/{work_id}"]
        g.add((title_uri, DCTERMS.references, work_uri))
        g.add((work_uri, DCTERMS.isReferencedBy, title_uri))

# Output
g.serialize("../turtles/titles.ttl", format="turtle")
print("✅ Saved RDF data to ../turtles/titles.ttl")

✅ Saved RDF data to ../turtles/titles.ttl


## Colophons

We want output like this:

```
<http://paths.uniroma1.it/atlas/colophons/105> a coptic:Colophon ;
    ns1:translation "] because (he is) the one which took care of this book at his own expenses, he gave it to the place (τόπος) of Apa E[pima of the acaci]a for the release of his soul (ψυχή) so that might the God of the two great archangels (ἀρχάγγελος) Michael and Gabriel and the Christ[-loving] saint [Epima] protect all his business for blessing him in this world (κόσμος) with everything which belongs to him, [might give] him plenty of peaceful (εἰρηνικός) years [---] him safe from all the snares of the devil (διάβολος) and, then, when he will go out of his body (σῶμα) and he will inherit (κληρονομεῖν) the things of sky, so that might the saint apa E[---] entreat (παρακαλεῖν) the Lord on his behalf so that he could be worthy of hearing the voice full of every joy saying come to me, the blessed of my father, inherit the kingdom which was prepared for you from the foundation (καταβολή) of the world, amen, (so) be it [---].    "@en ;
    dcterms:description "[± 17]ⲛ̣|[±16]ⲡⲁ|[± 11 ϫⲉⲛⲧⲟ]ϥ | [ⲡⲉⲛⲧⲁϥϥⲓⲡⲣⲟⲟⲩϣ ⲙ]ⲡⲓ|(5)[ϫⲱⲱⲙⲉ ϩⲛⲛⲉϥϩ]ⲓⲥⲉ | [ⲙⲙⲓⲛ ⲙⲙⲟϥ ⲁϥⲧⲁ]ⲁϥ | [ⲉϩⲟⲩⲛ ⲉⲡⲧⲟⲡⲟⲥ ⲛ]ⲁ̣ⲡⲁ ⲉ|[ⲡⲓⲙⲁ ⲙ̄ⲡϣⲁⲛⲧ]ⲉ̣ ϩⲁⲡⲟⲩ|[ϫⲁⲓ ⲛⲧⲉϥⲯⲩⲭ]ⲏ̣ ϫⲉⲕⲁⲥ | (10) [ⲉⲣⲉⲡⲛⲟⲩⲧⲉ ⲙⲛⲡ]ⲛ̣ⲟϭⲥⲛⲁⲩ | [ⲛⲁⲣⲭⲁⲅⲅⲉⲗⲟⲥ] ⲙⲓⲭⲁⲏⲗ | [ⲙⲛⲅⲁⲃⲣⲓⲏⲗ] ⲙ̣ⲛⲡϩⲁ̣ⲅ̣ⲓⲟⲥ | [ⲉⲡⲓⲙⲁ ⲙⲙⲁⲓ]ⲡⲉⲭ(ⲣⲓⲥⲧⲟ)ⲥ̣ ⲣ̣ⲟ̣ⲉ̣|[ⲓⲥ ⲉⲛⲉϥⲭⲣⲓⲁ] ⲧⲏⲣⲟⲩ ⲉ̣ⲩ|(15)[ⲥⲙⲟⲩ ⲉⲣⲟϥ] ϩⲙⲡⲓⲕⲟⲥⲙⲟⲥ | [ⲙⲛⲛⲕⲁ ⲛⲓⲙ ⲉ]ⲧ̣ϣⲟⲟⲡ ⲛⲁϥ | [± 5]ⲏ̣[  ̣] ⲛⲁ̣ϥ̣ ⲛⲟⲩ̣|[ⲙⲏⲏϣⲉ ⲛⲣⲟ]ⲙⲡⲉ ⲛⲓⲣⲏⲛⲓ|[ⲕⲟⲛ ± 6 ⲧ]ⲟ̣ⲩϫⲏⲩ ⲉ̣ⲙ|(20)[ⲙⲟϥ ⲛⲛϭⲟⲣϭⲥ ⲧⲏ]ⲣⲟⲩ ⲙ̣ⲓ̣ⲡⲓ|[ⲇⲓⲁⲃⲟⲗⲟⲥ ⲁⲩⲱ] ⲟⲛ ⲉϥϣⲁⲛ|[ⲉⲓ ⲉⲃⲟⲗ ϩⲛⲥⲱ]ⲙ̣ⲁ̣ ⲛϥϫⲓ|[ⲕⲗⲏⲣⲟⲛⲟⲙⲓⲁ ⲛ]ⲛ̣ⲁ̣ⲧⲡⲉ | [ϫⲉⲕⲁⲥ ⲉⲣⲉⲡϩ]ⲁ̣ⲅⲓⲟⲥ ⲁⲡⲁ ⲉ|(25)[± 4 ⲛⲁⲡⲁⲣⲁ]ⲕ̣ⲁⲗⲓ ⲙⲡϭ(ⲟⲓ)ⲥ | [ⲉϩⲣⲁⲓ ⲉϫⲱϥ ⲛϥⲙ]ⲡ̣ϣⲁ ⲛ̇ⲥⲱⲧⲙ | [ⲉⲧⲉⲥⲙⲏ ⲉⲧⲙ]ⲉϩ ⲛ̇ⲣⲁϣⲉ̣ | [ⲛⲓⲙ ϫⲉⲁⲙⲏ]ⲓⲧⲛ̇ ϣⲁⲣⲟⲓ̈ | [ⲛⲉⲧⲥⲙⲁⲙⲁⲁⲧ] ⲛⲧ[ⲉ]ⲡⲁ|(30)[ⲉⲓⲱⲧ ⲛⲧⲉⲧⲛ]ⲕⲗⲏⲣⲟ|[ⲛⲟⲙⲉⲓ ⲛⲧⲙⲛⲧⲉ]ⲣ̣ⲟ̇ ⲛ̇ⲧⲁⲩ|[ⲥⲃⲧⲱⲧⲥ ⲛⲏⲧ]ⲛ ⲛϫⲓⲛ|[ⲧⲕⲁⲧⲁⲃⲟⲗⲏ ⲙ]ⲡⲕⲟⲥⲙⲟⲥ | ϩⲁⲙⲏⲛ ⲉⲥⲉϣⲱ]ⲡⲉ | (35) [± 10]ⲡ̣ⲟ̣ⲡⲟⲥ | [± 12]ⲅⲁ | [± 11]ⲉ"@cop,
        "Only the left part of the text remains, due to the bad stato of preservation of the parchment sheet. The two sections are separated by bands made of dots and dashes."@en ;
    dcterms:identifier "105" ;
    dcterms:isPartOf <http://paths.uniroma1.it/atlas/manuscripts/209> ;
    dcterms:references <http://paths.uniroma1.it/atlas/persons/419> ;
    dcterms:type "colophon" ;
    time:hasBeginning "801" ;
    time:hasEnd "925" .
```

In [8]:
# Replace this with your actual sample records (truncated here for brevity)
with open("../data/colophons.json", "r", encoding="utf-8") as f:
    colophons_data = json.load(f)

# Namespaces
PATHS = Namespace("http://paths.uniroma1.it/atlas/")
SCHEMA = Namespace("http://schema.org/")
TIME = Namespace("http://www.w3.org/2006/time#")
COPTIC = Namespace("http://www.semanticweb.org/sjhuskey/ontologies/2025/7/coptic-metadata-viewer/")

# Create graph
g = Graph()
g.bind("dcterms", DCTERMS)
g.bind("paths", PATHS)
g.bind("schema", SCHEMA)
g.bind("time", TIME)
g.bind("coptic", COPTIC)

# Helper to extract manuscript ID from 'where'
def extract_id(where_clause):
    match = re.search(r"id\|=\|(\d+)", where_clause)
    return match.group(1) if match else None

def extract_cmcl_number(where_clause):
    match = re.search(r'cc0*(\d+)', where_clause)
    return match.group(1) if match else None

for record in colophons_data:
    core = record.get("core", {})
    links = record.get("links", {})
    plugins = record.get("plugins", {})
    # Normalize links into a dict (even if it's wrapped in a list)
    raw_manual_links = record.get("manualLinks", {})
    if isinstance(raw_manual_links, list):
        # Look for the first dict in the list
        manual_links = next((item for item in raw_manual_links if isinstance(item, dict)), {})
    else:
        manual_links = raw_manual_links

    colophon_id = core.get("id", {}).get("val")
    if not colophon_id:
        continue

    colophon_uri = PATHS[f"colophons/{colophon_id}"]
    g.add((colophon_uri, RDF.type, COPTIC.Colophon))
    g.add((colophon_uri, DCTERMS.identifier, Literal(colophon_id)))
    # Chrono range → temporal (both as separate strings)
    g.add((colophon_uri, DCTERMS.type, Literal("colophon")))
    if (chronofrom := core.get("chronofrom", {}).get("val")):
        g.add((colophon_uri, TIME.hasBeginning, Literal(chronofrom)))
    if (chronoto := core.get("chronoto", {}).get("val")):
        g.add((colophon_uri, TIME.hasEnd, Literal(chronoto)))
    if (istitutionplace := core.get("institutionplace", {}).get("val")):
        inst_uri = PATHS[f"places/{istitutionplace.lower().replace(' ', '-')}"]
        g.add((colophon_uri, DCTERMS.spatial, inst_uri))
        g.add((inst_uri, RDF.type, DCMITYPE.Place))
        g.add((inst_uri, RDFS.label, Literal(core.get("institutionplace", {}).get("val_label"))))
    if (description := core.get("description", {}).get("val")):
        g.add((colophon_uri, DCTERMS.description, Literal(description, lang="en")))
    if (text := core.get("text", {}).get("val")):
        g.add((colophon_uri, DCTERMS.description, Literal(text, lang="cop")))
    if (translation := core.get("translation", {}).get("val")):
        g.add((colophon_uri, LEXINFO.translation, Literal(translation, lang="en")))

    # Link to the manuscript
    msid = extract_id(links.get("paths__manuscripts", {}).get("where", ""))
    if msid:
        ms_uri = PATHS[f"manuscripts/{msid}"]
        g.add((ms_uri, DCTERMS.hasPart, colophon_uri))
        g.add((colophon_uri, DCTERMS.isPartOf, ms_uri))

    # Link to the persons
    for mlink in manual_links.values():
        if mlink.get("tb_stripped") == "persons":
            person_id = mlink.get("ref_id")
            if person_id:
                person_uri = PATHS[f"persons/{person_id}"]
                g.add((colophon_uri, DCTERMS.references, person_uri))

# Output
g.serialize("../turtles/colophons.ttl", format="turtle")
print("✅ Saved RDF data to ../turtles/colophons.ttl")

✅ Saved RDF data to ../turtles/colophons.ttl


## Places

We want output like this:

```
<http://paths.uniroma1.it/atlas/places/345> a lawd:Place ;
    rdfs:label "Talit"@en ;
    lawd:primaryForm "Talit"@en ;
    skos:exactMatch <http://pleiades.stoa.org/places/737065>,
        <https://www.trismegistos.org/place/2236> .
```

In [9]:
# Load your plain JSON
with open("../data/places.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Setup RDF graph and namespaces
g = Graph()
LAWD = Namespace("http://lawd.info/ontology/")
g.bind("lawd", LAWD)
g.bind("skos", SKOS)
g.bind("rdfs", RDFS)

# Loop through top-level entries (skip _:genid ones)
for uri, properties in data.items():
    if not uri.startswith("http://paths.uniroma1.it/atlas/places/"):
        continue  # Skip _:genid blocks

    subj = URIRef(uri)

    # Add type triple if present
    types = properties.get("http://www.w3.org/1999/02/22-rdf-syntax-ns#type", [])
    for t in types:
        g.add((subj, RDF.type, URIRef(t["value"])))

    # Add exactMatch links
    matches = properties.get("http://www.w3.org/2004/02/skos/core#exactMatch", [])
    for m in matches:
        g.add((subj, SKOS.exactMatch, URIRef(m["value"])))

    # Add rdfs:label
    labels = properties.get("http://www.w3.org/2000/01/rdf-schema#label", [])
    for l in labels:
        lang = l.get("lang")
        g.add((subj, RDFS.label, Literal(l["value"], lang=lang)))

    # Add lawd:primaryForm
    primary_forms = properties.get("http://lawd.info/ontology/primaryForm", [])
    for pf in primary_forms:
        lang = pf.get("lang")
        g.add((subj, LAWD.primaryForm, Literal(pf["value"], lang=lang)))

# Output the filtered Turtle
g.serialize("../turtles/places.ttl", format="turtle")
print("✅ Wrote '../turtles/places.ttl'")


✅ Wrote '../turtles/places.ttl'


## Validate TTL Files

Make sure that all the turtles are valid.

In [10]:
g = Graph()
# Loop over all Turtle files in the directory
for ttl_file in glob.glob("../turtles/*.ttl"):
    if not os.path.isfile(ttl_file):
        continue  # Skip if it's not a file
    print(f"Validating {ttl_file}...")
    g.parse(ttl_file, format="turtle")
print("TTL is valid!")

Validating ../turtles/persons.ttl...
Validating ../turtles/works.ttl...
Validating ../turtles/manuscripts.ttl...
Validating ../turtles/colophons.ttl...
Validating ../turtles/collections.ttl...
Validating ../turtles/statements.ttl...
Validating ../turtles/authors.ttl...
Validating ../turtles/places.ttl...
Validating ../turtles/titles.ttl...
TTL is valid!


## Merge the Turtle files into one graph

In [11]:
# Directory containing your Turtle files
input_dir = "../turtles"
output_file = "../turtles/graph.ttl"

# Create an empty graph
merged_graph = Graph()

# Iterate through Turtle files and parse each one
for filename in os.listdir(input_dir):
    if filename.endswith(".ttl"):
        path = os.path.join(input_dir, filename)
        merged_graph.parse(path, format="turtle")
        print(f"Merged: {filename}")

# Serialize the merged graph (duplicates are already removed)
merged_graph.serialize(destination=output_file, format="turtle")
print(f"Total triples in original graphs: {sum(len(Graph().parse(os.path.join(input_dir, f), format='turtle')) for f in os.listdir(input_dir) if f.endswith('.ttl'))}")
print(f"Total triples in merged graph: {len(merged_graph)}")

# Validate the merged graph
g = Graph()
g.parse(output_file, format="turtle")
print("Merged Turtle graph is valid!")
print(f"✅ Merged Turtle graph is valid and saved to {output_file}")

Merged: persons.ttl
Merged: works.ttl
Merged: manuscripts.ttl
Merged: colophons.ttl
Merged: collections.ttl
Merged: statements.ttl
Merged: authors.ttl
Merged: places.ttl
Merged: titles.ttl
Total triples in original graphs: 470639
Total triples in merged graph: 188288
Merged Turtle graph is valid!
✅ Merged Turtle graph is valid and saved to ../turtles/graph.ttl


## Add inferences

This block mimics the action of a reasoner. It looks for triples that reference other triples and adds them explicitly to existing turtles.

In [12]:
from rdflib import Graph, Namespace, URIRef

# Load your graph
g = Graph()
g.parse("../turtles/graph.ttl", format="ttl")  # Adjust if needed

# Define the namespaces
DCTERMS = Namespace("http://purl.org/dc/terms/")

# Apply the inference rule
inferred_triples = set()
for title, _, work in g.triples((None, DCTERMS.references, None)):
    for _, _, author in g.triples((work, DCTERMS.creator, None)):
        triple = (title, DCTERMS.creator, author)
        if triple not in g:
            inferred_triples.add(triple)

for title, _, work in g.triples((None, DCTERMS.references, None)):
    triple = (work, DCTERMS.isReferencedBy, title)
    if triple not in g:
        inferred_triples.add(triple)

# Add inferred triples to the graph
for triple in inferred_triples:
    g.add(triple)

# Save the updated graph (optional)
g.serialize("../turtles/graph_with_inference.ttl", format="turtle")

<Graph identifier=N594dba656a26407db4c41202ebf99922 (<class 'rdflib.graph.Graph'>)>

## Get the overall schema for the graph

In [13]:
from collections import defaultdict
from rdflib import Graph

# Load the merged graph
g = Graph()
g.parse("../turtles/graph_with_inference.ttl", format="turtle")

# Dictionary to store classes and their properties
class_properties = defaultdict(set)

# Get all triples and organize by class
for subject, predicate, obj in g:
    # Get the types (classes) of the subject
    for s, p, class_uri in g.triples((subject, RDF.type, None)):
        if s == subject:  # Make sure we're looking at the right subject
            # Convert URIs to prefixed format
            try:
                class_name = g.namespace_manager.qname(class_uri)
            except:
                class_name = str(class_uri)
            
            try:
                property_name = g.namespace_manager.qname(predicate)
            except:
                property_name = str(predicate)
            
            class_properties[class_name].add(property_name)

# Sort and display results
for class_name in sorted(class_properties.keys()):
    print(f"\n{class_name}:")
    for prop in sorted(class_properties[class_name]):
        print(f"  {prop}")


coptic:Author:
  dcterms:creator
  dcterms:description
  dcterms:identifier
  foaf:name
  owl:sameAs
  rdf:type
  schema1:title

coptic:Colophon:
  dcterms:description
  dcterms:identifier
  dcterms:isPartOf
  dcterms:references
  dcterms:type
  ns1:translation
  rdf:type
  time:hasBeginning
  time:hasEnd

coptic:Manuscript:
  dcterms:bibliographicCitation
  dcterms:description
  dcterms:hasPart
  dcterms:identifier
  dcterms:isPartOf
  dcterms:medium
  rdf:type
  time:hasBeginning
  time:hasEnd

coptic:Title:
  dcterms:creator
  dcterms:description
  dcterms:identifier
  dcterms:isPartOf
  dcterms:references
  dcterms:type
  rdf:type

dcmitype:Collection:
  dcterms:hasPart
  dcterms:identifier
  dcterms:spatial
  rdf:type
  schema1:location
  schema1:name

foaf:Person:
  dcterms:identifier
  dcterms:isReferencedBy
  foaf:name
  ns1:transliteration
  rdf:type
  rdfs:label
  schema1:birthPlace
  schema1:gender
  schema1:roleName
  time:hasBeginning
  time:hasEnd

frbr:Work:
  dcterms:c