Note: This is a work in progress where I am shifting to use of the Geoscience Ontology for geologic time information in the GeoKB along with a number of other reference sets.

This notebook works through a structured source for the geologic time scale to introduce items into the GeoKB representing eon, era, period, sub-period, epoch, and age units that can be linked to from other related items. The source used is the 2020 version of the [International Chronostratigraphic Chart] as represented in the [timescale RDF structure](https://github.com/CGI-IUGS/timescale-data) from Simon Cox and Steve Richard. Following our principle of pragmatic semantics for the GeoKB where we are building out representations of reference concepts and entities only to the level of detail needed for our own specific use cases. If we need to go back later and bring in more content from one or more sources, we can do that based on building in and retaining solid provenance and linkages to resolvable identifiers.

This particular case prompted me to rethink a few things in how we've laid out properties and classification structure within the GeoKB. This is the first time I've worked through processing what we want from a source that is organized against a suite of formal ontologies (most of this is part of mature GeoSciML work). Building a set of items within Wikibase that is structured on a more informal foundation within the context of a broad knowledge graph that will contain many other types of content forced me to narrow in a bit and think through the implications for this and other cases where we want to base a part of our system on other formal ontological encodings.

I ended up narrowing in on just the names, the geochronological rank (eon, era, etc.), and the basic broader/narrower relationships. The latter prompted me to leverage the same formal time interval concepts from the OWL time ontology. For now, I restricted this to the intervalDuring and intervalContains properties, instantiated in the GeoKB as item-type properties pointing to the corresponding item representations for the higher level and lower level "eras" (the ISC GTS uses "GeochronologicalEra" as the higher level concept for all ranks, which I also followed in the GeoKB representation). Another approach, perhaps more in keeping with at least some Wikidata conventions, would have been to establish separate properties for eon, era, period, and so on to house claims that point a given item to its containing geoligical time period and the time periods it contains. One approach requires chasing down the meaning of the property and the other may require chasing down the further definition of the linked items. I chose to build a few new properties more in keeping with the Cox/Richard encoding, which is more semantically correct, and we'll have to see if it communicates and works in practice.

I also introduced an explicit new property for IRI as an identifier to a presumably semantically robust resolver. For our purposes, this will indicate that a given item sources directly to that identifier as its foundation, giving is the linkage to go back to source for additional definition and characteristics.

I also reworked a bit of what I had been trying to figure out with pointing a given property or entity in the GeoKB to some other place where information about that thing can be found. I used the owl:sameAs property concept here in the same way that the ICS GTS uses it (pointing in that case to DBPedia identifiers). I'll end up using this in place of "related wikidata item" and "related wikidata property." I may also revisit this again and introduce a set of qualifiers on "same as" claims, clarifying exactly what the relationship is. I still need to deal with the dynamic that just because something in another knowledge graph is basically the same thing as something in our GeoKB, there are still some caveats that need to be applied in practice.

In [1]:
import requests
from wbmaker import WikibaseConnection
import rdflib
import re
import pandas as pd

In [None]:
geokb = WikibaseConnection("GEOKB_CLOUD")

# Quick exploration
This is a simple look at the most important aspects of the ISC Chart part of geo time in the GSO that we need to work through and bring into the GeoKB. It also looks at a new way to pull the source reference from the item in the GeoKB documenting the data source. The resulting table was shared for discussion.

In [2]:
# Experimenting with a new way of recording and retrieving data source details
# Need to make this much more bulletproof
def data_source(qid, source_property='P160'):
    source_item_url = f"https://geokb.wikibase.cloud/wiki/Special:EntityData/{qid}.json"
    r = requests.get(source_item_url)
    if r.status_code != 200:
        return {'error': r.status_code}
    try:
        source_item = r.json()
    except Exception as e:
        return {'error': str(e)}
    
    source_record = source_item['entities'][qid]

    if source_property not in source_record['claims']:
        return {'error': f"Property {source_property} not found"}
    else:
        data_values = [i['mainsnak']['datavalue'] for i in source_record['claims'][source_property]]
        if len(data_values) == 1:
            return data_values[0]['value']
        else:
            return [i['value'] for i in data_values]


In [3]:
# Pull the graph from raw GitHub source
gstime = rdflib.Graph()
gstime.parse(
    data_source('Q26294'),
    format="ttl"
)

# Query for the available properties on gst:Age
query_age = """
    SELECT ?iri ?label ?issued ?comment ?isPartOf ?nextTimeInterval ?previousTimeInterval ?timeHasBeginning ?timeHasEnd ?source 
    WHERE {
        ?iri a gst:Age ;
            rdfs:label ?label ;
            dcterms:issued ?issued ;
            rdfs:comment ?comment ;
            gsoc:isPartOf ?isPartOf ;
            gsoc:nextTimeInterval ?nextTimeInterval ;
            gsoc:previousTimeInterval ?previousTimeInterval ;
            gsoc:timeHasBeginning ?timeHasBeginning ;
            gsoc:timeHasEnd ?timeHasEnd .
        OPTIONAL {
            ?iri gsoc:source ?source .
        }
    }
"""

gstime_ages = gstime.query(query_age)

# Process out the rows into a list of dicts
rows = []
for row in gstime_ages:
    record = {}
    for i, prop in enumerate([str(i) for i in gstime_ages.vars]):
        record[prop] = str(row[i])
    rows.append(record)

# Convert to a pandas dataframe
df_gstime_ages = pd.DataFrame(rows)

# Aggregate the valued on unique IRI with a little bit of processing for readability
def aggregator(x):
    return str(x.unique()[0]) if len(x.unique()) == 1 else ' > '.join(map(str, sorted(x.unique())))

gstime_ages_grouped = df_gstime_ages.groupby('iri', as_index=False).agg(aggregator)

gstime_ages_grouped.head(10)

Unnamed: 0,iri,label,issued,comment,isPartOf,nextTimeInterval,previousTimeInterval,timeHasBeginning,timeHasEnd,source
0,https://w3id.org/gso/ischart/Aalenian2004,Aalenian Age (2004),2004-04-01,older bound -175.6 +|-2.0 Ma > younger bound -...,https://w3id.org/gso/geologictime/AalenianAge,https://w3id.org/gso/ischart/Bajocian2004,https://w3id.org/gso/ischart/LowerJurassic2004...,https://w3id.org/gso/ischart/BaseMiddleJurassi...,https://w3id.org/gso/ischart/BaseBajocian2004,
1,https://w3id.org/gso/ischart/Aalenian2012,Aalenian Age (2012),2012-08-01,older bound -174.1 +|-1.0 Ma > younger bound -...,https://w3id.org/gso/geologictime/AalenianAge,https://w3id.org/gso/ischart/Bajocian2012,https://w3id.org/gso/ischart/LowerJurassic2012...,https://w3id.org/gso/ischart/BaseMiddleJurassi...,https://w3id.org/gso/ischart/BaseBajocian2012,
2,https://w3id.org/gso/ischart/Aeronian2004,Aeronian Age,2004-04-01,older bound -439.0 +|-1.8 Ma > younger bound -...,https://w3id.org/gso/geologictime/AeronianAge,https://w3id.org/gso/ischart/Telychian2004,https://w3id.org/gso/ischart/Rhuddanian2004,https://w3id.org/gso/ischart/BaseAeronian2004,https://w3id.org/gso/ischart/BaseTelychian2004,
3,https://w3id.org/gso/ischart/Aeronian2012,Aeronian Age,2012-08-01,older bound -440.8 +|-1.2 Ma > younger bound -...,https://w3id.org/gso/geologictime/AeronianAge,https://w3id.org/gso/ischart/Telychian2012,https://w3id.org/gso/ischart/Rhuddanian2012,https://w3id.org/gso/ischart/BaseAeronian2012,https://w3id.org/gso/ischart/BaseTelychian2012,
4,https://w3id.org/gso/ischart/Albian2004,Albian Age,2004-04-01,older bound -112.0 +|-1.0 Ma > predates defini...,https://w3id.org/gso/geologictime/AlbianAge,https://w3id.org/gso/ischart/Cenomanian2004 > ...,https://w3id.org/gso/ischart/Aptian2004,https://w3id.org/gso/ischart/BaseAlbian2004,https://w3id.org/gso/ischart/BaseUpperCretaceo...,
5,https://w3id.org/gso/ischart/Albian2012,Albian Age,2012-08-01,older bound -113.0 Ma > younger bound -100.5 Ma,https://w3id.org/gso/geologictime/AlbianAge,https://w3id.org/gso/ischart/Cenomanian2012 > ...,https://w3id.org/gso/ischart/Aptian2012,https://w3id.org/gso/ischart/BaseAlbian2012,https://w3id.org/gso/ischart/BaseUpperCretaceo...,
6,https://w3id.org/gso/ischart/Anisian2004,Anisian Age,2004-04-01,older bound -245.0 +|-1.5 Ma > younger bound -...,https://w3id.org/gso/geologictime/AnisianAge,https://w3id.org/gso/ischart/Ladinian2004,https://w3id.org/gso/ischart/LowerTriassic2004...,https://w3id.org/gso/ischart/BaseMiddleTriassi...,https://w3id.org/gso/ischart/BaseLadinian2004,
7,https://w3id.org/gso/ischart/Anisian2012,Anisian Age,2012-08-01,older bound -247.2 Ma > younger bound -242 Ma,https://w3id.org/gso/geologictime/AnisianAge,https://w3id.org/gso/ischart/Ladinian2012 > ht...,https://w3id.org/gso/ischart/LowerTriassic2012...,https://w3id.org/gso/ischart/BaseMiddleTriassi...,https://w3id.org/gso/ischart/BaseLadinian2012,
8,https://w3id.org/gso/ischart/Aptian2004,Aptian Age,2004-04-01,older bound -125.0 +|-4.0 Ma > younger bound -...,https://w3id.org/gso/geologictime/AptianAge,https://w3id.org/gso/ischart/Albian2004,https://w3id.org/gso/ischart/Barremian2004,https://w3id.org/gso/ischart/BaseAptian2004,https://w3id.org/gso/ischart/BaseAlbian2004,
9,https://w3id.org/gso/ischart/Aptian2012,Aptian Age,2012-08-01,older bound -125.0 Ma > younger bound -113.0 Ma,https://w3id.org/gso/geologictime/AptianAge,https://w3id.org/gso/ischart/Albian2012,https://w3id.org/gso/ischart/Barremian2012,https://w3id.org/gso/ischart/BaseAptian2004,https://w3id.org/gso/ischart/BaseAlbian2012,


# Older approach now being reworked

I ended up using the 2020 RDF/TTL encoding of the ICS time scale from the GitHub source. I noticed that there is a 2023 graphical chart from the web site, and it may be that the maintenance of the underlying encoded information might have shifted elsewhere. We can track that down, but this should suffice for the time being. We still also need to figure out exactly how we are going to deal with changes in things like this over time within the GeoKB.

In [None]:
source_item_label = "International Chronostratigraphic Chart v.2020"
source_item_qid = geokb.ref_lookup[source_item_label]
source_item = geokb.wbi.item.get(source_item_qid)

knowledgebase_source_claim = next((i for i in source_item.claims.get_json()[geokb.prop_lookup['instance of']] if i["mainsnak"]["datavalue"]["value"]["id"] == geokb.class_lookup['knowledgebase source']), None)
reference_url = knowledgebase_source_claim["references"][0]["snaks"][geokb.prop_lookup['reference URL']][0]["datavalue"]["value"]

references = geokb.models.References()
references.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['knowledge source'],
        value=geokb.ref_lookup[source_item_label]
    )
)

In [None]:
gts_2020 = Graph()
gts_2020.parse(reference_url, format='ttl')

The turtle (TTL) representation of the ICS GTS is the same underlying RDF structure we have in Wikibase. I used the SPARQL method here to tease out what I want to work with in creating items. In this first pass, I am simply pulling each thing with a rank to get IRI identifiers, labels, and ranks. I go ahead and pull the time:intervalIn label to use in the description so that we get a more meaningful descriptive statement. The following codeblock pulls the query and builds a simplified data structure to work through (sample output shown).

There is some issue in the data where I missed a couple (2) items based on the initial iteration of my query where I looked for anything with a gts:rank. That should essentially get all ranks, but I missed the all important Phanerozoic Eon record. The only think I found was an extraneous space in the label, but I'm still not sure why that would have caused the query to not return that record. This new query gets everything with a label and then options the other properties I want.

In [None]:
q = """
    SELECT ?item ?label ?geo_rank ?interval_in_label
    WHERE {
        ?item rdfs:label ?label .
        OPTIONAL {
            ?item gts:rank ?geo_rank .
        }
        OPTIONAL {
            ?item time:intervalIn ?interval_in .
            ?interval_in rdfs:label ?interval_in_label .
        }
    }
"""

geo_eras = []
for r in gts_2020.query(q):
    record = {
        "iri": re.sub("<|>", "", r["item"].n3()),
        "label": r["label"].strip()
    }
    if r["geo_rank"] is not None:
        rank_str = re.sub("<|>", "", r["geo_rank"].n3()).split("/")[-1]
        record["geo_rank"] = rank_str
        record["geo_rank_qid"] = geokb.class_lookup[rank_str] if rank_str in geokb.class_lookup else None
        record["alt_label"] = r["label"].replace(rank_str, "").strip()
    
        if r["interval_in_label"] is not None:
            record["description"] = f"a geologic {rank_str} in the {r['interval_in_label']}"
        else:
            record["description"] = f"a geologic {rank_str}"

        geo_eras.append(record)

df_geo_eras = pd.DataFrame(geo_eras)
geo_eras[99]

In [None]:
eras_query = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fera%20%3Flabel%20%3Firi%0AWHERE%20%7B%0A%20%20VALUES%20%3Finstance_of%20%7B%20wd%3AQ26292%20wd%3AQ26288%20wd%3AQ26291%20wd%3AQ26289%20wd%3AQ26290%20wd%3AQ26293%20wd%3AQ26287%20%7D%0A%20%20%3Fera%20wdt%3AP1%20%3Finstance_of%20.%0A%20%20%3Fera%20rdfs%3Alabel%20%3Flabel%20.%0A%20%20%3Fera%20wdt%3AP80%20%3Firi%20.%0A%7D%0A"

df_wb_eras = geokb.wb_ref_data(query=eras_query)
df_wb_eras["era"] = df_wb_eras.era.apply(lambda x: x.split("/")[-1])

missing_eras = df_geo_eras[~df_geo_eras.iri.isin(df_wb_eras.iri)]
missing_eras

In [None]:
for index, era in missing_eras.iterrows():
    item = geokb.wbi.item.new()
    item.labels.set('en', era["label"])
    item.descriptions.set('en', era["description"])
    item.aliases.set('en', era["alt_label"])

    claims = geokb.models.Claims()
    claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value=era['geo_rank_qid'],
            references=references
        )
    )

    claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['IRI'],
            value=era['iri'],
            references=references
        )
    )

    item.claims.add(claims)
    response = item.write(summary="Added initial geochronologic era record from source")
    print(era["label"], response.id)

## Rework through to add temporal relationships

I went back and forth on the best way to bring the potentially somewhat complex relationships between geologic time units into the Wikibase context. This again gets at the principle toward pragmatic semantics, where we don't need to bring everything from perfectly solid reference sources into our GeoKB context if we don't need that information or level of depth in our system to reason against. If we ever do, we can go back and work through a process to augment what we have.

Looking at the source (e.g., [Phanerozoic Eon](http://resource.geosciml.org/classifier/ics/ischart/Phanerozoic)) there's a lot that Cox and Richard brought together in terms of semantic depth. I haven't dealt with the boundary concepts at all yet, which is where we get interpreted point in time figures that I experimented with from another source previously. We also need to decide how deep we want to go in terms of the geologic time units within each broader GeologicEra type. We could use the transitive relationships to show every era, period, ephoch, and stage/age for a given eon, but I opted to simply work with the immediate narrower and broader relationships for now. The connections are there to walk the system as needed.

I also debated whether we want to stick with the higher level SKOS concepts like broader/narrower as more generalized properties or stand in something that would put the different types of geochronologic units in as specific properties. It might be useful, in some ways, to work with the data showing specific eras within an eon without having to pull the type classification for the narrower units. However, this is partly how Wikidata has had such a proliferation of properties that mean essentially the same thing or perform the same basic function within the graph, so I opted to try and stick with higher level semantic concepts for a time and see how it works out in practice.

In [None]:
df_wb_eras = geokb.wb_ref_data(query=eras_query)
df_wb_eras["era"] = df_wb_eras.era.apply(lambda x: x.split("/")[-1])

wb_era_lookup = df_wb_eras.set_index("iri")["era"].to_dict()


In [None]:
q_broader = """
    SELECT ?item ?label ?broader
    WHERE {
        ?item rdfs:label ?label .
        ?item skos:broader ?broader .
    }
"""

q_narrower = """
    SELECT ?item ?label ?narrower
    WHERE {
        ?item rdfs:label ?label .
        ?item skos:narrower ?narrower .
    }
"""

relationships = []
for r in gts_2020.query(q_narrower):
    relationships.append({
        "subject": wb_era_lookup[re.sub("<|>", "", r["item"].n3())],
        "predicate": geokb.prop_lookup['narrower'],
        "object": wb_era_lookup[re.sub("<|>", "", r["narrower"].n3())]
    })

for r in gts_2020.query(q_broader):
    relationships.append({
        "subject": wb_era_lookup[re.sub("<|>", "", r["item"].n3())],
        "predicate": geokb.prop_lookup['broader'],
        "object": wb_era_lookup[re.sub("<|>", "", r["broader"].n3())]
    })

df_relationships = pd.DataFrame(relationships)

In [None]:
for era in df_relationships.groupby(["subject","predicate"], as_index=False)["object"].agg(list).to_dict("records"):
    item = geokb.wbi.item.get(era["subject"])

    claim_list = []
    for o in era["object"]:
        claim_list.append(
            geokb.datatypes.Item(
                prop_nr=era["predicate"],
                value=o,
                references=references
            )
        )
    claims = geokb.models.Claims()
    claims.add(claim_list)

    item.claims.add(claims)
    item.write(
        summary="Fixed issue on multiple claims for broader and narrower geologic time periods",
        clear=True
    )
    print(era["subject"])