This notebook works out some contributions to the Geoscience Ontology's rock material classification with Mindat identifiers to aid in the next step for the GeoKB of aligning ourselves with the GSO. Mindat "min" identifiers are also used for rock classification items, and there is use of these (gsmin:mindatid) plus a URL (gsmin:mindaturl) in the GSO minerals ontology. My plan is to implement these two properties into the GSO rock materal ontology and contribute it via pull request for consideration. One way or the other, we can use our branch to then drive work within the GeoKB.

In [1]:
import pandas as pd
import requests
from rdflib import Graph, Namespace, URIRef, Literal

# Get the GSRM ontology

In [2]:
gsrm_url = "https://raw.githubusercontent.com/Loop3D/GKM/master/Loop3D-GSO/Modules/GSO-Geologic_Rock_Material.ttl"

gsrm = Graph()
gsrm.parse(gsrm_url, format="ttl")

q_gsrm_classes = """
SELECT ?class ?class_label ?parent ?parent_label
WHERE {
    ?class rdf:type owl:Class ;
              rdfs:label ?class_label ;
              rdfs:subClassOf ?parent .
    ?parent rdfs:label ?parent_label .
}
"""

gsrm_classes = []
for row in gsrm.query(q_gsrm_classes):
    gsrm_classes.append({
        "class": str(row["class"]),
        "class_label": str(row["class_label"]),
        "parent": str(row["parent"]),
        "parent_label": str(row["parent_label"])
    })

gsrm_items = pd.DataFrame(gsrm_classes)
gsrm_items.head()

Unnamed: 0,class,class_label,parent,parent_label
0,https://w3id.org/gso/rockmaterial/Acidic_Igneo...,acidic igneous material,https://w3id.org/gso/rockmaterial/Igneous_Mate...,igneous material
1,https://w3id.org/gso/rockmaterial/Acidic_Igneo...,acidic igneous rock,https://w3id.org/gso/rockmaterial/Acidic_Igneo...,acidic igneous material
2,https://w3id.org/gso/rockmaterial/Acidic_Igneo...,acidic igneous rock,https://w3id.org/gso/rockmaterial/Igneous_Rock,igneous rock
3,https://w3id.org/gso/rockmaterial/Alkali-Olivi...,alkali olivine basalt,https://w3id.org/gso/rockmaterial/Basalt,basalt
4,https://w3id.org/gso/rockmaterial/Alkali_Felds...,alkali feldspar granite,https://w3id.org/gso/rockmaterial/Granitoid,granitoid


# GeoKB mindat linkages
Here, we grab all items in the GeoKB currently with mindat URLs in sameAs relationships. This is more than just the rock classes but not much more at this point, since I've not brought in all the minerals. I pull both label and alt labels as we have a couple of aliases that line up with GSRM.

In [3]:
query_geokb_mindat = """
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?item_label ?item_alt_label ?same_as
WHERE {
  ?item wdt:P84 ?same_as ;
        rdfs:label ?item_label .
  OPTIONAL {
    ?item skos:altLabel ?item_alt_label .
  }
  FILTER CONTAINS(STR(?same_as), 'mindat.org/min-')
}
"""

geokb_mindat_items = requests.get(
    "https://geokb.wikibase.cloud/query/sparql",
    params={
        "query": query_geokb_mindat,
        "format": "json"
    }
).json()

df_geokb_mindat_items = pd.DataFrame([{
    'qid': i['item']['value'].split('/')[-1], 
    'label': i['item_label']['value'],
    'alt_label': i['item_alt_label']['value'] if 'item_alt_label' in i else None,
    'mindaturl': i['same_as']['value'],
    'mindatid': i['same_as']['value'].split('/')[-1].split('.')[0]
} for i in geokb_mindat_items['results']['bindings']])

df_geokb_mindat_items.head()

Unnamed: 0,qid,label,alt_label,mindaturl,mindatid
0,Q421,clay,Clay,https://mindat.org/min-52526.html,min-52526
1,Q462,granite,Granite,https://mindat.org/min-48141.html,min-48141
2,Q41261,rock,,https://mindat.org/min-50468.html,min-50468
3,Q41274,adinole,,https://mindat.org/min-25.html,min-25
4,Q41282,chert,,https://mindat.org/min-994.html,min-994


# Find alignment
I'm starting with simple name matching at this point. I'll likely discover a few additional cases with completely different names as I work to "purge" Mindat items that I can't line up with the GSRM list.

In [4]:
name_matches = pd.concat([
    pd.merge(
        left=gsrm_items[['class', 'class_label']],
        right=df_geokb_mindat_items[['label','mindaturl','mindatid']],
        how='inner',
        left_on='class_label',
        right_on='label'
    ),
    pd.merge(
        left=gsrm_items[['class', 'class_label']],
        right=df_geokb_mindat_items[['alt_label','mindaturl','mindatid']],
        how='inner',
        left_on='class_label',
        right_on='alt_label'
    )
])
name_matches = name_matches[['class','class_label','mindaturl','mindatid']].drop_duplicates().reset_index(drop=True)

name_matches.head()

Unnamed: 0,class,class_label,mindaturl,mindatid
0,https://w3id.org/gso/rockmaterial/Alkali-Olivi...,alkali olivine basalt,https://mindat.org/min-53210.html,min-53210
1,https://w3id.org/gso/rockmaterial/Amphibolite,amphibolite,https://mindat.org/min-48627.html,min-48627
2,https://w3id.org/gso/rockmaterial/Andesite,andesite,https://mindat.org/min-48484.html,min-48484
3,https://w3id.org/gso/rockmaterial/Anorthosite,anorthosite,https://mindat.org/min-48323.html,min-48323
4,https://w3id.org/gso/rockmaterial/Anthracite_Coal,anthracite,https://mindat.org/min-9434.html,min-9434


# Add to graph
I need to add the gsmin namespace because that is not in the current GSRM. I then add each matching mindatid and mindaturl value and dump it out to a file in the fork/branch I started for this work in the GKM repo.

In [5]:
gsmin_prefix = 'gsmin'
gsmin_namespace = Namespace(f'https://w3id.org/gso/mineral/')

gsrm.namespace_manager.bind(gsmin_prefix, gsmin_namespace)

for index, row in name_matches.iterrows():
    class_uri = URIRef(row["class"])
    mindaturl_val = Literal(row["mindaturl"])
    mindatid_val = Literal(row["mindatid"])
    
    gsrm.add((class_uri, gsmin_namespace.mindaturl, mindaturl_val))
    gsrm.add((class_uri, gsmin_namespace.mindatid, mindatid_val))

with open('../../GKM/Loop3D-GSO/Modules/GSO-Geologic_Mineral.ttl', 'w') as f:
    f.write(gsrm.serialize(format="turtle"))