The Geoscience Knowledgebase (GeoKB) needs a representation of rocks, minerals, and commodities (mostly mineral but others as well) as identified entities that can be linked to from other entities in various ways. For instance, we identify mineral commodities in documents like the NI 43-101 Technical Reports as the commodities being mined in a mining project. We link to these entities in various ways through properties in the graph that define the relationship of a subject item to an object rock, mineral, or commodity.

This notebook works through a process of pulling rocks, minerals, and commodities from Mindat, introducing a small part of their information into the GeoKB to represent those concepts. Mindat gives us the best open online source of minerological information, providing us with things like the best technical interface to IMA-listed minerals and much other useful information. Mindat is currently in the process of engineering and releasing an API that provides a reasonable interface to work against for both an initial representation of some types of items as well as ongoing connections and updates.

For the GeoKB, we really only need a small portion of the content that Mindat provides, including linkages on identifiers that will let us re-tap the system as needed for things we don't port into our knowledge graph. We're going to focus on just those pieces we need for query, linkage, and reasoning in the GeoKB, ignoring or only minimally representing everything else.

### Everything in a name
We are currently experimenting with a design decision that may or may not be the best way to go. Certain named entities can mean more than one thing. For instance, the concept "manganese" is a chemical element, a metallic mineral (not on the IMA list), and a commodity. Mindat treats each of these as separate unique entities, which is a perfectly reasonable approach given its data structure. As a knowledge representation, we are experimenting with establishing a single unique entity in the GeoKB that is instantiated as a chemical element, mineral, and mineral commodity at the same time. We link the assertion of manganese as a mineral and mineral commodity to Mindat via the specific identifiers in that system along with the additional reference to a commodity code in the USGS MRDS. This gives us a single concept to link to in multiple contexts, relying on the property establishing the linkage to determine what the context is. This may or may not prove out to be the most workable approach, and we'll determine that through usage and adjust as needed.

In [1]:
import os
import requests
import pandas as pd
import html
import re
from wbmaker import WikibaseConnection
from getpass import getpass
from glob import glob

In [4]:
# We need a Mindat API key for interactions with Mindat
if not os.environ.get('MINDAT_API_KEY'):
    os.environ['MINDAT_API_KEY'] = getpass(prompt="Input Mindat API Key: ")

In [5]:
# For this work we need to establish a connection to our Wikibase instance where we're sending information
geokb = WikibaseConnection("GEOKB_CLOUD")

# Mindat API helper functions

In [6]:
def pull_mindat_records(mindat_api):
    headers = {'Authorization': f"Token {os.environ['MINDAT_API_KEY']}"}
    all_records = []
    page_num = 1

    while page_num > 0:
        params = {
            'page_size': '100',
            'page': str(page_num),
            'format': 'json',
        }
        x = requests.get(mindat_api, params=params, headers=headers).json()
        if isinstance(x, list) and len(x) > 0:
            return x
        if "results" in x and x["results"]:
            all_records.extend(x["results"])
            if x["next"]:
                page_num += 1
            else:
                page_num = 0
        else:
            page_num = 0

    return all_records

def mindat_materials(type):
    if type == "mineral":
        entrytype = "0"

        return_props = [
            "id",
            "longid",
            "name",
            "description_short",
            "occurrence",
            "otheroccurrence",
            "strunz10ed1",
            "strunz10ed2",
            "strunz10ed3",
            "strunz10ed4",
            "key_elements",
            "ima_status",
            "shortcode_ima"
        ]

    elif type == "rock":
        entrytype = "7"

        return_props = [
            "id",
            "longid",
            "name",
            "description_short",
            "rock_parent",
            "rock_parent2",
            "rock_root",
            "rock_bgs_code"
        ]

    elif type == "commodity":
        entrytype = "8"

        return_props = [
            "id",
            "longid",
            "name",
            "description_short"
        ]

    mindat_items = pull_mindat_records(
        mindat_api=f"https://api.mindat.org/geomaterials/?entrytype={entrytype}"
    )

    df_mindat_items = pd.DataFrame(mindat_items)
    return df_mindat_items[return_props].reset_index(drop=True)

# Rocks

In [25]:
mindat_rocks = mindat_materials("rock")
mindat_rocks.sort_values("name").head()

Unnamed: 0,id,longid,name,description_short,rock_parent,rock_parent2,rock_root,rock_bgs_code
165,48128,1:1:48128:3,A-type granite,A general term for granitoids typically occurr...,48126,0,0,
427,48464,1:1:48464:2,Absarokite,A basaltic-trachyandesite rock containing phen...,48463,0,0,
1523,49848,1:1:49848:7,Acapulcoite meteorite,"Acapulcoites, named after the Acapulco, Mexico...",49847,0,1,
1524,49849,1:1:49849:6,Acapulcoite-lodranite meteorite,A transitional type between acapulcoite and lo...,49848,11263,0,
1752,50157,1:1:50157:8,Aceite,"A low-temperature alkaline metasomatic rock, m...",48653,0,1,


In [28]:
geokb_rocks = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0APREFIX%20p%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2F%3E%0APREFIX%20pq%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fqualifier%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FitemDescription%20%3Fmindat_id%20%3Fsubclass_of%20%3Fsubclass_ofLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ41261%20.%0A%20%20%3Fitem%20p%3AP1%20%3Finstance_of%20.%0A%20%20%3Finstance_of%20pq%3AP99%20%3Fmindat_id%20.%0A%20%20%3Fitem%20wdt%3AP2%20%3Fsubclass_of%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_rocks.sort_values("itemLabel").head(10)

Unnamed: 0,item,itemLabel,itemDescription,mindat_id,subclass_of,subclass_ofLabel
885,https://geokb.wikibase.cloud/entity/Q41534,A-type granite,A general term for granitoids typically occurr...,1:1:48128:3,https://geokb.wikibase.cloud/entity/Q41533,Granitoid
556,https://geokb.wikibase.cloud/entity/Q41802,Absarokite,A basaltic-trachyandesite rock containing phen...,1:1:48464:2,https://geokb.wikibase.cloud/entity/Q41801,Shoshonite
1592,https://geokb.wikibase.cloud/entity/Q42880,Acapulcoite meteorite,"Acapulcoites, named after the Acapulco, Mexico...",1:1:49848:7,https://geokb.wikibase.cloud/entity/Q42879,Primitive achondrite meteorite
1594,https://geokb.wikibase.cloud/entity/Q42881,Acapulcoite-lodranite meteorite,A transitional type between acapulcoite and lo...,1:1:49849:6,https://geokb.wikibase.cloud/entity/Q42880,Acapulcoite meteorite
1593,https://geokb.wikibase.cloud/entity/Q42881,Acapulcoite-lodranite meteorite,A transitional type between acapulcoite and lo...,1:1:49849:6,https://geokb.wikibase.cloud/entity/Q41403,Lodranite meteorite
2343,https://geokb.wikibase.cloud/entity/Q43111,Aceite,"A low-temperature alkaline metasomatic rock, m...",1:1:50157:8,https://geokb.wikibase.cloud/entity/Q41946,Metasomatic-rock
3113,https://geokb.wikibase.cloud/entity/Q43580,Acid volcanic rock,A chemical classification based on the TAS dia...,1:1:51212:1,https://geokb.wikibase.cloud/entity/Q41789,"Fine-grained (""volcanic"") normal crystalline i..."
2038,https://geokb.wikibase.cloud/entity/Q43070,Actinolite schist,A schistose rock composed predominantly of act...,1:1:50201:8,https://geokb.wikibase.cloud/entity/Q44025,Amphibole schist
2045,https://geokb.wikibase.cloud/entity/Q43073,Actinolite-chlorite schist,A schistose metamorphic rock containing actino...,1:1:50115:2,https://geokb.wikibase.cloud/entity/Q41939,Chlorite schist
2046,https://geokb.wikibase.cloud/entity/Q43073,Actinolite-chlorite schist,A schistose metamorphic rock containing actino...,1:1:50115:2,https://geokb.wikibase.cloud/entity/Q42020,Greenschist


# Commodities

In [29]:
mindat_commodities = mindat_materials("commodity")
mindat_commodities.head(10)

Unnamed: 0,id,longid,name,description_short
0,52420,1:1:52420:5,commodity:Aggregates,Aggregate is a broad classification of coarse ...
1,52421,1:1:52421:4,commodity:Alumina,"Alumina, or aluminium oxide (aluminum oxide), ..."
2,52422,1:1:52422:3,commodity:Antimony,Antimony is a silvery metalloid chemically rel...
3,52423,1:1:52423:2,commodity:Arsenic,Arsenic is a silvery metalloid.
4,52424,1:1:52424:1,commodity:Asbestos,A fibrous natural form of certain silicate min...
5,52425,1:1:52425:0,commodity:Barite (Barytes),Barite (or baryte) is a mineral composed prima...
6,52427,1:1:52427:8,commodity:Bauxite (Aluminium),Bauxite is the main ore of aluminium (aluminum...
7,52429,1:1:52429:6,commodity:Bentonite,
8,52430,1:1:52430:2,commodity:Beryllium,Beryllium is a toxic white-grey light metal.
9,52431,1:1:52431:1,commodity:Bismuth,


# Minerals

In [22]:
mindat_file = sorted(glob('data/mindat_minerals_*'), reverse=True)
mindat_file
if mindat_file:
    mindat_items = pd.read_parquet(mindat_file[0])
else:
    mindat_items = mindat_materials("mineral")

mindat_items.head()

Unnamed: 0,id,longid,name,description_short,occurrence,otheroccurrence,strunz10ed1,strunz10ed2,strunz10ed3,strunz10ed4,key_elements,ima_status,shortcode_ima
0,1,1:1:1:5,Abelsonite,"Chemically a nickel porphyrine derivative, cla...",Mahogany Zone oil shale; found in six stratigr...,"Small aggregates, to 1 cm, of thin laths or\r\...",10,C,A,20.0,-N-Ni-,[APPROVED],Abl
1,2,1:1:2:4,Abenakiite-(Ce),Unique chemistry (the only Na-REE-Si-P-C miner...,In a xenolith of sodalite syenite. A late-stag...,,9,C,K,10.0,-Ce-,[APPROVED],Abk-Ce
2,3,1:1:3:3,Abernathyite,Meta-autunite Group. Chemically the As analogu...,Colorado Plateau uranium-vanadium deposit,Oxidation zone of U deposits.,8,E,B,15.0,-As-U-,"[APPROVED, GRANDFATHERED]",Abn
3,4,1:1:4:2,Abhurite,A tin hydroxychloride mineral.\r\nThis mineral...,On the surface of a tin ingot recovered from a...,On tin ingots corroded by sea-water,3,D,A,30.0,-Cl-Sn-,[APPROVED],Abh
4,5,1:1:5:1,Ablykite,A clay mineral close to Halloysite. \r\n\r\nOr...,,,0,0,0,,,[],


In [5]:
geokb_mindat_ids = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0APREFIX%20p%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2F%3E%0APREFIX%20pq%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fqualifier%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fmindat_id%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ24%20.%0A%20%20%3Fitem%20p%3AP1%20%3Finstance_of%20.%0A%20%20%3Finstance_of%20pq%3AP99%20%3Fmindat_id%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_mindat_items = geokb_mindat_ids[~geokb_mindat_ids["mindat_id"].str.contains(':')].reset_index(drop=True)

geokb_mindat_items["qid"] = geokb_mindat_items["item"].apply(lambda x: x.split("/")[-1])
geokb_mindat_items["id"] = geokb_mindat_items["mindat_id"].astype("Int64")

In [6]:
df_merged_commodities = pd.merge(
    left=geokb_mindat_items,
    right=mindat_commodities[["id","longid"]],
    how="left",
    on="id"
)

In [7]:
df_merged_commodities[df_merged_commodities.longid.notnull()]

Unnamed: 0,item,itemLabel,mindat_id,mrds_code,qid,id,longid
0,https://geokb.wikibase.cloud/entity/Q297,arsenic,52423,AS,Q297,52423,1:1:52423:2
1,https://geokb.wikibase.cloud/entity/Q302,bismuth,52431,BI,Q302,52431,1:1:52431:1
2,https://geokb.wikibase.cloud/entity/Q306,cadmium,52434,CD,Q306,52434,1:1:52434:8
3,https://geokb.wikibase.cloud/entity/Q313,chromium,52438,CR,Q313,52438,1:1:52438:4
4,https://geokb.wikibase.cloud/entity/Q314,cobalt,52440,CO,Q314,52440,1:1:52440:9
5,https://geokb.wikibase.cloud/entity/Q315,copper,52442,CU,Q315,52442,1:1:52442:7
6,https://geokb.wikibase.cloud/entity/Q329,gold,52454,AU,Q329,52454,1:1:52454:2
7,https://geokb.wikibase.cloud/entity/Q330,hafnium,52458,HF,Q330,52458,1:1:52458:8
8,https://geokb.wikibase.cloud/entity/Q332,helium,52459,HE,Q332,52459,1:1:52459:7
9,https://geokb.wikibase.cloud/entity/Q335,indium,52462,IN,Q335,52462,1:1:52462:1


In [8]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['knowledge source'],
        value="Q41269"
    )
)

refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['knowledge source'],
        value="Q44207"
    )
)

for index, row in df_merged_commodities[df_merged_commodities.longid.notnull()].iterrows():
    item = geokb.wbi.item.get(row["qid"])

    quals = geokb.models.Qualifiers()
    quals.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['MRDS commodity code'],
            value=row["mrds_code"]
        )
    )
    quals.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup["Mindat identifier"],
            value=row["longid"]
        )
    )

    commodity_instance_claim = geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup["instance of"],
        value="Q406",
        qualifiers=quals,
        references=refs
    )

    item.claims.add(
        claims=commodity_instance_claim,
        action_if_exists=geokb.action_if_exists.APPEND_OR_REPLACE
    )

    response = item.write(
        summary="Changed Mindat identifier to long form for commodity in instance of classification qualifier"
    )
    print(row["item"])

https://geokb.wikibase.cloud/entity/Q297
https://geokb.wikibase.cloud/entity/Q302
https://geokb.wikibase.cloud/entity/Q306
https://geokb.wikibase.cloud/entity/Q313
https://geokb.wikibase.cloud/entity/Q314
https://geokb.wikibase.cloud/entity/Q315
https://geokb.wikibase.cloud/entity/Q329
https://geokb.wikibase.cloud/entity/Q330
https://geokb.wikibase.cloud/entity/Q332
https://geokb.wikibase.cloud/entity/Q335
https://geokb.wikibase.cloud/entity/Q336
https://geokb.wikibase.cloud/entity/Q337
https://geokb.wikibase.cloud/entity/Q342
https://geokb.wikibase.cloud/entity/Q343
https://geokb.wikibase.cloud/entity/Q346
https://geokb.wikibase.cloud/entity/Q349
https://geokb.wikibase.cloud/entity/Q350
https://geokb.wikibase.cloud/entity/Q381
https://geokb.wikibase.cloud/entity/Q392
https://geokb.wikibase.cloud/entity/Q414
https://geokb.wikibase.cloud/entity/Q415
https://geokb.wikibase.cloud/entity/Q449
https://geokb.wikibase.cloud/entity/Q451
https://geokb.wikibase.cloud/entity/Q453
https://geokb.wi

In [None]:
mindat_items = []
headers = {'Authorization': f"Token {os.environ['MINDAT_API_KEY']}"}
page_num = 1
params = {
    'page_size': '10',
    'page': '1',
    'format': 'json',
}

r = requests.get("https://api.mindat.org/geomaterials/?entrytype=7", params=params, headers=headers)

# while page_num > 0:
#     params = {
#         'page_size': '100',
#         'page': str(page_num),
#         'format': 'json',
#     }
#     x = requests.get("https://api.mindat.org/items/", params=params, headers=headers).json()
#     if "results" in x and x["results"]:
#         mindat_items.extend(x["results"])
#         page_num += 1
#     else:
#         page_num = 0

# df_mindat_items = pd.DataFrame(mindat_items)
# # df_mindat_items.to_pickle("data/mindat_items_20230420.p")


In [None]:
r.json()["results"][0]

In [None]:
geokb = WikibaseConnection("GEOKB_CLOUD")

# mindat_rock_class_source = geokb.ref_lookup['Mindat Rock Classification']
# mrdata_commodity_source = geokb.ref_lookup['USGS MRDS Commodity Names and Codes']

# ref_mindat = geokb.datatypes.Item(
#     prop_nr=geokb.prop_lookup['knowledge source'],
#     value=mindat_rock_class_source
# )
# ref_mrdata = geokb.datatypes.Item(
#     prop_nr=geokb.prop_lookup['knowledge source'],
#     value=mrdata_commodity_source
# )

In [None]:
q_mindat_ids = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0APREFIX%20p%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2F%3E%0APREFIX%20pq%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fqualifier%2F%3E%20%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fmindat_id%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ41261%20.%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fitem%20p%3AP1%20%3Fstatement%20.%0A%20%20%20%20OPTIONAL%20%7B%20%3Fstatement%20pq%3AP99%20%3Fmindat_id%20.%20%7D%0A%20%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D%0ALIMIT%2010000"
df_mindat_ids = geokb.wb_ref_data(query=q_mindat_ids)
df_mindat_ids["qid"] = df_mindat_ids["item"].apply(lambda x: x.split("/")[-1])

In [None]:
mindat_item_types = pd.read_parquet("mindat_item_types.parquet")
mindat_item_types["old_mindat_id"] = mindat_item_types["id"].astype(str)
df_mindat_ids["old_mindat_id"] = df_mindat_ids["mindat_id"].astype(str)

In [None]:
fix_mindat_ids = pd.merge(
    left=df_mindat_ids,
    right=mindat_item_types,
    how="left",
    on="old_mindat_id"
)
fix_mindat_ids["new_mindat_id"] = fix_mindat_ids.apply(lambda x: f"1:1:{x.old_mindat_id}:{x.entrytype}", axis=1)

In [None]:
fix_mindat_ids[fix_mindat_ids.itemLabel == 'Coal']

In [None]:
elem_query = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fatomic_number%20%3Fchem_symbol%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP101%20%3Fatomic_number%20.%0A%20%20%3Fitem%20wdt%3AP17%20%3Fchem_symbol%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D%0ALIMIT%201000"

elem_items = geokb.wb_ref_data(query=elem_query)
elem_items["item_qid"] = elem_items["item"].apply(lambda x: x.split("/")[-1])

In [None]:
source_reference = geokb.datatypes.Item(
    prop_nr=geokb.prop_lookup['knowledge source'],
    value="Q44208"
)

for index, row in elem_items.iterrows():
    item_references = geokb.models.References()
    item_references.add(source_reference)

    item = geokb.wbi.item.get(row.item_qid)

    instance_of_claim = geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['instance of'],
        value="Q280",
        references=item_references
    )

    chem_symbol_claim = geokb.datatypes.String(
        prop_nr=geokb.prop_lookup['element symbol'],
        value=row.chem_symbol,
        references=item_references
    )

    atomic_number_claim = geokb.datatypes.Quantity(
        prop_nr=geokb.prop_lookup['atomic number'],
        amount=row.atomic_number,
        references=item_references
    )

    item.claims.add(
        claims=[instance_of_claim, chem_symbol_claim, atomic_number_claim],
        action_if_exists=ActionIfExists.APPEND_OR_REPLACE
    )

    item.write(summary="Added references for chemical elements to Mindat chemical element list")

    print(row["item"])

In [None]:
from wikibaseintegrator.wbi_enums import ActionIfExists

In [None]:
rock_source_reference = geokb.datatypes.Item(
    prop_nr=geokb.prop_lookup['knowledge source'],
    value=geokb.ref_lookup['Mindat Rock Classification']
)

for index, row in mindat_id_items[mindat_id_items['instance_ofLabel'] == 'rock'].iterrows():
    r = geokb.models.References()
    r.add(rock_source_reference)

    item = geokb.wbi.item.get(row.item_qid)

    q = geokb.models.Qualifiers()
    q.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['Mindat identifier'],
            value=str(row.mindat_identifier)
        )
    )

    instance_of_claim = geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['instance of'],
        value='Q41261',
        references=r,
        qualifiers=q
    )

    item.claims.add(
        claims=instance_of_claim,
        action_if_exists=ActionIfExists.REPLACE_ALL
    )

    item.claims.remove(
        property=geokb.prop_lookup['Mindat identifier']
    )

    item.write(
        summary="Moved Mindat rock identifier to instance of claim as qualifier"
    )
    print(row['item'])

In [None]:
if os.path.exists("data/mindat_items_20230420.p"):
    df_mindat_items = pd.read_pickle("data/mindat_items_20230420.p")
else:
    # Takes about 10 minutes to paginate through the max number of records possible (100)
    mindat_items = []
    headers = {'Authorization': f"Token {os.environ['MINDAT_API_KEY']}"}
    page_num = 1

    while page_num > 0:
        params = {
            'page_size': '100',
            'page': str(page_num),
            'format': 'json',
        }
        x = requests.get("https://api.mindat.org/items/", params=params, headers=headers).json()
        if "results" in x and x["results"]:
            mindat_items.extend(x["results"])
            page_num += 1
        else:
            page_num = 0

    df_mindat_items = pd.DataFrame(mindat_items)
    df_mindat_items.to_pickle("data/mindat_items_20230420.p")


# Commodities

In [None]:
commodities = df_mindat_items[df_mindat_items['entrytype_text'] == 'commodity'].reset_index()
commodities["commodity_name"] = commodities["name"].apply(lambda x: x.split(":")[-1].split("(")[0].strip().lower())

In [None]:
element_commodities = commodities[~commodities['mindat_formula'].isin(['','0'])][["id","commodity_name","mindat_formula"]].reset_index()
non_element_commodities = commodities[commodities['mindat_formula'].isin(['','0'])][["id","commodity_name"]].reset_index()

In [None]:
element_commodities

In [None]:
non_element_commodities

# Questionable concept items (rocks but not rocks)

Rock type names may or may not be better sourced from elsewhere, but Mindat does provide a pretty solid source of rock type names. More importantly, it lists the relationships between minerals and mineral commodities with their rock "parents" giving us a nice connection for the linked data model in the knowledge graph. If we look at items that are not classified as "rock" but are listed as a rock_root, we have an interesting collection of concepts that seem like they are "shoved" into the Mindat model because it needs them as references but they should really be classed as something else. The MRDS Commodity Codes that I worked in previously had some similar dynamics. Working this from the broader GeoKB context, we should be able to resolve some of this semantic dissonance, but I've got work out how best to source and classify these other concepts if we need them.

In [None]:
df_mindat_items[
    (df_mindat_items.entrytype_text != "rock")
    &
    (df_mindat_items.rock_root == 1)
][["id","name","entrytype_text"]].sort_values("entrytype_text")

# Rocks/rock types

We do need to have a representation of most or all of what Mindat has in its rock classification. Whether or not we want to represent the full hierarchical classification is a question. The rock type classification for [Garnet clinopyroxenite](https://www.mindat.org/min-470579.html) is an interesting example. The data we get from the API is fully sufficient to create what's shown in the web app. We get to the level of [Coarse-grained-ultramafic-rock](https://www.mindat.org/min-50598.html), and we have the split classification represented in data as rock_parent and rock_parent2. Since we will presumably have all of these items in the GeoKB representation, we may as well go ahead and capture the relationships in claims for the graph.

A handful (17) of Mindat's rocks are listed with identifiers from the British Geological Survey's rock classification system. This is one of several efforts to provide a semantically explicit rock type system. It is available in linked data form from the BGS, maintained via [GitHub](https://github.com/BritishGeologicalSurvey/vocabularies/tree/main/vocabularies) with other vocabularies. Interestingly, the BGS source includes their own take on geochronology that I've not yet compared to the IUGS source I ended up using (see the "Geo Time" notebook.) There are also some other representations of rock type/lithology classification schemes in various states of maturity and questionable "authoritativeness" that we could explore.

So, the question here is, should we start with what Mindat provides on named rocks and build up from there, or should we spend the time tracking down and hashing through other sources that may or may not be more complete or robust for this particular piece of information? This kinds of brings up an overall philosophical question exposed by this type of work and what exactly we should tap into for our own purposes. On one end of the spectrum are platforms like Mindat and our own MRData that are fundamentally a structured blend of data, information, and knowledge initially designed and built exclusively for human consumption. Later they have added in functionality like APIs or encoded renderings of their contents, mainly for human app-builders to write software against to go do other things. On the other end, we have numerous attempts to encode some of that same data, information, and knowledge essentially for AIs to operate against or other reasoning processes that need something more efficient to work at scale (size and complexity) of problems.

In a lot of ways, the best next generation frameworks should do both things well. This is where approaches like schema.org and the related "Science on Schema" work in ESIP come into play. They take what starts and continues as a web system designed to communicate concepts to humans and embed encoded data, information, and knowledge in a way that AIs can read the same content more efficiently for themselves. This is where I think OpenMindat is going as a project. Unfortunately, we don't yet have a whole lot of those things built out in our domain. They have been built for over a decade now in the world of e-commerce, which is what drove the development to begin with. This is what drives things like CapitalOne Shopping and other AI-driven tools. The more open-ended part of AI that has all the hype right now (ChatGPT, etc.) is just another piece of something that's already been in all our lives, manipulating us to buy stuff, for a long time.

I'm inclined to start with Mindat and then build up from there as needed. It may not provide a fully semantically coherent, AI-enabling API yet, but it does have an API, meaning we don't have to scrape web pages. Most importantly, we can encode identifiers in anything we represent in the GeoKB that are resolvable for both humans and software systems, with the former still the most important. If we link to some name of a rock type and that named entity shows we got it from Mindat and links right to the associated record, then we and everyone else knows exactly what we meant when we established the linkage. It is at least something larger than ourselves and part of a global community system that has active people in our same domain interacting with and improving on the information all the time. As we find things not yet covered in Mindat that we need (about some of the same things or about new things), we can go consult other sources. When we do, we can establish new linkages that enhance what we know and communicate about what we know. 

In [None]:
mindat_rocks = df_mindat_items[
    (df_mindat_items['entrytype_text'] == 'rock')
    &
    (
        ~(
            (df_mindat_items["rock_parent"] == 0)
            &
            (df_mindat_items["rock_parent2"] == 0)
        )
    )
][[
    "id",
    "name",
    "description_short",
    'rock_parent',
    'rock_parent2',
    'rock_root',
    'rock_bgs_code',
    'meteoritical_code'
]].reset_index()

mindat_rocks["desc"] = mindat_rocks["description_short"].apply(lambda x: x.split("\r\n")[0].split(".")[0][:250])

In [None]:
def wb_item_from_mindat(mindat_id, df_mindat, wb, ref_mindat, summary):
    df_mindat_record = df_mindat[df_mindat["id"] == int(mindat_id)]
    if df_mindat_record.empty:
        return
    
    mindat_record = df_mindat_record.iloc[0]

    item = wb.wbi.item.new()

    item.labels.set('en', mindat_record["name"])
    if mindat_record["name"] == mindat_record["desc"]:
        item.descriptions.set('en', f"a rock named {mindat_record['name']}")
    else:
        item.descriptions.set('en', mindat_record["desc"])

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup["instance of"],
            value=wb.class_lookup["rock"],
            references=[ref_mindat]
        )
    )

    item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=wb.prop_lookup["Mindat min Identifier"],
            value=str(mindat_record["id"])
        )
    )

    response = item.write(
        summary=summary
    )
    return response.id

def add_subclass_of_claim(wb, ref, item_qid, subclass_of_qid, summary):
    item = wb.wbi.item.get(item_qid)

    subclass_claims = []
    for qid in subclass_of_qid:
        subclass_claims.append(
            wb.datatypes.Item(
                prop_nr=wb.prop_lookup["subclass of"],
                value=qid,
                references=[ref]
            )
        )
    item.claims.add(subclass_claims)

    response = item.write(
        summary=summary,
        clear=True
    )
    return response.id

In [None]:
wb_rocks = []

for index, row in mindat_rocks.iterrows():
    item = geokb.wbi.item.new()

    item.labels.set('en', row["name"])
    if row["name"] == row["desc"]:
        item.descriptions.set('en', f"a rock named {row['name']}")
    else:
        item.descriptions.set('en', row["desc"])

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup["instance of"],
            value=geokb.class_lookup["rock"],
            references=[ref_mindat]
        )
    )

    item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup["Mindat min Identifier"],
            value=str(row["id"])
        )
    )

    response = item.write(
        summary="Added initial named and identified rock item from Mindat"
    )
    wb_rocks.append({
        "mindat_id": row["id"],
        "label": row["name"],
        "qid": response.id
    })
    print("ADDED ROCK:", row["name"], response.id)

In [None]:
query_mindat_ids = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fmindat_min_code%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP99%20%3Fmindat_min_code%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D%0A"
df_wb_mindat_ids = geokb.wb_ref_data(query=query_mindat_ids)
df_wb_mindat_ids["qid"] = df_wb_mindat_ids["item"].apply(lambda x: x.split("/")[-1])
df_wb_mindat_ids["id"] = df_wb_mindat_ids["mindat_min_code"].astype(int)
mindat_id_lookup = df_wb_mindat_ids.set_index("id")["qid"].to_dict()

In [None]:
mindat_rocks["qid"] = mindat_rocks["id"].apply(lambda x: mindat_id_lookup[x] if x in mindat_id_lookup else None)
mindat_rocks["first_subclass_of"] = mindat_rocks["rock_parent"].apply(lambda x: mindat_id_lookup[x] if x in mindat_id_lookup else None)
mindat_rocks["second_subclass_of"] = mindat_rocks["rock_parent2"].apply(lambda x: mindat_id_lookup[x] if x in mindat_id_lookup else None)

In [None]:
missing_parents = []
missing_parents.extend(list(mindat_rocks[mindat_rocks["first_subclass_of"].isnull()].rock_parent.unique()))
missing_parents.extend(list(mindat_rocks[mindat_rocks["second_subclass_of"].isnull()].rock_parent2.unique()))
missing_parents = list(set([i for i in missing_parents if i != 0]))
missing_parents

In [None]:
for mindat_id in missing_parents:
    new_qid = wb_item_from_mindat(
        mindat_id=mindat_id,
        df_mindat=df_mindat_items,
        wb=geokb,
        ref_mindat=ref_mindat,
        summary="Added initial rock parent item from Mindat source"
    )
    print(mindat_id, new_qid)

In [None]:
for index, row in mindat_rocks.iterrows():
    subclass_of_qid = [row["first_subclass_of"]]
    if isinstance(row["second_subclass_of"], str):
        subclass_of_qid.append(row["second_subclass_of"])

    updated_qid = add_subclass_of_claim(
        wb=geokb,
        ref=ref_mindat,
        item_qid=row["qid"],
        subclass_of_qid=subclass_of_qid,
        summary="Added to primary parent rock class as subclass"
    )

    print("ADDED classification", updated_qid)



# Commodities

In [None]:
mindat_commodities = df_mindat_items[df_mindat_items.entrytype_text == "commodity"]
mindat_commodities.head()

In [None]:
df_mindat_items[(df_mindat_items["name"].str.startswith("commodity:")) & (~df_mindat_items.id.isin(mindat_commodities.id))]

In [None]:

rock_types = df_mindat_items[df_mindat_items.id.isin(df_mindat_items.rock_parent) | df_mindat_items.id.isin(df_mindat_items.rock_parent2)].reset_index()

In [None]:
rock_types.mask(rock_types.isin(['',0,[]])).dropna(axis=1)

In [None]:
df_mindat_items.rock_parent.unique()

In [None]:
ids = df_mindat_items["name"]
df_mindat_items[ids.isin(ids[ids.duplicated()])].sort_values("name")

# df[ids.isin(ids[ids.duplicated()])].sort_values("ID")

In [None]:
df_mindat_items[(df_mindat_items.mindat_formula.astype(str) != '') & (df_mindat_items.synid == 0) & (df_mindat_items.varietyof == 0)]

# Do Wikibase Stuff