This notebook works through an experimental process for building out a section of "class items" in the GeoKB based on some level of thoughtful processing from a Wikidata source. As in other cases, I am leveraging the fact that we are basing our knowledgebase within a Mediawiki instance to use the simplicity of wiki pages to house important details about what we are assembling in the GeoKB. Here, we are building subdisciplines of the broader scientific field of hydrology. We get some source data from Wikidata, refine it in OpenRefine, and then cache it to the wiki page for the "superclass" item, hydrology, for later processing into new and improved GeoKB items. While we could push these changes directly from an OpenRefine instance, some of the details are not captured in a way that we can point to as provenance and build upon for other use cases. This is an alternative pathway that we'll examine in more complex use cases to determine its utillity.

In [59]:
import pandas as pd
from wbmaker import WikibaseConnection
import pytablewriter
import json
import yaml

geokb = WikibaseConnection('GEOKB_CLOUD')

# Get raw source from Wikidata
Here we pull raw source data that we worked through OpenRefine and convert it to a Mediawiki table that will render as HTML in the wiki page. This has the effect of caching a point-in-time snapshot of the content we leveraged from Wikidata.

In [5]:
wd_hydrology_query = """
SELECT ?item ?itemLabel ?itemDescription ?practiced_by ?practiced_byLabel ?practiced_byDescription ?subclass_of ?subclass_ofLabel
WHERE {
  ?item wdt:P279* wd:Q42250 .
  ?item wdt:P279 ?subclass_of .
  OPTIONAL {
    ?item wdt:P3095 ?practiced_by .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""

wd_hydrology = geokb.sparql_query(
    query=wd_hydrology_query,
    endpoint='https://query.wikidata.org/sparql'
)

raw_table_writer = pytablewriter.MediaWikiTableWriter()
raw_table_writer.from_dataframe(wd_hydrology)
raw_table = raw_table_writer.dumps()

# Get processed source from cache
There might eventually be a better programmatic way to do this, but for now I dump the refined data table to CSV and then import it here and convert to a Mediawiki markdown table that will render as HTML and be readable with Pandas.

In [6]:
refined_hydrology_subclasses = pd.read_csv('../../../../Downloads/hydrology-subclasses.csv')
refined_hydrology_subclasses.fillna('', inplace=True)

processed_table_writer = pytablewriter.MediaWikiTableWriter()
processed_table_writer.from_dataframe(refined_hydrology_subclasses)
processed_table = processed_table_writer.dumps()

# Get processing history
There might eventually be a better programmatic way to do this, but for now I dump the processing history from OpenRefine to its raw JSON form and then convert that to YAML so we can stash it in the item talk page.

In [7]:
with open('../../../../Downloads/history.json') as f:
    processing_history = yaml.safe_dump(json.load(f), sort_keys=False)

# Write to wiki page

In [13]:
hydrology_item_talk = 'Item_talk:Q158992'
hydrology_page = geokb.mw_site.pages[hydrology_item_talk]

In [11]:
# Write the final processed data from the cache as the first table on the page
hydrology_page.save(
    processed_table, 
    section="new", 
    summary="Source Data"
)

# Write the Wikidata query used
hydrology_page.save(
    "<sparql>" + wd_hydrology_query + "</sparql>", 
    section="new", 
    summary="Wikidata Query"
)

# Write the raw data from Wikidata
hydrology_page.save(
    raw_table, 
    section="new", 
    summary="Raw Wikidata"
)

# Write the processing history as YAML
hydrology_page.save(
    processing_history, 
    section="new", 
    summary="Processing History"
)

OrderedDict([('result', 'Success'),
             ('pageid', 213746),
             ('title', 'Item talk:Q158992'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 492783),
             ('newrevid', 492784),
             ('newtimestamp', '2023-10-23T12:14:36Z'),
             ('watched', '')])

# Process items and claims
The following is something I could refine into a standard pipeline to take something like the source table I produced which has information about new science disciplines, types of scientists, and the connections between them to build items within the GeoKB. The same basic logic could apply to anything with this kind of relationship. For now, there were a couple of specialized things I needed to handle in the following code, so I'll revisit and improve on this with the next case.

In [14]:
hydrology_fields_source = pd.read_html(f'https://geokb.wikibase.cloud/wiki/{hydrology_item_talk}')[0]
hydrology_fields_source.head()

Unnamed: 0,item,itemLabel,geokb_discipline_qid,itemDescription,practiced_by,practiced_byLabel,geokb_profession_qid,practiced_byDescription,subclass_of,subclass_ofLabel,geokb_superclass_qid
0,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,"science that deals with the water above, on an...",http://www.wikidata.org/entity/Q3644587,hydrologist,Q159575,earth scientist that studies water,http://www.wikidata.org/entity/Q336,earth sciences,Q158971
1,http://www.wikidata.org/entity/Q179509,hydrogeology,Q158991,branch of science that studies distribution an...,http://www.wikidata.org/entity/Q35380338,hydrogeologist,Q159638,earth scientist that studies ground water,http://www.wikidata.org/entity/Q1069,geology,Q158984
2,http://www.wikidata.org/entity/Q179509,hydrogeology,Q158991,branch of science that studies distribution an...,http://www.wikidata.org/entity/Q35380338,hydrogeologist,Q159638,earth scientist that studies ground water,http://www.wikidata.org/entity/Q42250,hydrology,Q158992
3,http://www.wikidata.org/entity/Q912065,hydrochemistry,,branch of hydrology that deals with the chemic...,http://www.wikidata.org/entity/Q110624867,hydrochemist,,earth scientist who studies the chemical chara...,http://www.wikidata.org/entity/Q42250,hydrology,Q158992
4,http://www.wikidata.org/entity/Q912065,hydrochemistry,,branch of hydrology that deals with the chemic...,http://www.wikidata.org/entity/Q110624867,hydrochemist,,earth scientist who studies the chemical chara...,http://www.wikidata.org/entity/Q161764,geochemistry,Q158980


In [22]:
practiced_by_query = """
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?practiced_by ?practiced_byLabel
WHERE {
  ?item wdt:P158 ?practiced_by .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

practiced_by = geokb.sparql_query(practiced_by_query)
practiced_by['geokb_superclass_qid'] = practiced_by['item'].apply(lambda x: x.split('/')[-1])
practiced_by['practiced_by_parent'] = practiced_by['practiced_by'].apply(lambda x: x.split('/')[-1])

In [24]:
geokb_practiced_by_merge = pd.merge(
    left=hydrology_fields_source,
    right=practiced_by[['geokb_superclass_qid', 'practiced_by_parent']],
    how='left',
    on='geokb_superclass_qid'
)

In [29]:
geokb_practiced_by_merge

Unnamed: 0,item,itemLabel,geokb_discipline_qid,itemDescription,practiced_by,practiced_byLabel,geokb_profession_qid,practiced_byDescription,subclass_of,subclass_ofLabel,geokb_superclass_qid,practiced_by_parent
0,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,"science that deals with the water above, on an...",http://www.wikidata.org/entity/Q3644587,hydrologist,Q159575,earth scientist that studies water,http://www.wikidata.org/entity/Q336,earth sciences,Q158971,Q159571
1,http://www.wikidata.org/entity/Q179509,hydrogeology,Q158991,branch of science that studies distribution an...,http://www.wikidata.org/entity/Q35380338,hydrogeologist,Q159638,earth scientist that studies ground water,http://www.wikidata.org/entity/Q1069,geology,Q158984,Q159572
2,http://www.wikidata.org/entity/Q179509,hydrogeology,Q158991,branch of science that studies distribution an...,http://www.wikidata.org/entity/Q35380338,hydrogeologist,Q159638,earth scientist that studies ground water,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,Q159575
3,http://www.wikidata.org/entity/Q912065,hydrochemistry,,branch of hydrology that deals with the chemic...,http://www.wikidata.org/entity/Q110624867,hydrochemist,,earth scientist who studies the chemical chara...,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,Q159575
4,http://www.wikidata.org/entity/Q912065,hydrochemistry,,branch of hydrology that deals with the chemic...,http://www.wikidata.org/entity/Q110624867,hydrochemist,,earth scientist who studies the chemical chara...,http://www.wikidata.org/entity/Q161764,geochemistry,Q158980,Q159577
5,http://www.wikidata.org/entity/Q2330259,potamology,,study of rivers; branch of geography,http://www.wikidata.org/entity/Q107448718,potamologist,,hydrologist that studies rivers,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,Q159575
6,http://www.wikidata.org/entity/Q2363192,ecohydrology,,interdisciplinary field studying the interacti...,http://www.wikidata.org/entity/Q113057726,ecohydrologist,,interdisciplinary scientist studying the inter...,http://www.wikidata.org/entity/Q7150,ecology,Q158972,Q159589
7,http://www.wikidata.org/entity/Q2363192,ecohydrology,,interdisciplinary field studying the interacti...,http://www.wikidata.org/entity/Q113057726,ecohydrologist,,interdisciplinary scientist studying the inter...,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,Q159575
8,http://www.wikidata.org/entity/Q2883300,agricultural hydrology,,study of water balance components intervening ...,,,,,http://www.wikidata.org/entity/Q42250,hydrology,Q158992,Q159575
9,http://www.wikidata.org/entity/Q2883300,agricultural hydrology,,study of water balance components intervening ...,,,,,http://www.wikidata.org/entity/Q396152,hydrology,Q158992,Q159575


In [31]:
new_scientist_classes = geokb_practiced_by_merge[
    geokb_practiced_by_merge['geokb_profession_qid'].isnull()
    &
    geokb_practiced_by_merge['practiced_by_parent'].notnull()
][
    [
        'practiced_by',
        'practiced_byLabel',
        'practiced_byDescription',
        'practiced_by_parent'
    ]
].groupby(['practiced_by', 'practiced_byLabel', 'practiced_byDescription'], as_index=False)['practiced_by_parent'].agg(list)

new_scientist_classes

Unnamed: 0,practiced_by,practiced_byLabel,practiced_byDescription,practiced_by_parent
0,http://www.wikidata.org/entity/Q101508301,hydroecologist,researcher at the interface between the hydrol...,"[Q159589, Q159575]"
1,http://www.wikidata.org/entity/Q105552827,forest hydrologist,earth scientist who gathers and analyzes infor...,[Q159575]
2,http://www.wikidata.org/entity/Q107448718,potamologist,hydrologist that studies rivers,[Q159575]
3,http://www.wikidata.org/entity/Q110624867,hydrochemist,earth scientist who studies the chemical chara...,"[Q159575, Q159577]"
4,http://www.wikidata.org/entity/Q113057726,ecohydrologist,interdisciplinary scientist studying the inter...,"[Q159589, Q159575]"


In [33]:
new_scientist_classes_qids = []

for _, row in new_scientist_classes.iterrows():
    item = geokb.wbi.item.new()
    item.labels.set('en', row['practiced_byLabel'])
    item.descriptions.set('en', row['practiced_byDescription'])

    subclass_claims = []
    for parent_qid in row['practiced_by_parent']:
        subclass_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['subclass of'],
                value=parent_qid
            )
        )
    item.claims.add(subclass_claims)

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['same as'],
            value=row['practiced_by']
        )
    )

    response = item.write(
        summary="Added new class of scientist for hydrology-related fields from Wikidata source"
    )
    new_scientist_classes_qids.append({
        'qid': response.id,
        'label': row['practiced_byLabel']
    })
    print(response.id, row['practiced_byLabel'])

Q161195 hydroecologist
Q161196 forest hydrologist
Q161197 potamologist
Q161198 hydrochemist
Q161199 ecohydrologist


In [34]:
new_science_classes = geokb_practiced_by_merge[
    geokb_practiced_by_merge['geokb_discipline_qid'].isnull()
][
    [
        'item',
        'itemLabel',
        'itemDescription',
        'geokb_superclass_qid'
    ]
].groupby(['item', 'itemLabel', 'itemDescription'], as_index=False)['geokb_superclass_qid'].agg(list)

new_science_classes

Unnamed: 0,item,itemLabel,itemDescription,geokb_superclass_qid
0,http://www.wikidata.org/entity/Q102073785,fault zone hydrogeology,study of how deformed rocks alter fluid flows,[Q158991]
1,http://www.wikidata.org/entity/Q10295440,statistical hydrology,branch of hydrology that uses mathematical and...,"[Q161190, Q158992]"
2,http://www.wikidata.org/entity/Q105552831,forest hydrology,branch of science that combines aspects of hyd...,"[Q158992, Q161191]"
3,http://www.wikidata.org/entity/Q110968692,watershed hydrology,"study of how water flows across a watershed, w...","[Q158992, Q158992]"
4,http://www.wikidata.org/entity/Q111327099,urban hydrology,scientific study of the hydrological cycle wit...,[Q158992]
5,http://www.wikidata.org/entity/Q115827802,coastal hydrogeology,branch of hydrogeology that focuses on the mov...,[Q158991]
6,http://www.wikidata.org/entity/Q13416553,Isotope hydrology,field of geochemistry and hydrology that uses ...,[Q158992]
7,http://www.wikidata.org/entity/Q20638505,hydrogeomorphology,interaction and linkage of hydrologic processe...,"[Q158992, Q158985]"
8,http://www.wikidata.org/entity/Q2330259,potamology,study of rivers; branch of geography,[Q158992]
9,http://www.wikidata.org/entity/Q2363192,ecohydrology,interdisciplinary field studying the interacti...,"[Q158972, Q158992]"


In [35]:
new_science_classes_qids = []

for _, row in new_science_classes.iterrows():
    item = geokb.wbi.item.new()
    item.labels.set('en', row['itemLabel'])
    item.descriptions.set('en', row['itemDescription'])

    subclass_claims = []
    for parent_qid in row['geokb_superclass_qid']:
        subclass_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['subclass of'],
                value=parent_qid
            )
        )
    item.claims.add(subclass_claims)

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['same as'],
            value=row['item']
        )
    )

    response = item.write(
        summary="Added new scientific discipline for hydrology-related fields from Wikidata source"
    )
    new_science_classes_qids.append({
        'qid': response.id,
        'label': row['itemLabel']
    })
    print(response.id, row['itemLabel'])

Q161200 fault zone hydrogeology
Q161201 statistical hydrology
Q161202 forest hydrology
Q161203 watershed hydrology
Q161204 urban hydrology
Q161205 coastal hydrogeology
Q161206 Isotope hydrology
Q161207 hydrogeomorphology
Q161208 potamology
Q161209 ecohydrology
Q161210 agricultural hydrology
Q161211 hydroecology
Q161212 telmatology
Q161213 river morphology
Q161214 catchment hydrology
Q161215 palaeohydrology
Q161216 fluvial geomorphology
Q161217 snow hydrology
Q161218 hydrochemistry


In [42]:
discipline_qid_lookup = pd.DataFrame(new_science_classes_qids).set_index('label')['qid'].to_dict()
profession_qid_lookup = pd.DataFrame(new_scientist_classes_qids).set_index('label')['qid'].to_dict()

In [54]:
practitioner_mapping = geokb_practiced_by_merge[geokb_practiced_by_merge['geokb_discipline_qid'].isnull()][['itemLabel','practiced_byLabel']].reset_index(drop=True)
practitioner_mapping['discipline_qid'] = practitioner_mapping['itemLabel'].apply(lambda x: discipline_qid_lookup[x])
practitioner_mapping['profession_qid'] = practitioner_mapping['practiced_byLabel'].apply(lambda x: profession_qid_lookup[x] if x in profession_qid_lookup else 'Q159575')

In [60]:
for _, row in practitioner_mapping[['discipline_qid', 'profession_qid']].groupby('discipline_qid', as_index=False)['profession_qid'].agg(list).iterrows():
    item = geokb.wbi.item.get(row['discipline_qid'])
    practiced_by_claims = []
    for profession_qid in row['profession_qid']:
        practiced_by_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['practiced by'],
                value=profession_qid
            )
        )
    item.claims.add(practiced_by_claims, action_if_exists=geokb.action_if_exists.REPLACE_ALL)
    response = item.write(
        summary="Added new practitioner mapping for hydrology-related fields from Wikidata source"
    )
    print(response.id)


Q161200
Q161201
Q161202
Q161203
Q161204
Q161205
Q161206
Q161207
Q161208
Q161209
Q161210
Q161211
Q161212
Q161213
Q161214
Q161215
Q161216
Q161217
Q161218


In [61]:
for _, row in practitioner_mapping[practitioner_mapping['profession_qid'] != 'Q159575'][['discipline_qid', 'profession_qid']].groupby('profession_qid', as_index=False)['discipline_qid'].agg(list).iterrows():
    item = geokb.wbi.item.get(row['profession_qid'])
    practioner_claims = []
    for discipline_qid in row['discipline_qid']:
        practioner_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['practitioner of'],
                value=discipline_qid
            )
        )
    item.claims.add(practioner_claims, action_if_exists=geokb.action_if_exists.REPLACE_ALL)
    response = item.write(
        summary="Added new practitioner mapping for hydrology-related fields from Wikidata source"
    )
    print(response.id)

Q161195
Q161196
Q161197
Q161198
Q161199
