We will be using place names all over the place in the GeoKB. These will often be in addition to geographic coordinate claims on items, providing simple name-based characterization useful in all kinds of searches and reporting. We may have bots that operate internally to the GeoKB, building from coordinates to introduce derived place name claims. Or they may come in from source material.

Our GeoKB is likely not an authority for any place names, so we need to develop a series of bots to handle one-time and potentially periodic updating of place references. This notebook starts to build these out, and we may move to more purpose-built code for the bots in future.

# Bot Operator

Following the principle outlined in the README, there is a specific bot account for handling this process. Establishing foundational place name/geospatial references will be something we need to revisit routinely as we bring in additional data sources.

In [6]:
import os
import pandas as pd
import requests
from zipfile import ZipFile
from io import BytesIO
import swifter
import geopandas as gpd

from utils import (
    sparql_query,
    query_by_item_label,
    property_lookup
)

from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login, wbi_helpers
from wikibaseintegrator.datatypes import Item, String, ExternalID, URL

In [2]:
# Vist the README for information on these variables
wbi_config['MEDIAWIKI_API_URL'] = os.environ['MEDIAWIKI_API_URL']
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['SPARQL_ENDPOINT_URL']
wbi_config['WIKIBASE_URL'] = os.environ['WIKIBASE_URL']

# Use bot account for this specific task
geokb_auth = wbi_login.Login(
    user=os.environ['WB_BOT_GEO'], 
    password=os.environ['WB_PASS_GEO']
)
wbi = WikibaseIntegrator(login=geokb_auth)

# Properties and Classes

Working through places introduces new properties and classification items:

* geographic region (aka place)
    * Rather than try to be exhaustive on this, I stuck with a simple high level item (subclass of entity) for now. We can make this more complex in future if needed.
* US State
    * I debated this for some time but ended up using one of the same methods employed in Wikidata with this very specific classifier. Wikidata also uses a much broader classification scheme for "administrative units" of various kinds within some larger administrative unit. With this approach, we'll have the same thing with the need for Canadian Province, Mexican State, and other specific classifiers. In a lot of ways, I'd rather keep the semantics at a more general level, but for now, this is what may communicate best in our context.
* ISO 3166-1 alpha-2 code, ISO 3166-2 code, FIPS 5-2 alpha code (US states), FIPS 5-2 numeric code (US state)
    * Specific properties (ExternalID datatype) leveraged from Wikidata as identifiers we need in linking from other data sources



In [3]:
geokb_props = property_lookup()
geokb_props

{'instance of': 'P1',
 'subclass of': 'P2',
 'reference item': 'P3',
 'reference url': 'P4',
 'reference statement': 'P5',
 'coordinate location': 'P6',
 'publication date': 'P7',
 'subject matter': 'P8',
 'ranking': 'P9',
 'ISO 3166-1 alpha-2 code': 'P10',
 'located in the administrative territorial entity': 'P11',
 'ISO 3166-2 code': 'P12',
 'FIPS 5-2 alpha code (US states)': 'P13',
 'FIPS 5-2 numeric code (US states)': 'P14',
 'corresponding wikidata property': 'P15',
 'related wikidata item': 'P16',
 'element symbol': 'P17',
 'SEDAR Identifier': 'P18',
 'MRDS commodity code': 'P19',
 'USGS Thesaurus ID': 'P20'}

# Countries

For our mineral assessment use cases, we need items in GeoKB for every country where we have something like a mineral occurrence record or a document dealing with a mineral prospect in that country. So, we may as well input every country in the world from some reasonable source. There are lots of places we could go for this, but the Wikidata source of country records is pretty robust with a massive slate of properties for many countries because they are in such heavy use within the global knowledgebase.

Here, I run a query that pulls back just the essential bits we need right now:
* country name (ignoring aliases for now)
* country description
* two-character ISO3166 country code

We might need numeric codes at some point as well, but this is the minimum to start.

In [None]:
wd_country_query = """
SELECT ?country ?countryLabel ?countryDescription ?country_code WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?country wdt:P31 wd:Q6256.
  OPTIONAL {
  VALUES (?property) {
    (wdt:P297)
  }  
  ?country ?property ?country_code.
  }
}
"""

wd_countries = sparql_query(
    endpoint='https://query.wikidata.org/sparql',
    query=wd_country_query,
    output='dataframe'
)
wd_countries.head()

In [None]:
wd_countries[wd_countries.countryLabel.str.contains('United')]

In [None]:
def get_label_id(label):
    label_query = query_by_item_label(label=label, include_aliases=False)
    geokb_results = wbi_helpers.execute_sparql_query(label_query)
    if len(geokb_results["results"]["bindings"]) == 1:
        return geokb_results["results"]["bindings"][0]["item"]["value"].split('/')[-1]

In [None]:
get_label_id('United States of America')

In [None]:
def add_item(label: str, description: str, aliases: list|str|None, claims: list|None):
    item = wbi.item.new()

    item.labels.set(language='en', value=label)
    item.descriptions.set(language='en', value=description)
    item.aliases.set(language='en', values=aliases)
    item.claims.add(claims)

    item.write()


In [None]:
instance_of = 'P1'
country_code_prop = 'P10'
country_class = 'Q27'

instance_of_country_claim = Item(
    prop_nr=instance_of, 
    value=country_class
)

for index, row in wd_countries[wd_countries.country_code != 'ZA'].iterrows():
    print("PROCESSING:", row.countryLabel)
    claims = [instance_of_country_claim]
    if row.country_code:
        country_code_claim = String(
            prop_nr=country_code_prop,
            value=row.country_code
        )
        claims.append(country_code_claim)

    add_item(
        label=row.countryLabel,
        description=row.countryDescription,
        aliases=row.country_code,
        claims=claims
    )


# Sub-country Units

We at least also need U.S. States and Territories and may need states/provinces from other countries as I know those show up in some of our data. We'll also probably want U.S. Counties. We'll introduce additional sub-country administrative units as they come up in our data sources, working through the same overall logic on identifying and processing reasonable reference sources.

In [None]:
wd_us_states_query = """
SELECT DISTINCT ?state ?stateLabel ?stateDescription ?iso3166 ?fips_alpha ?fips_numeric WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?state wdt:P31 wd:Q35657.
  OPTIONAL {
  VALUES (?propertyISO3166) {
    (wdt:P300)
  }  
  ?state ?propertyISO3166 ?iso3166.
  VALUES (?propertyFipsNumeric) {
    (wdt:P5087)
  }  
  ?state ?propertyFipsNumeric ?fips_numeric.
  VALUES (?propertyFipsAlpha) {
    (wdt:P5086)
  }  
  ?state ?propertyFipsAlpha ?fips_alpha.
  }
}

"""

wd_us_states = sparql_query(
    endpoint='https://query.wikidata.org/sparql',
    query=wd_us_states_query,
    output='dataframe'
)
wd_us_states.head()

In [None]:
reference_url = 'P4'
us_state = 'Q229'
is_located_in = 'P11'
us = 'Q161'
iso_3166_2 = 'P12'
fips_alpha = 'P13'
fips_numeric = 'P14'

state_instance_of_claim = Item(
    prop_nr=instance_of,
    value=us_state
)

located_in_us_claim = Item(
    prop_nr=is_located_in,
    value=us
)

for index, row in wd_us_states.iterrows():
    references = [URL(
        prop_nr=reference_url,
        value=row.state
    )]

    iso_3166_claim = ExternalID(
        prop_nr=iso_3166_2,
        value=row.iso3166,
        references=references
    )

    fips_alpha_claim = ExternalID(
        prop_nr=fips_alpha,
        value=row.fips_alpha,
        references=references
    )

    fips_numeric_claim = ExternalID(
        prop_nr=fips_numeric,
        value=row.fips_numeric,
        references=references
    )

    claims = [
        state_instance_of_claim,
        located_in_us_claim,
        iso_3166_claim,
        fips_alpha_claim,
        fips_numeric_claim
    ]

    add_item(
        label=row.stateLabel,
        description=row.stateDescription,
        aliases=row.fips_alpha,
        claims=claims
    )
