We will be using place names all over the place in the GeoKB. These will often be in addition to geographic coordinate claims on items, providing simple name-based characterization useful in all kinds of searches and reporting. We may have bots that operate internally to the GeoKB, building from coordinates to introduce derived place name claims. Or they may come in from source material.

Our GeoKB is likely not an authority for any place names, so we need to develop a series of bots to handle one-time and potentially periodic updating of place references. This notebook starts to build these out, and we may move to more purpose-built code for the bots in future.

# Properties and Classes

Working through places introduces new properties and classification items:

* geographic region (aka place)
    * Rather than try to be exhaustive on this, I stuck with a simple high level item (subclass of entity) for now. We can make this more complex in future if needed.
* ISO 3166-1 alpha-2 code
    * Specific property pulled from Wikidata as an identifier.

In [26]:
import os
import pandas as pd
from utils import (
    sparql_query,
    get_wb,
    process_item
)
import requests
from zipfile import ZipFile
from io import BytesIO


In [2]:
sparql_endpoint = os.environ['SPARQL_ENDPOINT']
wb_domain = os.environ['WB_DOMAIN']
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

accepted_languages = ['en']

# Countries

For our mineral assessment use cases, we need items in GeoKB for every country where we have something like a mineral occurrence record or a document dealing with a mineral prospect in that country. So, we may as well input every country in the world from some reasonable source. There are lots of places we could go for this, but the Wikidata source of country records is pretty robust with a massive slate of properties for many countries because they are in such heavy use within the global knowledgebase.

Here, I run a query that pulls back just the essential bits we need right now:
* country name (ignoring aliases for now)
* country description
* two-character ISO3166 country code

We might need numeric codes at some point as well, but this is the minimum to start.

In [23]:
wd_country_query = """
SELECT ?country ?countryLabel ?countryDescription ?country_code WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?country wdt:P31 wd:Q6256.
  OPTIONAL {
  VALUES (?property) {
    (wdt:P297)
  }  
  ?country ?property ?country_code.
  }
}
"""

wd_countries = sparql_query(
    endpoint='https://query.wikidata.org/sparql',
    query=wd_country_query,
    output='dataframe'
)
wd_countries.head()

Unnamed: 0,country,countryLabel,countryDescription,country_code
0,http://www.wikidata.org/entity/Q929,Central African Republic,country in Central Africa,CF
1,http://www.wikidata.org/entity/Q945,Togo,country in West Africa,TG
2,http://www.wikidata.org/entity/Q948,Tunisia,country in North Africa,TN
3,http://www.wikidata.org/entity/Q954,Zimbabwe,sovereign state in southern Africa,ZW
4,http://www.wikidata.org/entity/Q953,Zambia,sovereign state in Southern Africa,ZM


# Sub-country Units

We at least also need U.S. States and Territories and may need states/provinces from other countries as I know those show up in some of our data. We'll also probably want U.S. Counties. We'll introduce additional sub-country administrative units as they come up in our data sources, working through the same overall logic on identifying and processing reasonable reference sources.

# GNIS Places

There are likely some key feature classes from the GNIS that we should establish a relationship with. While the GNIS is *A* source for some features, it may not be the *BEST* source for our purposes. For instance, should we simply pull all GNIS named mines as things we are likely to encounter and need to link to at some point, should we pull mines from USMIN as a more contemporary source, should we pull USMIN mines and then get other mines from GNIS, when and how should we handle mine references in MRDS that exist as mineral occurrences...lots of questions to work through. One way or the other, the whole point of the GeoKB effort is to help resolve these issues, giving us a living source with a set of rules for how things come into the knowledgebase and work out their differences.

The codeblocks below pull the GNIS national file and read it into a dataframe. We look at the feature classes present and pull out mines as an example.

In [65]:
gnis_national_file = 'https://geonames.usgs.gov/docs/stategaz/NationalFile.zip'
r_gnis_national_file = requests.get(gnis_national_file)
z = ZipFile(BytesIO(r_gnis_national_file.content))
# We know it's the first/only file and it's delimited with pipe
gnis_national = pd.read_csv(z.open(z.namelist()[0]), sep='|')

In [66]:
gnis_national.FEATURE_CLASS.unique()

array(['Stream', 'Valley', 'Park', 'Crossing', 'Cape', 'Summit', 'School',
       'Canal', 'Mine', 'Plain', 'Locale', 'Spring', 'Basin', 'Reservoir',
       'Well', 'Populated Place', 'Flat', 'Cliff', 'Lake', 'Arch', 'Gap',
       'Range', 'Dam', 'Bench', 'Ridge', 'Trail', 'Cemetery', 'Hospital',
       'Building', 'Arroyo', 'Falls', 'Bay', 'Swamp', 'Area', 'Pillar',
       'Rapids', 'Island', 'Bend', 'Church', 'Bar', 'Bridge', 'Civil',
       'Channel', 'Tunnel', 'Military', 'Tower', 'Levee', 'Crater',
       'Lava', 'Harbor', 'Woods', 'Airport', 'Reserve', 'Slope', 'Gut',
       'Beach', 'Post Office', 'Forest', 'Oilfield', 'Census', 'Glacier',
       'Sea', 'Isthmus', 'Unknown'], dtype=object)

In [67]:
mine_features = gnis_national[gnis_national.FEATURE_CLASS == 'Mine']
mine_features.head(10)

Unnamed: 0,FEATURE_ID,FEATURE_NAME,FEATURE_CLASS,STATE_ALPHA,STATE_NUMERIC,COUNTY_NAME,COUNTY_NUMERIC,PRIMARY_LAT_DMS,PRIM_LONG_DMS,PRIM_LAT_DEC,PRIM_LONG_DEC,SOURCE_LAT_DMS,SOURCE_LONG_DMS,SOURCE_LAT_DEC,SOURCE_LONG_DEC,ELEV_IN_M,ELEV_IN_FT,MAP_NAME,DATE_CREATED,DATE_EDITED
44,444,Yucca Mine,Mine,AZ,4,Mohave,15.0,343909N,1142231W,34.652509,-114.375235,,,,,451.0,1480.0,Topock,02/08/1980,05/01/2006
70,470,Abe Lincoln Mine,Mine,AZ,4,Yavapai,25.0,340244N,1123232W,34.045586,-112.542118,,,,,1192.0,3911.0,Morgan Butte,02/08/1980,
73,473,Abril Mine,Mine,AZ,4,Cochise,3.0,315429N,1095929W,31.90814,-109.991459,,,,,2031.0,6663.0,Cochise Stronghold,02/08/1980,
83,483,Adams Mine,Mine,AZ,4,Mohave,15.0,345808N,1142335W,34.968892,-114.393014,,,,,659.0,2162.0,Boundary Cone,02/08/1980,
118,519,Aguinaldo Mine,Mine,AZ,4,Pima,19.0,315508N,1111712W,31.918971,-111.286767,,,,,1139.0,3737.0,Stevens Mountain,02/08/1980,
146,547,Alabama Mine,Mine,AZ,4,Mohave,15.0,352027N,1133603W,35.340831,-113.600772,,,,,1501.0,4924.0,Valentine SE,02/08/1980,
162,563,Alaska Mine,Mine,AZ,4,Maricopa,13.0,334342N,1131854W,33.728366,-113.314918,,,,,572.0,1877.0,Weldon Hill,02/08/1980,
163,564,Ajax Mine,Mine,AZ,4,Cochise,3.0,320048N,1091243W,32.013422,-109.212006,,,,,1473.0,4833.0,Blue Mountain,02/08/1980,
174,575,Alcyone Mine,Mine,AZ,4,Mohave,15.0,345934N,1142425W,34.992781,-114.406904,,,,,664.0,2178.0,Boundary Cone,02/08/1980,
187,588,Alice Mine,Mine,AZ,4,Pinal,21.0,330756N,1105525W,33.132284,-110.923722,,,,,935.0,3068.0,Hot Tamale Peak,02/08/1980,


# Adding to GeoKB

With this information in hand, we can stub out basic country reference items with the bare minimum of information needed to establish linkages. We can also include the Wikidata identifier for future use in retrieving additional information when it comes up in a use case.

## Stuck Again!

I'm stuck here again because of whatever is going on with the API. I can add a single claim to an item, but as soon as I do that, the addClaim() function fails on the next claim I try to add with a different error. Essentially, once claims are on an item, that item is alive and well in Wikibase (UI and SPARQL) but is dead to the API. Argh!

In [3]:
geokb_site = get_wb('geokb')

In [None]:
for index, row in wd_countries.head().iterrows():
    claims = [
        {
            'property_id': 'P2',
            'object': 'Q37',
            'prov_statement': 'Adding classification for country'
        },
        {
            'property_id': 'P26',
            'object': row.country_code,
            'prov_statement': 'Adding country code'
        }
    ]

    display(claims)

    # process_item(
    #     site=geokb_site,
    #     label=row.countryLabel,
    #     description=row.countryDescription,
    #     claims=claims,
    #     prov_statement='Adding country item borrowed from Wikidata'
    # )