The GeoKB can help synthesize and integrate across the several disparate sources of mine-related names. There is a certain degree of semantic dissonance that exists in current resources like MRDS and other parts of the "mineral occurrence" space. USMIN is clearing some of that up but adds a further source to deal with. We then have other sources with shorthand approaches that are incoherent at global scale or to a knowledge management system that relies on explicit semantics.

In the immediate future, we need to tag our NI 43-101 reports to a mine name. Shorthand string approaches were followed to some extent in the two underlying inventories used to pull together our initial tranche of the reports. These were incorporated into the Zotero meta-model, but we need to put these together with other sources to build out our system.

I'm going to take a somewhat different approach on this case in terms of a processing flow for initial population of the GeoKB and then subsequent integration of additional sources. I'm starting with the GNIS mine features even though I know that is an imperfect source in many ways. Coordinate locations are not particularly accurate for one thing. However, the gazetteer aspects of GNIS as a name source is a reasonable place to begin, knowing that we will improve quality of the information over time.

In [1]:
import os
import geopandas as gpd
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
import requests

from utils import sparql_query, property_lookup

from wikibaseintegrator.models import Qualifiers, References, Reference
from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login, wbi_helpers
from wikibaseintegrator.datatypes import Item, String, ExternalID, URL, GlobeCoordinate

In [2]:
wbi_config['MEDIAWIKI_API_URL'] = os.environ["WC_MEDIAWIKI_API_URL"]
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['WC_SPARQL_ENDPOINT_URL']
wbi_config['WIKIBASE_URL'] = os.environ['WC_WIKIBASE_URL']

# Use bot account for this specific task
geokb_auth = wbi_login.Login(
    user=os.environ["WC_BOT_INIT"],
    password=os.environ["WC_PASS_INIT"]
)
wbi = WikibaseIntegrator(login=geokb_auth)

### Place Linkages

We will want to add in some place contextual linkages to help orient, select, and group mining site items. Here we pull U.S. states and counties added already to the GeoKB source so we have their QIDs along with FIPS codes we can use for matching as needed.

In [3]:
query_geokb_states = """
PREFIX wd: <%(wb_domain)s/entity/>
PREFIX wdt: <%(wb_domain)s/prop/direct/>

SELECT ?state ?stateLabel ?state_fips WHERE {
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    ?state wdt:P1 wd:Q229 .
    OPTIONAL {
        ?state wdt:P14 ?state_fips .
    }
}
""" % {'wb_domain': os.environ['WC_WIKIBASE_URL']}

query_geokb_counties = """
PREFIX wd: <%(wb_domain)s/entity/>
PREFIX wdt: <%(wb_domain)s/prop/direct/>

SELECT ?county ?countyLabel ?county_fips WHERE {
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    ?county wdt:P1 wd:Q481 .
    OPTIONAL {
        ?county wdt:P22 ?county_fips .
    }
}
""" % {'wb_domain': os.environ['WC_WIKIBASE_URL']}

geokb_states = sparql_query(
    endpoint=os.environ["WC_SPARQL_ENDPOINT_URL"],
    query=query_geokb_states,
    output='dataframe'
)

geokb_counties = sparql_query(
    endpoint=os.environ["WC_SPARQL_ENDPOINT_URL"],
    query=query_geokb_counties,
    output='dataframe'
)

### Wikidata Mines

Wikidata has many mine items that look to have been introduced from Geonames. Most of them have GNIS identifiers, so we can link them to one of our own sources though with the qualifier that the linkage has not been confirmed. We will explore the Wikidata items further to determine if there are any particularly useful claims on them for us to exploit.

In [4]:
query_wd_mine = """
SELECT ?mine ?mineLabel ?mineAltLabel ?mineDescription ?gnis_id WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?mine wdt:P31 wd:Q820477.
  OPTIONAL {
      ?mine wdt:P590 ?gnis_id .
  }
}
"""

wd_mines = sparql_query(
    endpoint='https://query.wikidata.org/sparql',
    query=query_wd_mine,
    output='dataframe'
)

### GNIS Source

We can get the GNIS from a zip download (this one is not yet part of the TNM staged data products). I set up an item in the GeoKB to serve as the reference. It includes both a link to an information web site for the GNIS as an organizational entity as well as a link to the specific download. Using the "reference item" property for references is one of the documentation dynamics I'm exploring as a more robust pointer to a reference than a URL to somewhere else or a simple string statement.

In [5]:
gnis_national_file = 'https://geonames.usgs.gov/docs/stategaz/NationalFile.zip'
r_gnis_national_file = requests.get(gnis_national_file)
z = ZipFile(BytesIO(r_gnis_national_file.content))
# We know it's the first/only file and it's delimited with pipe
gnis_national = pd.read_csv(z.open(z.namelist()[0]), sep='|', dtype=str)

In [6]:
mine_features = gnis_national[gnis_national.FEATURE_CLASS == 'Mine'].copy()
mine_features.head(10)

Unnamed: 0,FEATURE_ID,FEATURE_NAME,FEATURE_CLASS,STATE_ALPHA,STATE_NUMERIC,COUNTY_NAME,COUNTY_NUMERIC,PRIMARY_LAT_DMS,PRIM_LONG_DMS,PRIM_LAT_DEC,PRIM_LONG_DEC,SOURCE_LAT_DMS,SOURCE_LONG_DMS,SOURCE_LAT_DEC,SOURCE_LONG_DEC,ELEV_IN_M,ELEV_IN_FT,MAP_NAME,DATE_CREATED,DATE_EDITED
44,444,Yucca Mine,Mine,AZ,4,Mohave,15,343909N,1142231W,34.6525089,-114.3752352,,,,,451,1480,Topock,02/08/1980,05/01/2006
70,470,Abe Lincoln Mine,Mine,AZ,4,Yavapai,25,340244N,1123232W,34.0455864,-112.5421178,,,,,1192,3911,Morgan Butte,02/08/1980,
73,473,Abril Mine,Mine,AZ,4,Cochise,3,315429N,1095929W,31.9081404,-109.9914592,,,,,2031,6663,Cochise Stronghold,02/08/1980,
83,483,Adams Mine,Mine,AZ,4,Mohave,15,345808N,1142335W,34.9688924,-114.3930143,,,,,659,2162,Boundary Cone,02/08/1980,
118,519,Aguinaldo Mine,Mine,AZ,4,Pima,19,315508N,1111712W,31.9189709,-111.2867671,,,,,1139,3737,Stevens Mountain,02/08/1980,
146,547,Alabama Mine,Mine,AZ,4,Mohave,15,352027N,1133603W,35.340831,-113.6007724,,,,,1501,4924,Valentine SE,02/08/1980,
162,563,Alaska Mine,Mine,AZ,4,Maricopa,13,334342N,1131854W,33.7283663,-113.3149184,,,,,572,1877,Weldon Hill,02/08/1980,
163,564,Ajax Mine,Mine,AZ,4,Cochise,3,320048N,1091243W,32.0134216,-109.2120057,,,,,1473,4833,Blue Mountain,02/08/1980,
174,575,Alcyone Mine,Mine,AZ,4,Mohave,15,345934N,1142425W,34.9927808,-114.4069036,,,,,664,2178,Boundary Cone,02/08/1980,
187,588,Alice Mine,Mine,AZ,4,Pinal,21,330756N,1105525W,33.1322841,-110.9237224,,,,,935,3068,Hot Tamale Peak,02/08/1980,


### Join Linked Items

Here we join the states, counties, and Wikidata links to the mine features on available identifiers so that we have our local GeoKB identifiers to link in claims.

In [7]:
mine_features["state_co_code"] = mine_features.apply(lambda x: f"{x.STATE_NUMERIC}{x.COUNTY_NUMERIC}", axis=1)

mine_features = pd.merge(
    left=mine_features,
    right=geokb_states[["state","stateLabel","state_fips"]].rename(columns={"state": "qid_state", "state_fips": "STATE_NUMERIC"}),
    how="left",
    on="STATE_NUMERIC"
)

mine_features = pd.merge(
    left=mine_features,
    right=geokb_counties[["county","countyLabel","county_fips"]].rename(columns={"county": "qid_county", "county_fips": "state_co_code"}),
    how="left",
    on="state_co_code"
)

mine_features = pd.merge(
    left=mine_features,
    right=wd_mines[["mine","gnis_id"]].rename(columns={"mine": "wd_id", "gnis_id": "FEATURE_ID"}),
    how="left",
    on="FEATURE_ID"
)


### Data Prep

Here, I run through a little bit of additional data prep. I trust but verify that the point coordinates are at least structurally sound by creating a GeoDataFrame and then pull point coordinates in a validated format for coordinate location claims. I also made a judgment call on handling duplicate names at this stage that I may have to revisit - added the identifier to the name to create a unique label.

In [21]:
gnis_props = [
    'FEATURE_ID',
    'FEATURE_NAME', 
    'PRIM_LAT_DEC', 
    'PRIM_LONG_DEC', 
    'state_co_code', 
    'qid_state', 
    'stateLabel', 
    'qid_county', 
    'countyLabel',
    'wd_id'
]

gdf_mine_features = gpd.GeoDataFrame(
    mine_features[gnis_props], 
    geometry=gpd.points_from_xy(mine_features.PRIM_LONG_DEC, mine_features.PRIM_LAT_DEC)
)
gdf_mine_features['lon'] = gdf_mine_features['geometry'].x
gdf_mine_features['lat'] = gdf_mine_features['geometry'].y

df_mine_features = pd.DataFrame(gdf_mine_features.drop(columns=['geometry','PRIM_LAT_DEC','PRIM_LONG_DEC']))
df_mine_features['description'] = df_mine_features.apply(
    lambda x: f"a mine feature originated from the GNIS located in {x.countyLabel}, {x.stateLabel} USA", 
    axis=1
)

df_mine_features = df_mine_features.drop_duplicates(subset="FEATURE_ID", keep="first")

ids = df_mine_features["FEATURE_NAME"]
dup_mine_names = df_mine_features[ids.isin(ids[ids.duplicated()])].FEATURE_NAME.unique()

df_mine_features["label"] = df_mine_features.apply(
    lambda x: f"{x.FEATURE_NAME} ({x.FEATURE_ID})" if x.FEATURE_NAME in dup_mine_names else x.FEATURE_NAME, 
    axis=1
)

### Load the items

I'm continuing to fiddle with just how to structure the code for building items. I like the idea of never preparing an intermediary file-based data object between source and the GeoKB wherever possible. That keeps us from having an extra artifact somewhere we have to keep track of and incorporate into our provenance trace. If everything is handled in code from some form of API or crude download-to-memory like this, we just need to keep the software codes in play.

I also like the idea of not introducing too much code between the process being run and the library being used. The less extraneous stuff we have to document and manage the better. I learned a few new things about WikibaseIntegrator here in dealing with qualifiers, the last major part of the Wikibase model I needed to tackle. Like everything with Wikibase, once you figure out and understand the data model, everything kind of starts falling into place in terms of how to incorporate different types of information, structurally, letting us focus on getting the more important semantic aspects.

I did make a judgment call here we may need to revisit by putting country, state, and county all into one property - "located in the administrative territorial entity." We know they are different types (instance of) of entities but at the item claim level, they are all in the same bucket. Wikidata is a mixed bag on this approach with most (I think) items using the lower level classification with individual properties for country, state, province, county, etc. We can go either way; I'm just trying to be slow and deliberate on modeling properties.

In [None]:
%%time
ref_gnis_item = Reference()
ref_gnis_item.add(
    Item(
        prop_nr='P3',
        value='Q3624'
    )
)

gnis_refs = References()
gnis_refs.add(ref_gnis_item)

for index, mine in df_mine_features[51:].iterrows():
    # Set mine item basics
    item = wbi.item.new()
    item.labels.set(language='en', value=mine.label)
    item.descriptions.set(
        language='en', 
        value=mine.description
    )

    # Establish classification
    instance_of = Item(
        prop_nr='P1',
        value='Q3646', # instance of mine
        references=gnis_refs
    )
    item.claims.add(instance_of)
    
    # GNIS ID
    gnis_id = ExternalID(
        prop_nr='P21',
        value=mine.FEATURE_ID,
        references=gnis_refs
    )
    item.claims.add(gnis_id)

    # Country, state and county
    usa_link = Item(
        prop_nr='P11',
        value='Q161',
        references=gnis_refs
    )
    state_link = Item(
        prop_nr='P11',
        value=mine.qid_state.split("/")[-1],
        references=gnis_refs
    )
    county_link = Item(
        prop_nr='P11',
        value=mine.qid_county.split("/")[-1],
        references=gnis_refs
    )
    item.claims.add([usa_link, state_link, county_link])

    # Coordinate location
    mine_location = GlobeCoordinate(
        prop_nr='P6',
        latitude=mine.lat,
        longitude=mine.lon,
        references=gnis_refs
    )
    item.claims.add(mine_location)

    # Wikidata link
    if isinstance(mine.wd_id, str):
        wd_qualifiers = Qualifiers()
        wd_qualifiers.add(
            String(
                prop_nr='P5',
                value='Wikidata link created on GNIS ID; not confirmed'
            )
        )
        wd_link = URL(
            prop_nr='P16',
            value=mine.wd_id,
            qualifiers=wd_qualifiers
        )
        item.claims.add(wd_link)
    
    item.write()    
    
    

Service unavailable (HTTP Code 502). Sleeping for 60 seconds.
