In [None]:
!pip install awswrangler wikibaseintegrator

This notebook works through the process of adding county-level government units to the GeoKB using the government units source from the USGS National Map. While there are other sources we could work against, this provides a way to demonstrate operating against National Map staged data in the AWS cloud, something we often need to do with other data assets such as Lidar point clouds for processing custom DEMs.

In [1]:
import boto3
import awswrangler as wr
import geopandas as gpd
from io import BytesIO
import zipfile
from getpass import getpass

from utils import sparql_query, property_lookup

from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login, wbi_helpers
from wikibaseintegrator.datatypes import Item, String, ExternalID, URL, GlobeCoordinate

In [2]:
# I cheated a little bit here since I haven't spun up a persistent environment yet
s3_resource = boto3.resource('s3')

wbi_config['MEDIAWIKI_API_URL'] = 'https://geokb.wikibase.cloud/w/api.php'
wbi_config['SPARQL_ENDPOINT_URL'] = 'https://geokb.wikibase.cloud/query/sparql'
wbi_config['WIKIBASE_URL'] = 'https://geokb.wikibase.cloud/'

# Use bot account for this specific task
geokb_auth = wbi_login.Login(
    user=input("BOT Name"), 
    password=getpass("BOT Password")
)
wbi = WikibaseIntegrator(login=geokb_auth)

BOT Name Sky@init
BOT Password ········


### TNM Staged Products

There is lots of stuff in the StagedProducts folder of the prd-tnm bucket that could be interesting in other use cases. the following gets us a list of the zip files containing geopackage forms of the state files. I'm going this route instead of processing the national file to use the more open standard format (national is only in an Esri GDB).

In [3]:
tnm_gov_units = wr.s3.list_objects('s3://prd-tnm/StagedProducts/GovtUnit/GPKG')
tnm_state_gpkg = [i for i in tnm_gov_units if i.endswith('.zip')]
tnm_state_gpkg[:5]

['s3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_Alabama_State_GPKG.zip',
 's3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_Alaska_State_GPKG.zip',
 's3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_American_Samoa_State_GPKG.zip',
 's3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_Arizona_State_GPKG.zip',
 's3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_Arkansas_State_GPKG.zip']

The first function reads a zip file into memory and then loads it into geodataframe for processing. This saves having to extract the file to disc. It would be better if we used gzip or another file compressor as those can be read directly with some tools without going through this kind of process. This is not a super clean engineered function, and it could use a bunch of error trapping. I just happen to know that this is a pretty stable source and this should work for my immediate purposes. These data don't change real often, but we'll have to harden this eventually.

The second is just a simple helper for the final push of a new item to the GeoKB.

In [4]:
def gdf_from_s3_zip(path):
    # Parse out the bucket and key
    bucket = path.split("/")[2]
    key = "/".join(path.split("/")[3:])
    
    # Get the file object
    file_obj = s3_resource.Object(
        bucket_name=bucket, 
        key=key
    )
    # Read the file object into a buffer
    buffer = BytesIO(file_obj.get()["Body"].read())
    # Put the zip in the buffer
    z = zipfile.ZipFile(buffer)
    
    # Figure out the gpkg file name
    gpkg_file = next((i for i in z.namelist() if i.endswith('.gpkg')), None)
    
    if gpkg_file is None:
        return
    # Pull the file into a geodataframe
    return gpd.read_file(z.open(gpkg_file))

def add_item(label, description, aliases, claims):
    item = wbi.item.new()

    item.labels.set(language='en', value=label)
    item.descriptions.set(language='en', value=description)
    item.aliases.set(language='en', values=aliases)
    item.claims.add(claims)

    item.write()

# GeoKB Properties

We need to know what properties we are using to put information from this or any source into the GeoKB. I put a function in the utils.py that gets all properties via SPARQL and puts them in a simple dict so we can call "P" identifiers by name. For this case, we'll be using a number of the ExternalID properties like the FIPS codes along with the GNIS identifier.

I'm going to leave off the related wikidata item for now. We can use Wikidata's own use of FIPS codes and GNIS identifiers to pull back items claiming to be associated with those, but it is just a claim. We need to trust but verify if we want to leverage the linkage to do anything like get additional characteristics.

From the gov units schema, we are really just focused on the following properties:

* stco_fipscode - gives us a linkage to other systems and assets
* county_name - short name of the county used as an alias
* gnis_id - gives us a linkage to other systems and assets
* gnis_name - primary label with better context
* coordinate_location - computed centroid of the county boundary entered for informational purposes

In [5]:
geokb_props = property_lookup(wbi_config['SPARQL_ENDPOINT_URL'])
geokb_props

{'instance of': 'P1',
 'subclass of': 'P2',
 'reference item': 'P3',
 'reference url': 'P4',
 'reference statement': 'P5',
 'coordinate location': 'P6',
 'publication date': 'P7',
 'subject matter': 'P8',
 'ranking': 'P9',
 'ISO 3166-1 alpha-2 code': 'P10',
 'located in the administrative territorial entity': 'P11',
 'ISO 3166-2 code': 'P12',
 'FIPS 5-2 alpha': 'P13',
 'FIPS 5-2 numeric': 'P14',
 'corresponding wikidata property': 'P15',
 'related wikidata item': 'P16',
 'element symbol': 'P17',
 'SEDAR Identifier': 'P18',
 'MRDS commodity code': 'P19',
 'USGS Thesaurus ID': 'P20',
 'GNIS ID': 'P21',
 'FIPS 6-4': 'P22'}

In [6]:
# We need our identifiers for states so we can establish the linkage
# This will also drive what gov units we get, because we don't yet have the territories
query_us_states = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT DISTINCT ?item ?itemLabel ?fips_code
WHERE { ?item  wdt:P1  wd:Q229 . 
       ?item wdt:P14 ?fips_code .
        SERVICE wikibase:label 
          { bd:serviceParam  wikibase:language  "en" . } 
      }
"""

geokb_states = sparql_query(
    endpoint=wbi_config['SPARQL_ENDPOINT_URL'],
    query=query_us_states,
    output='dict'
)
geokb_states[:5]

[{'item': 'https://geokb.wikibase.cloud/entity/Q230',
  'itemLabel': 'Michigan',
  'fips_code': '26'},
 {'item': 'https://geokb.wikibase.cloud/entity/Q231',
  'itemLabel': 'Louisiana',
  'fips_code': '22'},
 {'item': 'https://geokb.wikibase.cloud/entity/Q232',
  'itemLabel': 'Oklahoma',
  'fips_code': '40'},
 {'item': 'https://geokb.wikibase.cloud/entity/Q233',
  'itemLabel': 'California',
  'fips_code': '06'},
 {'item': 'https://geokb.wikibase.cloud/entity/Q234',
  'itemLabel': 'Georgia',
  'fips_code': '13'}]

In [7]:
# Set universal properties
claim_instance_of = Item(
    prop_nr=geokb_props['instance of'],
    value='Q481' # U.S. County
)

In [None]:
for st in geokb_states[1:]:
    print("STARTED PROCESSING:", st['itemLabel'])
    geokb_id = st['item'].split('/')[-1]

    # Get the appropriate S3 file to process
    gov_unit_s3_path = next((i for i in tnm_state_gpkg if st['itemLabel'].replace(' ', '_') in i), None)
    
    # Set up an http reference
    http_ref = gov_unit_s3_path.replace('s3://prd-tnm/', 'https://prd-tnm.s3.amazonaws.com/')
    
    # Set up reference source to add to claims
    ref_source = URL(
        prop_nr=geokb_props['reference url'],
        value=http_ref
    )
    
    claim_county_in_state = Item(
        prop_nr=geokb_props['located in the administrative territorial entity'],
        value=geokb_id,
        references=[[ref_source]]
    )  
    
    # Get the gpkg
    state_units = gdf_from_s3_zip(gov_unit_s3_path)
    # Add a centroid so we can populate coordinate location
    state_units['coordinate_location'] = state_units.geometry.apply(lambda x: x.centroid)
    state_units['lon'] = state_units['coordinate_location'].x
    state_units['lat'] = state_units['coordinate_location'].y
    
    print("PULLED AND PREPPED SOURCE DATA", gov_unit_s3_path)

    for index, row in state_units.iterrows():
        print("PROCESSING CLAIMS AND ADDING:", row.gnis_name)
        # Set county fips code
        claim_county_fips = ExternalID(
            prop_nr=geokb_props['FIPS 6-4'],
            value=row.stco_fipscode,
            references=[[ref_source]]
        )
        
        # Set GNIS code
        claim_county_gnis = ExternalID(
            prop_nr=geokb_props['GNIS ID'],
            value=row.gnis_id,
            references=[[ref_source]]
        )
        
        # Set coordinate location
        claim_location = GlobeCoordinate(
            prop_nr=geokb_props['coordinate location'],
            latitude=row.lat,
            longitude=row.lon,
            references=[[ref_source]]
        )
        
        # Send it
        add_item(
            label=row.gnis_name, # I prefer this form for the context
            description=f"a county in {st['itemLabel']}",
            aliases=row.county_name,
            claims=[
                claim_instance_of,
                claim_county_in_state,
                claim_county_fips,
                claim_county_gnis
            ]
        )



STARTED PROCESSING: Louisiana
PULLED AND PREPPED SOURCE DATA s3://prd-tnm/StagedProducts/GovtUnit/GPKG/GOVTUNIT_Louisiana_State_GPKG.zip
PROCESSING CLAIMS AND ADDING: Madison Parish
PROCESSING CLAIMS AND ADDING: Concordia Parish
PROCESSING CLAIMS AND ADDING: Saint Helena Parish
PROCESSING CLAIMS AND ADDING: Bossier Parish
PROCESSING CLAIMS AND ADDING: Lafayette Parish
PROCESSING CLAIMS AND ADDING: Jefferson Davis Parish
PROCESSING CLAIMS AND ADDING: LaSalle Parish
PROCESSING CLAIMS AND ADDING: Calcasieu Parish
PROCESSING CLAIMS AND ADDING: Ascension Parish
PROCESSING CLAIMS AND ADDING: Evangeline Parish
PROCESSING CLAIMS AND ADDING: Jackson Parish
PROCESSING CLAIMS AND ADDING: Vernon Parish
PROCESSING CLAIMS AND ADDING: West Carroll Parish
PROCESSING CLAIMS AND ADDING: Caldwell Parish
PROCESSING CLAIMS AND ADDING: Plaquemines Parish
PROCESSING CLAIMS AND ADDING: Claiborne Parish
PROCESSING CLAIMS AND ADDING: Allen Parish
PROCESSING CLAIMS AND ADDING: Saint Bernard Parish
PROCESSING CLA