This notebook introduces items for U.S. States into the knowledgebase. This later iteration on the notebook works from the transformation of U.S. Census data on the Microsoft Planetary Computer vs. my initial approach using the HTML tables directly from the U.S. Census' TIGER system. The MPC instantiation on Azure public cloud storage provides a more robust processing environment for when we need to run geospatial operations with the data.

Every time I come back to a given reference source, I work through a few improvements in thinking about how to incorporate the material into the knowledge graph.

In [1]:
import planetary_computer
import pystac_client
import dask_geopandas
import pandas as pd

from wbmaker import WikibaseConnection

In [11]:
eew = WikibaseConnection('EEW')

In [3]:
# Select source item to process
source_item_id = eew.ref_lookup['U.S. State names and identifiers from U.S. Census via Microsoft Planetary Computer']
source_item = eew.wbi.item.get(source_item_id)

## Data source documentation

I took another approach toward documenting a data source here with some new properties that should be applicable for other Spatiotemporal Asset Catalog (STAC) sources. We need to know essentially three pieces of information - a catalog URL, collection name, and item name. From those, we can pull the necessary details to get a data source. This applies to a parquet file format, but other types of assets may just need a different method for reading and working with the data.

The other thing I've worked on in the item is the mapping of source properties to knowledgebase properties. In this instance, I didn't use that mapping to drive the processing, but I'll come back and look at that again at some point

In [4]:
# Extract STAC connection details from source item
stac_source = source_item.claims.get_json()[eew.prop_lookup['STAC catalog URL']][0]
stac_catalog_url = stac_source['mainsnak']['datavalue']['value']
stac_collection_name = stac_source['qualifiers'][eew.prop_lookup['STAC Collection Name']][0]['datavalue']['value']
stac_item_name = stac_source['qualifiers'][eew.prop_lookup['STAC Item Name']][0]['datavalue']['value']

In [5]:
# Query STAC catalog for asset
stac_catalog = pystac_client.Client.open(
    stac_catalog_url,
    modifier=planetary_computer.sign_inplace,
)
stac_collection = stac_catalog.get_collection(stac_collection_name)
stac_asset = stac_collection.get_item(stac_item_name).assets["data"]

In [6]:
# Read source data and compute coordinates
gdf_us_states = dask_geopandas.read_parquet(
    stac_asset.href,
    storage_options=stac_asset.extra_fields["table:storage_options"],
    calculate_divisions=True,
)
gdf_us_states = gdf_us_states.to_crs(epsg=4326).compute()
gdf_us_states['coordinates'] = gdf_us_states.to_crs('+proj=cea').geometry.centroid.to_crs(gdf_us_states.crs)

In [7]:
# Lookup QID for LSAD and merge with source data
query_lsad = "PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Feew-edgi.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Flsad_code%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP54%20%3Flsad_code%20.%0A%7D%0A"
df_lsad = eew.wb_ref_data(query=query_lsad)
df_lsad['lsad_qid'] = df_lsad.item.apply(lambda x: x.split('/')[-1])

state_records = pd.merge(
    left=gdf_us_states,
    right=df_lsad[["lsad_code","lsad_qid"]],
    how="left",
    left_on="LSAD",
    right_on="lsad_code"
)

In [8]:
# Set the instance of for the item based on FIPS code
def instance_of_name(STATEFP):
    int_statefp = int(STATEFP)
    if int_statefp == 11:
        return 'U.S. federal district'
    if int_statefp > 56:
        return 'U.S. Territory'
    return 'U.S. State'

state_records['instance_of_name'] = state_records.STATEFP.apply(instance_of_name)
state_records['instance_of_qid'] = state_records.instance_of_name.apply(lambda x: eew.class_lookup[x])

In [9]:
def lookup_state(fips_alpha):
    q = """
    %(namespaces)s

    SELECT ?st ?fips_alpha
    WHERE {
    ?st wdt:%(p_fips_alpha)s "%(v_fips_alpha)s" .
    ?st wdt:%(p_fips_alpha)s ?fips_alpha .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    """ % {
        "namespaces": eew.sparql_namespaces(),
        "v_fips_alpha": fips_alpha,
        "p_fips_alpha": eew.prop_lookup['FIPS 5-2 alpha']
    }

    return eew.sparql_query(query=q, output="lookup")

## Item upsert

This is a somewhat new approach I'm trying where I check for an existing item based on some specific identifier and then either pull that item to work on or instantiate a new item. Everything else follows in establishing claims from the specific source and even setting labels, descriptions, and aliases. WBI includes a "clear" parameter on the write operation, which will clear existing claims that align directly on prop_nr and value. It leaves in place any other claims that might exist on the item.

In [14]:
references = eew.models.References()
references.add(
    eew.datatypes.Item(
        prop_nr=eew.prop_lookup['data source'],
        value=source_item_id
    )
)

for index, row in state_records.iterrows():
    wb_item_id = lookup_state(row.STUSPS)
    if wb_item_id:
        wb_item = eew.wbi.item.get(wb_item_id[row.STUSPS])
    else:
        wb_item = eew.wbi.item.new()

    wb_item.labels.set('en', row.NAME)
    wb_item.aliases.set('en', row.STUSPS)
    wb_item.descriptions.set('en', f'a {row.instance_of_name}')

    claims = eew.models.Claims()
    claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['instance of'],
            value=row.instance_of_qid,
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['FIPS 5-2 alpha'],
            value=row.STUSPS,
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['FIPS 5-2 numeric'],
            value=str(row.STATEFP),
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['FIPS 10-4'],
            value=f"US{row.STATEFP}",
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['ISO 3166-2 code'],
            value=f"US-{row.STUSPS}",
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['TIGER GEOID'],
            value=str(row.GEOID),
            references=references
        )
    )
    claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['GNIS ID'],
            value=str(row.STATENS),
            references=references
        )
    )
    claims.add(
        eew.datatypes.GlobeCoordinate(
            prop_nr=eew.prop_lookup['coordinate location'],
            latitude=row.coordinates.y,
            longitude=row.coordinates.x,
            references=references
        )
    )
    claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['Legal/Statistical Area Description'],
            value=row.lsad_qid,
            references=references
        )
    )

    wb_item.add_claims(claims=claims)
    response = wb_item.write(
        summary="Updated item from Census source",
        clear=True
    )
    print("UPDATED:", row.NAME, response.id)


UPDATED: Guam Q242
UPDATED: Texas Q311
UPDATED: Wisconsin Q317
UPDATED: Rhode Island Q307
UPDATED: New York Q300
UPDATED: United States Virgin Islands Q152
UPDATED: New Hampshire Q297
UPDATED: Minnesota Q291
UPDATED: Puerto Rico Q149
UPDATED: Missouri Q293
UPDATED: North Carolina Q301
UPDATED: Michigan Q290
UPDATED: Louisiana Q286
UPDATED: Nebraska Q295
UPDATED: California Q272
UPDATED: Wyoming Q318
UPDATED: South Carolina Q308
UPDATED: Commonwealth of the Northern Mariana Islands Q59
UPDATED: Kansas Q284
UPDATED: Delaware Q275
UPDATED: Alaska Q269
UPDATED: New Jersey Q298
UPDATED: North Dakota Q302
UPDATED: District of Columbia Q276
UPDATED: Colorado Q273
UPDATED: Virginia Q314
UPDATED: Indiana Q282
UPDATED: Nevada Q296
UPDATED: New Mexico Q299
UPDATED: Alabama Q268
UPDATED: Tennessee Q310
UPDATED: Kentucky Q285
UPDATED: Oregon Q305
UPDATED: Mississippi Q292
UPDATED: Connecticut Q274
UPDATED: Georgia Q278
UPDATED: Utah Q312
UPDATED: Idaho Q280
UPDATED: Illinois Q281
UPDATED: Iowa Q283