I reworked this process to use the US Census data on the Microsoft Planetary Computer as my source similar to other areas. This establishes a connection for future processing of US Counties when I need their full boundary geometry. I already had counties populated from previous work with the TIGER Web HTML tables, so the notebook here just runs through and fills in some blanks that I somehow missed in that process.

I'm continuing to be somewhat slowed by issues with the wikibase.cloud instance not handling mass creation of items all that well. I've tried parallel processing methods with Dask, but that just makes matters worse. This notebook ran into an issue with the SPARQL query for existing WB items for U.S. counties not returning everything that the query should return. New items created as this script ran through the first time and bombed out are not returned in the query.

In [1]:
import planetary_computer
import pystac_client
import pandas as pd
import geopandas as gpd

from wbmaker import WikibaseConnection

In [2]:
eew = WikibaseConnection('EEW')

In [3]:
# MPC Data Catalog
data_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)

search = data_catalog.search(collections=["us-census"])
items = {item.id: item for item in search.items()}

In [4]:
item_counties = items['2020-cb_2020_us_county_500k']
asset_counties = item_counties.assets['data']
gdf_counties = gpd.read_parquet(
    asset_counties.href,
    storage_options=asset_counties.extra_fields["table:storage_options"]
)
gdf_counties.to_crs(epsg=4326, inplace=True)
gdf_counties["coordinates"] = gdf_counties.to_crs('+proj=cea').geometry.centroid.to_crs(gdf_counties.crs)


In [5]:
df_county_ref = eew.wb_ref_data('us_counties')

In [6]:
missing_counties = gdf_counties[~gdf_counties.GEOID.isin(df_county_ref.county_fips)].reset_index(drop=True)

### Problem here

This is where I've run into a challenge with how the Wikibase instance is seemingly behind in fully incorporating new records. The "missing counties" are actually in the WB instance; I just can't turn them up with a SPARQL query.

In [7]:
# Get existing US state reference from knowledgebase
wb_state_ref_query = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Feew-edgi.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Feew-edgi.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fstate%20%3FstateLabel%20%3FSTUSPS%0AWHERE%20%7B%0A%20%20%3Fstate%20wdt%3AP23%20%3FSTUSPS%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D%0A"

wb_state_ref_query_url = f"{eew.sparql_endpoint}?query={wb_state_ref_query}"
wb_state_ref_json = eew.url_sparql_query(wb_state_ref_query_url)

df_state_ref = eew.df_sparql_results(wb_state_ref_json)
df_state_ref["state_qid"] = df_state_ref['state'].apply(lambda x: x.split("/")[-1])

In [8]:
missing_counties = pd.merge(
    left=missing_counties,
    right=df_state_ref[["STUSPS","state_qid","stateLabel"]],
    how="left",
    on="STUSPS"
)

In [None]:
references = eew.models.References()
references.add(
    eew.datatypes.Item(
        prop_nr=eew.prop_lookup['data source'],
        value=eew.ref_lookup['U.S. Census data as part of open public data catalog on the Microsoft Planetary Computer']
    )
)

for index, row in missing_counties.iterrows():
    county_label = f"{row.NAMELSAD}, {row.stateLabel}"
    county_item = eew.wbi.item.new()
    county_item.labels.set('en', county_label)
    county_item.descriptions.set('en', f"a county in {row.stateLabel}")
    county_item.aliases.set('en', [row.NAMELSAD, row.NAME])

    county_item.claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['instance of'],
            value=eew.class_lookup['U.S. County (or equivalent)'],
            references=references
        )
    )

    county_item.claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['U.S. state'],
            value=row.state_qid,
            references=references
        )
    )

    county_item.claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['GNIS ID'],
            value=str(row.COUNTYNS),
            references=references
        )
    )

    county_item.claims.add(
        eew.datatypes.ExternalID(
            prop_nr=eew.prop_lookup['TIGER GEOID'],
            value=str(row.GEOID),
            references=references
        )
    )

    county_item.claims.add(
        eew.datatypes.GlobeCoordinate(
            prop_nr=eew.prop_lookup['coordinate location'],
            latitude=row.coordinates.y,
            longitude=row.coordinates.x,
            references=references
        )
    )

    try:
        response = county_item.write(summary="Added county item from U.S. Census source")
        print("ADDED:", county_label, response.id)
    except Exception as e:
        print("PROBLEM:", county_label)