In working through the EPA FRS data, location information at the level of a city or town is prominent in the records. I think it's important to add this level of location to the EEW knowledgebase for a couple of reasons:

1. As we start connecting the dots on other important information about populations and demographics, much of this will be tied to cities and "Census Designated Places." We want to be able to link to this information confidently and associate facilities in those locations.
2. In an effort to disambiguate facilities with the same name, we'll need to add in something else from the data. In some cases, this can come from a city, state reference while in others we will need to take things down to the level of the road/street part of an address.

This notebook uses the "places" data from the 2020 U.S. Census data cached for the Microsoft Planetary Computer. We do a little prep work on those records to group a few cases as we really just need to focus in on city, state labels. In these cases, we add multiple GEOID, GNIS ID, and coordinate location claims with the records to provide future linkages.

Initially, we only add in places where we have a match from the FRS data on city, state label. There may be additional records needed from the source once we work through issues in the FRS data with missing city, state or misalignment with coordinate locations. We'll come back to that in another step, pulling the boundary polygons from the US Census source for that exercise.

In [1]:
import planetary_computer
import pystac_client
import pandas as pd
import geopandas as gpd

from wbmaker import WikibaseConnection

In [2]:
eew = WikibaseConnection('EEW')

In [3]:
# MPC Data Catalog
data_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)

search = data_catalog.search(collections=["us-census"])
items = {item.id: item for item in search.items()}

In [4]:
# Place Reference Data from US Census
item_places = items["2020-cb_2020_us_place_500k"]
asset_places = item_places.assets["data"]
gdf_places = gpd.read_parquet(
    asset_places.href,
    storage_options=asset_places.extra_fields["table:storage_options"]
)
gdf_places.to_crs(epsg=4326, inplace=True)

gdf_places["wb_label"] = gdf_places.apply(lambda x: f"{x.NAME}, {x.STUSPS}", axis=1)
gdf_places["city_state"] = gdf_places.apply(lambda x: f"{x.NAME.lower()}, {x.STUSPS.lower()}", axis=1)
gdf_places["city_state_alt"] = gdf_places.apply(lambda x: f"{x.NAMELSAD.lower()}, {x.STUSPS.lower()}", axis=1)
gdf_places["coordinates"] = gdf_places.to_crs('+proj=cea').geometry.centroid.to_crs(gdf_places.crs)


In [5]:
# Prep Census places for knowledgebase representation
census_place_label_lookup = gdf_places[["wb_label","NAMELSAD","STUSPS"]].reset_index(drop=True)
census_place_label_lookup["wb_label_lower"] = census_place_label_lookup.wb_label.apply(lambda x: x.lower())
census_place_label_lookup["wb_alt_label_lower"] = census_place_label_lookup.apply(lambda x: f"{x.NAMELSAD.lower()}, {x.STUSPS.lower()}", axis=1)
census_place_label_lookup.drop(columns=["NAMELSAD","STUSPS"], inplace=True)

census_place_ref = gdf_places[["wb_label","NAME","NAMELSAD","STUSPS","GEOID","PLACENS","coordinates"]].groupby("wb_label").agg(list).reset_index("wb_label")

census_place_ref["aliases"] = census_place_ref.apply(lambda x: list(set(x.NAME + x.NAMELSAD)), axis=1)
census_place_ref["state_abbrev"] = census_place_ref.STUSPS.apply(lambda x: x[0])
census_place_ref["GEOID"] = census_place_ref.GEOID.apply(lambda x: list(set(x)))
census_place_ref["PLACENS"] = census_place_ref.PLACENS.apply(lambda x: list(set(x)))
census_place_ref["coordinates"] = census_place_ref.coordinates.apply(lambda x: list(set([(i.x,i.y) for i in x])))
census_place_ref.drop(columns=["STUSPS","NAME","NAMELSAD"], inplace=True)

In [7]:
# Get existing US state reference from knowledgebase
df_state_ref = eew.wb_ref_data('us_states')
df_state_ref["state_qid"] = df_state_ref['state'].apply(lambda x: x.split("/")[-1])

In [8]:
# Add state QIDs to places
census_place_ref = pd.merge(
    left=census_place_ref,
    right=df_state_ref[["state_abbrev","state_qid"]],
    how="left",
    on="state_abbrev"
)

In [9]:
# Pull cached FRS facilities
# Pre-cached in my MPC Hub folder space
frs_facilities = pd.read_parquet('data/FRS_FACILITIES.parquet')

In [10]:
# Prep FRS city/state to determine what we need from the full Census data source
frs_city_state = frs_facilities[
    frs_facilities.FAC_CITY.notnull()
    &
    frs_facilities.FAC_STATE.notnull()
][["FAC_CITY","FAC_STATE"]].drop_duplicates().reset_index(drop=True)
frs_city_state["city_state"] = frs_city_state.apply(lambda x: f"{x.FAC_CITY.lower()}, {x.FAC_STATE.lower()}", axis=1)

wb_place_labels = census_place_label_lookup[
    census_place_label_lookup.wb_label_lower.isin(frs_city_state.city_state)
    |
    census_place_label_lookup.wb_alt_label_lower.isin(frs_city_state.city_state)
].wb_label.to_list()

places_to_wb = census_place_ref[census_place_ref.wb_label.isin(wb_place_labels)].reset_index(drop=True)

### Issue

There's a problem here in the process of pulling a full set of records from the Wikibase instance where we are not getting every record we ask for in this query. I'm classifying all of these as an instance of a human settlement (a high level classification aligned with Wikidata that we can revisit later). I had worked up a version of this that used Dask delayed to build items in parallel, but that ran into issues with too many connections to the WB instance. In running the process sequentially, I'm still seeing issues where we throw errors when we try to write a label/description combo that's already in the instance but did not come back in our search. This probably has something to do with a problem I noted with the wikibase.cloud team on index completion. When a WB item is written, a number of different "pieces" of that record have to be built out in the underlying SQL database and Elasticsearch. This relies on a background process that apparently doesn't always complete or at least not right away. Even days after submitting records, though, I'm seeing issues in type-ahead search completion, SPARQL queries, and other things indicating missing records even though I can go to the QID landing page and see the record.

In [11]:
# Check for cities we already have in the knowledgebase
df_city_ref = eew.wb_ref_data('us_cities')

# Filter to the records we still need to add to the knowledgebase
missing_cities = places_to_wb[~places_to_wb.wb_label.isin(df_city_ref.cityLabel)]

print("MISSING CITY RECORDS:", len(missing_cities))

MISSING CITY RECORDS: 351


In [15]:
df_city_ref.head()

Unnamed: 0,city,cityLabel,state,stateLabel,gnis_id,geoid
0,https://eew-edgi.wikibase.cloud/entity/Q4242,"Allegany, NY",https://eew-edgi.wikibase.cloud/entity/Q300,New York,2391509,3601286
1,https://eew-edgi.wikibase.cloud/entity/Q4243,"Allegan, MI",https://eew-edgi.wikibase.cloud/entity/Q290,Michigan,1625819,2601260
2,https://eew-edgi.wikibase.cloud/entity/Q4244,"Alleghany, CA",https://eew-edgi.wikibase.cloud/entity/Q272,California,2582931,600982
3,https://eew-edgi.wikibase.cloud/entity/Q4245,"Allen, KS",https://eew-edgi.wikibase.cloud/entity/Q284,Kansas,2393921,2001275
4,https://eew-edgi.wikibase.cloud/entity/Q4246,"Alleman, IA",https://eew-edgi.wikibase.cloud/entity/Q283,Iowa,2393920,1901180


In [12]:
missing_cities

Unnamed: 0,wb_label,GEOID,PLACENS,coordinates,aliases,state_abbrev,state_qid
197,"Albany, KY",[2100694],[2403071],"[(-85.13529756416843, 36.69054372377878)]","[Albany, Albany city]",KY,Q285
210,"Albemarle, NC",[3700680],[2403073],"[(-80.19147829275747, 35.3593285337532)]","[Albemarle city, Albemarle]",NC,Q301
282,"Alexandria, PA",[4200756],[1215264],"[(-78.099853020863, 40.558307156305844)]","[Alexandria borough, Alexandria]",PA,Q306
299,"Algood, TN",[4700640],[2405133],"[(-85.44675960970967, 36.19998328243239)]","[Algood city, Algood]",TN,Q310
340,"Allentown, PA",[4202000],[1215372],"[(-75.47559349633475, 40.596155537947496)]","[Allentown, Allentown city]",PA,Q306
...,...,...,...,...,...,...,...
24206,"White Bear Lake, MN",[2769970],[2397299],"[(-93.01497052820436, 45.06563459450618)]","[White Bear Lake, White Bear Lake city]",MN,Q291
24252,"White River, SD",[4671340],[1267653],"[(-100.74487253796136, 43.567064252603714)]","[White River, White River city]",SD,Q309
24279,"Whitelaw, WI",[5586775],[1584428],"[(-87.82780441833039, 44.14534763250391)]","[Whitelaw village, Whitelaw]",WI,Q317
24831,"Worthville, KY",[2184900],[2405792],"[(-85.06812329511678, 38.609775107591915)]","[Worthville, Worthville city]",KY,Q285


In [None]:
references = eew.models.References()
references.add(
    eew.datatypes.Item(
        prop_nr=eew.prop_lookup['data source'],
        value=eew.ref_lookup['U.S. Census data as part of open public data catalog on the Microsoft Planetary Computer']
    )
)

for index, row in missing_cities.iterrows():
    try:
        place_item = eew.wbi.item.new()
        place_item.labels.set('en', row.wb_label)
        place_item.descriptions.set('en', 'a city, town, or other named place from U.S. Census data')
        place_item.aliases.set('en', row.aliases)

        place_item.claims.add(
            eew.datatypes.Item(
                prop_nr=eew.prop_lookup['instance of'],
                value=eew.class_lookup['human settlement'],
                references=references
            )
        )

        place_item.claims.add(
            eew.datatypes.Item(
                prop_nr=eew.prop_lookup['U.S. state'],
                value=row.state_qid,
                references=references
            )
        )

        gnis_id_claims = []
        for gnis_id in row.PLACENS:
            gnis_id_claims.append(
                eew.datatypes.ExternalID(
                    prop_nr=eew.prop_lookup['GNIS ID'],
                    value=str(gnis_id),
                    references=references
                )
            )
        place_item.claims.add(gnis_id_claims)

        geoid_claims = []
        for geoid in row.GEOID:
            geoid_claims.append(
                eew.datatypes.ExternalID(
                    prop_nr=eew.prop_lookup['TIGER GEOID'],
                    value=str(geoid),
                    references=references
                )
            )
        place_item.claims.add(geoid_claims)

        loc_claims = []
        for coord_set in row.coordinates:
            loc_claims.append(
                eew.datatypes.GlobeCoordinate(
                    prop_nr=eew.prop_lookup['coordinate location'],
                    latitude=coord_set[1],
                    longitude=coord_set[0],
                    references=references
                )
            )
        place_item.claims.add(loc_claims)
    
        response = place_item.write(summary="Added place item from U.S. Census place data")
        print("ADDED:", row.wb_label, response.id)

    except Exception as e:
        print("PROBLEM WITH:", row.wb_label)