This notebook works back through the PW baseline to add "addresses place" claims based on PW Catalog metadata for countries, states, and counties that can be confidently matched to established items in the GeoKB. We are ignoring cities for the time being because a) I've not yet pulled any of those into the GeoKB and b) the PW metadata is pretty poor as it does not indicate what state or country the cities are in. We could use the context of any country/state/county combinations to get close on some records, but we'll deal with that down the road at some point.

The otherGeospatial field from the PW Catalog is potentially even more interesting in that it can contain other geographic place names beyond administrative boundaries. This is another area I need to work through at some point, but we'll have to compare the quality of the metadata with what we can extract from full texts. For now, I'm ignoring this field.

I place this information using the "addresses subject" predicate based on a presumption that that is what the metadata means - the article/report has something in it that addresses the named place.

In [1]:
import pickle
import pandas as pd
from wbmaker import WikibaseConnection

ModuleNotFoundError: No module named 'sqlalchemy'

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

# Working with a cached dump of USGS Numbered Series
pw_dump = pd.DataFrame(pickle.load(open('data/pw_usgs_reports_dump.pickle', 'rb')))

# Get PW IDs

I'm running the whole process as a batch to establish a baseline of PW items in the GeoKB. To build out a batch of claims for places, I need to get the mapping of GeoKB QID to PW indexId. There is a current issue in the SPARQL query not returning all the items it should be. To compensate, I've run a handful of queries to pull out more of the items, stashing those into a few files stored temporarily on the MPC hub where I'm running this code. The following codeblock pulls all of that together to create an ID map.

In [3]:
geokb_pw_ids = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fclasses%20wdt%3AP2%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fclasses%20.%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%7D",
    output_format="dataframe"
)

geokb_pw_ids["qid"] = geokb_pw_ids['item'].apply(lambda x: x.split('/')[-1])

id_dfs = [
    geokb_pw_ids.drop(columns=['item'])
]

# Workaround for partial failure state in SPARQL
import json
from glob import glob

for fn in glob('./data/extra_pwids/*'):
    cached_ids = pd.DataFrame(json.load(open(fn)))
    cached_ids['qid'] = cached_ids['item'].apply(lambda x: x.split('/')[-1])
    id_dfs.append(cached_ids.drop(columns=['item']))

id_map = pd.concat(id_dfs)

id_map.drop_duplicates(inplace=True)

# GeoNames

Here I tease out the country and state fields from the PW dump and set those up for easier processing. Some records list multiple, so I split those into lists and then explode them for use.

In [4]:
def split_geo_names(name_string):
    if not isinstance(name_string, str):
        return
    delim = ","
    if ";" in name_string:
        delim=';'

    return [i.strip() for i in name_string.split(delim)]

pw_geo_name_props = [
    "indexId",
    "country",
    "state"
]

pw_geo_names = pw_dump[pw_geo_name_props].dropna(subset=[i for i in pw_geo_name_props if i != 'indexId'], how="all")
pw_geo_names['country'] = pw_geo_names['country'].apply(split_geo_names)
pw_geo_names['state'] = pw_geo_names['state'].apply(split_geo_names)

dfs = []
for name_class in pw_geo_name_props:
    if name_class != 'indexId':
        class_df = pw_geo_names[pw_geo_names[name_class].str.len() > 0][['indexId',name_class]].explode(name_class).rename(columns={name_class: 'place_name'}).reset_index(drop=True)
        class_df['name_type'] = name_class
        dfs.append(class_df)

pw_classed_geo_names = pd.concat(dfs)

# Countries

In [5]:
pw_countries = pw_classed_geo_names[pw_classed_geo_names['name_type'] == 'country']

# Get country reference from the GeoKB
geokb_countries = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fitem_alt_label%20%3Fiso_country_code%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP38%20%3Fiso_country_code%20.%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fitem%20skos%3AaltLabel%20%3Fitem_alt_label%20.%0A%20%20%20%20FILTER%20(lang(%3Fitem_alt_label)%3D'en')%0A%20%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
geokb_countries["object"] = geokb_countries['item'].apply(lambda x: x.split('/')[-1])
country_lookup = pd.concat([
    geokb_countries[['object','itemLabel']].drop_duplicates().rename(columns={'itemLabel': 'country_name'}),
    geokb_countries[['object','item_alt_label']].dropna().rename(columns={'item_alt_label': 'country_name'}),
])
country_lookup.drop_duplicates(inplace=True)

# Build country claims
country_claims = pd.merge(
    left=pw_classed_geo_names[pw_classed_geo_names['name_type'] == 'country'],
    right=country_lookup[['object','country_name']].rename(columns={'country_name': 'place_name'}),
    how="inner",
    on="place_name"
)

country_claims.head()

Unnamed: 0,indexId,place_name,name_type,object
0,ofr20231045,United States,country,Q161
1,ofr20231052,United States,country,Q161
2,sir20235079,United States,country,Q161
3,fs20233030,United States,country,Q161
4,sir20235042,United States,country,Q161


# States/Provinces

The majority of "state" strings in the PW Catalog are for the U.S. along with some for Canada and Mexico. While there are other strings that could be useful, I have not yet incorporated first-level subdivisions for any other countries.

In [6]:
pw_states = pw_classed_geo_names[pw_classed_geo_names['name_type'] == 'state']

# Get state, territory, province references from the GeoKB for US, CA, and MX
geokb_state_entities = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20%3Fstate_classes%20.%0A%20%20VALUES%20%3Fstate_classes%20%7B%20wd%3AQ138361%20wd%3AQ138360%20wd%3AQ138362%20wd%3AQ229%20wd%3AQ25363%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_state_entities['object'] = geokb_state_entities['item'].apply(lambda x: x.split('/')[-1])
geokb_state_entities.drop(columns='item', inplace=True)

state_claims = pd.merge(
    left=pw_states,
    right=geokb_state_entities.rename(columns={'itemLabel': 'place_name'}),
    how="inner",
    on="place_name"
)

state_claims.head()

Unnamed: 0,indexId,place_name,name_type,object
0,ofr20231052,Alaska,state,Q254
1,ofr20231049,Alaska,state,Q254
2,sir20235039,Alaska,state,Q254
3,sir20235029,Alaska,state,Q254
4,sir20225099,Alaska,state,Q254


# Counties

County references are problematic in the PW Catalog because they are simple text strings with no specific context or identifiers. To deal with this, I run through and only use items where there are both state and county lists, build strings that match how I labeled these in the GeoKB (e.g., Mesa County, Colorado), and then only make connections where I have an exact match on those strings. This cuts out a bunch of misspellings and other issues, but those are all suspect anyway.

In [7]:
def county_states(x):
    place_names = []
    for state in x['states']:
        for county in x['counties']:
            place_names.append(f"{county}, {state}")
    return place_names

pw_counties = pw_dump[['indexId','county','state']].reset_index(drop=True)
pw_counties.dropna(subset=['county','state'], inplace=True)
pw_counties['counties'] = pw_counties['county'].apply(lambda x: [i.strip() for i in x.split(',')])
pw_counties['states'] = pw_counties['state'].apply(lambda x: [i.strip() for i in x.split(',')])
pw_counties.drop(columns=['state','county'], inplace=True)

pw_counties['place_name'] = pw_counties.apply(county_states, axis=1)
pw_counties = pw_counties[['indexId','place_name']].explode("place_name").reset_index(drop=True)

# Get county references from the GeoKB for the U.S.
geokb_county_entities = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ481%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_county_entities['object'] = geokb_county_entities['item'].apply(lambda x: x.split('/')[-1])
geokb_county_entities.drop(columns='item', inplace=True)

# Build the claims
county_claims = pd.merge(
    left=pw_counties,
    right=geokb_county_entities.rename(columns={'itemLabel': 'place_name'}),
    how="inner",
    on="place_name"
)

county_claims.head()

Unnamed: 0,indexId,place_name,object
0,ofr20231047,"Kern County, California",Q51311
1,ofr20231014,"Kern County, California",Q51311
2,ofr20201116,"Kern County, California",Q51311
3,ofr20191114,"Kern County, California",Q51311
4,ofr20161181,"Kern County, California",Q51311


# Combine places

We only want to pull each item once, so we combine all the claims together and get them lined up with our ID map to the subject item QID in the GeoKB

In [8]:
place_claims = pd.concat([
    country_claims[["indexId","object"]],
    state_claims[["indexId","object"]],
    country_claims[["indexId","object"]]
])

place_claims_identified = pd.merge(
    left=place_claims,
    right=id_map,
    how="inner",
    on="indexId"
)

place_claims_identified.head()

Unnamed: 0,indexId,object,qid
0,ofr20231045,Q161,Q55218
1,ofr20231045,Q161,Q55218
2,ofr20231052,Q161,Q55220
3,ofr20231052,Q254,Q55220
4,ofr20231052,Q161,Q55220


# Submit as addresses place claims

In [9]:
addresses_place_claims = place_claims_identified[['qid','object']].groupby('qid')['object'].agg(list).to_dict()

In [14]:
list(addresses_place_claims.items())[:5]

[('Q100001', ['Q161', 'Q236', 'Q161']),
 ('Q100003', ['Q161', 'Q256', 'Q161']),
 ('Q100007', ['Q161', 'Q276', 'Q268', 'Q161']),
 ('Q100013', ['Q161', 'Q256', 'Q161']),
 ('Q100032', ['Q161', 'Q233', 'Q161'])]

In [None]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup["data source"],
        value="Q54915"
    )
)

failures = []
successes = []

for item_qid, place_qids in addresses_place_claims.items():
    item = geokb.wbi.item.get(item_qid)

    place_claims = []
    for place_qid in place_qids:
        place_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['addresses place'],
                value=place_qid,
                references=refs
            )
        )

    item.claims.add(place_claims, action_if_exists=geokb.action_if_exists.REPLACE_ALL)

    try:
        response = item.write(
            summary="Added addresses place claims to publication from Pubs Warehouse country/state/county information that was successfully linked by name to GeoKB labels"
        )
        print(f"SUCCESS:{response.id}")
    except:
        print(f"FAILURE:{item_qid}")