
Now that I'm working out more of the methodology for documenting source data in Wikibase items, this notebook exercises leveraging those for building item content. One of the big bugaboo issues I've got to work out is how to effectively deal with change over time. One piece of that is figuing out at the item level what we already know about, and then we have to figure that same thing out at the claim level. If we do know about something, it doesn't necessarily mean we ignore it; it just means we need to figure out what to do in that case.

In [1]:
import pandas as pd
from wbmaker import WikibaseConnection

In [2]:
eew = WikibaseConnection('eew')

In [3]:
data_sources = eew.datasets(output='dataframe')

In [4]:
data_sources

Unnamed: 0,item,itemLabel
0,https://eew-edgi.wikibase.cloud/entity/Q13,Wikidata listing of world countries
1,https://eew-edgi.wikibase.cloud/entity/Q264,TIGER data file source for U.S. States
2,https://eew-edgi.wikibase.cloud/entity/Q265,Wikidata listing of U.S. states and territories
3,https://eew-edgi.wikibase.cloud/entity/Q321,TIGER data file source for Alaska Native Regio...
4,https://eew-edgi.wikibase.cloud/entity/Q337,TIGER data file source for Federal American In...
5,https://eew-edgi.wikibase.cloud/entity/Q522,TIGER data file source for Hawaiian Home Lands
6,https://eew-edgi.wikibase.cloud/entity/Q654,TIGER data file source for U.S. State American...
7,https://eew-edgi.wikibase.cloud/entity/Q657,TIGER data file source for U.S. Counties
8,https://eew-edgi.wikibase.cloud/entity/Q664,TIGER data file source for U.S. Congressional ...


# Data Source Configuration

In [5]:
county_datasource_qid = 'Q657'
county_ds = eew.datasource(county_datasource_qid)

In [6]:
county_ds

Unnamed: 0,wdLabel,ps_,ps_Label,ps_type,wdpq,wdpqLabel,pq_,pq_Label
0,instance of,https://eew-edgi.wikibase.cloud/entity/Q11,dataset,,,,,
1,reference URL,https://tigerweb.geo.census.gov/tigerwebmain/T...,https://tigerweb.geo.census.gov/tigerwebmain/T...,,,,,
2,html table,https://tigerweb.geo.census.gov/tigerwebmain/F...,https://tigerweb.geo.census.gov/tigerwebmain/F...,,https://eew-edgi.wikibase.cloud/entity/P44,applies to jurisdiction,https://eew-edgi.wikibase.cloud/entity/Q273,Colorado
3,html table,https://tigerweb.geo.census.gov/tigerwebmain/F...,https://tigerweb.geo.census.gov/tigerwebmain/F...,,https://eew-edgi.wikibase.cloud/entity/P44,applies to jurisdiction,https://eew-edgi.wikibase.cloud/entity/Q280,Idaho
4,html table,https://tigerweb.geo.census.gov/tigerwebmain/F...,https://tigerweb.geo.census.gov/tigerwebmain/F...,,https://eew-edgi.wikibase.cloud/entity/P44,applies to jurisdiction,https://eew-edgi.wikibase.cloud/entity/Q270,Arizona
...,...,...,...,...,...,...,...,...
61,property from data source,https://eew-edgi.wikibase.cloud/entity/P41,label,http://wikiba.se/ontology#String,https://eew-edgi.wikibase.cloud/entity/P38,source property,NAME,NAME
62,property from data source,https://eew-edgi.wikibase.cloud/entity/P43,alias,http://wikiba.se/ontology#String,https://eew-edgi.wikibase.cloud/entity/P38,source property,BASENAME,BASENAME
63,property from data source,https://eew-edgi.wikibase.cloud/entity/P36,TIGER GEOID,http://wikiba.se/ontology#ExternalId,https://eew-edgi.wikibase.cloud/entity/P38,source property,GEOID,GEOID
64,property from data source,https://eew-edgi.wikibase.cloud/entity/P11,coordinate location,http://wikiba.se/ontology#GlobeCoordinate,https://eew-edgi.wikibase.cloud/entity/P38,source property,"CENTLAT,CENTLON","CENTLAT,CENTLON"


# Source Material

The source for counties we documented from the U.S. Census TIGER data system is a collection of HTML pages with tables that we can harvest. The query for full details on the source item is generalized in that it brings back every property and qualifier. We need to make sense of what we get for this specific purpose. We get two things in this case:

1. The URL to go get the table from
2. The state the counties are located in, including its local identifier in this Wikibase instance so we can link to it

We set that up here as a simpler table to work from.

In [7]:
html_tables = county_ds[county_ds.wdLabel == 'html table'][
    ['ps_','pq_','pq_Label']
].rename(columns={
    'ps_': 'source_url',
    'pq_': 'linked_state_qid',
    'pq_Label': 'linked_state'
}).reset_index(drop=True)

html_tables['linked_state_qid'] = html_tables.linked_state_qid.apply(eew.extract_wbid)
html_tables.head()

Unnamed: 0,source_url,linked_state_qid,linked_state
0,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado
1,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q280,Idaho
2,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q270,Arizona
3,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q287,Maine
4,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q277,Florida


# Entity Classification

Another key piece of information we need to know is how to classify the entities we will be getting from this source. That's included as another property on the data source representation. This tells us how to create the items and also how to check for whether any of items already exist. Right now, this is just a single value, but we could end up with use cases where we need to classify something multiple ways - it can be an instance of multiple things simultaneously. To make this easy to read, I pull out the pertinent bits of the dataframe here.

In [8]:
classifier = county_ds[county_ds.wdLabel == 'entity classifier'].reset_index(drop=True)
classifier['ps_qid'] = classifier.ps_.apply(eew.extract_wbid)
classifier = classifier.set_index('ps_Label')['ps_qid'].to_dict()
classifier

{'U.S. County (or equivalent)': 'Q655'}

# Property Mapping

The other important piece of information we need to know and have recorded in items describing a data source is how to map what the source has to how we want it represented in the knowledge graph. We're following a principle of minimizing transformation at the point of ingest, only taking what we can make sense of reasonably well that supports our use cases and then doing more work once the information is in our system. We can support a fairly simple mapping of things like these tables to a few key properties. This is done with another property on the source item - property from data source.

Here we extract the pertinent information from the dataframe and rename things to make better sense.

In [9]:
prop_mapping = county_ds[county_ds.wdLabel == 'property from data source'].reset_index(drop=True)
prop_mapping['wb_pid'] = prop_mapping.ps_.apply(eew.extract_wbid)
prop_mapping['wb_property_type'] = prop_mapping.ps_type.apply(lambda x: x.split('#')[-1])
prop_mapping = prop_mapping[['wb_pid','ps_Label','wb_property_type','pq_']].rename(columns={
    'ps_Label': 'wb_property',
    'pq_': 'source_property'
})
display(prop_mapping)

Unnamed: 0,wb_pid,wb_property,wb_property_type,source_property
0,P25,GNIS ID,ExternalId,COUNTYNS
1,P41,label,String,NAME
2,P43,alias,String,BASENAME
3,P36,TIGER GEOID,ExternalId,GEOID
4,P11,coordinate location,GlobeCoordinate,"CENTLAT,CENTLON"
5,P34,FIPS 10-4,ExternalId,GEOID


# Retrieve the source

We're going to want to check to see if we have any of these counties already in our Wikibase instance. If we were working against some kind of an API, we might be able to start with what we aleary know and go after anything missing or use some other means to narrow it down first. We could potentially do that using the ArcGIS REST services provided by the TIGER system, but county references really don't change much, and running a full harvest of the tables doesn't take much time. I did put a helper function into "wbmaker" for this process, so we can just run our URLs and build a collection of dataframes to put together and work on.

In [10]:
%%time
county_dfs = []

for source in html_tables.to_dict('records'):
    county_dfs.append(
        eew.get_html_table(
            url=source['source_url'],
            injection=source
        )
    )

CPU times: user 5.59 s, sys: 105 ms, total: 5.69 s
Wall time: 59.5 s


In [12]:
df_counties = pd.concat(county_dfs).reset_index(drop=True)
df_counties.head()

Unnamed: 0,MTFCC,OID,GEOID,STATE,COUNTY,COUNTYNS,BASENAME,NAME,LSADC,FUNCSTAT,...,CSA,CBSA,METDIV,CENTLAT,CENTLON,INTPTLAT,INTPTLON,source_url,linked_state_qid,linked_state
0,G4020,27590700234319,8001,8,1,198116,Adams,Adams County,6,A,...,,,,39.8741784,-104.337846,39.8743252,-104.3318718,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado
1,G4020,2759086215981,8003,8,3,198117,Alamosa,Alamosa County,6,A,...,,,,37.5728342,-105.7884,37.5684423,-105.7880414,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado
2,G4020,27590703789414,8005,8,5,198118,Arapahoe,Arapahoe County,6,A,...,,,,39.6503198,-104.3393295,39.6445537,-104.3317065,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado
3,G4020,275901333230110,8007,8,7,198119,Archuleta,Archuleta County,6,A,...,,,,37.1937061,-107.0481382,37.2023952,-107.0508634,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado
4,G4020,27590100102454,8009,8,9,198120,Baca,Baca County,6,A,...,,,,37.319105,-102.5604802,37.3097802,-102.5437412,https://tigerweb.geo.census.gov/tigerwebmain/F...,Q273,Colorado


## Validate assumptions

Before processing what we scraped from tables, we need to validate some assumptions and take at least one action on the source information:

* We need to put together county name with state name to make a unique label within what will be a single instance of classification ("U.S. County (or equivalent)"). And we need to make sure that is actually unique.
* We assume that the three identifiers we're going to use here are unique across this collection, but we need to make sure.
* We assume that the geospatial center coordinates are correct because they came from a GIS system that is relied upon. We don't know if they are actually accurate, but we can at least make sure they are valid. We could go so far as to pull geospatial information together and make sure they are within their states or compare with another source, but we won't worry about that for now.

In [13]:
df_counties['label'] = df_counties.apply(lambda x: f"{x.NAME}, {x.linked_state}", axis=1)
df_counties['CENTLAT'] = df_counties.CENTLAT.apply(float)
df_counties['CENTLON'] = df_counties.CENTLON.apply(float)

In [14]:
# Crude but effective tests
print(len(df_counties), len(df_counties.label.unique())) # Unique labels
print(len(df_counties), len(df_counties.GEOID.unique())) # Unique internal ID and FIPS code
print(len(df_counties), len(df_counties.COUNTYNS.unique())) # Unique GNIS identifier
df_counties[
    ((df_counties.CENTLAT > 90) | (df_counties.CENTLAT < -90))
    |
    ((df_counties.CENTLON > 180) | (df_counties.CENTLON < -180))
].empty # No lat/lon values out of bounds for WGS-84

3235 3235
3235 3235
3235 3235


True

# Already in Wikibase

We can now check to see what we might already have in our Wikibase instance for this classification. We can run a query to check for what we will be putting in from this source borrowing from our wbmaker object.

In [21]:
query_counties = """
%(namespaces)s

SELECT ?county ?countyLabel ?countyDescription
?coordinate_location ?gnis_id ?geoid ?fips_10_4
WHERE
{
    ?county wdt:%(instance_of)s wd:%(us_county)s .
    OPTIONAL { ?county wdt:%(geoid)s ?geoid . }
    OPTIONAL { ?county wdt:%(fips_10_4)s ?fips_10_4 . }
    OPTIONAL { ?county wdt:%(gnis_id)s ?gnis_id . }
    OPTIONAL { ?county wdt:%(coordinate_location)s ?coordinate_location . }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
""" % {
    'namespaces': eew.sparql_namespaces(),
    'instance_of': eew.prop_lookup['instance of'],
    'us_county': classifier['U.S. County (or equivalent)'],
    'geoid': eew.prop_lookup['TIGER GEOID'],
    'fips_10_4': eew.prop_lookup['FIPS 10-4'],
    'gnis_id': eew.prop_lookup['GNIS ID'],
    'coordinate_location': eew.prop_lookup['coordinate location']
}

df_current_counties = eew.sparql_query(
    query=query_counties,
    output='dataframe'
)

df_current_counties is None

True

# Processing Workflow

There are a number of ways to go about the workflow, and I keep experimenting with different approaches. When I eventually get to massive data loads, I'm going to have to try something totally different to parallelize the process, but for now a simple loop suffices. It's dumb but effective.

In [None]:
for index, row in df_counties.iterrows():
    print("PROCESSING:", row.label)
    
    item = eew.wbi.item.new()
    
    item.labels.set('en', row.label)
    item.descriptions.set('en', f"a county in {row.linked_state}")
    item.aliases.set('en', [row.NAME, row.BASENAME])
    
    ds_ref = eew.models.References()
    ds_ref.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['data source'],
            value=county_datasource_qid
        )
    )
    
    instance_of_claims = []
    for qid in classifier.values():
        instance_of_claims.append(
            eew.datatypes.Item(
                prop_nr=eew.prop_lookup['instance of'],
                value=qid,
                references=ds_ref
            )
        )
    item.claims.add(instance_of_claims)
    
    item.claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['U.S. state'],
            value=row.linked_state_qid,
            references=ds_ref
        )
    )
    
    for i, r in prop_mapping[~prop_mapping.wb_property.isin(['label','alias'])].iterrows():
        if r.wb_property_type == 'ExternalId':
            item.claims.add(
                eew.datatypes.ExternalID(
                    prop_nr=r.wb_pid,
                    value=row[r.source_property],
                    references=ds_ref
                )
            )
        elif r.wb_property_type == 'GlobeCoordinate':
            coord_props = r.source_property.split(',')
            item.claims.add(
                eew.datatypes.GlobeCoordinate(
                    prop_nr=r.wb_pid,
                    latitude=row[coord_props[0]],
                    longitude=row[coord_props[1]],
                    references=ds_ref
                )
            )
            
    new_item = item.write(summary="County item added from TIGER source data")
    print("NEW ITEM ADDED:", new_item.id)
