This notebook works through the process of adding records for U.S. Counties (and equivalent units) to the knowledgebase. I'm starting with the U.S. Census TIGER source that I've laid out in a data source item in the knowledgebase already.

This is an interesting case in that the information in Wikidata is probably pretty good. There are items for U.S. Counties, and Wikidata contributors have gone so far as to build out crosslinks from states to their counties and from counties to their state. However, this is some level of semantic dissonance in how these items are classified, and we have to consult a lot of information to decide if we're going to trust the completeness of the information in Wikidata. So, I'm going to start with a more authoritative source from which to base items, establish linkages to Wikidata via identifiers, and then decide what I can do with those relationships.

In [1]:
import os
import requests
import pandas as pd
import numpy as np

from functions import (
    sparql_query,
    kb_props,
    kb_datasources,
    valid_classes
)

from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login
from wikibaseintegrator.models import Qualifiers, References, Reference, Claims
from wikibaseintegrator import datatypes
from wikibaseintegrator.wbi_helpers import execute_sparql_query

In [2]:
wbi_config['MEDIAWIKI_API_URL'] = os.environ['MEDIAWIKI_API_URL']
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['SPARQL_ENDPOINT_URL']
wbi_config['WIKIBASE_URL'] = os.environ['WIKIBASE_URL']
wbi_config['USER_AGENT'] = f'EDJIBot/1.0 ({os.environ["WIKIBASE_URL"]})'

login_instance = wbi_login.Login(
    user=os.environ['BOT_NAME'],
    password=os.environ['BOT_PASS']
)

wbi = WikibaseIntegrator(login=login_instance)

# Foundational properties, classifiers, and data sources

Every time we run a workflow to build out some concept in the knowledgebase, we need to pull a reference together of the fundamental properties and specific definition information that drives how claims are built, the items that serve as classifiers (establishing "instance of" claims), and data sources. As I work through each source several times, I'm fiddling with the best way to document a source such that a link to that item in a reference from a claim provides a lot of detail to fully understand where the claim came from.

In [3]:
prop_item_definitions, properties = kb_props()
classes = valid_classes()
datasources = kb_datasources()

In [4]:
display(properties)
display(classes)
display(datasources)

{'NAICS Sector Code': 'P6',
 'NAICS Subsector Code': 'P7',
 'NAICS Industry Group Code': 'P8',
 'NAICS Industry Code': 'P9',
 'NAICS National Industry Code': 'P10',
 'instance of': 'P1',
 'subclass of': 'P2',
 'SIC Code': 'P3',
 'reference url': 'P4',
 'data source': 'P5',
 'file format': 'P11',
 'item of this property': 'P12',
 'identifier length': 'P13',
 'formatter URL': 'P14',
 'equivalent property': 'P15',
 'related wikidata item': 'P16',
 'caveat': 'P17',
 'ISO 3166-1 alpha-2 code': 'P18',
 'ISO 3166-1 alpha-3 code': 'P19',
 'ISO 3166-1 numeric code': 'P20',
 'ISO 3166-2 code': 'P21',
 'country': 'P22',
 'location': 'P23',
 'HTML Data Table': 'P24',
 'entity classification': 'P26',
 'FIPS 5-2 alpha code': 'P27',
 'GNIS ID': 'P28',
 'coordinate location': 'P29',
 'FIPS 5-2 numeric code': 'P30',
 'FIPS 6-4': 'P31'}

{'spatio-temporal activity': 'Q2',
 'data source': 'Q4',
 'file format': 'Q455',
 'geographic entity': 'Q2148',
 'industrial activity': 'Q3',
 'NAICS Sector': 'Q450',
 'NAICS Subsector': 'Q451',
 'NAICS Industry Group': 'Q452',
 'NAICS Industry': 'Q453',
 'NAICS Industry (national)': 'Q454',
 'artificial geographic entity': 'Q2149',
 'country': 'Q1897',
 'U.S. State': 'Q2150',
 'U.S. Territory': 'Q2158',
 'U.S. County': 'Q2206',
 'civil political division': 'Q2207'}

{'SEC listing of SIC codes': 'Q5',
 'North American Industry Classification System': 'Q458',
 'U.S. Census Bureau TIGER Data Files': 'Q2205',
 'Wikidata': 'Q2208'}

# TIGER Source

I'm starting off working against the very rudimentary but complete set of data tables for counties (and equivalent units) from the U.S. Census web site as HTML tables. On the one hand, this seems like a technologically crude source to operate against when there are other options like the ArcGIS REST services also advertised by Census. County names and identifiers are also not all that special in that they are incorporated into any number of other data services, some of which are more functionally usable than what Census puts out as part of TIGER.

However, exploring the available services left some confusion as to what I should actually use as a foundational data source. As is typical in many cases, the ArcGIS services are organized in a way that supports GIS and web mapping functionality specific to how Census operates. This means they are not really spun up with the intent of serving other use cases. They could change based on what the provider's needs are, and there are actually many "layers" that provide exactly the same information on U.S. Counties - which one do I use?

So, even though the HTML tables are crude, they are simple enough to read and parse for use. There could be issues like text encoding of extra spaces that have to be dealt with, but we can check for those easily enough.

In [5]:
tiger_source_query = """
PREFIX wd: <https://edji-knows.wikibase.cloud/entity/>
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?statement
WHERE {
  wd:Q2205 wdt:P24 ?statement.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

tiger_source_urls = sparql_query(
    endpoint=os.environ['SPARQL_ENDPOINT_URL'],
    query=tiger_source_query,
    output="dict"
)

county_source_links = [i for i in tiger_source_urls if "_county_" in i["statement"]]

In [6]:
%%time
county_sources = []

for source_link in county_source_links:
    url = source_link['statement']
    source_dfs = pd.read_html(
        url,
        converters={
            'GEOID': str, 
            'STATE': str,
            'COUNTY': str,
            'COUNTYNS': str 
        }
    )
    county_sources.append(source_dfs[0])

df_county_sources = pd.concat(county_sources)

CPU times: user 2.74 s, sys: 27.8 ms, total: 2.77 s
Wall time: 28 s


## What's Here

* NAME and BASENAME can be used as label and an alias
* GEOID is the FIPS 6-4 ExternalID (new property from this source)
* COUNTYNS is the GNIS ID
* CENTLAT and CENTLON give us coordinate location
* STATE provides our link to existing U.S. State or U.S. Territory items in our knowledgebase

## Bring in States

Since we need to establish a linkage to state items, we can pull in the state name and abbreviation to build out additional aliases on these items. In establishing the linkage to state, I'm currently sticking with a much simpler classification of the relationship and making this a "location/located in" relationship. This is crude and may need to get more sophisticated, but I don't know what the most appropriate semantics are at this point. Wikidata uses "located in the administrative territorial entity," which makes sense, but there are other ontologies that might better inform these relationships.

In [7]:
query_kb_states_territories = """
PREFIX wd: <https://edji-knows.wikibase.cloud/entity/>
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?st ?stLabel ?fips_alpha ?STATE
WHERE {
    ?st wdt:P1 ?classifier .
    VALUES ?classifier { wd:Q2150 wd:Q2158 } .
    OPTIONAL { ?st wdt:P27 ?fips_alpha . }
    OPTIONAL { ?st wdt:P30 ?STATE . }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

kb_states_territories = sparql_query(
    endpoint=os.environ['SPARQL_ENDPOINT_URL'],
    query=query_kb_states_territories,
    output="dataframe"
)

In [8]:
merged_counties_states = pd.merge(
    left=df_county_sources,
    right=kb_states_territories,
    how="left",
    on="STATE"
)

In [9]:
# Create a label that will be unique by combining county name and state name
merged_counties_states['label'] = merged_counties_states.NAME + ', ' + merged_counties_states.stLabel

## Wikidata Items with FIPS Codes

Our best way to connect "county" items in this knowledgebase to Wikidata items that might align and contain additional relevant information and linkages is through the FIPS 6-4 codes. We can query for everything in Wikidata with that property and return information to work with.

One interesting challenge with the FIPS Coded items at the "county" level is that these are not all counties. Classification of these areas below the U.S. State level gets messy. Contributors to Wikidata took the tact of building state-level classification items to some extent with a mix of other things. Excluding names that end in "County," still yields 24 different type classifications. To simplify this for our purposes, I create a new property for a "civil political division" and used that to classify everything with a FIPS 6-4 code. If we need to get into deeper level classification to handle particular use cases, we can perhaps pull concepts from Wikidata or another source.

Since descriptions can be just about anything and the information coming from Wikidata for FIPS 6-4 identified items may be somewhat useful in the absence of specific classification, I opted to use those descriptions in building our items.

In [10]:
query_wd_fips64 = """
SELECT ?item ?itemLabel ?itemDescription ?GEOID 
(GROUP_CONCAT(?classLabel  ; separator=',') as ?classes)
WHERE {
  ?item wdt:P882 ?GEOID .
  ?item wdt:P31 ?class .
  ?class rdfs:label ?classLabel . filter (lang(?classLabel)='en')
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?item ?itemLabel ?itemDescription ?GEOID
"""

wd_fips64 = sparql_query(
    endpoint="https://query.wikidata.org/sparql",
    query=query_wd_fips64,
    output="dataframe"
)

In [11]:
# Examine instance of classification for Wikidata FIPS 6-4 items from TIGER names not ending in "County"
wd_fips64['class_list'] = wd_fips64.classes.apply(lambda x: x.split(','))
list(wd_fips64[wd_fips64.GEOID.isin(merged_counties_states[~merged_counties_states.NAME.str.endswith('County')].GEOID)][['class_list']].explode('class_list').class_list.unique())

['county seat',
 'independent city',
 'railway town',
 'parish of Louisiana',
 'borough of Alaska',
 'Metropolitan Statistical Area',
 'municipality of Puerto Rico',
 'big city',
 'city in the United States',
 'consolidated city-county',
 'state or insular area capital of the United States',
 'census area of Alaska',
 'city',
 'census-designated place in the United States',
 'island',
 'unorganized atoll of American Samoa',
 'disputed territory',
 'unincorporated territory',
 'insular area',
 'territory of the United States',
 'municipality of the Northern Mariana Islands',
 'district of American Samoa',
 'census district of the United States Virgin Islands',
 'district of the United States Virgin Islands']

In [12]:
wd_matches = wd_fips64[wd_fips64.GEOID.isin(merged_counties_states.GEOID)]
wd_matches = wd_matches.drop_duplicates(subset="GEOID", keep="first")

merged_counties_wd = pd.merge(
    left=merged_counties_states,
    right=wd_matches[["item", "itemDescription", "GEOID"]],
    how="left",
    on="GEOID"
)

merged_counties_wd['kb_state_qid'] = merged_counties_wd.st.apply(lambda x: x.split('/')[-1])
merged_counties_wd['wd_rel_id'] = merged_counties_wd.item.apply(lambda x: x.split('/')[-1] if isinstance(x, str) else None)

# Build Items

There is a little bit of a functional question here on whether we should run this kind of blending operation in building items. It might be cleaner to simply work on one individual source at a time, using it to present only what it knows individually at the point of integration into a knowledgebase. Each one of these that we run will consult what's already in the knowledgebase to decide what to do next. The next time it runs, it may be able to do a little bit more because some other process has run to bank its own information.

In this process, we've done that to some extent - linking to state records already in our knowledgebase. If those records weren't there, and this process ran first, the most notable "issue" would have been an inability to create functionally unique labels for these items (because the county FIPS source in the TIGER data files does not include state name). We also consult Wikidata on the way in with these records. That is also something that is going to potentially need to be re-run over time. The only thing we are doing at this stage with Wikidata consultation is bringing in a convenient description and nailing down the link via FIPS code so we can exploit the link to a Wikidata item with additional information about counties more readily later.

In [13]:
# I goofed up one thing so I needed to see what was already in the knowledgebase
query_existing_fips = """
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?fips_code
WHERE {
    ?item wdt:P31 ?fips_code .
}
"""

existing_fips = sparql_query(
    endpoint=os.environ['SPARQL_ENDPOINT_URL'],
    query=query_existing_fips,
    output="dataframe"
)

missing_records = merged_counties_wd[~merged_counties_wd.GEOID.isin(existing_fips.fips_code)]

In [None]:
tiger_refs = References()
ref_tiger = datatypes.Item(
    prop_nr=properties['data source'],
    value=datasources['U.S. Census Bureau TIGER Data Files']
)
tiger_refs.add(ref_tiger)

for index, row in missing_records.iterrows():
    print("PROCESSING:", row.label)
    prov_statement="Added new county record from TIGER source matched to existing state item"

    item = wbi.item.new()
    
    item.labels.set('en', row.label)
    
    # Add some additional alt names
    aliases = [
        row.NAME,
        row.BASENAME,
        f"{row.NAME}, {row.fips_alpha}"
    ]
    item.aliases.set('en', aliases)

    # Instance of classification
    item.claims.add(
        datatypes.Item(
            prop_nr=properties['instance of'],
            value=classes['civil political division'],
            references=tiger_refs
        )
    )
    
    # Location in state
    item.claims.add(
        datatypes.Item(
            prop_nr=properties['location'],
            value=row.kb_state_qid,
            references=tiger_refs
        )
    )
    
    # Coordinate location
    item.claims.add(
        datatypes.GlobeCoordinate(
            prop_nr=properties['coordinate location'],
            latitude=row.CENTLAT,
            longitude=row.CENTLON,
            references=tiger_refs
        )
    )
    
    # FIPS ID
    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['FIPS 6-4'],
            value=row.GEOID,
            references=tiger_refs
        )
    )
    
    # GNIS ID
    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['GNIS ID'],
            value=row.COUNTYNS.lstrip('0'),
            references=tiger_refs
        )
    )
    
    # Related wikidata item
    if isinstance(row.wd_rel_id, str):
        # Borrow description from Wikidata
        item.descriptions.set('en', row.itemDescription)
        
        wd_refs = References()
        ref_wd = datatypes.Item(
            prop_nr=properties['data source'],
            value=datasources['Wikidata']
        )
        wd_refs.add(ref_wd)
        
        item.claims.add(
            datatypes.ExternalID(
                prop_nr=properties['related wikidata item'],
                value=row.wd_rel_id,
                references=wd_refs
            )
        )
        prov_statement="Added new county record from TIGER source matched to existing state item with link to related Wikidata item"
    else:
        # Set standard description
        item.descriptions.set('en', f'a civil political division in {row.stLabel}')

    
    item.write(summary=prov_statement)
        