This notebook brings in the U.S. Census Legal/Statistical Area Description codes and values from an HTML reference table documented as a data source. The values and descriptions for these codes are somewhat data-centric and specific to how the U.S. Census Bureau sees the world they operate in, they are a somewhat necessary and potentially useful starting point to classifying different populated places for future analysis.

I'm again running into a challenge with bringing these into the WB instance and what are likely delays in Elasticsearch index processing that means the items do not immediately show up in SPARQL queries.

In [1]:
import pandas as pd
from wbmaker import WikibaseConnection

In [2]:
eew = WikibaseConnection("EEW")

In [3]:
datasource_qid = eew.ref_lookup['Legal/Statistical Area Description Codes and Definitions']
instance_of_class = eew.class_lookup['Legal/Statistical Area Description']
lsad_property = eew.prop_lookup['LSAD']

In [4]:
ds = eew.wbi.item.get(datasource_qid)
ds_table_url = ds.claims.get('P31')[0].get_json()["mainsnak"]["datavalue"]["value"]
lsad_tables = pd.read_html(ds_table_url)
lsad_reference = lsad_tables[0].astype({
    "LSAD": "category", 
    "LSAD Description": "str", 
    "Associated Geographic Entity": "str"
})

In [10]:
query_lsad = "PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Feew-edgi.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Flsad%20%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP54%20%3Flsad.%0A%7D%0A"
wb_lsad = eew.wb_ref_data(query=query_lsad)

In [11]:
missing_lsad = lsad_reference[~lsad_reference.LSAD.isin(wb_lsad.lsad)]

### Elasticsearch Index Catch-up Challenge

I should not be seeing anything in missing_lsad at this point. Spot checking these based on what's not coming back in a SPARQL query shows them on the recent changes listing for the WB instance. This is likely all related to [this issue](https://phabricator.wikimedia.org/T330796) I posted for another experimental wikibase instance. An administrator mentioned a bunch of clogged MediaWiki jobs that hadn't been completed for that instance, which is likely the case here. I don't have a way to resolve this myself, but I've posted [another bug report](https://phabricator.wikimedia.org/T332894).

In [12]:
missing_lsad

Unnamed: 0,LSAD,LSAD Description,Associated Geographic Entity
45,73,Primary Metropolitan Statistical Area,Primary Metropolitan Statistical Area
46,74,New England County Metropolitan Area,New England County Metropolitan Area
112,M3,Metro Division (suffix),Metropolitan Division
113,M4,Combined NECTA (suffix),Combined New England City and Town Area
114,M5,Metropolitan NECTA (suffix),New England City and Town Metropolitan and Mic...
115,M6,Micropolitan NECTA (suffix),New England City and Town Metropolitan and Mic...
116,M7,NECTA Division (suffix),New England City and Town Division
117,MB,metropolitan government (balance),"Economic Census Place, Incorporated Place"
118,MG,metropolitan government (suffix),"Consolidated City, Economic Census Place, Inco..."
119,MT,metro government (suffix),Consolidated City


In [None]:
references = eew.models.References()
references.add(
    eew.datatypes.Item(
        prop_nr=eew.prop_lookup['data source'],
        value=datasource_qid
    )
)

for index, row in missing_lsad.iterrows():
    item = eew.wbi.item.new()
    if row["LSAD Description"] == "nan":
        item.labels.set("en", "general or unknown legal/statistica area")
        item.descriptions.set("en", "general LSAD category referring to any other classification; essentially unclassified")
    else:
        item.labels.set("en", row["LSAD Description"])
        item.descriptions.set("en", row["Associated Geographic Entity"])

    item.claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['instance of'],
            value=instance_of_class,
            references=references
        )
    )

    item.claims.add(
        eew.datatypes.ExternalID(
            prop_nr=lsad_property,
            value=row["LSAD"],
            references=references
        )
    )

    try:
        response = item.write(summary="Added LSAD item from source HTML table")
        print("ADDED:", row["LSAD"], response.id)
    except Exception as e:
        pass