I'm working on bringing USEPA Facility Registration Service records into a Wikibase instance. Part of the classification for facilities includes code references to the North American Industry Classification System. In order to link everything up, it is useful to first ingest the NAICS as Wikibase items that can be linked to. The NAICS data tables are available by specific year from the U.S. Census Bureau.

Things are a little more complicated here than they were with the SEC Standard Industrial Classification codes as there is a 5-level hierarchical system with the 6-digit code applying specifically to U.S. National industry classification. Encoding this information into a knowledgebase context is also different than treating the source simply as a code lookup. We need to identify the core concepts within the classification system that will be meaningful in a larger context like we are working with here.

What I decided to do was treat these classification concepts in a gazetteer context where we work out unique named industries across the different levels, classify them with multiple instance of properties, and record all applicable code values for linking to those items. There are some cases then, where an industry concept like "Wholesale Trade Agents and Brokers" can be an Industry Subsector, Industry Group, Industry (multi-national), and Industry (national). I think this is workable in the knowledge graph in that some given facility or a commercial company could link to one given concept based on any of the codes applied to that entity in source data. When we establish the linkage, we'll record the hierarchical level of significance to the linkage as a qualifer. For most use cases, we'll really want to hone in on the broad concept anyway for some type of query, and the specific level used in the governmental classification process won't really matter.

In [None]:
import os
import pandas as pd
import numpy as np

from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login
from wikibaseintegrator.models import Qualifiers, References, Reference
from wikibaseintegrator import datatypes
from wikibaseintegrator.wbi_helpers import execute_sparql_query

In [None]:
wbi_config['MEDIAWIKI_API_URL'] = os.environ['MEDIAWIKI_API_URL']
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['SPARQL_ENDPOINT_URL']
wbi_config['WIKIBASE_URL'] = os.environ['WIKIBASE_URL']
wbi_config['USER_AGENT'] = f'EDJIBot/1.0 ({os.environ["WIKIBASE_URL"]})'

login_instance = wbi_login.Login(
    user=os.environ['BOT_NAME'],
    password=os.environ['BOT_PASS']
)

wbi = WikibaseIntegrator(login=login_instance)

I took a new tact here that I'm continuing to evolve where I used the property definition in the Wikibase instance to define the necessary aspects to complete property mapping. There are properties for each of the 5 code levels in the NAICS hierarchy that are named according to the explanation in the documentation (sector, subsector, group, industry (multi-national), industry (US). The length of the code indicates which grouping the item should be within. The properties to contain the codes relate to Wikibase items used in classification (instance of). By recording the length variable in the property and the corresponding item (the latter is something done regularly in Wikidata), we have everything we need from property definitions to both establish instance of classification and record identifiers.

A more robust and elegant way to do this would be to provide more specific structure in RegEx form for ExternalID properties. We could then use the same basic idea to configure various other aspects of knowledge encoding logic into the Wikibase driven from one or more key properties contained in source data.

In [None]:
prop_query = """
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?property ?propertyLabel ?item_of_prop ?item_of_propLabel ?id_len WHERE {
  ?property a wikibase:Property .
  OPTIONAL {
    ?property wdt:P12 ?item_of_prop .
    ?property wdt:P13 ?id_len .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""
edjikb_props = execute_sparql_query(prop_query)

prop_records = []
for result in edjikb_props['results']['bindings']:
    prop_records.append({k:v['value'] for k,v in result.items()})

df_props = pd.DataFrame(prop_records)
df_props['prop_qid'] = df_props.property.apply(lambda x: x.split('/')[-1])
df_props['instance_of_qid'] = df_props.item_of_prop.apply(lambda x: x.split('/')[-1] if isinstance(x, str) else None)
item_of_props = df_props[df_props.item_of_prop.notnull()].copy()
item_of_props = item_of_props.convert_dtypes()
display(item_of_props)

prop_lookup = df_props[["propertyLabel","prop_qid"]].set_index("propertyLabel").to_dict()["prop_qid"]
display(prop_lookup)

datasource_query = """
PREFIX wd: <https://edji-knows.wikibase.cloud/entity/>
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?ds ?dsLabel WHERE {
  ?ds wdt:P1 wd:Q4 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

edjikb_datasources = execute_sparql_query(datasource_query)
datasource_lookup = {}
for x in edjikb_datasources['results']['bindings']:
    datasource_lookup[x['dsLabel']['value']] = x['ds']['value'].split('/')[-1]

display(datasource_lookup)

Unnamed: 0,property,item_of_prop,id_len,propertyLabel,item_of_propLabel,prop_qid,instance_of_qid
0,https://edji-knows.wikibase.cloud/entity/P6,https://edji-knows.wikibase.cloud/entity/Q450,2,NAICS Sector Code,NAICS Sector,P6,Q450
1,https://edji-knows.wikibase.cloud/entity/P7,https://edji-knows.wikibase.cloud/entity/Q451,3,NAICS Subsector Code,NAICS Subsector,P7,Q451
2,https://edji-knows.wikibase.cloud/entity/P8,https://edji-knows.wikibase.cloud/entity/Q452,4,NAICS Industry Group Code,NAICS Industry Group,P8,Q452
3,https://edji-knows.wikibase.cloud/entity/P9,https://edji-knows.wikibase.cloud/entity/Q453,5,NAICS Industry Code,NAICS Industry,P9,Q453
4,https://edji-knows.wikibase.cloud/entity/P10,https://edji-knows.wikibase.cloud/entity/Q454,6,NAICS National Industry Code,NAICS Industry (national),P10,Q454


{'NAICS Sector Code': 'P6',
 'NAICS Subsector Code': 'P7',
 'NAICS Industry Group Code': 'P8',
 'NAICS Industry Code': 'P9',
 'NAICS National Industry Code': 'P10',
 'instance of': 'P1',
 'subclass of': 'P2',
 'SIC Code': 'P3',
 'reference url': 'P4',
 'data source': 'P5',
 'file format': 'P11',
 'item of this property': 'P12',
 'identifier length': 'P13'}

{'SEC listing of SIC codes': 'Q5',
 'North American Industry Classification System': 'Q458'}

I tried to stick with a fairly basic transformation on source data. The Excel file containing descriptions seemed to be the most complete source to work from. The underlying data is provided in a couple other ways.

I had to take care of a title issue where the values include a superscript indicator at the end of the title string indicating whether the classification is comparable across all three NA countries. I may need to come back and incorporate something on this later such as an "applicable to" (or similar) concept linking a classification to a country item. Description texts were too long and messy in many cases to be usable here, so I punted on that and am working on formatter URL to have the external IDs link to a dynamic view from the US Census web site.

In [None]:
naics_desc_url = "https://www.census.gov/naics/2022NAICS/2022_NAICS_Descriptions.xlsx"

# Get data table from source
naics_desc = pd.read_excel(naics_desc_url)

# Set classification level indicator
naics_desc['class_level'] = naics_desc.Code.apply(lambda x: len(str(x)))

# Strip superscript indicator from title
naics_desc['fixed_title'] = naics_desc.Title.apply(
    lambda x: x[:-1] if str(x).endswith('T') else x
)

# Indicate if title is duplicated
dup_title = naics_desc.duplicated(['fixed_title'], keep=False)
naics_desc['dup_title'] = np.select([dup_title],[True], default=False)

# Pull out duplicated titles and group codes/class levels
dup_titles = naics_desc[naics_desc.dup_title][['fixed_title','Code','class_level']].reset_index(drop=True)
dup_title_mapping = dup_titles.groupby('fixed_title').agg(list).reset_index()

# Match non-dup title structure
non_dup_titles = naics_desc[~naics_desc.dup_title][['fixed_title','Code','class_level']].reset_index(drop=True)
non_dup_titles['Code'] = non_dup_titles.Code.apply(lambda x: [x])
non_dup_titles['class_level'] = non_dup_titles.class_level.apply(lambda x: [x])

# Concat the two dup/non-dup together
naics_records = pd.concat([
    non_dup_titles,
    dup_title_with_desc
])

In [None]:
naics_records

Unnamed: 0,fixed_title,Code,class_level,Description,reference_url
0,"Agriculture, Forestry, Fishing and Hunting",[11],[2],,https://www.census.gov/naics/?input=[11]&year=...
1,Crop Production,[111],[3],,https://www.census.gov/naics/?input=[111]&year...
2,Oilseed and Grain Farming,[1111],[4],,https://www.census.gov/naics/?input=[1111]&yea...
3,Other Grain Farming,[11119],[5],,https://www.census.gov/naics/?input=[11119]&ye...
4,Oilseed and Grain Combination Farming,[111191],[6],,https://www.census.gov/naics/?input=[111191]&y...
...,...,...,...,...,...
610,Wine and Distilled Alcoholic Beverage Merchant...,"[42482, 424820]","[5, 6]",This industry comprises establishments primari...,"https://www.census.gov/naics/?input=[42482, 42..."
611,Wineries,"[31213, 312130]","[5, 6]",This industry comprises establishments primari...,"https://www.census.gov/naics/?input=[31213, 31..."
612,Wood Container and Pallet Manufacturing,"[32192, 321920]","[5, 6]",This industry comprises establishments primari...,"https://www.census.gov/naics/?input=[32192, 32..."
613,Wood Kitchen Cabinet and Countertop Manufacturing,"[33711, 337110]","[5, 6]",This industry comprises establishments primari...,"https://www.census.gov/naics/?input=[33711, 33..."


I still need to come back to this process and work out a parallel method. However, there are some issues in the Wikibase.cloud instances where there is a delay in items becoming fully functional, I believe based on some kind of indexing lag. So, it doesn't really matter how quickly I push items into the system if they will take a while to work fully anyway.

In [None]:
naics_refs = References()
naics_ref = Reference()
naics_ref.add(
    datatypes.Item(
        prop_nr=prop_lookup['data source'],
        value=datasource_lookup['North American Industry Classification System']
    )
)
naics_refs.add(naics_ref)

for index, row in naics_records.iterrows():
    print("PROCESSING:", row['fixed_title'])
    
    item = wbi.item.new()
    
    # Set label and description
    item.labels.set('en', row['fixed_title'])
    item.descriptions.set('en', 'a NAICS industry classification')

    instance_of_claims = []

    # For each code, establish instance of and external id claims
    for code in row['Code']:
        code_str = str(code)
        prop_ref = item_of_props[item_of_props.id_len == str(len(code_str))].iloc[0].to_dict()

        instance_of_claims.append(
            datatypes.Item(
                prop_nr=prop_lookup['instance of'],
                value=prop_ref['instance_of_qid'],
                references=naics_refs
            )
        )

        item.claims.add(
            datatypes.ExternalID(
                prop_nr=prop_ref['prop_qid'],
                value=code_str,
                references=naics_refs
            )
        )

    item.add_claims(instance_of_claims)

    item.write()

PROCESSING: Boat Building
PROCESSING: Motorcycle, Bicycle, and Parts Manufacturing
PROCESSING: Military Armored Vehicle, Tank, and Tank Component Manufacturing
PROCESSING: All Other Transportation Equipment Manufacturing
PROCESSING: Furniture and Related Product Manufacturing
PROCESSING: Household and Institutional Furniture and Kitchen Cabinet Manufacturing
PROCESSING: Household and Institutional Furniture Manufacturing
PROCESSING: Upholstered Household Furniture Manufacturing
PROCESSING: Nonupholstered Wood Household Furniture Manufacturing
PROCESSING: Household Furniture (except Wood and Upholstered) Manufacturing
PROCESSING: Institutional Furniture Manufacturing
PROCESSING: Wood Office Furniture Manufacturing
PROCESSING: Custom Architectural Woodwork and Millwork Manufacturing
PROCESSING: Office Furniture (except Wood) Manufacturing
PROCESSING: Showcase, Partition, Shelving, and Locker Manufacturing
PROCESSING: Other Furniture Related Product Manufacturing
PROCESSING: Miscellaneous

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=30a1da16-8d37-4863-b767-04fc5292d9a6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>