I'm working on bringing USEPA Facility Registration Service records into a Wikibase instance. Part of the classification for facilities includes code references to the North American Industry Classification System. In order to link everything up, it is useful to first ingest the NAICS as Wikibase items that can be linked to. The NAICS data tables are available by specific year from the U.S. Census Bureau.

Things are a little more complicated here than they were with the SEC Standard Industrial Classification codes as there is a 5-level hierarchical system with the 6-digit code applying specifically to U.S. National industry classification. Encoding this information into a knowledgebase context is also different than treating the source simply as a code lookup. We need to identify the core concepts within the classification system that will be meaningful in a larger context like we are working with here.

What I decided to do was treat these classification concepts in a gazetteer context where we work out unique named industries across the different levels, classify them with multiple instance of properties, and record all applicable code values for linking to those items. There are some cases then, where an industry concept like "Wholesale Trade Agents and Brokers" can be an Industry Subsector, Industry Group, Industry (multi-national), and Industry (national). I think this is workable in the knowledge graph in that some given facility or a commercial company could link to one given concept based on any of the codes applied to that entity in source data. When we establish the linkage, we'll record the hierarchical level of significance to the linkage as a qualifer. For most use cases, we'll really want to hone in on the broad concept anyway for some type of query, and the specific level used in the governmental classification process won't really matter.

In [66]:
import pandas as pd
import numpy as np
from wbmaker import WikibaseConnection

In [94]:
eew = WikibaseConnection('EEW')

### NAICS 2007-2022

Each of the NAICS editions stored in a flavor of Excel can be read from URLs and processed to produce usable dataframes.

In [3]:
naics_2022 = pd.read_excel(
    "https://www.census.gov/naics/2022NAICS/2-6%20digit_2022_Codes.xlsx", 
    dtype={"2022 NAICS Code": str, "2022 NAICS Title": str}
).rename(
    columns={
        "2022 NAICS US   Code": "code",
        "2022 NAICS US Title": "desc"
    }
)
naics_2022["edition"] = 2022

naics_2017 = pd.read_excel(
    "https://www.census.gov/naics/2017NAICS/2-6%20digit_2017_Codes.xlsx", 
    dtype={"2017 NAICS Code": str, "2017 NAICS Title": str}
).rename(
    columns={
        "2017 NAICS US   Code": "code",
        "2017 NAICS US Title": "desc"
    }
)
naics_2017["edition"] = 2017

naics_2012 = pd.read_excel(
    "https://www.census.gov/naics/2012NAICS/2-digit_2012_Codes.xls",
    dtype={"2012 NAICS Code": str, "2012 NAICS Title": str}
).rename(
    columns={
        "2012 NAICS US   Code": "code",
        "2012 NAICS US Title": "desc"
    }
)
naics_2012["edition"] = 2012

naics_2007 = pd.read_excel(
    "https://www.census.gov/naics/reference_files_tools/2007/naics07.xls",
    dtype={"2007 NAICS Code": str, "2007 NAICS Title": str}
).rename(
    columns={
        "2007 NAICS US Code": "code",
        "2007 NAICS US Title": "desc"
    }
)
naics_2007["edition"] = 2007

### NAICS 2002

The 2002 text file needed a slight bit of cleanup and was cached locally for processing.

In [4]:
naics_2002_records = []
with open("datacache/naics_2002.txt", "r") as f:
    for line in f.readlines():
        code = None
        if "      " in line:
            code = line[:2]
            desc = line[2:].strip()
        if code is None and "     " in line:
            code = line[:3]
            desc = line[3:].strip()
        if code is None and "    " in line:
            code = line[:4]
            desc = line[4:].strip()
        if code is None and "   " in line:
            code = line[:5]
            desc = line[5:].strip()
        if code is None and "  " in line:
            code = line[:6]
            desc = line[6:].strip()
        if code is not None:
            naics_2002_records.append({
                "code": code,
                "desc": desc,
                "edition": 2002
            })
naics_2002 = pd.DataFrame(naics_2002_records)

### NAICS 1997

I had to use a separate web scraping process to assemble the 1997 edition of the NAICS codes.

In [5]:
naics_1997 = pd.read_parquet('datacache/NAICS_1997.parquet')
naics_1997['edition'] = '1997'

## NAICS Knowledgebase Reference

For the knowledgebase representation of EPA-regulated facilities, it will be most useful to connect to the simple set of named industries. While there may be some variation in definition between editions of the NAICS, that really won't matter all that much at a conceptual level if we group on the basic name/labal of industry.

Grouping the classification on label ("desc" as I've called it here), we get codes and the specific year in which they apply. We can encode these into Wikibase items as the code (ExternalID), which is what we'll need to link to from source records, and a qualifier indicating the year/edition of the NAICS for that code. Some codes will apply to more than one year/edition.

In most use cases, we probably don't care all that much about the built-in hierarchy in the NAICS, but it may be useful in some cases where we want to look at broader industrial sectors.

In [6]:
naics_all = pd.concat([
    naics_2022[["code","desc","edition"]],
    naics_2017[["code","desc","edition"]],
    naics_2012[["code","desc","edition"]],
    naics_2007[["code","desc","edition"]],
    naics_2002[["code","desc","edition"]],
    naics_1997[["code","desc","edition"]]
]).dropna()

naics_all["code"] = naics_all.code.astype(str)
naics_all["desc"] = naics_all.desc.apply(lambda x: x.strip())
naics_all["code_edition"] = naics_all.apply(lambda x: f'{x.code}_{x.edition}', axis=1)

# NAICS Industry Classification Items

My process for working through and adding the classification items here is a little clunky, but it results in a set of named industries with their associated NAICS codes (2, 3, 4, 5, or 6 digit codes) that can be linked to from items representing EPA-regulated facilities. The NAICS items also have their hierarchical relationships accounted for using "has part" and "part of" claims, facilitating hierarchical analysis of higher level industry groups.

The items created have unique labels within the simplified instance of industry classification, accounting for differences across years/editions of the NAICS classification system through qualifiers indicating the years to which the NAICS code applies. This smooths over some of the differences to create a simplified classification mechanism for facilities given the fact that the EPA has not fully accounted for changes to the NAICS in their own systems. Facility registrations across the various constituent information systems do not have any kind of specific linkage to a given year/edition of the NAICS and do not fully release any internal lookup source that may have been used within a system over a period of time. I've found NAICS codes within the FRS specific to a given year going back to the initial 1997 edition. This approach does not account for any other nuances that may have occurred within the specific coded and labeled classification element, but those differences do not really matter for this type of basic classification mechanism.

In [33]:
unique_industry_names = list(naics_all.desc.unique())
wb_naics = []

refs = eew.models.References()
refs.add(
    eew.datatypes.Item(
        prop_nr=eew.prop_lookup['data source'],
        value=eew.ref_lookup['North American Industry Classification System']
    )
)

for industry_name in unique_industry_names:
    item = eew.wbi.item.new()

    item.labels.set('en', industry_name)
    item.descriptions.set('en', 'A name identifying a type of industry under the North American Industry Classification System')

    claims = eew.models.Claims()

    claims.add(
        eew.datatypes.Item(
            prop_nr=eew.prop_lookup['instance of'],
            value=eew.class_lookup['industry'],
            references=refs
        )
    )

    for code, years in naics_all[naics_all.desc == industry_name][["code","edition"]].groupby("code")["edition"].agg(list).items():
        qualifiers = eew.models.Qualifiers()
        for year in years:
            dt = f"+{year}-01-01T00:00:00Z"
            qualifiers.add(
                eew.datatypes.Time(
                    prop_nr=eew.prop_lookup['point in time'],
                    time=dt
                )
            )
        claims.add(
            eew.datatypes.ExternalID(
                prop_nr=eew.prop_lookup['NAICS Code'],
                value=code,
                qualifiers=qualifiers,
                references=refs
            )
        )

    item.claims.add(claims)
    
    response = item.write(summary="Added item for an NAICS industry classification term")
    wb_naics.append({
        'desc': industry_name,
        'qid': response.id
    })
    print(industry_name, response.id)
    

Agriculture, Forestry, Fishing and Hunting Q30682
Crop Production Q30683
Oilseed and Grain Farming Q30684
Soybean Farming Q30685
Oilseed (except Soybean) Farming Q30686
Dry Pea and Bean Farming Q30687
Wheat Farming Q30688
Corn Farming Q30689
Rice Farming Q30690
Other Grain Farming Q30691
Oilseed and Grain Combination Farming Q30692
All Other Grain Farming Q30693
Vegetable and Melon Farming Q30694
Potato Farming Q30695
Other Vegetable (except Potato) and Melon Farming Q30696
Fruit and Tree Nut Farming Q30697
Orange Groves Q30698
Citrus (except Orange) Groves Q30699
Noncitrus Fruit and Tree Nut Farming Q30700
Apple Orchards Q30701
Grape Vineyards Q30702
Strawberry Farming Q30703
Berry (except Strawberry) Farming Q30704
Tree Nut Farming Q30705
Fruit and Tree Nut Combination Farming Q30706
Other Noncitrus Fruit Farming Q30707
Greenhouse, Nursery, and Floriculture Production Q30708
Food Crops Grown Under Cover Q30709
Mushroom Production Q30710
Other Food Crops Grown Under Cover Q30711
Nurse

In [80]:
wb_naics_items = pd.merge(
    left=naics_all,
    right=pd.DataFrame(wb_naics),
    how="left",
    on="desc"
)
wb_naics_items["parent_code_edition"] = wb_naics_items.apply(
    lambda x: f"{x.code[:len(x.code)-1]}_{x.edition}" if len(x.code) > 2 else None,
    axis=1
)

parent_lookup = wb_naics_items[["code_edition","qid","desc"]].rename(columns={
    "code_edition": "parent_code_edition",
    "qid": "parent_qid",
    "desc": "parent_desc"
}).reset_index(drop=True)

wb_naics_items_parents = pd.merge(
    left=wb_naics_items[wb_naics_items.parent_code_edition.notnull()],
    right=parent_lookup,
    how="left",
    on="parent_code_edition"
)

In [86]:
wb_naics_items_parents

Unnamed: 0,code,desc,edition,code_edition,qid,parent_code_edition,parent_qid,parent_desc
0,111,Crop Production,2022,111_2022,Q30683,11_2022,Q30682,"Agriculture, Forestry, Fishing and Hunting"
1,1111,Oilseed and Grain Farming,2022,1111_2022,Q30684,111_2022,Q30683,Crop Production
2,11111,Soybean Farming,2022,11111_2022,Q30685,1111_2022,Q30684,Oilseed and Grain Farming
3,111110,Soybean Farming,2022,111110_2022,Q30685,11111_2022,Q30685,Soybean Farming
4,11112,Oilseed (except Soybean) Farming,2022,11112_2022,Q30686,1111_2022,Q30684,Oilseed and Grain Farming
...,...,...,...,...,...,...,...,...
13411,9281,National Security and International Affairs,1997,9281_1997,Q32111,928_1997,Q32111,National Security and International Affairs
13412,92811,National Security,1997,92811_1997,Q32112,9281_1997,Q32111,National Security and International Affairs
13413,928110,National Security,1997,928110_1997,Q32112,92811_1997,Q32112,National Security
13414,92812,International Affairs,1997,92812_1997,Q32113,9281_1997,Q32111,National Security and International Affairs


In [103]:
for index, row in wb_naics_items_parents[["qid","parent_qid"]].groupby("qid").agg(list).reset_index().iterrows():
    parent_qids = [i for i in list(set(row.parent_qid)) if i != row.qid]
    if parent_qids:
        item = eew.wbi.item.get(row.qid)
        try:
            part_of_claims = item.claims.get(property=eew.prop_lookup['part of'])
        except:
            try:
                new_claims = []
                for parent_qid in parent_qids:
                    new_claims.append(
                        eew.datatypes.Item(
                            prop_nr=eew.prop_lookup['part of'],
                            value=parent_qid,
                            references=refs
                        )
                    )
                
                item.claims.add(claims=new_claims)
                item.write(summary="Added parent industry as part of claim")

                print(row.qid, parent_qids)
            except Exception as e:
                print(e)
                print(row.qid, parent_qids)


Expected str or int, found <class 'float'> (nan)
Q30861 [nan]
Expected str or int, found <class 'float'> (nan)
Q30862 [nan]
Q30863 ['Q30862']
Q30864 ['Q30863']
Q30865 ['Q30863']
Q30866 ['Q30862']
Q30867 ['Q30866']
Q30868 ['Q30867']
Q30869 ['Q30867']
Q30870 ['Q30867']
Q30871 ['Q30866']
Q30872 ['Q30871']
Q30873 ['Q30871']
Q30874 ['Q30871']
Q30875 ['Q30866']
Q30876 ['Q30862']
Q30877 ['Q30876']
Q30878 ['Q30877']
Q30879 ['Q30877']
Q30880 ['Q30876']
Q30881 ['Q30876']
Q30882 ['Q30881', 'Q30876']
Q30883 ['Q30881', 'Q30876']
Q30884 ['Q30862']
Q30885 ['Q30884']
Q30886 ['Q30885']
Q30887 ['Q30885']
Q30888 ['Q30884']
Q30889 ['Q30888']
Q30890 ['Q30888']
Q30891 ['Q30888']
Q30892 ['Q30862']
Q30893 ['Q30892']
Q30894 ['Q30893']
Q30895 ['Q30893']
Q30896 ['Q30893']
Q30897 ['Q30893']
Q30898 ['Q30892']
Q30899 ['Q30862']
Q30900 ['Q30899']
Q30901 ['Q30899']
Q30902 ['Q30899']
Q30903 ['Q30899']
Q30904 ['Q30862']
Q30905 ['Q30862']
Q30906 ['Q30905']
Q30907 ['Q30906']
Q30908 ['Q30906']
Q30909 ['Q30906']
Q30910 ['Q

In [104]:
parents_w_children = wb_naics_items_parents[["parent_qid","qid"]].groupby("parent_qid").agg(list).reset_index()
parents_w_children["qid"] = parents_w_children.qid.apply(lambda x: [i for i in list(set(x)) if isinstance(i, str)])

In [107]:
for index, row in parents_w_children.iterrows():
    item = eew.wbi.item.get(row.parent_qid)

    new_claims = eew.models.Claims()
    for child_qid in row.qid:
        if child_qid != row.parent_qid:
            new_claims.add(
                eew.datatypes.Item(
                    prop_nr=eew.prop_lookup['has part'],
                    value=child_qid,
                    references=refs
                )
            )
    item.claims.add(claims=new_claims)
    item.write(summary="Added child industries using 'has part' claims")
    print(row.parent_qid, row.qid)

Q30682 ['Q30724', 'Q30754', 'Q30760', 'Q32300', 'Q30750', 'Q30683']
Q30683 ['Q30715', 'Q30697', 'Q30684', 'Q30694', 'Q30708']
Q30684 ['Q30690', 'Q30687', 'Q30689', 'Q30685', 'Q30686', 'Q30691', 'Q30688']
Q30685 ['Q30685']
Q30686 ['Q30686']
Q30687 ['Q30687']
Q30688 ['Q30688']
Q30689 ['Q30689']
Q30690 ['Q30690']
Q30691 ['Q30692', 'Q30693']
Q30694 ['Q30695', 'Q30696', 'Q30694']
Q30697 ['Q30698', 'Q30699', 'Q30700']
Q30698 ['Q30698']
Q30699 ['Q30699']
Q30700 ['Q30701', 'Q30707', 'Q30703', 'Q30702', 'Q30704', 'Q30706', 'Q30705']
Q30708 ['Q30712', 'Q30709']
Q30709 ['Q30711', 'Q30710']
Q30712 ['Q30713', 'Q30714']
Q30715 ['Q30717', 'Q30716', 'Q30719', 'Q30720', 'Q30718']
Q30716 ['Q30716']
Q30717 ['Q30717']
Q30718 ['Q30718']
Q30719 ['Q30719']
Q30720 ['Q30722', 'Q30723', 'Q30721']
Q30724 ['Q30741', 'Q30731', 'Q30745', 'Q30738', 'Q30732', 'Q30725']
Q30725 ['Q30726', 'Q30729', 'Q30730']
Q30726 ['Q30727', 'Q30728']
Q30729 ['Q30729']
Q30730 ['Q30730']
Q30731 ['Q30731']
Q30732 ['Q30735', 'Q30736', 'Q