# Standard Industry Classification

The U.S. Securities and Exchange Commission maintains a classification for the industries it regulates. The Standard Industry Classification codes are one of the reference elements in the U.S. EPA's Facility Registration System. To help connect the dots between information sources, we build the SIC codes and names as a reference within the knowledgebase. When we process facility records, we link to these entities as classification items on facilities.

In [13]:
import os
import pandas as pd

from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator import WikibaseIntegrator, wbi_login
from wikibaseintegrator.models import Qualifiers, References, Reference
from wikibaseintegrator import datatypes
from wikibaseintegrator.wbi_helpers import execute_sparql_query


In [14]:
wbi_config['MEDIAWIKI_API_URL'] = os.environ['MEDIAWIKI_API_URL']
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['SPARQL_ENDPOINT_URL']
wbi_config['WIKIBASE_URL'] = os.environ['WIKIBASE_URL']
wbi_config['USER_AGENT'] = f'EDJIBot/1.0 ({os.environ["WIKIBASE_URL"]})'

login_instance = wbi_login.Login(
    user=os.environ['BOT_NAME'],
    password=os.environ['BOT_PASS']
)

wbi = WikibaseIntegrator(login=login_instance)

## EDJI-KB properties, classification, and data sources

We set up the knowledgebase with a set of properties to be used in characterizing items with information, a set of organizing classes (subclasses below entity) that are used in instance of classification of items, and a set of data sources that serve to document specifics about provenance. We pull these into the notebook here so that we can reference them by name and get their associated QID values for the particular Wikibase instance we are operating on.

In [10]:
prop_query = """
SELECT ?property ?propertyLabel WHERE {
  ?property a wikibase:Property .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

edjikb_props = execute_sparql_query(prop_query)
edjikb_prop_lookup = {}
for x in edjikb_props['results']['bindings']:
    edjikb_prop_lookup[x['propertyLabel']['value']] = x['property']['value'].split('/')[-1]

display(edjikb_prop_lookup)

class_query = """
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?class ?classLabel WHERE {
  ?class wdt:P2 [] .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

edjikb_classes = execute_sparql_query(class_query)
edjikb_class_lookup = {}
for x in edjikb_classes['results']['bindings']:
    edjikb_class_lookup[x['classLabel']['value']] = x['class']['value'].split('/')[-1]

display(edjikb_class_lookup)

datasource_query = """
PREFIX wd: <https://edji-knows.wikibase.cloud/entity/>
PREFIX wdt: <https://edji-knows.wikibase.cloud/prop/direct/>

SELECT ?ds ?dsLabel WHERE {
  ?ds wdt:P1 wd:Q4 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

edjikb_datasources = execute_sparql_query(datasource_query)
edjikb_datasource_lookup = {}
for x in edjikb_datasources['results']['bindings']:
    edjikb_datasource_lookup[x['dsLabel']['value']] = x['ds']['value'].split('/')[-1]

display(edjikb_datasource_lookup)


{'instance of': 'P1',
 'subclass of': 'P2',
 'SIC Code': 'P3',
 'reference url': 'P4',
 'data source': 'P5'}

{'spatio-temporal activity': 'Q2',
 'data source': 'Q4',
 'industrial activity': 'Q3'}

{'SEC listing of SIC codes': 'Q5'}

### SIC Source

The source item in the knowledgebase describes a web page with a table listing the SIC codes. This is the best source I've found for this information that appears to be comprehensive, current, and authoritative. It's a simple enough thing that we can simply read the page using Pandas' read_html method to get the table as a dataframe for processing. It's crude but effective enough.

Note: I need to do more work on the data source idea in the Wikibase design to better describe something about the type of process run against a source. We can also point to the code used to introduce the items.

In [15]:
sec_sic_page = pd.read_html("https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list")
sec_sic_page[0]

Unnamed: 0,SIC Code,Office,Industry Title
0,100,Industrial Applications and Services,AGRICULTURAL PRODUCTION-CROPS
1,200,Industrial Applications and Services,AGRICULTURAL PROD-LIVESTOCK & ANIMAL SPECIALTIES
2,700,Industrial Applications and Services,AGRICULTURAL SERVICES
3,800,Industrial Applications and Services,FORESTRY
4,900,Industrial Applications and Services,"FISHING, HUNTING AND TRAPPING"
...,...,...,...
439,8880,Office of International Corp Fin,AMERICAN DEPOSITARY RECEIPTS
440,8888,Office of International Corp Fin,FOREIGN GOVERNMENTS
441,8900,Office of Trade & Services,"SERVICES-SERVICES, NEC"
442,9721,Office of International Corp Fin,INTERNATIONAL AFFAIRS


### Item Processing

For a few hundred items, this basic serial looping process is fine. I don't really understand the "fast run mode" for the WikibaseIntegrator package, and I'm not sure it would be all that effective anyway. Nominally, I could parallelize this by packaging up the WBI item construct and then chunking out runners that instantiate the WBI connection and send the data. But I don't know how much the Wikibase API can handle at once anyway and if there might be a better way to handle the connections. I need to research further.

In [19]:
sic_refs = References()
sic_ref = Reference()
sic_ref.add(
    datatypes.Item(
        prop_nr=edjikb_prop_lookup['data source'],
        value=edjikb_datasource_lookup['SEC listing of SIC codes']
    )
)
sic_refs.add(sic_ref)

for index, row in sec_sic_page[0].iterrows():
    print("PROCESSING:", row['Industry Title'])
    
    item = wbi.item.new()
    
    item.labels.set('en', row['Industry Title'].strip())
    item.descriptions.set('en', 'a Standard Industry Classification type of industry')

    instance_of_claim = Item(
        prop_nr=edjikb_prop_lookup['instance of'],
        value=edjikb_class_lookup['industrial activity'],
        references=sic_refs
    )
    item.claims.add(instance_of_claim)

    sic_code_claim = ExternalID(
        prop_nr=edjikb_prop_lookup['SIC Code'],
        value=str(row['SIC Code']),
        references=sic_refs
    )
    item.claims.add(sic_code_claim)

    item.write()

PROCESSING: AGRICULTURAL PRODUCTION-CROPS
PROCESSING: AGRICULTURAL PROD-LIVESTOCK & ANIMAL SPECIALTIES
PROCESSING: AGRICULTURAL SERVICES
PROCESSING: FORESTRY
PROCESSING: FISHING, HUNTING AND TRAPPING
PROCESSING: METAL MINING
PROCESSING: GOLD AND SILVER ORES
PROCESSING: MISCELLANEOUS METAL ORES
PROCESSING: BITUMINOUS COAL & LIGNITE MINING
PROCESSING: BITUMINOUS COAL & LIGNITE SURFACE MINING
PROCESSING: CRUDE PETROLEUM & NATURAL GAS
PROCESSING: DRILLING OIL & GAS WELLS
PROCESSING: OIL & GAS FIELD EXPLORATION SERVICES
PROCESSING: OIL & GAS FIELD SERVICES, NEC
PROCESSING: MINING & QUARRYING OF NONMETALLIC MINERALS (NO FUELS)
PROCESSING: GENERAL BLDG CONTRACTORS - RESIDENTIAL BLDGS
PROCESSING: OPERATIVE BUILDERS
PROCESSING: GENERAL BLDG CONTRACTORS - NONRESIDENTIAL BLDGS
PROCESSING: HEAVY CONSTRUCTION OTHER THAN BLDG CONST - CONTRACTORS
PROCESSING: WATER, SEWER, PIPELINE, COMM & POWER LINE CONSTRUCTION
PROCESSING: CONSTRUCTION - SPECIAL TRADE CONTRACTORS
PROCESSING: ELECTRICAL WORK
PROCESSI

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=30a1da16-8d37-4863-b767-04fc5292d9a6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>