The 1997 version of the NAICS codes is the only edition not available as a download. Since there are some EPA references to codes only found in the 1997 edition, I put together a scraping process to pull these into the same structure for use. It did require the caching of the basic sector-level index table as an HTML snippet since that was structured in a way that didn't make scraping easy. The notebook reads the inventory, retrieves each sector page, parses the industry codes and descriptions, and writes to a parquet file in the data cache.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
with open('datacache/naics_1997_inventory.html', 'r') as f:
    soup = BeautifulSoup(f.read(), 'html.parser')
    table = soup.find('table')

naics_1997_index = []
for index, row in enumerate(table.find_all('tr')):
    if index > 0:
        columns = row.find_all('td')
        sector = {
            "code": columns[0].find('a').text,
            "url": columns[0].find('a')['href'],
            "desc": columns[1].find('span').text
        }
        r_sector = requests.get(sector['url'])
        soup_sector = BeautifulSoup(r_sector.content, 'html.parser')
        sector["items"] = [i.text for i in soup_sector.find_all('h3')]
        naics_1997_index.append(sector)

In [3]:
naics_1997 = []
for sector in naics_1997_index:
    naics_1997.append({
        'code': sector['code'],
        'desc': sector['desc'],
        'reference': sector['url']
    })
    for ind in sector['items']:
        industry_code = ind.split()[0]
        industry_desc = ' '.join(ind.split()[1:])
        if industry_desc.endswith('US'):
            industry_desc = industry_desc.replace('US', '')
        if industry_desc.endswith('CAN'):
            industry_desc = industry_desc.replace('CAN', '')
        naics_1997.append({
            'code': industry_code,
            'desc': industry_desc,
            'reference': sector['url']
        })

pd.DataFrame(naics_1997).to_parquet('datacache/NAICS_1997.parquet')