# NAICS Codes Tagging for Companies

Becky Wang

beiqi.wang0509@gmail.com

### Objectives

This project is about extracting textual data from HTML files and using the data for categorizing companies using NAICS ([North American Industry Classification System](https://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2017)).

Each company will be assigned one or more NAICS codes based on publicly available information about the company and on descriptions of NAICS codes. Descriptions of NAICS subsector codes are provided in the HTML files in the zipped file. Snippets of text about 4 companies are provided as separate text files.

### Methodologies

In the project, I first scrape all HTML files under ```lookup``` folder. I only focus on three parts of the descriptions on HTLM page. As shown below: I scrape part 1 for NAICS subsector code and name, part 2 for the first paragrah and part 3 for all industry groups under the subsector.

After getting all the information needed, I use bag-of-words method to get the similarity matrix of company's description to all NAICS subsector descriptions, trying to find top three NAICS subsector that are most related. The method I use is called ```Term Frequency, Inverse Document Frequency``` abbreviated to ```tf-idf```, which is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

## Scrape data from HTMLs

In [1]:
# import packages
import os
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

Inline-style: 
![title](html_parsing_eg.jpg)


In [2]:
# scrape htmls to get useful information in NAICS descriptions

subsector_names = []
subsector_codes = []
descriptions = []
child_groups = []

for root, dirs, files in os.walk("lookup"):
    for file in files:
        if file.endswith("index.html"):
            # find all htmls under lookup folder
            path = os.path.join(root, file)
            soup = BeautifulSoup(open(path), "html.parser")
            
            # find html tag for descriptions
            content_box = soup.find("div", attrs={"class": "entry-content"})
            
            # parse part 1 (see picture above) to get subsector name and code number
            title = content_box.find("h1").text.strip()
            index1 = title.find("–")
            subsector_name = title[index1 + 1 :]
            subsector_names.append(subsector_name)
            
            # parse part 2 (see picture above) to get the paragraph of description
            content = content_box.find("p").text.strip()
            index2 = content.find(":")        
            subsector_codes.append(content[index2 - 3: index2])            
            descriptions.append(content[index2 + 2:])
            
            # parse part 2 (see picture above) to get industry groups under the subsector
            text = content_box.get_text()
            index3 = text.find(".Sector ")
            child_groups.append(text[index3 + 1:])

In [3]:
# dataframe of all the information
table = pd.DataFrame({"subsector_code" : subsector_codes ,"subsector_name" : subsector_names,
                      "description" : descriptions, "industry_group" : child_groups})
table.head()

Unnamed: 0,description,industry_group,subsector_code,subsector_name
0,Industries in the Crop Production subsector gr...,"Sector 11: Agriculture, Forestry, Fishing and ...",111,Crop Production
1,Industries in the Animal Production and Aquacu...,"Sector 11: Agriculture, Forestry, Fishing and ...",112,Animal Production and Aquaculture
2,Industries in the Forestry and Logging subsect...,"Sector 11: Agriculture, Forestry, Fishing and ...",113,Forestry and Logging
3,"Industries in the Fishing, Hunting and Trappin...","Sector 11: Agriculture, Forestry, Fishing and ...",114,"Fishing, Hunting and Trapping"
4,Industries in the Support Activities for Agric...,"Sector 11: Agriculture, Forestry, Fishing and ...",115,Support Activities for Agriculture and Forestry


In [4]:
# export to csv file
table.to_csv('subsector_data.csv', index = False)

## Data Cleaning

In [5]:
# define function to cleaning text
def clean_text(text): 
    """This function will clean the text, remove all punctuations and numbers
        and only keep stem of each word
    Args:
        text (str): text to be cleaned
    Returns:
        str: text after cleaning.
    """
    text = text.lower()
    text = text.replace("industry","")
    only_words = re.sub('[^a-zA-Z\s]+', '', text)
    words_list = word_tokenize(only_words)
    stemmer = SnowballStemmer("english")
    cleaned_text = ' '.join(stemmer.stem(word) for word in words_list)

    
    return cleaned_text

In [6]:
# clean description and industry group for each subsector
cleaned_descriptions = []
for description in descriptions:
    cleaned_descriptions.append(clean_text(description))

cleaned_child_groups = []
for child_group in child_groups:
    child_group = clean_text(child_group)
    child_group = child_group.replace(" digit", "").replace(" code", "").replace(" naic", "")
    cleaned_child_groups.append(child_group)

In [7]:
# combine cleaned description and industry group
contents = []
for i in range(len(cleaned_child_groups)):
    contents.append(cleaned_descriptions[i] + " " + cleaned_child_groups[i])
    
contents[0]

'industri in the crop product subsector grow crop main for food and fiber the subsector compris establish such as farm orchard grove greenhous and nurseri primarili engag in grow crop plant vine or tree and their seed sector agricultur forestri fish and huntingsubsector crop product group oilse and grain farm oilse except soybean farm oilse except soybean farm dri pea and bean farm dri pea and bean farm group oilse and grain farm wheat farm wheat farm corn farm corn farm rice farm rice farm other grain farm oilse and grain combin farm all other grain farm group veget and melon farm veget and melon farm potato farm other veget except potato and melon farm group fruit and tree nut farm orang grove orang grove citrus except orang grove citrus except orang grove noncitrus fruit and tree nut farm appl orchard grape vineyard strawberri farm berri except strawberri farm tree nut farm fruit and tree nut combin farm other noncitrus fruit farm group greenhous nurseri and floricultur product food

## Generate Similarity Matrix

In [10]:
def similarity_matrix(test_text):
    """This function will clean the test text, then generate the similarity matrix 
        of company's description to all NAICS subsector descriptions, and print out
        top three NAICS subsector that are most related
    Args:
        text (str): text to be tested
    Returns:
        df: a sorted list of similarity ratio.
    """
    cleaned_test = clean_text(test_text)
    comb = [cleaned_test] + contents
    vect = TfidfVectorizer(min_df=0.002, max_df = .1, stop_words='english', norm='l2')
    tfidf = vect.fit_transform(comb)
    cor = (tfidf * tfidf.T).A
    corr_df = pd.DataFrame({"corrcoef": cor[0][1:]})
    corr_df_sorded = corr_df.sort_values(by = "corrcoef", ascending=False)
    print("1: ",subsector_codes[corr_df_sorded.index[0]], "--",subsector_names[corr_df_sorded.index[0]])
    print("2: ",subsector_codes[corr_df_sorded.index[1]], "--",subsector_names[corr_df_sorded.index[1]])
    print("3: ",subsector_codes[corr_df_sorded.index[2]], "--",subsector_names[corr_df_sorded.index[2]])

In [11]:
test1 = "Nordstrom, Inc. is a leading fashion specialty retailer based in the U.S. Founded in 1901 as a shoe store in Seattle, today Nordstrom operates 370 stores including, including 122 full-line stores in the United States, Canada and Puerto Rico; 236 Nordstrom Racks; two Jeffrey boutiques; and two clearance stores. Nordstrom also serves customers online through Nordstrom.com, Nordstromrack.com and private sale site HauteLook. The company also owns Trunk Club, a personalized clothing service that takes care of customers online at TrunkClub.com and its seven showrooms. Nordstrom, Inc.'s common stock is publicly traded on the NYSE under the symbol JWN."
print("Company:  Nordstrom")
similarity_matrix(test1)

Company:  Nordstrom
1:  448 --  Clothing and Clothing Accessories Stores
2:  551 --  Management of Companies and Enterprises
3:  814 --  Private Households


In [12]:
test2 = "Nissan Motor gets where it's going in North America through Nissan North America. With plants in the US and Mexico, Nissan North America designs, engineers, and produces such vehicles as the Xterra SUV, the Altima sedan, and the Frontier pickup. It also provides marketing, financing, distribution, and services in Canada, Guam, Mexico, Puerto Rico, and the US. It oversees sales of Nissan's luxury Infiniti brand of cars in North America. Through Nissan Forklift, the company distributes and sells Nissan's electric and gasoline-powered forklifts. Nissan North America was formed in 1990 to coordinate the company's US, Mexican, and Canadian operations. North America accounts for almost 40% of Nissan Motor's sales."
print("Company:  Nissan Motor")
similarity_matrix(test2)

Company:  Nissan Motor
1:  336 --  Transportation Equipment Manufacturing
2:  551 --  Management of Companies and Enterprises
3:  335 --  Electrical Equipment, Appliance, and Component Manufacturing


In [13]:
test3 = "The Sonos Wireless HiFi System delivers all the music on earth, in every room, with warm, full-bodied sound thatís crystal clear at any volume. Sonos can fill your home with music by combining HiFi sound and rock-solid wireless in a smart system that is easy to set-up, control and expand."
print("Company:  Sonos Wireless HiFi System")
similarity_matrix(test3)

Company:  Sonos Wireless HiFi System
1:  512 --  Motion Picture and Sound Recording Industries
2:  451 --  Sporting Goods, Hobby, Musical Instrument, and Book Stores
3:  517 --  Telecommunications


In [14]:
test4 = "Bank of America is one of the world's largest financial institutions, serving individuals, small- and middle-market businesses and large corporations with a full range of banking, investing, asset management and other financial and risk management products and services. The company serves approximately 56 million U.S. consumer and small business relationships. It is among the world's leading wealth management companies and is a global leader in corporate and investment banking and trading."
print("Company:  Bank of America")
similarity_matrix(test4)

Company:  Bank of America
1:  551 --  Management of Companies and Enterprises
2:  521 --  Monetary Authorities-Central Bank
3:  523 --  Securities, Commodity Contracts, and Other Financial Investments and Related Activities
