# Analyzing NAICS codes

I would like to understand the NAICS coding used in the San Diego business data.  Specifically:
  1. The original data set contains two different columns with NAICS codes, NAICS and naics_code.  I'm not sure who did the encoding.
  2. I'm going to make a huge assumption that the 2017 version is used.
  3. This analysis starts with the non-geocoded dataframe built in transform.ipynb. Once we understand this we can look at it geographically.
  4. The transform process added two columns: sector and sector_desc based on first two digits of NAICS.
  5. NAICS data (primarily via the naics module) for validation etc. was obtained from census.gov (2017 version).
  
  
Results of this analysis will be:
  1. Which codes are valid 2017 codes?
  2. Can we identify an algorithm to "roll up" invalid codes to the "best" valid code?
  3. I will stop short of implementing the algorithm.

`Starting from Daniel's email:`<br/>
Based on that table, we've identified the following categories (`sector highlighting is mine`):<br/>
Construction - code **23**, **23**3, **23**4 etc.<br/>
Personal care - **81**21<br/>
Health - **62**1 and similar<br/>
restaurants - **31**, **31**1811, **31**212 etc.<br/>
schools and instruction (including fitness and personal training) - **61**, **61**1<br/>
retail (**45**) somehow didn't make it into pivot tables, but I believe it should be there as well - small retail

So, for starters, the `sectors` **we're interested in** are: **23, 31, 45, 61, 62, and 81**.

In [1]:
%run start.py
import naics
from collections import defaultdict

Read the data and transform the dtypes.  The read/write machinery wants to convert the NAICS to int's.

In [2]:
transformed_df = pd.read_csv("../data/transformed.csv", sep='\t', index_col=0, dtype={'NAICS': str, 'naics_code': str, 'sector': str})

## Utilities

Adding functions and look up table (dictionary) for use later in the analysis.  Evenutally need to turn this into py for use in other areas.

In [3]:
#TBD: move this to naics module?
def invalid_codes(pd_column):
    missing_codes = defaultdict(int)
    for c in pd_column:
        if naics.valid_code(c) is False:
            if c in missing_codes.keys():
                missing_codes[c] += 1
            else:
                missing_codes[c] = 1
    return missing_codes

def total_count(df):
    return len(df)

def invalid_count(invalid_dict):
    return sum(list(invalid_dict.values()))

level_map = {2: 'Sector Level(2)',
            3: 'Subsector Level(3)',
            4: 'Industry Group Level(4)',
            5: 'Industry Level(5)',
            6: 'US Industry Level(6)'}

def naics_level_mapping(naics_str):
    return level_map[len(naics_str)]

# moving this from lower in the (old) nb.  Don't think I need it?  Revisit.
def count_per_code(code_dict):
    s = 0
    for k, v in code_dict.items():
        s += v
    return s

def sum_code_counts(count_dict):
    return sum(list(count_dict.values()))


# taken from https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side#:~:text=To%20display%20two%20DataFrames%20side,the%20display_html%20method%20from%20IPython.
# attributed to jakevdp
from IPython.display import display_html, display, HTML
def display_side_by_side(dfs:list, captions:list):
    html_str=''
    for caption, df in dict(zip(captions, dfs)).items():
        html_str += df.style.set_table_attributes("style='display:inline'").set_caption(caption)._repr_html_()
        html_str += "\xa0\xa0\xa0\xa0"
    #display_html(html_str.replace('table','table style="display:inline"'),raw=True)
    display(HTML(html_str))

### Levels encoded by NAICS codes

NAICS codes have different numbers of digits based on the specificity of the encoding.<br/>
I'm assuming that more is better than less?<br/><br/>
Here's a list of the level names and digit counts.

In [4]:
for encoding_level in list(level_map.values()):
    print(encoding_level)

Sector Level(2)
Subsector Level(3)
Industry Group Level(4)
Industry Level(5)
US Industry Level(6)


Before I look at NAICS levels in the data set I'll take one last check on the sector level mappings (the first two digits) is to verify they're the same based on NAICS and naics_code.

In [5]:
len(transformed_df[transformed_df['naics_code'].apply(lambda c: c[:2]) != transformed_df['sector']]) == 0

True

So sector codes derivied from both NAICS and nacis_code are the same.  That is good!

## Check for valid codes

This step simply checks the codes in the business data set with codes from census.gov.<br/>
I'm using python utilities defined above.  They use a simple census naics wrapper found in src/naics.py.

### sector code validation

In [6]:
sector_dict = invalid_codes(transformed_df['sector'])

percentage = invalid_count(sector_dict) / total_count(transformed_df)
print(f"{percentage:.2%}")

0.00%


All the sector codes are valid, so we're good at that level.</br>
Now look at the counts for each sector (note I'm using the sector_desc for clarity).<br/>

Interesting!

In [7]:
transformed_df.sector_desc.value_counts()

Professional, Scientific, and Technical Services(54)                            11358
Other Services (except Public Administration)(81)                                8159
Health Care and Social Assistance(62)                                            4170
Administrative and Support and Waste Management and Remediation Services(56)     4125
Construction(23)                                                                 4122
Retail Trade(45)                                                                 3567
Accommodation and Food Services(72)                                              2951
Retail Trade(44)                                                                 2855
Real Estate and Rental and Leasing(53)                                           2730
Wholesale Trade(42)                                                              1719
Transportation and Warehousing(48)                                               1624
Educational Services(61)                              

### First look at codes

Recollect the list of sectors we're interested in: 23, 31, 45, 61, 62, and 81.

In [8]:
scalesd_biz_sectors = ['23', '31', '45', '61', '62', '81']

In [9]:
len(transformed_df.query(f"sector in {scalesd_biz_sectors}"))

21764

In [10]:
print(f"{_/len(transformed_df):.2%}")

40.55%


So 41% of the rows are in our sectors of interest.

I'm interested in "**_knowledge businesses_**" and wonder how many there are in the dataset?<br/>
`some might argue all business are knowledge businesses?`

In [11]:
# i.e. the NAICS sector codes I'm defining as knowledge businesses (i.e. selling your knowledge)
knowledge_biz_codes = ['51', '52', '54', '55', '56']
for naics_sector in knowledge_biz_codes:
    print(f"{naics_sector}: {naics.title(naics_sector)}")

51: Information
52: Finance and Insurance
54: Professional, Scientific, and Technical Services
55: Management of Companies and Enterprises
56: Administrative and Support and Waste Management and Remediation Services


In [12]:
knowledge_biz = len(transformed_df.query(f"sector in {knowledge_biz_codes}"))
print(f"{knowledge_biz/len(transformed_df):.2%}")

32.35%


So a reasonable percentage of the businesses fall within my definition of knowlege-based businesses.

### NAICS validation

This process will look at the NAICS codes, use predicate from naics module and determine if the code is valid in the 2017 version of NAICS.<br/>
First let's look at the NAICS column from the original data set.

In [13]:
invalid_dict = invalid_codes(transformed_df['NAICS'])

percentage = invalid_count(invalid_dict) / total_count(transformed_df['NAICS'])
print(f"Invalid naics_code codes: {percentage:.2%}")
print(f"Valid naics_code codes: {1 - percentage:.2%}")

Invalid naics_code codes: 23.81%
Valid naics_code codes: 76.19%


In [14]:
# remember the length of the code identifies the NAICS level (see level_map above)
transformed_df['NAICS_len'] = transformed_df['NAICS'].apply(lambda v: len(v))
transformed_df['NAICS_valid'] = transformed_df['NAICS'].apply(lambda c: naics.valid_code(c))

Understand that both the NAICS and naics_code columns are encoded to multiple specific `naics levels`.<br/>
We can look at the distribution, by `naics level` for the NAICS column.

In [15]:
transformed_df['NAICS_len'].apply(lambda v: level_map[v]).value_counts()

Industry Level(5)          25632
Industry Group Level(4)    11046
US Industry Level(6)       10013
Subsector Level(3)          6249
Sector Level(2)              735
Name: NAICS_len, dtype: int64

### naics_code validation

Similar steps for naics_code encoding.

In [16]:
invalid_dict = invalid_codes(transformed_df['naics_code'])

percentage = invalid_count(invalid_dict) / total_count(transformed_df)
print(f"Invalid naics_code codes: {percentage:.2%}")
print(f"Valid naics_code codes: {1 - percentage:.2%}")

Invalid naics_code codes: 17.20%
Valid naics_code codes: 82.80%


In [17]:
transformed_df['naics_code_len'] = transformed_df['naics_code'].apply(lambda v: len(v))
transformed_df['naics_code_valid'] = transformed_df['naics_code'].apply(lambda c: naics.valid_code(c))

transformed_df['naics_code_len'].apply(lambda v: level_map[v]).value_counts()

Industry Group Level(4)    46691
Subsector Level(3)          6249
Sector Level(2)              735
Name: naics_code_len, dtype: int64

### Results so far

Observations from the counts above:

  1. Sector Level and Subsector Level have the same counts (doesn't mean they are the same codes of course).
  2. naics_code encoding only goes as deep as level 4.
  3. NAICS has encodings at all levels.
  4. The percentage of valid naics_code is a bit higher than NAICS.

## Compare NAICS and naics_code

In [18]:
naics_df = transformed_df[['BUSINESS ACCT#', 'OWNERSHIP TYPE', 'sector', 'NAICS', 'NAICS_len', 'NAICS_valid', 'naics_code', 'naics_code_len', 'naics_code_valid']]
naics_df = naics_df.rename(columns={'OWNERSHIP TYPE': 'TYPE'})

In [19]:
naics_df.NAICS_len.value_counts()

5    25632
4    11046
6    10013
3     6249
2      735
Name: NAICS_len, dtype: int64

In [20]:
naics_df.naics_code_len.value_counts()

4    46691
3     6249
2      735
Name: naics_code_len, dtype: int64

In [21]:
n_value_counts = naics_df.NAICS_len.value_counts().sort_index(ascending=True)
n_df_value_counts = pd.DataFrame(n_value_counts)
n_df_value_counts = n_df_value_counts.reset_index()
n_df_value_counts.columns = ['NAICS level', 'biz count']

nc_value_counts = naics_df.naics_code_len.value_counts().sort_index(ascending=True)
nc_df_value_counts = pd.DataFrame(nc_value_counts)
nc_df_value_counts = nc_df_value_counts.reset_index()
nc_df_value_counts.columns = ['NAICS level', 'biz count']

In [22]:
display_side_by_side([n_df_value_counts, nc_df_value_counts], ["From NAICS", "From naics_code"])

Unnamed: 0,NAICS level,biz count
0,2,735
1,3,6249
2,4,11046
3,5,25632
4,6,10013

Unnamed: 0,NAICS level,biz count
0,2,735
1,3,6249
2,4,46691


There is good information in this side-by-side comparison.<br>
We can use it to dig deeper and find the most specific NAICS business level for each business in the dataset.

### NAICS Level 3 - Subsector Level

First, the counts at level 3 are the same but are the rows?

In [23]:
len(naics_df.query(f"NAICS_len == 3") == naics_df.query(f"naics_code_len == 3"))

6249

Yes.  It doesn't matter which selector we use.  I'll use NAICS.

In [24]:
valid_3_percentage = len(naics_df.query(f"NAICS_len == 3 and NAICS_valid")) / \
len(naics_df.query(f"NAICS_len == 3"))
print(f"{valid_3_percentage:.2%} of the Subsector codes are valid")

invalid_subsector_codes = len(naics_df.query(f"naics_code_len == 3 and not naics_code_valid")) 
print(f"So we have {invalid_subsector_codes} invalid codes.")

73.96% of the Subsector codes are valid
So we have 1627 invalid codes.


Curious about the Sectors with invalid Subsectors

In [25]:
naics_df.query(f"NAICS_len == 3 and not NAICS_valid")['sector'].value_counts()

23    1188
42     370
51      69
Name: sector, dtype: int64

In [26]:
for sector_code in list(_.keys()):
    print(f"{sector_code}: {naics.title(sector_code)}")

23: Construction
42: Wholesale Trade
51: Information


One of our sectors of interest is 23, Construction.  This tells me that 1188 of the businesses are not coded correctly.  I'm curious about this.  Maybe do some spot checks for [NAICS lookup](https://www.census.gov/naics/?input=1028&year=2017) at census.gov.

In [27]:
naics_df.query(f"NAICS_len == 3 and not NAICS_valid")['NAICS'][:10]

8      233
29     421
133    235
186    233
187    235
188    233
226    233
246    233
323    421
389    421
Name: NAICS, dtype: object

So, by inspection, 233, 421, and 235 are not valid 2017 Subsector codes.  Use the link above to check 2012 and 2007 versions. <br/>

You'll see they're not valid in any of the versions `(couple of hours later - they are in NAICS 1997)`.

## Rollup - Part 1

It is straight forward to adjust the invalid Subsector codes (chop, chop).

### NAICS Level 4 - Industry Group Level

At this point the two encodings diverge.  Looking at the side-by-side value_counts we see that nacis_code count for level 4 is same as NAICS count for level 4 + 5 + 6.<br/>
That makes me wonder if we've lost some information?

In [28]:
len_4 = len(naics_df.query(f"NAICS_len == 4"))
valid_4 = len(naics_df.query(f"NAICS_len == 4 and NAICS_valid"))
valid_4_percentage = valid_4 / len_4 
print(f"Total rows: {len_4}, valid: {valid_4} ({valid_4_percentage:.2%}) of the Industry Group codes.")

invalid_industry_group_codes = len(naics_df.query(f"NAICS_len == 4 and not NAICS_valid")) 
print(f"So we have {invalid_industry_group_codes} invalid codes.")

Total rows: 11046, valid: 8911 (80.67%) of the Industry Group codes.
So we have 2135 invalid codes.


In [29]:
len_4 = len(naics_df.query(f"naics_code_len == 4"))
valid_4 = len(naics_df.query(f"naics_code_len == 4 and naics_code_valid"))
valid_4_percentage = valid_4 / len_4
print(f"Total rows: {len_4}, valid: {valid_4} ({valid_4_percentage:.2%}) of the Industry Group codes.")

invalid_industry_group_codes = len(naics_df.query(f"naics_code_len == 4 and not naics_code_valid")) 
print(f"So we have {invalid_industry_group_codes} invalid codes.")

Total rows: 46691, valid: 39085 (83.71%) of the Industry Group codes.
So we have 7606 invalid codes.


### NAICS Level 5 - Industry Level

In [30]:
# !!! really ??? why isn't this a function by now !!!
len_5 = len(naics_df.query(f"NAICS_len == 5"))
valid_5 = len(naics_df.query(f"NAICS_len == 5 and NAICS_valid"))
valid_5_percentage = valid_5 / len_5 
print(f"Total rows: {len_5}, valid: {valid_5} ({valid_5_percentage:.2%}) of the Industry Level codes.")

invalid_industry_group_codes = len(naics_df.query(f"NAICS_len == 5 and not NAICS_valid")) 
print(f"So we have {invalid_industry_group_codes} invalid codes.")

Total rows: 25632, valid: 20742 (80.92%) of the Industry Level codes.
So we have 4890 invalid codes.


### NAICS Level 6 - US Industry Level

In [31]:
# !!! really ??? why isn't this a function by now !!!
len_6 = len(naics_df.query(f"NAICS_len == 6"))
valid_6 = len(naics_df.query(f"NAICS_len == 6 and NAICS_valid"))
valid_6_percentage = valid_6 / len_6 
print(f"Total rows: {len_6}, valid: {valid_6} ({valid_6_percentage:.2%}) of the US Industry Level codes.")

invalid_industry_group_codes = len(naics_df.query(f"NAICS_len == 6 and not NAICS_valid")) 
print(f"So we have {invalid_industry_group_codes} invalid codes.")

Total rows: 10013, valid: 5885 (58.77%) of the US Industry Level codes.
So we have 4128 invalid codes.


## Looking at this another way

In [32]:
total = len(naics_df.query(f"NAICS_len >= 4"))
valid = len(naics_df.query(f"NAICS_len >= 4 and NAICS_valid"))
invalid = total - valid

print(f"total: {total}")
print(f"{valid / total:.2%}")
print(f"Missing: {invalid}")

total: 46691
76.11%
Missing: 11153


In [33]:
total = len(naics_df.query(f"NAICS_len >= 4 and sector in {scalesd_biz_sectors}"))
valid = len(naics_df.query(f"NAICS_len >= 4 and NAICS_valid and sector in {scalesd_biz_sectors}"))
invalid = total - valid

print(f"total: {total}")
print(f"{valid / total:.2%} valid")
print(f"Missing: {invalid}")

total: 18737
66.87% valid
Missing: 6208


In [34]:
total = len(naics_df.query(f"NAICS_len >= 4 and sector in {knowledge_biz_codes}"))
valid = len(naics_df.query(f"NAICS_len >= 4 and NAICS_valid and sector in {knowledge_biz_codes}"))
invalid = total - valid

print(f"total: {total}")
print(f"{valid / total:.2%} valid")
print(f"Missing: {invalid}")

total: 15645
91.15% valid
Missing: 1385
