# Imports 

In [1]:
import pandas as pd
import numpy as np

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Read in data and make sure relevant columns are string/character 

- San Diego data: `naics_code` and `account_key`
- NAICS details data: `naics` 

Run code below; if pulling from github, pathname should be fine; if working elsewhere may need to edit path name at read in 

In [2]:
sd_df = pd.read_csv("../../public_data/sd_df.csv")
naics_df = pd.read_csv("../../public_data/naics_df.csv")

In [3]:
cols_sd_use = ["naics_code", "account_key"]
cols_naics_use = ["naics"]

sd_df[cols_sd_use] = sd_df[cols_sd_use].astype(str)
naics_df[cols_naics_use] = naics_df[cols_naics_use].astype(str)

sd_df.dtypes
naics_df.dtypes

account_key          object
dba_name             object
council_district     object
naics_code           object
naics_description    object
naics_nchar           int64
dtype: object

naics                object
naics_description    object
dtype: object

## "Inner join"- retain only San Diego businesses with details on their NAICS code

- Use the `naics_sector` column in the San Diego business data as the join key
- Use the `naics_twodig` column in the NAICS code details data as the join key

- Do an inner join of the San Diego data onto the NAICS code details using these join keys
- After the inner join, print some examples of San Diego businesses lost in the merge
- Use value_counts() on the `naics_nchar` column in the San Diego data to see why they might have gotten lost


In [None]:
# your code here

## "Left join"- retain all sd businesses even if naics code isn't in the naics_details df

- Using the same join keys as above, and treating the San Diego businesses as the left hand side data, left join the naics code details onto the San Diego businesses
- Use the `indicator` argument within merge to create an indicator, `naics_merge_status`, to help with later merge diagnostics and examine sample of ones that didn't merge]
- Use the `suffixes` argument within merge to add `_sd` as the left suffix, `_census` as the right suffix


In [None]:
# your code here

## Use group by and agg to see if there are differences in merge rates by area

- Using the left-joined dataframe created in previous step, create a boolean indicator---`is_lost` if the merge indicator is equal to "left_only"
- Group by `council_district` and use the shortcut of taking the mean of a True/False indicator to find the proportions in order to find the proportion lost in the merge (so in the left join, ones that failed to match to `naics_df`) by council_district


In [None]:
# your code here

## Optional challenge exercise: add lagging 0's and see if merge rate improves

You noticed earlier that a big reason for non-matches is that the San Diego tax certificate NAICS codes were oftentimes not 6-digits long, while the Census ones were always 6 digits

You wonder if this is an issue where 0's in some of the SD's data naics codes got cutoff (eg 540000 became 54) and if adding these lagging zeros would improve the merge rate in a left join

- Using one of two approaches, pad the `naics_code` column in `sd_df` with 0's to get that column up to 6-digits: (1) str.pad in pandas (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.pad.html); (2) for more of a challenge, write your own function that checks the # of digits in the naics code string and pads with 0's at the end up to 6 digits and use row-wise apply---`df.apply(funcname, axis = 1)`---to execute it
- Perform the same left join as above and see how the match rate changes
- Create an indicator variable--`is_new_match`---for new matches under the padded NAICS code; compare the `naics_description` column from San Diego versus Census in the new dataset for a sample of these new matches and comment on whether the padding seems to be correct

In [None]:
# your code here