# 00 - Data Collection Notebook - Development Version
**The intent of this notebook is to show some of the working logic that went into how we figured out the processes that were engineered in the `00 - Data Collection Notebook`**

This notebook walks through the process of scraping external data in multiple steps, arriving at a compiled dataframe that is then exported into the `raw_data` folder. The notebook then transitions to transforming the data into the necessary format needed for modeling.

*Some of the scraping methods used in this notebook were referenced from the following lecture: https://git.generalassemb.ly/DSIR-0124/lesson-webscraping/blob/master/intro-to-web-scraping-spiders-with-scrapy.ipynb*

#### Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import time
import os

### Individual State Scraper 
In this scraper, we extract one link per state rom the U.S. Bureau of Labor Statistics website found in the `url` variable below. As seen within this scraper, we chose to remove the U.S. territories of Guam, Puerto Rico, and the Virgin Islands for our analysis. 

Using BeautifulSoup, this scraper returns a Pandas DataFrame of the State name and associated URL.

In [2]:
url = 'https://www.bls.gov/eag'

response = requests.get(url)

html = response.text

soup = BeautifulSoup(html, 'lxml')

all_h4 = soup.find_all('h4')
all_h4

state_link_list = []

for element in all_h4:
    result = {}
    
    a_href = element.find('a')
    if a_href:
        result['title'] = a_href.text
        result['link'] = 'https://www.bls.gov/' + a_href['href'].strip().lstrip('/')
    
    state_link_list.append(result)
    
state_link_list

state_links = pd.DataFrame(state_link_list)
state_links = state_links.set_index('title')
state_links = state_links.drop(['Guam', 'Puerto Rico', 'Virgin Islands'])
state_links

Unnamed: 0_level_0,link
title,Unnamed: 1_level_1
Alabama,https://www.bls.gov/regions/southeast/alabama....
Alaska,https://www.bls.gov/regions/west/alaska.htm#eag
Arizona,https://www.bls.gov/regions/west/arizona.htm#eag
Arkansas,https://www.bls.gov/regions/southwest/arkansas...
California,https://www.bls.gov/regions/west/california.ht...
Colorado,https://www.bls.gov/regions/mountain-plains/co...
Connecticut,https://www.bls.gov/regions/new-england/connec...
Delaware,https://www.bls.gov/regions/mid-atlantic/delaw...
District of Columbia,https://www.bls.gov/regions/mid-atlantic/distr...
Florida,https://www.bls.gov/regions/southeast/florida....


### All State Links Scraper 
In this scraper, we extract the necessary child links from the aforementioned DataFrame, using BeautifulSoup, and then save them into a Dictionary.

In [3]:
state_backdata = {}

for i in range(len(state_links)):
    state_url = state_links.iloc[i][0]
#     state_url
    state_name = state_links.index[i]
    
    loop_response = requests.get(state_url)

    loop_html = loop_response.text

    loop_soup = BeautifulSoup(loop_html, 'lxml')

    loop_table = loop_soup.find('table', {'class': 'regular'})

    loop_backdata_list = []

    loop_backdata_links = []
    
    for row in loop_table.find_all('tr')[1:]:

        loop_backdata_list.append(row)

    for item in loop_backdata_list:
        item = str(item)
        if 'https' in item:
            links= re.findall("(https:\S+)", item)
            loop_backdata_links.append(links[0])

    loop_final_list = [x for i, x in enumerate(loop_backdata_links) if i in [0,4,6,8,10,12,14,16,18,20,22,24,26,28]]
    state_backdata[state_name] = loop_final_list
    print(f'{state_name} links added to dictionary')
    time.sleep(1)

Alabama links added to dictionary
Alaska links added to dictionary
Arizona links added to dictionary
Arkansas links added to dictionary
California links added to dictionary
Colorado links added to dictionary
Connecticut links added to dictionary
Delaware links added to dictionary
District of Columbia links added to dictionary
Florida links added to dictionary
Georgia links added to dictionary
Hawaii links added to dictionary
Idaho links added to dictionary
Illinois links added to dictionary
Indiana links added to dictionary
Iowa links added to dictionary
Kansas links added to dictionary
Kentucky links added to dictionary
Louisiana links added to dictionary
Maine links added to dictionary
Maryland links added to dictionary
Massachusetts links added to dictionary
Michigan links added to dictionary
Minnesota links added to dictionary
Mississippi links added to dictionary
Missouri links added to dictionary
Montana links added to dictionary
Nebraska links added to dictionary
Nevada links ad

In [4]:
# Testing the state_backdata dictionary for one to ensure the correct links were passed through
state_backdata['Colorado']

['https://data.bls.gov/timeseries/LASST080000000000006?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000000000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000001000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000002000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000003000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000004000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000005000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
 'https://data.bls.gov/timeseries/SMS08000005500000001?amp%253bdata_tool=XGtable&amp;output_view=

In [5]:
# Viewing two of the state's links
list(state_backdata.items())[:2]

[('Alabama',
  ['https://data.bls.gov/timeseries/LASST010000000000006?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000000000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000001000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000002000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000003000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000004000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000005000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true"',
   'https://data.bls.gov/timeseries/SMS01000005500000001?amp%253bdata_

#### Working on Parsing the URL
Here we test our Regular Expression's on sample URL's to ensure the logic will work in the scraper

##### Index into URL for ID placement

In [11]:
('https://data.bls.gov/timeseries/SMS02000009000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true')[42:44]

'90'

##### Regex method for finding subsegment ID in URL
We found this to be a bit more resilient, in case the BLS modifies the length of the URL

In [12]:
link1 = 'https://data.bls.gov/timeseries/SMS02000009000000001?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true'
re.findall("\/([A-Z\d\s]+)", link1)[0][10:12]

'90'

##### Regex method for removing non-numeric characters

In [13]:
re.findall('^\d+\.?\d*', '53.1(R)')[0] # Demonstrate regex expression

'53.1'

##### Prove out process for accessing state employment data using one hardcoded state

In [21]:
link = 'https://data.bls.gov/timeseries/LASST010000000000006?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true'
state_name = 'Alabama'
state_employment =  {}

resp = requests.get(link)
data_name = f'./Development/development_state_employment/{state_name}_Employment'
output = open(f'{data_name}.xls', 'wb')
output.write(resp.content)
output.close()

df_list = pd.read_html(f'{data_name}.xls')
df_employment = pd.DataFrame(df_list[1])
# remove footnote markers
df_employment = df_employment.drop(df_employment.index[-1]) # remove footnote row
df_target = df_employment[['labor force participation rate','employment-population ratio', 'labor force', 'employment', 'unemployment', 'unemployment rate']] # where to apply regex
# df_target = df_target.apply(remove_special_char)
state_employment[state_name] = df_employment # .astype(float) after regex applied later

print(f'{state_name} added to list.')

Alabama added to list.


In [22]:
# View results
state_employment['Alabama']

Unnamed: 0,Year,Period,labor force participation rate,employment-population ratio,labor force,employment,unemployment,unemployment rate
0,2012,Jan,58.3,53.4,2179750,1993523,186227,8.5
1,2012,Feb,58.2,53.3,2175635,1992089,183546,8.4
2,2012,Mar,58.1,53.2,2173644,1991104,182540,8.4
3,2012,Apr,58.1,53.2,2173494,1991037,182457,8.4
4,2012,May,58.1,53.2,2174754,1992484,182270,8.4
...,...,...,...,...,...,...,...,...
116,2021,Sep,56.3(R),54.5(R),2240353(R),2168364(R),71989(R),3.2(R)
117,2021,Oct,56.3(R),54.5(R),2241442(R),2170873(R),70569(R),3.1(R)
118,2021,Nov,56.2(R),54.5(R),2242078(R),2172390(R),69688(R),3.1(R)
119,2021,Dec,56.2(R),54.5(R),2242275(R),2172841(R),69434(R),3.1(R)


##### Was struggling with applying the regex function; explored how .apply was operating over dataframe

In [23]:
df_target.apply(lambda s: s * 2) 

Unnamed: 0,labor force participation rate,employment-population ratio,labor force,employment,unemployment,unemployment rate
0,58.358.3,53.453.4,21797502179750,19935231993523,186227186227,8.58.5
1,58.258.2,53.353.3,21756352175635,19920891992089,183546183546,8.48.4
2,58.158.1,53.253.2,21736442173644,19911041991104,182540182540,8.48.4
3,58.158.1,53.253.2,21734942173494,19910371991037,182457182457,8.48.4
4,58.158.1,53.253.2,21747542174754,19924841992484,182270182270,8.48.4
...,...,...,...,...,...,...
116,56.3(R)56.3(R),54.5(R)54.5(R),2240353(R)2240353(R),2168364(R)2168364(R),71989(R)71989(R),3.2(R)3.2(R)
117,56.3(R)56.3(R),54.5(R)54.5(R),2241442(R)2241442(R),2170873(R)2170873(R),70569(R)70569(R),3.1(R)3.1(R)
118,56.2(R)56.2(R),54.5(R)54.5(R),2242078(R)2242078(R),2172390(R)2172390(R),69688(R)69688(R),3.1(R)3.1(R)
119,56.2(R)56.2(R),54.5(R)54.5(R),2242275(R)2242275(R),2172841(R)2172841(R),69434(R)69434(R),3.1(R)3.1(R)


In [24]:
df_target.apply(lambda s: re.findall(r'^\d+\.?\d*', str(s))[0])

labor force participation rate    0
employment-population ratio       0
labor force                       0
employment                        0
unemployment                      0
unemployment rate                 0
dtype: object

##### Found that .apply was bringing in columns, not applying the function to individual values

In [25]:
df_target.iloc[:,0] * 2 

0            58.358.3
1            58.258.2
2            58.158.1
3            58.158.1
4            58.158.1
            ...      
116    56.3(R)56.3(R)
117    56.3(R)56.3(R)
118    56.2(R)56.2(R)
119    56.2(R)56.2(R)
120    56.4(P)56.4(P)
Name: labor force participation rate, Length: 121, dtype: object

##### Target column (Series) before regex method

In [26]:
df_target.iloc[:,0]

0         58.3
1         58.2
2         58.1
3         58.1
4         58.1
        ...   
116    56.3(R)
117    56.3(R)
118    56.2(R)
119    56.2(R)
120    56.4(P)
Name: labor force participation rate, Length: 121, dtype: object

##### Target column (Series) after regex method applied

In [27]:
df_target.iloc[:,0].map(lambda s: re.findall('^\d+\.?\d*', str(s))[0])

0      58.3
1      58.2
2      58.1
3      58.1
4      58.1
       ... 
116    56.3
117    56.3
118    56.2
119    56.2
120    56.4
Name: labor force participation rate, Length: 121, dtype: object

Now create a function that handles columns and applies regex to each value, returning a converted series

In [28]:
def remove_special_char(column):
    return column.map(lambda s: re.findall('^\d+\.?\d*', str(s))[0])

##### Prove out process for accessing state employment data using one hardcoded state - with regex method included

In [31]:
link = 'https://data.bls.gov/timeseries/LASST010000000000006?amp%253bdata_tool=XGtable&amp;output_view=data&amp;include_graphs=true'
state_name = 'Alabama'
state_employment =  {}

resp = requests.get(link)
data_name = f'./Development/development_state_employment/{state_name}_Employment'
output = open(f'{data_name}.xls', 'wb')
output.write(resp.content)
output.close()

df_list = pd.read_html(f'{data_name}.xls')
df_employment = pd.DataFrame(df_list[1])
# remove footnote markers
df_employment = df_employment.drop(df_employment.index[-1])
df_target = df_employment[['labor force participation rate','employment-population ratio', 'labor force', 'employment', 'unemployment', 'unemployment rate']] # where to apply regex
df_target = df_target.apply(remove_special_char).astype(float)
df_employment.loc[:, df_target.columns] = df_target
state_employment[state_name] = df_employment

print(f'{state_name} added to list.')

Alabama added to list.


In [32]:
# Viewing output
df_target

Unnamed: 0,labor force participation rate,employment-population ratio,labor force,employment,unemployment,unemployment rate
0,58.3,53.4,2179750.0,1993523.0,186227.0,8.5
1,58.2,53.3,2175635.0,1992089.0,183546.0,8.4
2,58.1,53.2,2173644.0,1991104.0,182540.0,8.4
3,58.1,53.2,2173494.0,1991037.0,182457.0,8.4
4,58.1,53.2,2174754.0,1992484.0,182270.0,8.4
...,...,...,...,...,...,...
116,56.3,54.5,2240353.0,2168364.0,71989.0,3.2
117,56.3,54.5,2241442.0,2170873.0,70569.0,3.1
118,56.2,54.5,2242078.0,2172390.0,69688.0,3.1
119,56.2,54.5,2242275.0,2172841.0,69434.0,3.1


In [33]:
# Viewing output
df_employment

Unnamed: 0,Year,Period,labor force participation rate,employment-population ratio,labor force,employment,unemployment,unemployment rate
0,2012,Jan,58.3,53.4,2179750.0,1993523.0,186227.0,8.5
1,2012,Feb,58.2,53.3,2175635.0,1992089.0,183546.0,8.4
2,2012,Mar,58.1,53.2,2173644.0,1991104.0,182540.0,8.4
3,2012,Apr,58.1,53.2,2173494.0,1991037.0,182457.0,8.4
4,2012,May,58.1,53.2,2174754.0,1992484.0,182270.0,8.4
...,...,...,...,...,...,...,...,...
116,2021,Sep,56.3,54.5,2240353.0,2168364.0,71989.0,3.2
117,2021,Oct,56.3,54.5,2241442.0,2170873.0,70569.0,3.1
118,2021,Nov,56.2,54.5,2242078.0,2172390.0,69688.0,3.1
119,2021,Dec,56.2,54.5,2242275.0,2172841.0,69434.0,3.1


In [34]:
# Viewing output
state_employment

{'Alabama':      Year Period labor force participation rate employment-population ratio  \
 0    2012    Jan                           58.3                        53.4   
 1    2012    Feb                           58.2                        53.3   
 2    2012    Mar                           58.1                        53.2   
 3    2012    Apr                           58.1                        53.2   
 4    2012    May                           58.1                        53.2   
 ..    ...    ...                            ...                         ...   
 116  2021    Sep                           56.3                        54.5   
 117  2021    Oct                           56.3                        54.5   
 118  2021    Nov                           56.2                        54.5   
 119  2021    Dec                           56.2                        54.5   
 120  2022    Jan                           56.4                        54.7   
 
     labor force employment

#### Subsegment
The `subsegment` dictionary below is used to ensure that the scraper will classify and store files appropriately by industry. Through the use of Regular Expressions in the code, the url is parsed and compared to this dictionary.

Source of the subsegment dictionary can be found here: https://download.bls.gov/pub/time.series/sm/sm.supersector

In [35]:
subsegment = {'00':'Total Nonfarm', '05':'Total Private', '06':'Goods Producing',
              '07':'Service-Providing', '08':'Private Service Providing', '10':'Mining and Logging',
              '15':'Mining, Logging and Construction', '20':'Construction', '30':'Manufacturing', 
              '31':'Durable Goods', '32':'Non-Durable Goods', '40':'Trade, Transportation, and Utils', 
              '41':'Wholesale Trade', '42':'Retail Trade', '43':'Transportation and Utils', 
              '50':'Information', '55':'Financial Activities', '60':'Professional & Business Services', 
              '65':'Education & Health Services', '70':'Leisure & Hospitality', '80':'Other Services', 
              '90':'Government'}

In [36]:
# This references the regex link used in the Parsing the URL section
# However, this time it is returning the `key` from the subsegment dictionary
# We did not end up using this, but at the time thought we may need to
subsegment[re.findall("\/([A-Z\d\s]+)", link1)[0][10:12]]

'Government'

In [39]:
# This scraper will take 10+ minutes and will flood the working directories with files. 
# If you'd prefer to just scroll through the notebook and see the comments of why the work was done, by all means, go that route.

state_data = {}
state_employment = {}

for state, links in list(state_backdata.items()):

    state_name = state

    dfs = {}
    employment = {}

    for i, link  in enumerate(links):
        if i == 0:
            resp = requests.get(link)
            data_name = f'./Development/development_state_employment/{state_name}_Employment'
            output = open(f'{data_name}.xls', 'wb')
            output.write(resp.content)
            output.close()

            df_list = pd.read_html(f'{data_name}.xls')
            df_employment = pd.DataFrame(df_list[1])
            # remove footnote markers
            df_employment = df_employment.drop(df_employment.index[-1])
            df_target = df_employment[['labor force participation rate','employment-population ratio', 'labor force', 'employment', 'unemployment', 'unemployment rate']] # where to apply regex
            df_target = df_target.apply(remove_special_char).astype(float)
            df_employment.loc[:, df_target.columns] = df_target
            state_employment[state_name] = df_employment

            print(f'{state_name} added to state employment dictionary.')
            time.sleep(1)
            
        # pull in industry subsegment information
        else:
            resp = requests.get(link)
            sub_name = subsegment[re.findall("\/([A-Z\d\s]+)", link)[0][10:12]]
            data_name = f'./Development/development_state_data/{state_name}_{sub_name}'
            output = open(f'{data_name}.xls', 'wb')
            output.write(resp.content)
            output.close()

            df_list = pd.read_html(f'{data_name}.xls')
            df = pd.DataFrame(df_list[1])
            df.drop([10, 11], inplace=True) # removing unnecessary rows
            df.set_index('Year', inplace=True)
            dfs[sub_name] = df.astype(float) # create state-subsegment entry for state dictionary
            print(f'{state_name}_{sub_name} added to list.')
            time.sleep(1)
    
    state_data[state_name] = dfs
    print(f'{state_name} data added to state data dictionary')

Alabama added to state employment dictionary.
Alabama_Total Nonfarm added to list.
Alabama_Mining and Logging added to list.
Alabama_Construction added to list.
Alabama_Manufacturing added to list.
Alabama_Trade, Transportation, and Utils added to list.
Alabama_Information added to list.
Alabama_Financial Activities added to list.
Alabama_Professional & Business Services added to list.
Alabama_Education & Health Services added to list.
Alabama_Leisure & Hospitality added to list.
Alabama_Other Services added to list.
Alabama_Government added to list.
Alabama data added to state data dictionary
Alaska added to state employment dictionary.
Alaska_Total Nonfarm added to list.
Alaska_Mining and Logging added to list.
Alaska_Construction added to list.
Alaska_Manufacturing added to list.
Alaska_Trade, Transportation, and Utils added to list.
Alaska_Information added to list.
Alaska_Financial Activities added to list.
Alaska_Professional & Business Services added to list.
Alaska_Education & 

##### Viewing outputs

In [40]:
state_data['Alabama']['Total Nonfarm'].loc['2018'].mean()

2046.2583333333332

In [41]:
state_data['Alaska']['Total Nonfarm'].loc['2018'].mean()

327.6583333333333

In [42]:
state_data['Alabama']['Total Nonfarm']

Unnamed: 0_level_0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2012,1898.8,1898.9,1905.1,1905.9,1902.9,1904.0,1901.7,1906.3,1908.5,1911.6,1913.3,1912.9
2013,1914.0,1918.7,1921.2,1921.8,1924.3,1924.9,1924.5,1925.9,1925.2,1927.0,1929.0,1933.0
2014,1930.8,1929.3,1933.7,1939.1,1939.4,1942.7,1943.8,1947.5,1951.4,1954.0,1955.5,1960.1
2015,1958.2,1960.5,1958.2,1962.4,1968.9,1969.0,1970.8,1975.3,1977.1,1980.6,1983.2,1986.4
2016,1989.2,1989.1,1990.2,1995.9,1994.8,1993.8,1999.3,1999.6,2005.8,2002.0,2000.3,2003.7
2017,2009.2,2012.4,2016.1,2012.5,2016.4,2019.0,2018.8,2021.2,2022.6,2026.7,2023.7,2025.1
2018,2027.9,2032.3,2036.7,2038.9,2040.9,2047.0,2048.5,2051.1,2054.1,2057.7,2059.3,2060.7
2019,2066.4,2067.7,2069.4,2075.0,2077.9,2078.4,2081.6,2083.2,2082.7,2081.1,2083.5,2084.4
2020,2087.7,2087.8,2079.3,1843.5,1893.5,1942.6,1962.5,1984.7,1996.4,2011.4,2016.4,2022.8
2021,2023.0,2026.7,2030.5,2028.3,2034.3,2039.7,2048.3,2045.7,2039.8,2053.9,2052.8,2052.8


#### Working with 2018 Data

In [43]:
# Calculate average values by state and industry
target_year = '2018'

state_avgs_2018 = {}

for state_name, data in state_data.items():
    year_avgs = {}

    for subsegment, data_local in data.items():
        year_avgs[subsegment] = round(data_local.loc[target_year, :].mean(), 3)
        print(f'{state_name} {target_year} data added to year averages')

    state_avgs_2018[state_name] = year_avgs
    print(f'{state_name} averages added to state year dictionary')

Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama 2018 data added to year averages
Alabama averages added to state year dictionary
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data added to year averages
Alaska 2018 data add

In [44]:
# Save dictionary to DataFrame
economies_2018 = pd.DataFrame(state_avgs_2018).T
economies_2018

Unnamed: 0,Total Nonfarm,Mining and Logging,Construction,Manufacturing,"Trade, Transportation, and Utils",Information,Financial Activities,Professional & Business Services,Education & Health Services,Leisure & Hospitality,Other Services,Government,"Mining, Logging and Construction"
Alabama,2046.258,10.0,89.2,266.992,383.467,21.092,96.258,244.242,245.167,205.933,97.117,386.792,
Alaska,327.658,12.658,15.833,12.5,64.408,5.608,11.767,27.358,50.442,35.65,11.058,80.375,
Arizona,2857.717,13.05,157.4,171.45,534.525,47.558,220.008,434.033,445.417,325.975,92.442,415.858,
Arkansas,1267.492,5.933,50.833,160.8,253.408,12.433,61.058,145.425,191.575,118.267,55.65,212.108,
California,17172.225,22.492,860.683,1323.55,3045.983,542.85,837.875,2670.217,2722.283,1993.142,571.667,2581.483,
Colorado,2726.925,28.525,173.125,147.508,470.375,75.617,171.617,423.55,340.8,339.45,110.958,445.4,
Connecticut,1699.275,0.567,58.767,160.667,296.5,31.667,125.467,221.092,344.792,157.783,65.617,236.358,
Delaware,461.508,,,27.05,80.7,4.067,47.817,63.425,79.7,51.767,18.508,66.183,22.3
District of Columbia,792.958,,,1.342,33.208,19.158,29.592,168.117,130.892,79.808,76.783,238.4,15.65
Florida,8780.95,5.742,542.617,372.908,1779.892,139.492,575.8,1361.775,1305.058,1229.408,353.075,1115.008,


##### Sum columns as a way to impute "missing" values
In essence, we thought data was missing - but after some thought, we realized it was just an inconsistency in reporting; therefore, we solved with this method.

In [45]:
economies_2018['Mining, Logging and Construction'] = economies_2018['Mining, Logging and Construction'].fillna(economies_2018['Mining and Logging'] + economies_2018['Construction'])

economies_2018 = economies_2018.drop(columns=['Mining and Logging', 'Construction'])

##### However, we lost resolution on how 'Mining and Logging' as a separate industry subsegment impacts economic resiliency. Potential next step could be to create an additional clustering model for analysis that leaves out the states the were missing the 'Mining and Logging' subsegment, to see if there is a strong/notable correlation.

In [46]:
# Ensure solution worked - it did!
economies_2018.isnull().sum()

Total Nonfarm                       0
Manufacturing                       0
Trade, Transportation, and Utils    0
Information                         0
Financial Activities                0
Professional & Business Services    0
Education & Health Services         0
Leisure & Hospitality               0
Other Services                      0
Government                          0
Mining, Logging and Construction    0
dtype: int64

#### Calculate Percentages
Exploring adding percent-of-total-nonfarm columns for each column

This cell calculates the percentage of each industry in relation to the `Total Nonfarm` column, and saves these percentages to a new column.

In [47]:
economies_2018_pcts = economies_2018.copy()

for column in economies_2018.columns[1:]:
    new_column = f'pct_{column}'
    economies_2018_pcts[new_column] = round(economies_2018[column] / economies_2018['Total Nonfarm'], 4) 

economies_2018_pcts.head()

Unnamed: 0,Total Nonfarm,Manufacturing,"Trade, Transportation, and Utils",Information,Financial Activities,Professional & Business Services,Education & Health Services,Leisure & Hospitality,Other Services,Government,...,pct_Manufacturing,"pct_Trade, Transportation, and Utils",pct_Information,pct_Financial Activities,pct_Professional & Business Services,pct_Education & Health Services,pct_Leisure & Hospitality,pct_Other Services,pct_Government,"pct_Mining, Logging and Construction"
Alabama,2046.258,266.992,383.467,21.092,96.258,244.242,245.167,205.933,97.117,386.792,...,0.1305,0.1874,0.0103,0.047,0.1194,0.1198,0.1006,0.0475,0.189,0.0485
Alaska,327.658,12.5,64.408,5.608,11.767,27.358,50.442,35.65,11.058,80.375,...,0.0381,0.1966,0.0171,0.0359,0.0835,0.1539,0.1088,0.0337,0.2453,0.087
Arizona,2857.717,171.45,534.525,47.558,220.008,434.033,445.417,325.975,92.442,415.858,...,0.06,0.187,0.0166,0.077,0.1519,0.1559,0.1141,0.0323,0.1455,0.0596
Arkansas,1267.492,160.8,253.408,12.433,61.058,145.425,191.575,118.267,55.65,212.108,...,0.1269,0.1999,0.0098,0.0482,0.1147,0.1511,0.0933,0.0439,0.1673,0.0448
California,17172.225,1323.55,3045.983,542.85,837.875,2670.217,2722.283,1993.142,571.667,2581.483,...,0.0771,0.1774,0.0316,0.0488,0.1555,0.1585,0.1161,0.0333,0.1503,0.0514


In [48]:
# Checking work to ensure each state equals 1 (+/- rounding error)
economies_2018_pcts.iloc[:, -10:].T.sum()

Alabama                 1.0000
Alaska                  0.9999
Arizona                 0.9999
Arkansas                0.9999
California              1.0000
Colorado                0.9999
Connecticut             1.0000
Delaware                1.0000
District of Columbia    0.9999
Florida                 1.0000
Georgia                 1.0001
Hawaii                  1.0001
Idaho                   1.0001
Illinois                1.0000
Indiana                 0.9999
Iowa                    1.0001
Kansas                  1.0000
Kentucky                1.0000
Louisiana               1.0001
Maine                   1.0000
Maryland                0.9999
Massachusetts           1.0000
Michigan                1.0001
Minnesota               1.0001
Mississippi             0.9999
Missouri                1.0000
Montana                 0.9999
Nebraska                0.9999
Nevada                  1.0002
New Hampshire           0.9999
New Jersey              0.9999
New Mexico              1.0000
New York

In [52]:
# Save file to CSV
economies_2018_pcts.to_csv('./Development/development_source_data/economies_2018.csv')

#### Working with 2021 Data

In [53]:
# Calculate average values by state and industry

target_year = '2021'

state_avgs_2021 = {}

for state_name, data in state_data.items():
    year_avgs = {}

    for subsegment, data_local in data.items():
        year_avgs[subsegment] = round(data_local.loc[target_year, :].mean(), 3)
        print(f'{state_name} {target_year} data added to year averages')

    state_avgs_2021[state_name] = year_avgs
    print(f'{state_name} averages added to state year dictionary')

Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama 2021 data added to year averages
Alabama averages added to state year dictionary
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data added to year averages
Alaska 2021 data add

In [54]:
# Save dictionary to DataFrame
economies_2021 = pd.DataFrame(state_avgs_2021).T
economies_2021

Unnamed: 0,Total Nonfarm,Mining and Logging,Construction,Manufacturing,"Trade, Transportation, and Utils",Information,Financial Activities,Professional & Business Services,Education & Health Services,Leisure & Hospitality,Other Services,Government,"Mining, Logging and Construction"
Alabama,2039.65,8.575,94.333,263.567,394.55,19.958,98.017,250.492,239.05,188.392,95.058,387.658,
Alaska,310.458,10.45,15.925,12.5,61.05,4.775,10.783,26.458,50.508,30.35,10.475,77.183,
Arizona,2957.95,11.992,177.525,180.817,583.842,47.383,245.65,444.75,464.508,303.942,91.125,406.417,
Arkansas,1282.608,5.3,55.233,157.183,256.342,11.708,65.892,146.117,195.1,115.8,66.958,206.975,
California,16705.817,19.067,880.417,1272.492,3033.142,566.55,823.083,2703.4,2809.083,1630.642,500.008,2467.933,
Colorado,2745.258,19.733,176.95,148.675,486.225,76.325,177.75,452.75,347.85,306.617,113.675,438.708,
Connecticut,1614.067,0.492,59.667,153.408,290.475,29.842,117.575,213.175,333.408,133.825,58.233,223.967,
Delaware,449.433,,,24.758,80.875,3.558,47.567,62.892,77.583,45.058,18.242,65.867,23.042
District of Columbia,742.292,,,1.092,29.242,19.567,28.083,167.208,120.1,49.575,71.233,241.15,15.042
Florida,8915.367,5.358,575.683,388.083,1840.283,138.55,621.783,1455.658,1340.283,1122.65,334.008,1093.083,


##### Sum columns as a way to impute "missing" values
In essence, we thought data was missing - but after some thought, we realized it was just an inconsistency in reporting; therefore, we solved with this method.

In [55]:
economies_2021['Mining, Logging and Construction'] = economies_2021['Mining, Logging and Construction'].fillna(economies_2021['Mining and Logging'] + economies_2021['Construction'])

economies_2021 = economies_2021.drop(columns=['Mining and Logging', 'Construction'])

#### Calculate Percentages
Exploring adding percent-of-total-nonfarm columns for each column

This cell calculates the percentage of each industry in relation to the `Total Nonfarm` column, and saves these percentages to a new column.

In [56]:
economies_2021_pcts = economies_2021.copy()

for column in economies_2021.columns[1:]:
    new_column = f'pct_{column}'
    economies_2021_pcts[new_column] = round(economies_2021[column] / economies_2021['Total Nonfarm'], 4) 

economies_2021_pcts.head()

Unnamed: 0,Total Nonfarm,Manufacturing,"Trade, Transportation, and Utils",Information,Financial Activities,Professional & Business Services,Education & Health Services,Leisure & Hospitality,Other Services,Government,...,pct_Manufacturing,"pct_Trade, Transportation, and Utils",pct_Information,pct_Financial Activities,pct_Professional & Business Services,pct_Education & Health Services,pct_Leisure & Hospitality,pct_Other Services,pct_Government,"pct_Mining, Logging and Construction"
Alabama,2039.65,263.567,394.55,19.958,98.017,250.492,239.05,188.392,95.058,387.658,...,0.1292,0.1934,0.0098,0.0481,0.1228,0.1172,0.0924,0.0466,0.1901,0.0505
Alaska,310.458,12.5,61.05,4.775,10.783,26.458,50.508,30.35,10.475,77.183,...,0.0403,0.1966,0.0154,0.0347,0.0852,0.1627,0.0978,0.0337,0.2486,0.085
Arizona,2957.95,180.817,583.842,47.383,245.65,444.75,464.508,303.942,91.125,406.417,...,0.0611,0.1974,0.016,0.083,0.1504,0.157,0.1028,0.0308,0.1374,0.0641
Arkansas,1282.608,157.183,256.342,11.708,65.892,146.117,195.1,115.8,66.958,206.975,...,0.1225,0.1999,0.0091,0.0514,0.1139,0.1521,0.0903,0.0522,0.1614,0.0472
California,16705.817,1272.492,3033.142,566.55,823.083,2703.4,2809.083,1630.642,500.008,2467.933,...,0.0762,0.1816,0.0339,0.0493,0.1618,0.1681,0.0976,0.0299,0.1477,0.0538


In [57]:
# Checking work to ensure each state equals 1 (+/- rounding error)
economies_2021_pcts.iloc[:, -10:].T.sum()

Alabama                 1.0001
Alaska                  1.0000
Arizona                 1.0000
Arkansas                1.0000
California              0.9999
Colorado                0.9999
Connecticut             1.0001
Delaware                1.0000
District of Columbia    1.0002
Florida                 0.9999
Georgia                 1.0002
Hawaii                  1.0000
Idaho                   1.0000
Illinois                1.0000
Indiana                 0.9999
Iowa                    0.9999
Kansas                  1.0001
Kentucky                0.9999
Louisiana               0.9998
Maine                   1.0001
Maryland                1.0000
Massachusetts           1.0001
Michigan                0.9999
Minnesota               1.0000
Mississippi             0.9999
Missouri                1.0000
Montana                 1.0001
Nebraska                0.9999
Nevada                  1.0001
New Hampshire           1.0000
New Jersey              1.0000
New Mexico              1.0000
New York

In [59]:
# Save file to directory
economies_2021_pcts.to_csv('./Development/development_source_data/economies_2021.csv', index_label='State')

##### Export Unemployment Data to CSV's by State

In [60]:
for state, dataframe in state_employment.items():
    filename = f'./Development/development_state_employment/{state}_employment.csv'
    dataframe.to_csv(filename, index=False)