# 00 - Data Collection Notebook
The intent of this notebook is to acquire economic data, including but not limited to, industry and employment information at the state level for all 50 states, and Washington D.C. Once compiled into a Pandas DataFrame, it is then exported into the `raw_data` folder. The notebook then transitions to transforming the data into the necessary format needed for modeling in our next step.

*Some of the scraping methods used in this notebook were referenced from the following lecture: https://git.generalassemb.ly/DSIR-0124/lesson-webscraping/blob/master/intro-to-web-scraping-spiders-with-scrapy.ipynb*

#### Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import time
import os

### Individual State Scraper 
In this scraper, we extract one link per state rom the U.S. Bureau of Labor Statistics website found in the `url` variable below. As seen within this scraper, we chose to remove the U.S. territories of Guam, Puerto Rico, and the Virgin Islands for our analysis. 

Using BeautifulSoup, this scraper returns a Pandas DataFrame of the State name and associated URL.

In [6]:
url = 'https://www.bls.gov/eag'

response = requests.get(url)

html = response.text

soup = BeautifulSoup(html, 'lxml')

all_h4 = soup.find_all('h4')
all_h4

state_link_list = []

for element in all_h4:
    result = {}
    
    a_href = element.find('a')
    if a_href:
        result['title'] = a_href.text
        result['link'] = 'https://www.bls.gov/' + a_href['href'].strip().lstrip('/')
    
    state_link_list.append(result)
    
state_link_list

state_links = pd.DataFrame(state_link_list)
state_links = state_links.set_index('title')
state_links = state_links.drop(['Guam', 'Puerto Rico', 'Virgin Islands'])
state_links

Unnamed: 0_level_0,link
title,Unnamed: 1_level_1
Alabama,https://www.bls.gov/regions/southeast/alabama....
Alaska,https://www.bls.gov/regions/west/alaska.htm#eag
Arizona,https://www.bls.gov/regions/west/arizona.htm#eag
Arkansas,https://www.bls.gov/regions/southwest/arkansas...
California,https://www.bls.gov/regions/west/california.ht...
Colorado,https://www.bls.gov/regions/mountain-plains/co...
Connecticut,https://www.bls.gov/regions/new-england/connec...
Delaware,https://www.bls.gov/regions/mid-atlantic/delaw...
District of Columbia,https://www.bls.gov/regions/mid-atlantic/distr...
Florida,https://www.bls.gov/regions/southeast/florida....


### All State Links Scraper 
In this scraper, we extract the necessary child links from the aforementioned DataFrame, and save them into a Dictionary.

In [5]:
state_backdata = {}

for i in range(len(state_links)):
    state_url = state_links.iloc[i][0]
#     state_url
    state_name = state_links.index[i]
    
    loop_response = requests.get(state_url)

    loop_html = loop_response.text

    loop_soup = BeautifulSoup(loop_html, 'lxml')

    loop_table = loop_soup.find('table', {'class': 'regular'})

    loop_backdata_list = []

    loop_backdata_links = []
    
    for row in loop_table.find_all('tr')[1:]:

        loop_backdata_list.append(row)

    for item in loop_backdata_list:
        item = str(item)
        if 'https' in item:
            links= re.findall("(https:\S+)", item)
            loop_backdata_links.append(links[0])

    loop_final_list = [x for i, x in enumerate(loop_backdata_links) if i in [0,4,6,8,10,12,14,16,18,20,22,24,26,28]]
    state_backdata[state_name] = loop_final_list
    print(f'{state_name} links added to dictionary')
    time.sleep(1)

Alabama links added to dictionary
Alaska links added to dictionary
Arizona links added to dictionary
Arkansas links added to dictionary
California links added to dictionary
Colorado links added to dictionary
Connecticut links added to dictionary
Delaware links added to dictionary
District of Columbia links added to dictionary
Florida links added to dictionary
Georgia links added to dictionary
Hawaii links added to dictionary
Idaho links added to dictionary
Illinois links added to dictionary
Indiana links added to dictionary
Iowa links added to dictionary
Kansas links added to dictionary
Kentucky links added to dictionary
Louisiana links added to dictionary
Maine links added to dictionary
Maryland links added to dictionary
Massachusetts links added to dictionary
Michigan links added to dictionary
Minnesota links added to dictionary
Mississippi links added to dictionary
Missouri links added to dictionary
Montana links added to dictionary
Nebraska links added to dictionary
Nevada links ad

In [13]:
def remove_special_char(column):
    return column.map(lambda s: re.findall('^\d+\.?\d*', str(s))[0])

In [None]:
df_target

In [None]:
df_employment

In [None]:
state_employment

In [16]:
subsegment = {'00':'Total Nonfarm', '05':'Total Private', '06':'Goods Producing',
              '07':'Service-Providing', '08':'Private Service Providing', '10':'Mining and Logging',
              '15':'Mining, Logging and Construction', '20':'Construction', '30':'Manufacturing', 
              '31':'Durable Goods', '32':'Non-Durable Goods', '40':'Trade, Transportation, and Utils', 
              '41':'Wholesale Trade', '42':'Retail Trade', '43':'Transportation and Utils', 
              '50':'Information', '55':'Financial Activities', '60':'Professional & Business Services', 
              '65':'Education & Health Services', '70':'Leisure & Hospitality', '80':'Other Services', 
              '90':'Government'}

In [19]:
if not os.path.exists('./state_employment'):
    os.mkdir('./state_employment')

In [None]:
if not os.path.exists('./state_industry'):
    os.mkdir('./state_industry')

In [None]:
for state, dataframe in state_employment.items():
    filename = f'./State Employment/{state}_employment.csv'
    dataframe.to_csv(filename, index=False)

In [20]:
# state_backdata.iloc[:2].items()
state_data = {}
state_employment = {}

for state, links in list(state_backdata.items()):

    state_name = state

    dfs = {}
    employment = {}

    for i, link  in enumerate(links):
        if i == 0:
            resp = requests.get(link)
            data_name = f'./state_employment/{state_name}_Employment'
            output = open(f'{data_name}.xls', 'wb')
            output.write(resp.content)
            output.close()

            df_list = pd.read_html(f'{data_name}.xls')
            df_employment = pd.DataFrame(df_list[1])
            # remove footnote markers
            df_employment = df_employment.drop(df_employment.index[-1])
            df_target = df_employment[['labor force participation rate','employment-population ratio', 'labor force', 'employment', 'unemployment', 'unemployment rate']] # where to apply regex
            df_target = df_target.apply(remove_special_char).astype(float)
            df_employment.loc[:, df_target.columns] = df_target
            state_employment[state_name] = df_employment

            print(f'{state_name} added to state employment dictionary.')
            time.sleep(1)
            
        # pull in industry subsegment information
        else:
            resp = requests.get(link)
            sub_name = subsegment[re.findall("\/([A-Z\d\s]+)", link)[0][10:12]]
            data_name = f'./state_industry/{state_name}_{sub_name}'
            output = open(f'{data_name}.xls', 'wb')
            output.write(resp.content)
            output.close()

            df_list = pd.read_html(f'{data_name}.xls')
            df = pd.DataFrame(df_list[1])
            df.drop([10, 11], inplace=True) # removing unnecessary rows
            df.set_index('Year', inplace=True)
            dfs[sub_name] = df.astype(float) # create state-subsegment entry for state dictionary
            print(f'{data_name} added to list.')
            time.sleep(1)
    
    state_data[state_name] = dfs
    print(f'{state_name} data added to state data dictionary')

Alabama added to state employment dictionary.
Alabama_Total Nonfarm added to list.
Alabama_Mining and Logging added to list.
Alabama_Construction added to list.
Alabama_Manufacturing added to list.
Alabama_Trade, Transportation, and Utils added to list.
Alabama_Information added to list.
Alabama_Financial Activities added to list.
Alabama_Professional & Business Services added to list.
Alabama_Education & Health Services added to list.
Alabama_Leisure & Hospitality added to list.
Alabama_Other Services added to list.
Alabama_Government added to list.
Alabama data added to state data dictionary
Alaska added to state employment dictionary.
Alaska_Total Nonfarm added to list.
Alaska_Mining and Logging added to list.
Alaska_Construction added to list.
Alaska_Manufacturing added to list.
Alaska_Trade, Transportation, and Utils added to list.
Alaska_Information added to list.


KeyboardInterrupt: 

In [None]:
state_data['Alabama']['Total Nonfarm'].loc['2018'].mean()

In [None]:
state_data['Alaska']['Total Nonfarm'].loc['2018'].mean()

In [None]:
pd.DataFrame(state_data).loc['Total Nonfarm', 'Alabama'].loc['2018'].mean()

In [None]:
state_data['Alabama']['Total Nonfarm']

In [None]:
state_data['Alaska']['Total Nonfarm']

In [None]:
state_data.items()

In [None]:
list(state_data.items())[0]

In [None]:
list(list(state_data.items())[0][1].items())

In [None]:
target_year = '2018'

state_avgs_2018 = {}

for state_name, data in state_data.items():
    year_avgs = {}

    for subsegment, data_local in data.items():
        year_avgs[subsegment] = round(data_local.loc[target_year, :].mean(), 3)
        print(f'{state_name} {target_year} data added to year averages')

    state_avgs_2018[state_name] = year_avgs
    print(f'{state_name} averages added to state year dictionary')

In [None]:
economies_2018 = pd.DataFrame(state_avgs_2018).T
economies_2018

In [None]:
economies_2018['Mining, Logging and Construction'] = economies_2018['Mining, Logging and Construction'].fillna(economies_2018['Mining and Logging'] + economies_2018['Construction'])

economies_2018 = economies_2018.drop(columns=['Mining and Logging', 'Construction'])

In [None]:
economies_2018.isnull().sum()

In [None]:
economies_2018_pcts = economies_2018.copy()

for column in economies_2018.columns[1:]:
    new_column = f'pct_{column}'
    economies_2018_pcts[new_column] = round(economies_2018[column] / economies_2018['Total Nonfarm'], 4) 

economies_2018_pcts.head()

In [None]:
economies_2018_pcts.to_csv('./Source Data/economies_2018.csv')

In [None]:
target_year = '2021'

state_avgs_2021 = {}

for state_name, data in state_data.items():
    year_avgs = {}

    for subsegment, data_local in data.items():
        year_avgs[subsegment] = round(data_local.loc[target_year, :].mean(), 3)
        print(f'{state_name} {target_year} data added to year averages')

    state_avgs_2021[state_name] = year_avgs
    print(f'{state_name} averages added to state year dictionary')

In [None]:
state_avgs_2021

In [None]:
economies_2021 = pd.DataFrame(state_avgs_2021).T
economies_2021

In [None]:
economies_2021['Mining, Logging and Construction'] = economies_2021['Mining, Logging and Construction'].fillna(economies_2021['Mining and Logging'] + economies_2021['Construction'])

economies_2021 = economies_2021.drop(columns=['Mining and Logging', 'Construction'])

In [None]:
economies_2021_pcts = economies_2021.copy()

for column in economies_2021.columns[1:]:
    new_column = f'pct_{column}'
    economies_2021_pcts[new_column] = round(economies_2021[column] / economies_2021['Total Nonfarm'], 4) 

economies_2021_pcts.head()

In [None]:
economies_2021_pcts.to_csv('./Source Data/economies_2021.csv', index_label='State')

In [None]:
pd.read_csv('./Source Data/economies_2021.csv').set_index('State')

We lost resolution on how 'Mining and Logging' as a separate industry subsegment impacts economic resiliency. We could create an additional clustering model for analysis that leaves out the states the were missing the 'Mining and Logging' subsegment, to see if there is a strong/notable correlation.

In [None]:
state_employment['Alabama']

Exploring adding percent-of-total-nonfarm columns for each column

In [None]:
economies_2018.columns

In [None]:
economies_2018_pcts = economies_2018.copy()

for column in economies_2018.columns[1:]:
    new_column = f'pct_{column}'
    economies_2018_pcts[new_column] = round(economies_2018[column] / economies_2018['Total Nonfarm'], 4) 

economies_2018_pcts.head()

In [None]:
economies_2018_pcts.iloc[:, -10:].T.sum()

In [None]:
state_employment

In [None]:
state_employment.items()

In [None]:
if not os.path.exists('./State Employment'):
    os.mkdir('./State Employment')

In [None]:
for state, dataframe in state_employment.items():
    filename = f'./State Employment/{state}_employment.csv'
    dataframe.to_csv(filename, index=False)