# (E)xtract (T)ransform (L)oad

**Abstract**

This notebook contains the the code that we used to extract and transform our different datasets. The final step in the ETL process (Load) can be found [here](./load_data_to_SQL_database.ipynb).

In [1]:
import requests
import pandas as pd

A helpful function used throughout the notebook to uppercase all column headings!

In [15]:
def upper_columns(data):
    columns = data
    myList = []
    for name in columns:
        myList.append(name.upper())
    data.columns = myList

## Census.gov API Call 

https://www.census.gov/data/developers/data-sets/Poverty-Statistics.html

3 API calls to collect the poverty data on a national, state, and county level

**Extract**

In [16]:
national_url ='https://api.census.gov/data/timeseries/poverty/saipe?get=SAEPOVRTALL_PT,SAEPOVRT0_17_PT,SAEMHI_PT,YEAR&for=us'
state_url = 'https://api.census.gov/data/timeseries/poverty/saipe?get=NAME,SAEPOVRTALL_PT,SAEPOVRT0_17_PT,SAEMHI_PT,YEAR&for=state:*'
county_url = 'https://api.census.gov/data/timeseries/poverty/saipe?get=NAME,SAEPOVRTALL_PT,SAEPOVRT0_17_PT,SAEMHI_PT,YEAR&for=county:*'

national_poverty_response = requests.get(national_url)
state_poverty_response = requests.get(state_url)
county_poverty_response = requests.get(county_url)

**Transform**

In [17]:
### Converting the responses from the API calls into JSON format

national_poverty_response_json = national_poverty_response.json()
state_poverty_response_json = state_poverty_response.json()
county_poverty_response_json = county_poverty_response.json()

In [18]:
### Storing the JSON into different dataframes

poverty_nat = pd.DataFrame(national_poverty_response_json[1:], columns = national_poverty_response_json[0])
poverty_state = pd.DataFrame(state_poverty_response_json[1:], columns = state_poverty_response_json[0])
poverty_county = pd.DataFrame(county_poverty_response_json[1:], columns = county_poverty_response_json[0])

In [19]:
### 1. Create dataframe for state names
### 2. Rename columns
### 3. Generating another dataframe by merging new dataframe with county poverty dataframe
### 4. Dropping any duplicates

state_name_df = poverty_state[['NAME','state']]
state_name_df.rename(columns={'NAME': "STATE_NAME"}, inplace=True)
poverty_county_state = state_name_df.merge(poverty_county, on='state')
poverty_county_state = poverty_county_state.drop_duplicates()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  state_name_df.rename(columns={'NAME': "STATE_NAME"}, inplace=True)


In [20]:
### Dropping and renaming columns

poverty_state = poverty_state.drop(columns='state')
poverty_nat = poverty_nat.drop(columns='us')

poverty_state = poverty_state.rename(columns={"NAME": "STATE", "SAEPOVRTALL_PT": "PR_ALL", "SAEPOVRT0_17_PT": "PR_YOUTH", 'SAEMHI_PT': 'MED_HH_INCOME' })
poverty_nat = poverty_nat.rename(columns={"NAME": "STATE", "SAEPOVRTALL_PT": "PR_ALL", "SAEPOVRT0_17_PT": "PR_YOUTH", 'SAEMHI_PT': 'MED_HH_INCOME' })


poverty_county_state = poverty_county_state.drop(columns='state')
poverty_county_state = poverty_county_state.rename(columns={"NAME": "COUNTY", "SAEPOVRTALL_PT": "PR_ALL", "SAEPOVRT0_17_PT": "PR_YOUTH", 'SAEMHI_PT': 'MED_HH_INCOME' })
poverty_county_state = poverty_county_state.drop(columns='county')

In [21]:
### Changing the datatypes 

poverty_state = poverty_state.astype({'PR_ALL': float, 'PR_YOUTH': float, 'MED_HH_INCOME': float, 'YEAR': int})
poverty_nat = poverty_nat.astype({'PR_ALL': float, 'PR_YOUTH': float, 'MED_HH_INCOME': float, 'YEAR': int})
poverty_county_state = poverty_county_state.astype({'PR_ALL': float, 'PR_YOUTH': float, 'MED_HH_INCOME': float, 'YEAR': int})

In [22]:
### Filtering out certain values from the 'STATE' column

poverty_state = poverty_state[poverty_state['STATE'] != 'Guam'] 
poverty_state = poverty_state[poverty_state['STATE'] != 'Puerto Rico'] 
poverty_state = poverty_state[poverty_state['STATE'] != 'Virgin Islands']
poverty_state = poverty_state[poverty_state['STATE'] != 'District of Columbia']

## County Dataset with Latitude/Longitude/Population

**Extract**

Dataset obtained from: https://simplemaps.com/data/us-counties

In [24]:
### Reading in the CSV and saving the data in a dataframe

uscounties_df = pd.read_csv("../data/uscounties.csv")

**Transform**

In [25]:
### Uppercasing column headings

upper_columns(uscounties_df)

In [26]:
### Creating a new dataframe by merging LAT/LONG/POP with county poverty dataset

county_locs_df = poverty_county_state.merge(uscounties_df, how='left', left_on=['STATE_NAME', 'COUNTY'], right_on=['STATE_NAME', 'COUNTY_FULL'])

In [27]:
### Filtering out what columns to select

county_final = county_locs_df[['STATE_NAME','YEAR','PR_ALL','PR_YOUTH','MED_HH_INCOME','COUNTY_FULL','LAT','LNG','POPULATION']]

In [28]:
### Removing population data from all years except 2020

for i in range(len(county_final)):
    if county_final.loc[i, "YEAR"] != 2020:
        county_final.loc[i,'POPULATION'] = None

## County Unemployment Rate Dataset


In [29]:
AL_UE_df = pd.read_csv("../data/A-L_unemployment.csv")
MZ_UE_df = pd.read_csv("../data/M-Z_unemployment.csv")

In [30]:
al_transpose = AL_UE_df.transpose()
mz_transpose = MZ_UE_df.transpose()

In [31]:
unemployment_df = pd.concat([al_transpose, mz_transpose], axis=0)

In [32]:
header = unemployment_df.iloc[0]
unemployment_df2 = unemployment_df[1:]
unemployment_df2.columns = header
unemployment_df2.columns
unemployment_df2['STATE_NAME'] = "Minnesota"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unemployment_df2['STATE_NAME'] = "Minnesota"


In [33]:
ue_df = unemployment_df2.rename(columns={ '2022 Annual Avg.' : '2022', '2021 Annual Avg.': '2021', '2020 Annual Avg.' : '2020',
       '2019 Annual Avg.': '2019', '2018 Annual Avg.': '2018', '2017 Annual Avg.': '2017',
       '2016 Annual Avg.': '2016', '2015 Annual Avg.': '2015', '2014 Annual Avg.': '2014',
       '2013 Annual Avg.': '2013', '2012 Annual Avg.': '2012', '2011 Annual Avg.': '2011',
       '2010 Annual Avg.' : '2010', '2009 Annual Avg.': '2009', '2008 Annual Avg.': '2008',
       '2007 Annual Avg.': '2007', '2006 Annual Avg.': '2006', '2005 Annual Avg.': '2005',
       '2004 Annual Avg.': '2004', '2003 Annual Avg.': '2003', '2002 Annual Avg.': '2002',
       '2001 Annual Avg.': '2001'})

In [34]:
i = -1
num_list = []
for i in range(88):
    i += 1
    num_list.append(i)

In [35]:
ue_df['COUNTY_FULL'] = ue_df.index
ue_df.index = num_list

ue_df.columns

Index(['2022', '2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006', '2005',
       '2004', '2003', '2002', '2001', 'STATE_NAME', 'COUNTY_FULL'],
      dtype='object', name='Year/Month')

In [36]:
upper_columns(ue_df)

In [37]:
ue_df_unpivot = pd.melt(ue_df, id_vars=['COUNTY_FULL', 'STATE_NAME'], value_vars = ['2022', '2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006', '2005',
       '2004', '2003', '2002', '2001'])

In [38]:
ue_df_unpivot2 = ue_df_unpivot.rename(columns={ 'COUNTY': 'COUNTY_FULL', 'variable' : 'YEAR', 'value': 'UE_RATE', })

In [39]:
ue_df_unpivot3 = ue_df_unpivot2[ue_df_unpivot2.COUNTY_FULL != 'Year/Month']

In [40]:
ue_final = ue_df_unpivot3.astype({'YEAR': int, 'UE_RATE': float})

## National Employment

**Extract**

 Tables obtained from: https://www.bls.gov/oes/tables.htm
 
 *Tables Created by BLS : U.S. Bureau of Labor Statistics

In [41]:
### Reading in Excel files and saving all data into dataframes

national2001_df = pd.read_excel("../data/national_M2001_dl.xls")
national2002_df = pd.read_excel("../data/national_M2002_dl.xls")
national2003_df = pd.read_excel("../data/national_M2003_dl.xls")
national2004_df = pd.read_excel("../data/national_M2004_dl.xls")
national2005_df = pd.read_excel("../data/national_M2005_dl.xls")
national2006_df = pd.read_excel("../data/national_M2006_dl.xls")
national2007_df = pd.read_excel("../data/national_M2007_dl.xls")
national2008_df = pd.read_excel("../data/national_M2008_dl.xls")
national2009_df = pd.read_excel("../data/national_M2009_dl.xls")
national2010_df = pd.read_excel("../data/national_M2010_dl.xls")
national2011_df = pd.read_excel("../data/national_M2011_dl.xls")
national2012_df = pd.read_excel("../data/national_M2012_dl.xls")
national2013_df = pd.read_excel("../data/national_M2013_dl.xls")
national2014_df = pd.read_excel("../data/national_M2014_dl.xlsx")
national2015_df = pd.read_excel("../data/national_M2015_dl.xlsx")
national2016_df = pd.read_excel("../data/national_M2016_dl.xlsx")
national2017_df = pd.read_excel("../data/national_M2017_dl.xlsx")
national2018_df = pd.read_excel("../data/national_M2018_dl.xlsx")
national2019_df = pd.read_excel("../data/national_M2019_dl.xlsx")
national2020_df = pd.read_excel("../data/national_M2020_dl.xlsx")
national2021_df = pd.read_excel("../data/national_M2021_dl.xlsx")
national2022_df = pd.read_excel("../data/national_M2022_dl.xlsx")

**Transform**

In [42]:
### Adding a 'Year' columns to each dataframe

national2002_df['Year'] = 2002
national2003_df['Year'] = 2003
national2004_df['Year'] = 2004
national2005_df['Year'] = 2005
national2006_df['Year'] = 2006
national2007_df['Year'] = 2007
national2008_df['Year'] = 2008
national2009_df['Year'] = 2009
national2010_df['Year'] = 2010
national2011_df['Year'] = 2011
national2012_df['Year'] = 2012
national2013_df['Year'] = 2013
national2014_df['Year'] = 2014
national2015_df['Year'] = 2015
national2016_df['Year'] = 2016
national2017_df['Year'] = 2017
national2018_df['Year'] = 2018
national2019_df['Year'] = 2019
national2020_df['Year'] = 2020
national2021_df['Year'] = 2021
national2022_df['Year'] = 2022

In [43]:
### Uppercasing all column headings

upper_columns(national2001_df)
upper_columns(national2002_df)
upper_columns(national2003_df)
upper_columns(national2004_df)
upper_columns(national2005_df)
upper_columns(national2006_df)
upper_columns(national2007_df)
upper_columns(national2008_df)
upper_columns(national2009_df)
upper_columns(national2010_df)
upper_columns(national2011_df)
upper_columns(national2012_df)
upper_columns(national2013_df)
upper_columns(national2014_df)
upper_columns(national2015_df)
upper_columns(national2016_df)
upper_columns(national2017_df)
upper_columns(national2018_df)
upper_columns(national2019_df)
upper_columns(national2020_df)
upper_columns(national2021_df)
upper_columns(national2022_df)

In [44]:
### Renaming columns

n2001_df = national2001_df.rename(columns={ 'GROUP': 'OCC_GROUP', 'H_WPCT10' : 'H_PCT10', 'H_WPCT25' : 'H_PCT25', 'H_WPCT75' : 'H_PCT75', 'H_WPCT90' : 'H_PCT90', 'A_WPCT10' : 'A_PCT10', 'A_WPCT25' : 'A_PCT25', 'A_WPCT75' : 'A_PCT75', 'A_WPCT90' : 'A_PCT90'})
n2002_df = national2002_df.rename(columns={ 'GROUP': 'OCC_GROUP', 'H_WPCT10' : 'H_PCT10', 'H_WPCT25' : 'H_PCT25', 'H_WPCT75' : 'H_PCT75', 'H_WPCT90' : 'H_PCT90', 'A_WPCT10' : 'A_PCT10', 'A_WPCT25' : 'A_PCT25', 'A_WPCT75' : 'A_PCT75', 'A_WPCT90' : 'A_PCT90'})
n2003_df = national2003_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2004_df = national2004_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2005_df = national2005_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2006_df = national2006_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2007_df = national2007_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2008_df = national2008_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
n2009_df = national2009_df.rename(columns={'GROUP': 'OCC_GROUP'})
n2010_df = national2010_df.rename(columns={'LOC QUOTIENT': 'LOC_QUOTIENT', 'GROUP': 'OCC_GROUP'})
n2011_df = national2011_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT', 'GROUP': 'OCC_GROUP'})
n2012_df = national2012_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2013_df = national2013_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2014_df = national2014_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2015_df = national2015_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2016_df = national2016_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2017_df = national2017_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2018_df = national2018_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
n2019_df = national2019_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
n2020_df = national2020_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
n2021_df = national2021_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
n2022_df = national2022_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})

In [45]:
### Filtering desired columns and assigning column order

n2001_ordered = n2001_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2002_ordered = n2002_df[[
    'OCC_CODE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2003_ordered = n2003_df[[
  'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2004_ordered = n2004_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2005_ordered = n2005_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2006_ordered = n2006_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2007_ordered = n2007_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2008_ordered = n2008_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2009_ordered = n2009_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2010_ordered = n2010_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2011_ordered = n2011_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2012_ordered = n2012_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2013_ordered = n2013_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2014_ordered = n2014_df[[
   'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2015_ordered = n2015_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2016_ordered = n2016_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2017_ordered = n2017_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2018_ordered = n2018_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
  'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2019_ordered = n2019_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2020_ordered = n2020_df[[
   'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2021_ordered = n2021_df[[
    'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
   'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
   'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
   'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
n2022_ordered = n2022_df[[
     'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]

In [46]:
### Concatinating all dataframes into one dataframe

national_df = pd.concat([n2001_ordered,n2002_ordered,n2003_ordered,n2004_ordered,
                         n2005_ordered,n2006_ordered, n2007_ordered,n2008_ordered,
                         n2009_ordered,n2010_ordered, n2011_ordered,n2012_ordered,
                         n2013_ordered,n2014_ordered,n2015_ordered,n2016_ordered,
                         n2017_ordered,n2018_ordered,n2019_ordered,n2020_ordered,
                         n2021_ordered,n2022_ordered ], axis=0)

In [47]:
### Filtering columns

national_df = national_df[[
    'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP', 'H_MEAN',
    'A_MEAN', 'H_MEDIAN', 'A_MEDIAN', 'YEAR']]


In [48]:
### Removing certain values from the dataframe

cols = list(national_df.columns)
for col in cols: 
    national_df = national_df[national_df[col] != '#']
    national_df = national_df[national_df[col] != '*']
    national_df = national_df[national_df[col] != '**']

In [49]:
### Chaning datatypes

nat_final_df = national_df.astype({'TOT_EMP': int, 'H_MEAN': float, 'A_MEAN': float, 'H_MEDIAN': float, 'A_MEDIAN': float, 'YEAR': int})

## State Employment

**Extract**

Tables obtained from: https://www.bls.gov/oes/tables.htm

*Tables Created by BLS : U.S. Bureau of Labor Statistics 

In [50]:
### Reading in tables and storing data into different dataframes

state2001_df = pd.read_excel("../data/state_M2001_dl.xls")
state2002_df = pd.read_excel("../data/state_M2002_dl.xls")
state2003_df = pd.read_excel("../data/state_M2003_dl.xls")
state2004_df = pd.read_excel("../data/state_M2004_dl.xls")
state2005_df = pd.read_excel("../data/state_M2005_dl.xls")
state2006_df = pd.read_excel("../data/state_M2006_dl.xls")
state2007_df = pd.read_excel("../data/state_M2007_dl.xls")
state2008_df = pd.read_excel("../data/state_M2008_dl.xls")
state2009_df = pd.read_excel("../data/state_M2009_dl.xls")
state2010_df = pd.read_excel("../data/state_M2010_dl.xls")
state2011_df = pd.read_excel("../data/state_M2011_dl.xls")
state2012_df = pd.read_excel("../data/state_M2012_dl.xls")
state2013_df = pd.read_excel("../data/state_M2013_dl.xls")
state2014_df = pd.read_excel("../data/state_M2014_dl.xlsx")
state2015_df = pd.read_excel("../data/state_M2015_dl.xlsx")
state2016_df = pd.read_excel("../data/state_M2016_dl.xlsx")
state2017_df = pd.read_excel("../data/state_M2017_dl.xlsx")
state2018_df = pd.read_excel("../data/state_M2018_dl.xlsx")
state2019_df = pd.read_excel("../data/state_M2019_dl.xlsx")
state2020_df = pd.read_excel("../data/state_M2020_dl.xlsx")
state2021_df = pd.read_excel("../data/state_M2021_dl.xlsx")
state2022_df = pd.read_excel("../data/state_M2022_dl.xlsx")

**Transform**

In [51]:
### Creating a 'Year' column for each dataframe

state2002_df['Year'] = 2002
state2003_df['Year'] = 2003
state2004_df['Year'] = 2004
state2005_df['Year'] = 2005
state2006_df['Year'] = 2006
state2007_df['Year'] = 2007
state2008_df['Year'] = 2008
state2009_df['Year'] = 2009
state2010_df['Year'] = 2010
state2011_df['Year'] = 2011
state2012_df['Year'] = 2012
state2013_df['Year'] = 2013
state2014_df['Year'] = 2014
state2015_df['Year'] = 2015
state2016_df['Year'] = 2016
state2017_df['Year'] = 2017
state2018_df['Year'] = 2018
state2019_df['Year'] = 2019
state2020_df['Year'] = 2020
state2021_df['Year'] = 2021
state2022_df['Year'] = 2022

In [52]:
### Uppercasing all column headings in each dataframe

upper_columns(state2001_df)
upper_columns(state2002_df)
upper_columns(state2003_df)
upper_columns(state2004_df)
upper_columns(state2005_df)
upper_columns(state2006_df)
upper_columns(state2007_df)
upper_columns(state2008_df)
upper_columns(state2009_df)
upper_columns(state2010_df)
upper_columns(state2011_df)
upper_columns(state2012_df)
upper_columns(state2013_df)
upper_columns(state2014_df)
upper_columns(state2015_df)
upper_columns(state2016_df)
upper_columns(state2017_df)
upper_columns(state2018_df)
upper_columns(state2019_df)
upper_columns(state2020_df)
upper_columns(state2021_df)
upper_columns(state2022_df)

In [53]:
### Renaming columns

s2001_df = state2001_df.rename(columns={ 'GROUP': 'OCC_GROUP', 'H_WPCT10' : 'H_PCT10', 'H_WPCT25' : 'H_PCT25', 'H_WPCT75' : 'H_PCT75', 'H_WPCT90' : 'H_PCT90', 'A_WPCT10' : 'A_PCT10', 'A_WPCT25' : 'A_PCT25', 'A_WPCT75' : 'A_PCT75', 'A_WPCT90' : 'A_PCT90'})
s2002_df = state2002_df.rename(columns={ 'GROUP': 'OCC_GROUP', 'H_WPCT10' : 'H_PCT10', 'H_WPCT25' : 'H_PCT25', 'H_WPCT75' : 'H_PCT75', 'H_WPCT90' : 'H_PCT90', 'A_WPCT10' : 'A_PCT10', 'A_WPCT25' : 'A_PCT25', 'A_WPCT75' : 'A_PCT75', 'A_WPCT90' : 'A_PCT90'})
s2003_df = state2003_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2004_df = state2004_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2005_df = state2005_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2006_df = state2006_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2007_df = state2007_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2008_df = state2008_df.rename(columns={ 'GROUP': 'OCC_GROUP'})
s2009_df = state2009_df.rename(columns={'GROUP': 'OCC_GROUP'})
s2010_df = state2010_df.rename(columns={'LOC QUOTIENT': 'LOC_QUOTIENT', 'GROUP': 'OCC_GROUP'})
s2011_df = state2011_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT', 'GROUP': 'OCC_GROUP'})
s2012_df = state2012_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2013_df = state2013_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2014_df = state2014_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2015_df = state2015_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2016_df = state2016_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2017_df = state2017_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2018_df = state2018_df.rename(columns={'LOC_Q': 'LOC_QUOTIENT'})
s2019_df = state2019_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
s2020_df = state2020_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
s2021_df = state2021_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})
s2022_df = state2022_df.rename(columns={'O_GROUP': 'OCC_GROUP', 'AREA_TITLE': 'STATE'})


In [54]:
### Filtering columns and assigning column order

s2001_ordered = s2001_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2002_ordered = s2002_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2003_ordered = s2003_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2004_ordered = s2004_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2005_ordered = s2005_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2006_ordered = s2006_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2007_ordered = s2007_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE',  'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2008_ordered = s2008_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2009_ordered = s2009_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2010_ordered = s2010_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2011_ordered = s2011_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2012_ordered = s2012_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2013_ordered = s2013_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2014_ordered = s2014_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2015_ordered = s2015_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2016_ordered = s2016_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2017_ordered = s2017_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2018_ordered = s2018_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2019_ordered = s2019_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2020_ordered = s2020_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2021_ordered = s2021_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]
s2022_ordered = s2022_df[[
    'AREA', 'STATE', 'OCC_CODE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'EMP_PRSE', 'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN', 'MEAN_PRSE',
    'H_PCT10', 'H_PCT25', 'H_MEDIAN', 'H_PCT75', 'H_PCT90', 'A_PCT10',
    'A_PCT25', 'A_MEDIAN', 'A_PCT75', 'A_PCT90', 'YEAR']]

In [55]:
### Concatinating all dataframes into one

state_df = pd.concat([s2001_ordered, s2002_ordered, s2003_ordered,s2004_ordered,s2005_ordered,s2006_ordered
                      ,s2007_ordered,s2008_ordered,s2009_ordered,s2010_ordered,s2011_ordered,s2012_ordered,
                      s2013_ordered,s2014_ordered,s2015_ordered,s2016_ordered,s2017_ordered,s2018_ordered,
                      s2019_ordered,s2020_ordered,s2021_ordered,s2022_ordered], axis=0)

In [56]:
### Filtering out certain values in the 'STATE' column

state_df = state_df[state_df['STATE'] != 'Guam'] 
state_df = state_df[state_df['STATE'] != 'Puerto Rico'] 
state_df = state_df[state_df['STATE'] != 'Virgin Islands']
state_df = state_df[state_df['STATE'] != 'District of Columbia']

In [57]:
### Filtering columns

state_df = state_df[[
    'AREA', 'STATE', 'OCC_TITLE', 'OCC_GROUP', 'TOT_EMP',
    'JOBS_1000', 'LOC_QUOTIENT', 'H_MEAN', 'A_MEAN',
    'H_MEDIAN', 'A_MEDIAN', 'YEAR']]

In [58]:
### Removing certain values

cols = list(state_df.columns)
for col in cols:
    state_df = state_df[state_df[col] != '#']
    state_df = state_df[state_df[col] != '*']
    state_df = state_df[state_df[col] != '**']

In [59]:
### Changing datatypes

state_final_df = state_df.astype({
    'TOT_EMP': int, 'JOBS_1000': float, 'LOC_QUOTIENT': float, 
    'H_MEAN': float, 'A_MEAN': float, 'H_MEDIAN': float, 
    'A_MEDIAN': float, 'YEAR': int})

## Poverty Threshold

**Extract**

Excel file obtained from: https://aspe.hhs.gov/topics/poverty-economic-mobility/poverty-guidelines

A portion of the excel file was then saved as a CSV.

In [2]:
### Reading in the CSV file and storing the data into a dataframe

poverty_threshold_df = pd.read_csv('../data/guidelines-1983-2023.csv')

**Transform**

In [3]:
### Reassigning the column headers

poverty_threshold_df.columns = poverty_threshold_df.iloc[2]

In [4]:
### Dropping certain rows and columns

poverty_threshold_df = poverty_threshold_df.drop([0,1,2,44,45,46,47,48]).dropna(axis = 1).drop(columns = ['2 Persons', '3 Persons', '4 Persons', '5 Persons', '6 Persons', '7 Persons', '8 Persons', '$ For Each Additional Person (9+)'])

In [5]:
### Replacing certain strings within the values

poverty_threshold_df['1 Person'] = poverty_threshold_df['1 Person'].str.replace('$', '').str.replace(',', '')

  poverty_threshold_df['1 Person'] = poverty_threshold_df['1 Person'].str.replace('$', '').str.replace(',', '')


## State Employment & State Poverty Merged

In [64]:
states_merged = state_df.merge(poverty_state, on=['STATE', 'YEAR'])

(639115, 15)

**Writing all dataframes to CSV's**

In [65]:
poverty_state.to_csv('../output/state_poverty.csv', index = False)
poverty_nat.to_csv('../output/national_poverty.csv', index = False)
nat_final_df.to_csv('../output/national_salary.csv', index = False)
state_final_df.to_csv('../output/state_salary.csv', index = False)
states_merged.to_csv('../output/state_poverty_and_income.csv', index = False)
county_final.to_csv('../output/county_poverty.csv', index = False)
poverty_threshold_df.to_csv('../output/poverty_threshold.csv', index = False)
ue_final.to_csv('../output/unemployment.csv', index = False)