## Unemployment Data collection

The code for the unemployment data is reproduced and explained in this section.

First, import required packages:

In [1]:
import requests
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from os import listdir

# Bokeh provides state and county boundaries as latitude, longitudes. 
# This will be useful for assigning oil/gas wells to a given county.
from bokeh.sampledata.us_counties import data as counties
from bokeh.sampledata.us_states import data as states

Next, read in individual county codes provided from BLS's website. This was downloaded locally and is provided in the 'UnemploymentData' folder. Note that this file contains all the county codes in the entire country.

In [2]:
area_codes = pd.read_csv('BLS_AreaCodes.txt',sep='\t',index_col=False)
county_codes = area_codes[area_codes['area_type_code'] == 'F']
county_codes = county_codes.reset_index(drop=True)

Check the codes:

In [3]:
county_codes.head()

Unnamed: 0,area_type_code,area_code,area_text,display_level,selectable,sort_sequence
0,F,CN0100100000000,"Autauga County, AL",0,T,31
1,F,CN0100300000000,"Baldwin County, AL",0,T,32
2,F,CN0100500000000,"Barbour County, AL",0,T,33
3,F,CN0100700000000,"Bibb County, AL",0,T,34
4,F,CN0100900000000,"Blount County, AL",0,T,35


Create a dictionary based on the county codes:

In [4]:
county_names = dict(zip(county_codes['area_code'],county_codes['area_text']))

The county codes are arranged by state, and each state corresponds to a sort_sequence. For the three states of interest to us, the sort_sequences are:

Texas county codes : 6755-7008

North Dakota county codes: 5563-5615

Oklahoma county codes: 5881-5957

Wyoming county codes: 8139-8161

We create a dictionary for easy access to these codes:

In [5]:
state_sequences = {'tx':[6755,7008], 'nd':[5563,5615], 'ok':[5881,5957], 'wy':[8139,8161]}

The data extracted from the BLS is reported by month, which is reported as 'M01', 'M02' etc. We also create a dictionary to extract the actual month name from this so the data is legible:

In [6]:
months = dict(zip(['M01','M02','M03','M04','M05','M06','M07','M08','M09','M10','M11','M12'],
                  ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']))

Now we can define the target states for which data is to be extracted:



In [7]:
target_states = ['tx','nd','wy'] # correspond to Texas, North Dakota and Wyoming, respectively

Define starting and ending years for which to extract data

In [8]:
start_year = 1990
end_year = 2017

Define your own API key, provided by registration on the BLS website

In [9]:
bls_key = "62fa9141c8314e7eb68c043321bde37a"

Now we are ready to start extracting the data. We have extracted the data by state, and saved it for each state in separate files. The BLS API does not allow extraction of data for more than 50 counties and 20 years at a time (after registering for a BLS API key). So this limitation must be taken into account. We get around this by extracting data for each state 50 counties at a time, in a loop.

We extract two kinds of data: the unemployment rate, and the total workforce, for each county, for each state under consideration. The data is reported by month and year.

### Unemployment rate

In [10]:
headers = {'Content-type': 'application/json'}

# Getting data by state
for state in target_states:
    start_row = int(county_codes.index[county_codes['sort_sequence'] == state_sequences[state][0]].values)
    end_row = int(start_row + state_sequences[state][1] - state_sequences[state][0]) + 1
    
    # The BLS API requires the seriesId, which is a specific code for each county, and the kind of data which is being extracted
    series_id_list = list(map(lambda x: 'LAU' + x + '03', list(county_codes.iloc[start_row:end_row]['area_code'].values)))

    cur_state_data = {'SeriesId': [], 'Year': [], 'Period': [], 'Value': []}

    # must break up series list into lots of 50 for API
    for i in np.arange(0,len(series_id_list),50):
        series_end = min(i+50,len(series_id_list))
        cur_series_list = series_id_list[i:series_end]
        
        # Iterate over start and end dates (only 20 years may be obtained at a time)
        for year in np.arange(start_year,end_year,20): 
            cur_start_year = year
            cur_end_year = min(year + 20, end_year)
            
            data = json.dumps({"seriesid": cur_series_list, "startyear": str(cur_start_year), "endyear": str(cur_end_year),"registrationkey":bls_key})
        
            # Send the actual data request
            p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
        
            # Load raw text data received from the API
            json_data = json.loads(p.text)

            # The json_data
            for series in json_data['Results']['series']:
                seriesId = series['seriesID']
                for item in series['data']:
                    cur_state_data['SeriesId'].append(seriesId)
                    cur_state_data['Year'].append(item['year'])
                    cur_state_data['Period'].append(item['period'])
                    cur_state_data['Value'].append(item['value'])

    # Create data frame from current state data
    cur_state_df = pd.DataFrame.from_dict(cur_state_data)
    
    # Map the SeriesId so that the actual county is output, not a weird sounding code
    cur_state_df['CountyName'] = cur_state_df['SeriesId'].apply(lambda x: x[3:-2]).map(county_names)
    
    # Map the month using the months dictionary so that the actual month is reported
    cur_state_df['Month'] = cur_state_df['Period'].map(months)
    
    # Make a new column so that the month and year are combined into a single time column
    cur_state_df['Date'] = pd.to_datetime(cur_state_df['Month'] + cur_state_df['Year'].astype(str))
    cur_state_df = cur_state_df[['Date','CountyName','Value']]
    cur_state_df['Date'] = cur_state_df['Date'].dt.strftime('%m/%Y')
    
    # Pivot the data series so that the data for each county becomes its own time series
    cur_state_df = cur_state_df.pivot(index='Date',columns='CountyName',values='Value')
    
    # Write out the file for each state to the UnemploymentData directory
    cur_state_df.to_csv(state + '_unemployment.csv')


### Total workforce, by county


In [11]:
headers = {'Content-type': 'application/json'}

# Getting data by state
for state in target_states:
    start_row = int(county_codes.index[county_codes['sort_sequence'] == state_sequences[state][0]].values)
    end_row = int(start_row + state_sequences[state][1] - state_sequences[state][0])+1
    
    # The BLS API requires the seriesId, which is a specific code for each county, and the kind of data which is being extracted
    series_id_list = list(map(lambda x: 'LAU' + x + '06', list(county_codes.iloc[start_row:end_row]['area_code'].values)))

    cur_state_data = {'SeriesId': [], 'Year': [], 'Period': [], 'Value': []}

    # must break up series list into lots of 50 for API
    for i in np.arange(0,len(series_id_list),50):
        series_end = min(i+50,len(series_id_list))
        cur_series_list = series_id_list[i:series_end]
        
        # Iterate over start and end dates (only 20 years may be obtained at a time)
        for year in np.arange(start_year,end_year,20): 
            cur_start_year = year
            cur_end_year = min(year + 20, end_year)
            
            data = json.dumps({"seriesid": cur_series_list, "startyear": str(cur_start_year), "endyear": str(cur_end_year),"registrationkey":bls_key})
        
        
            # Send the actual data request
            p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
        
            # Load raw text data received from the API
            json_data = json.loads(p.text)

            # The json_data
            for series in json_data['Results']['series']:
                seriesId = series['seriesID']
                for item in series['data']:
                    cur_state_data['SeriesId'].append(seriesId)
                    cur_state_data['Year'].append(item['year'])
                    cur_state_data['Period'].append(item['period'])
                    cur_state_data['Value'].append(item['value'])

    # Create data frame from current state data
    cur_state_df = pd.DataFrame.from_dict(cur_state_data)
    
    # Map the SeriesId so that the actual county is output, not a weird sounding code
    cur_state_df['CountyName'] = cur_state_df['SeriesId'].apply(lambda x: x[3:-2]).map(county_names)
    
    # Map the month using the months dictionary so that the actual month is reported
    cur_state_df['Month'] = cur_state_df['Period'].map(months)
    
    # Make a new column so that the month and year are combined into a single time column
    cur_state_df['Date'] = pd.to_datetime(cur_state_df['Month'] + cur_state_df['Year'].astype(str))
    cur_state_df = cur_state_df[['Date','CountyName','Value']]
    cur_state_df['Date'] = cur_state_df['Date'].dt.strftime('%m/%Y')
    
    # Pivot the data series so that the data for each county becomes its own time series
    cur_state_df = cur_state_df.pivot(index='Date',columns='CountyName',values='Value')
    
    # Write out the file for each state to the UnemploymentData directory
    cur_state_df.to_csv(state + '_laborForce.csv')


Now we check the data files output to make sure everything looks good:

In [12]:
tx_data = pd.read_csv('tx_unemployment.csv')
nd_data = pd.read_csv('nd_unemployment.csv')
wy_data = pd.read_csv('wy_unemployment.csv')

In [13]:
tx_data.head()

Unnamed: 0,Date,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bailey County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
0,01/1990,7.2,4.4,6.5,4.7,2.6,3.8,7.4,3.6,2.8,...,18.9,4.5,4.7,5.2,4.3,6.1,3.1,3.5,14.6,22.6
1,01/1991,6.8,3.2,7.2,4.7,2.6,4.6,7.2,3.2,4.2,...,20.3,3.3,4.2,5.1,4.8,6.0,3.9,3.5,14.7,24.1
2,01/1992,9.7,8.0,8.0,6.1,4.7,5.1,9.0,5.2,5.8,...,22.5,3.3,5.6,8.7,6.9,8.2,4.6,6.9,18.9,30.5
3,01/1993,7.6,8.3,8.2,6.9,3.1,4.9,7.2,4.2,9.6,...,25.3,2.9,4.0,10.4,6.6,7.3,7.8,7.9,19.1,25.3
4,01/1994,6.6,6.2,7.2,7.4,4.3,5.2,6.0,4.4,6.9,...,28.1,2.6,3.7,9.5,5.2,6.9,5.5,9.3,17.5,24.8


In [14]:
tx_data.isnull().sum().sum()

0

In [15]:
nd_data.head()

Unnamed: 0,Date,"Adams County, ND","Barnes County, ND","Benson County, ND","Billings County, ND","Bottineau County, ND","Bowman County, ND","Burke County, ND","Burleigh County, ND","Cass County, ND",...,"Slope County, ND","Stark County, ND","Steele County, ND","Stutsman County, ND","Towner County, ND","Traill County, ND","Walsh County, ND","Ward County, ND","Wells County, ND","Williams County, ND"
0,01/1990,2.1,4.7,9.7,8.7,5.2,2.9,3.7,5.1,3.6,...,2.9,4.9,2.7,4.8,4.7,3.9,5.4,5.5,9.0,4.3
1,01/1991,3.4,5.2,9.0,6.3,5.6,2.6,4.1,4.8,3.4,...,3.5,5.7,1.5,4.7,4.6,3.3,5.3,5.1,9.9,3.9
2,01/1992,3.2,5.6,12.5,8.1,4.9,4.7,4.9,5.1,3.8,...,5.1,6.8,2.4,5.7,4.7,3.4,5.2,5.6,10.3,6.1
3,01/1993,2.2,6.3,15.7,6.1,5.5,5.1,3.8,5.1,3.6,...,3.3,5.1,2.5,5.2,3.6,3.4,5.3,6.0,10.4,5.5
4,01/1994,1.9,4.9,11.2,2.9,4.0,3.6,3.2,4.9,3.2,...,2.9,5.4,1.8,4.8,2.4,3.0,6.8,5.3,8.9,6.4


In [16]:
nd_data.isnull().sum().sum()

0

In [17]:
wy_data.head()

Unnamed: 0,Date,"Albany County, WY","Big Horn County, WY","Campbell County, WY","Carbon County, WY","Converse County, WY","Crook County, WY","Fremont County, WY","Goshen County, WY","Hot Springs County, WY",...,"Niobrara County, WY","Park County, WY","Platte County, WY","Sheridan County, WY","Sublette County, WY","Sweetwater County, WY","Teton County, WY","Uinta County, WY","Washakie County, WY","Weston County, WY"
0,01/1990,5.8,7.2,5.3,6.7,8.8,5.4,11.1,5.0,5.1,...,5.6,7.2,7.5,7.2,3.9,6.7,2.5,8.0,5.4,6.3
1,01/1991,4.2,7.8,5.7,8.0,8.0,4.0,9.3,4.2,5.3,...,6.2,6.3,8.1,8.3,4.2,7.6,3.0,8.3,4.7,7.5
2,01/1992,3.6,7.9,6.5,8.2,7.6,4.7,10.1,4.5,5.7,...,7.2,6.6,7.8,8.4,5.6,7.9,4.9,9.7,5.3,8.1
3,01/1993,3.5,7.3,6.5,7.9,6.5,4.1,8.5,4.5,6.6,...,4.8,5.7,7.1,5.9,3.8,7.4,2.7,8.4,7.3,5.8
4,01/1994,3.2,7.9,6.0,7.3,7.5,5.1,9.6,4.4,6.2,...,3.9,6.0,7.5,7.5,4.7,6.7,2.7,9.3,6.6,7.8


In [18]:
wy_data.isnull().sum().sum()

0

So there don't appear to be any null values in the data pulled.

In [19]:
tx_laborForce = pd.read_csv('tx_laborForce.csv')
nd_laborForce = pd.read_csv('nd_laborForce.csv')
wy_laborForce = pd.read_csv('wy_laborForce.csv')

In [20]:
print(tx_laborForce.isnull().sum().sum())
print(nd_laborForce.isnull().sum().sum())
print(wy_laborForce.isnull().sum().sum())

0
0
0


In [21]:
tx_laborForce.head()

Unnamed: 0,Date,"Anderson County, TX","Andrews County, TX","Angelina County, TX","Aransas County, TX","Archer County, TX","Armstrong County, TX","Atascosa County, TX","Austin County, TX","Bailey County, TX",...,"Willacy County, TX","Williamson County, TX","Wilson County, TX","Winkler County, TX","Wise County, TX","Wood County, TX","Yoakum County, TX","Young County, TX","Zapata County, TX","Zavala County, TX"
0,01/1990,17854,6356,32084,7444,3826,953,12447,8992,3671,...,7075,77254,10354,3463,15603,12560,3996,8471,3084,5013
1,01/1991,17918,6389,32338,7509,3645,1076,13248,9131,3563,...,7063,79805,10327,3403,15700,12936,3986,8506,3122,5067
2,01/1992,18689,6432,33236,7520,3721,1072,14041,9371,3621,...,7385,84445,10786,3630,16856,13358,3996,8746,3475,5010
3,01/1993,18558,6064,33691,7937,3784,1037,14559,10136,3908,...,7805,91141,11247,3458,17265,14260,3867,8866,3643,4905
4,01/1994,18930,5989,33722,8465,3938,998,15176,10332,3547,...,7840,100224,12056,3306,18344,14026,3684,8813,3898,5033


In [22]:
nd_laborForce.head()

Unnamed: 0,Date,"Adams County, ND","Barnes County, ND","Benson County, ND","Billings County, ND","Bottineau County, ND","Bowman County, ND","Burke County, ND","Burleigh County, ND","Cass County, ND",...,"Slope County, ND","Stark County, ND","Steele County, ND","Stutsman County, ND","Towner County, ND","Traill County, ND","Walsh County, ND","Ward County, ND","Wells County, ND","Williams County, ND"
0,01/1990,1734,5923,2719,416,3684,1851,1254,32471,59221,...,561,11857,1210,11233,1588,3953,6089,25984,2517,9602
1,01/1991,1580,5571,2511,475,3455,1795,1220,33726,59750,...,433,11558,973,11080,1513,3733,6345,25459,2574,9921
2,01/1992,1495,5295,2508,406,3292,1731,1072,34350,60516,...,369,11652,984,10835,1457,3645,6060,25756,2393,9998
3,01/1993,1424,5311,2594,374,3245,1728,1084,35127,61361,...,359,11564,926,10790,1563,3619,6161,25932,2480,9847
4,01/1994,1646,5796,2784,312,3525,1766,1091,35829,63238,...,482,12394,1245,11203,1703,3918,6058,27499,2472,9898


In [23]:
wy_laborForce.head()

Unnamed: 0,Date,"Albany County, WY","Big Horn County, WY","Campbell County, WY","Carbon County, WY","Converse County, WY","Crook County, WY","Fremont County, WY","Goshen County, WY","Hot Springs County, WY",...,"Niobrara County, WY","Park County, WY","Platte County, WY","Sheridan County, WY","Sublette County, WY","Sweetwater County, WY","Teton County, WY","Uinta County, WY","Washakie County, WY","Weston County, WY"
0,01/1990,16669,4875,16214,8682,5729,2631,16046,5928,2382,...,1159,12169,3827,12010,2488,20164,7287,9603,4300,3348
1,01/1991,16636,4734,16308,8480,5567,2508,15699,5737,2392,...,1097,11781,3685,12039,2524,20705,7482,9989,4319,3293
2,01/1992,16289,4646,16649,8155,5809,2557,15955,5918,2376,...,1139,11991,3738,12083,2514,21210,7825,10198,4428,3178
3,01/1993,15122,4781,17094,8157,5974,2654,16079,6016,2445,...,1063,12131,3901,12680,2613,21090,8405,10444,4492,3198
4,01/1994,15979,5028,17714,8393,6118,2746,17003,6360,2482,...,1149,13191,3944,13061,2815,21345,8900,10405,4579,3311


So the labor force data also seems to be correct.