# BLS Web Downloader - Tool 1 of 3 for accessing ATUS files

>This pulls flat files from the ATUS (American Time Use Survery) estimates website and converts them into a data dictonary that can be used to search for specific series IDs by characteristics.

>The data dictonary can be used manually to search for specific characteristics or you can use the BLS Series Selector (Tool 2) to use interactivly choose characteristics.

>The resulting list of series ids can be used in the BLS API Connector (Tool 3) to pull data files from the BLS site. It also cleans, organizes and merges multiple datafiles into a single csv. Option to add standard error estimates from the aspects.txt flat file. 


### Information about ATUS

The American time use survey contains data from 2003 - 2018, However, the coding of demographic and time-use data changed after 2008. It is suggested to start longitudinal studies from 2008. The Series files are estimates made from the raw data files. They include annual averages for major ativity catagories and demographics.
    
To search for series, you can use the [BLS DataFinder](https://beta.bls.gov/dataQuery/find?fq=survey:[tu]&s=popularity:D), but I find this to be a little clunky and I wanted to build a data dictonary that would let me see all the available estimates and search them by characteristics. 

Step 1 - Get the flat files from ATUS. ATUS maintains several text files that contain basic dictonary information about their Series data. This information is NOT in the Dictonary PDF for the raw files.

Website for flat files: [ATUS Flat Files](https://download.bls.gov/pub/time.series/tu/)

1. From the ATUS database page, click on the Text Files icon. The download page for the ATUS Flat files will open.

2. The tu.txt file describes all the files in this directory.

3. The files tu.data.0.Current and tu.data.1.AllData contain the data for all ATUS data series available on LABSTAT.

4. The tu.series file contains the series ID, the series title, and characteristics of each series. The titles of the data series describe all characteristics relevant to each data series.

5. The us.series file uses codes to describe individual demographic characteristics. For example: 'sex_code' == '1' would indicate the series is for Men. However, the mappings for these codes are not in tu.series files, but in individual ext files for each characteristic. This notebook combines the individual code files with the larger tu.series file to make a human searchable dictonary. 

### Before you begin, you should have some idea of what series you are interested in. The code below lets you see the columns in the tu.series file. I have saved it into the data/ directory

In [1]:
#Import modules
import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display
import requests
import urllib3, shutil

## If this is the first time you have used this tool box you will need to download the tu.series file. Once it is in your data folder, you should be able to open it without downloading.

In [4]:
#Run to download the tu.txt file
c = urllib3.PoolManager()
with c.request('GET', 'https://download.bls.gov/pub/time.series/tu/tu.series', preload_content=False) as res, open('data/tu.txt', 'wb') as out_file:
            shutil.copyfileobj(res, out_file)



In [5]:
#open the tu.txt file with pandas
lex = pd.read_csv('data/tu.txt', sep='\t') 

In [33]:
#this will let you see all the columns that are in the dataset and an example of what the coding looks like
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
lex.head(1)

Unnamed: 0,series_id,seasonal,stattype_code,datays_code,sex_code,region_code,lfstat_code,educ_code,maritlstat_code,age_code,orig_code,race_code,mjcow_code,nmet_code,where_code,sjmj_code,timeday_code,actcode_code,industry_code,occ_code,prhhchild_code,earn_code,disability_code,who_code,hhnscc03_code,schenr_code,prownhhchild_code,work_code,elnum_code,ecage_code,elfreq_code,eldur_code,elwho_code,ecytd_code,elder_code,lfstatw_code,pertype_code,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
0,TUU10100AA01000007,U,10100,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,0,Number of persons - Employed,,2003,A01,2018,A01


In [34]:
#I printed this out to show that the series_id column does not have a standard size. 
lex.columns

Index(['series_id                     ', 'seasonal', 'stattype_code',
       'datays_code', 'sex_code', 'region_code', 'lfstat_code', 'educ_code',
       'maritlstat_code', 'age_code', 'orig_code', 'race_code', 'mjcow_code',
       'nmet_code', 'where_code', 'sjmj_code', 'timeday_code', 'actcode_code',
       'industry_code', 'occ_code', 'prhhchild_code', 'earn_code',
       'disability_code', 'who_code', 'hhnscc03_code', 'schenr_code',
       'prownhhchild_code', 'work_code', 'elnum_code', 'ecage_code',
       'elfreq_code', 'eldur_code', 'elwho_code', 'ecytd_code', 'elder_code',
       'lfstatw_code', 'pertype_code', 'series_title', 'footnote_codes',
       'begin_year', 'begin_period', 'end_year', 'end_period'],
      dtype='object')

I'm only interested in some catagories, so I am going to pull those out to make the file smaller and easier to read. 
* Series ID - The id that will be used in the api. Note the large number of spaces in this code column name!
* sex_code - denotes M\F - note that the US government is not hip with non-binary:(
* edu_code - The level of education attained
* age_code - the age class of the individual
* pertype_code - Weekday or Weekends
* series_title - just so we know what we are getting!
* earn_code - the earnings class
* actcode_the - the activity being done
* lfstatw_code - employment enformation
* orig_code - Latinx is not listed in race code it is here
* stattype_code - what estimate was taken

The definition of all codes can be found here [tu.txt] (https://download.bls.gov/pub/time.series/tu/tu.txt)


In [35]:
#Copy and paste in any of the series columns you are interested in. Some do not have any choice, like "elfreq_code"
#This makes the smaller dictionary we will use the rest of the toolbox. It will make everything faster later.
lex_sm = lex[['series_id                     ', 'sex_code', 'race_code', 'educ_code', 'age_code', 'pertype_code', 'series_title', 'earn_code', 'actcode_code', 'lfstat_code', 'orig_code', 'stattype_code']]

lex_sm.reset_index(drop=True, inplace=True)




In [36]:
#check the new smaller file! You can hide the output by double clicking to the left of the output box.
lex_sm

Unnamed: 0,series_id,sex_code,race_code,educ_code,age_code,pertype_code,series_title,earn_code,actcode_code,lfstat_code,orig_code,stattype_code
0,TUU10100AA01000007,0,0,0,0,0,Number of persons - Employed,0,0,1,0,10100
1,TUU10100AA01000013,0,0,0,0,0,"Number of persons - Employed, Multiple jobholders",0,0,1,0,10100
2,TUU10100AA01000014,0,0,0,0,0,"Number of persons - Employed, Single jobholders",0,0,1,0,10100
3,TUU10100AA01000015,0,0,0,0,0,"Number of persons - Employed, Wage and salary workers",0,0,1,0,10100
4,TUU10100AA01000018,0,0,0,0,0,"Number of persons - Employed, Self-employed workers",0,0,1,0,10100
...,...,...,...,...,...,...,...,...,...,...,...,...
85274,TUU30107AA01019524,0,0,0,0,19,"Number participating on an avg day - Working at main job, Nonholiday weekdays, Employed, Self-employed workers",0,50101,1,0,30107
85275,TUU30107AA01054190,0,0,37,28,16,"Number participating on an avg day - Working, Weekend days and holidays, Employed, Bachelor's degree only, 25 yrs and over",0,50100,1,0,30107
85276,TUU30107AA01054191,0,0,37,28,19,"Number participating on an avg day - Working, Nonholiday weekdays, Employed, Bachelor's degree only, 25 yrs and over",0,50100,1,0,30107
85277,TUU30107AA01054197,0,0,38,28,16,"Number participating on an avg day - Working, Weekend days and holidays, Employed, Advanced degree, 25 yrs and over",0,50100,1,0,30107


In [37]:
#little assert to make sure we did not miss anything!
assert (len(lex_sm) == len(lex)), 'Missing Lines!'

## Get the code definitions flat files for all the columns we wanted
Now we will pull all the code mapping files and merge them into the dictionary.
The flat files all have the same format "codename.txt" the codename matches the str before the _code in the dictionary file. For example, the code mapping file for "sex_code" is "sex.txt". You can see all of them at [Flat Files](https://download.bls.gov/pub/time.series/tu/)

**By default they will be saved in the /data directory**

In [38]:
#define the base url
server_url = 'https://download.bls.gov/pub/time.series/tu/tu.' #all files start with tu
#list of files to we want, use for both the requests and to create the filename
cat_list = ['actcode', 'age', 'aspect', 'earn', 'educ', 'lfstat', 'orig', 'pertype', 'race', 'sex', 'stattype']
c = urllib3.PoolManager()
def get_bls_write_bls(server_url = server_url, cat_list = cat_list):
    """This definition takes the list of catagories and iteratively turns them into 
    urls to the individual code maps and then uses them as the filename for the output txt file.
    Unless specifed the resulting files will be written to the working directory. Leave them there if you are using
    the other tools in the toolbox"""
    for i in range(len(cat_list)):
        url = str(server_url + cat_list[i])
        filename = 'data/' + cat_list[i]+'.txt' 
        with c.request('GET', url, preload_content=False) as res, open(filename, 'wb') as out_file:
            shutil.copyfileobj(res, out_file)



In [39]:
#Run the function, unless you change the name of the server_url or cat_list variables, you can call it without arguments.
get_bls_write_bls() #ignore fun warnings about insecure requests



## Replacing the codes in the dictonary with usable text!
This section wil take the individual maps and merge them with the dictionary.

In [41]:
#just in case
lex_replace = lex_sm.copy(deep=True)

In [42]:
#make it a string - needed for replacing later - we don't do any stats on this file, so its okay
lex_replace = lex_replace.applymap(str)

In [43]:
#setting up the function
#List of the codes you want to replace with text - you will keep the original code column in the dictionary so we can do a little checking later
cat_list2 = ['actcode', 'age', 'earn', 'educ', 'lfstat', 'orig', 'pertype', 'race', 'sex', 'stattype']


In [44]:
def adding_text_to_dict(df=lex_replace, cat_list=cat_list2):#defaults are set to use the files set above
    """This function imports the individual code files and attempts to match them to column names in the dictionary file.
    If the column is found, it will add a new column with the textual description for the code. The column will be called
    codename_text. For example the age_code will be matched to a new age_text column"""
    for i in range(len(cat_list)):
        filename = cat_list[i]+'.txt'
        var_name = cat_list[i]+'_code'
        text_name = cat_list[i]+'_text'
        codeing = pd.read_csv(filename, sep ='\t')
        codeing = codeing.iloc[:,:2]
        codeing = codeing.applymap(str)
        code_dict = codeing.set_index(var_name).T.to_dict('list')
        df[text_name] = df[var_name].replace(code_dict)
        

In [45]:
adding_text_to_dict()#if using defaults, no need for inputs

In [47]:
#check!!!You should now see the new columns at the far right
lex_replace.head(5)

Unnamed: 0,series_id,sex_code,race_code,educ_code,age_code,pertype_code,series_title,earn_code,actcode_code,lfstat_code,orig_code,stattype_code,actcode_text,age_text,earn_text,educ_text,lfstat_text,orig_text,pertype_text,race_text,sex_text,stattype_text
0,TUU10100AA01000007,0,0,0,0,0,Number of persons - Employed,0,0,1,0,10100,"Total, all activities",15 years and over,All persons,All education levels,Employed,,All days,All races,Both sexes,Number of persons (in thousands)
1,TUU10100AA01000013,0,0,0,0,0,"Number of persons - Employed, Multiple jobholders",0,0,1,0,10100,"Total, all activities",15 years and over,All persons,All education levels,Employed,,All days,All races,Both sexes,Number of persons (in thousands)
2,TUU10100AA01000014,0,0,0,0,0,"Number of persons - Employed, Single jobholders",0,0,1,0,10100,"Total, all activities",15 years and over,All persons,All education levels,Employed,,All days,All races,Both sexes,Number of persons (in thousands)
3,TUU10100AA01000015,0,0,0,0,0,"Number of persons - Employed, Wage and salary workers",0,0,1,0,10100,"Total, all activities",15 years and over,All persons,All education levels,Employed,,All days,All races,Both sexes,Number of persons (in thousands)
4,TUU10100AA01000018,0,0,0,0,0,"Number of persons - Employed, Self-employed workers",0,0,1,0,10100,"Total, all activities",15 years and over,All persons,All education levels,Employed,,All days,All races,Both sexes,Number of persons (in thousands)


In [48]:
#write it to a file if you want. Keep track of this name if you want to use the next tool to search for series ids.
lex_replace.to_csv('data/with_replace.csv')

In [49]:
#code to check a few of the replacements to make sure they worked
#this section is just for checking it can be modified to check for different values
#check the values agains the original def file you downloaded, you can open it in a reader or read it in here. 
#I just used atom, most of the files have very few codes

#setup a match column - I used age_code 31 and matched it to the correct text '25 to 34 years'
lex_replace.loc[(lex_replace['age_code'] == '31') & (lex_replace['age_text'] == '25 to 34 years'), 'match'] = 'Match'
lex_replace.loc[(lex_replace['age_code'] == '31') & (lex_replace['age_text'] != '25 to 34 years'), 'match'] = 'NoMatch'


In [50]:
#see if you got any NoMatch! nan means that the row did not meet either criteria
lex_replace['match'].unique()

array([nan, 'Match'], dtype=object)

## You now have a searchable dictionary! If you want you can move onto the BLS Series Selector to search for specific series. 