# Data Exploration

My objective of this visualization is to provide an intuitive perception of how the wages of employees (maybe more interested in Information Technology related fields) have been changing over time. The dataset that I will depend on will (primary) be **National Employment, Hours, and Earnings** [(link to Kaggle)](https://www.kaggle.com/bls/employment#ce.series.csv) provided by _Current Employment Statistics (CES)_.

For the purpose of the visualization, I would have to explore around the data, understand the dataset. This notebook exposes these steps...

In [155]:
import requests
import os

def download_ce(filename, dst='./ce_data/'):
    """
    Download from https://download.bls.gov/pub/time.series/ce/
    """
    url = f'https://download.bls.gov/pub/time.series/ce/{filename}'
    
    print('downloading from', url)
    r = requests.get(url)
    f = dst+filename
    print('save to', f)
    
    raw = r.content #.decode().replace('\t', ' ')
    open(f, 'wb').write(raw)
    
import pandas as pd
def load_ce(filename, dst='./ce_data/', force_download=False):
    """
    Load ce file as DataFrame, if not exist, download from ce
    """
    f = dst+filename
    if force_download or not os.path.exists(f):
        download_ce(filename, dst)
   
    # return open(f, 'r').read()
    # return f
    return pd.read_csv(f, sep=r'\s*\t', engine='python')

In [200]:
%mkdir ce_data
open('ce_data/touch', 'w').write('');

mkdir: ce_data: File exists


In [157]:
df_50a = load_ce('ce.data.50a.Information.Employment', force_download=True)
df_50a.head()

downloading from https://download.bls.gov/pub/time.series/ce/ce.data.50a.Information.Employment
save to ./ce_data/ce.data.50a.Information.Employment


Unnamed: 0,series_id,year,period,value,footnote_codes
0,CES5000000001,1939,M01,1112.0,
1,CES5000000001,1939,M02,1118.0,
2,CES5000000001,1939,M03,1126.0,
3,CES5000000001,1939,M04,1127.0,
4,CES5000000001,1939,M05,1125.0,


In [158]:
df_50b = load_ce('ce.data.50b.Information.AllEmployeeHoursAndEarnings')
df_50b.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,CES5000000002,2006,M03,36.4,
1,CES5000000002,2006,M04,36.4,
2,CES5000000002,2006,M05,36.5,
3,CES5000000002,2006,M06,36.5,
4,CES5000000002,2006,M07,36.4,


In [159]:
df_50b.year.unique()

array([2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
       2017, 2018, 2019])

In [160]:
df_50c = load_ce('ce.data.50c.Information.ProductionEmployeeHoursAndEarnings')
df_50c.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,CES5000000007,1964,M01,38.0,
1,CES5000000007,1964,M02,37.9,
2,CES5000000007,1964,M03,38.1,
3,CES5000000007,1964,M04,38.3,
4,CES5000000007,1964,M05,38.2,


In [192]:
df_50c[ df_50c['year'] == 1964 ].series_id.unique()

array(['CES5000000007', 'CES5000000008', 'CES5000000030', 'CES5000000031',
       'CES5000000032', 'CES5000000034', 'CES5000000035', 'CES5000000081',
       'CES5000000082', 'CEU5000000007', 'CEU5000000008', 'CEU5000000030',
       'CEU5000000031', 'CEU5000000032', 'CEU5000000034', 'CEU5000000035',
       'CEU5000000081', 'CEU5000000082'], dtype=object)

**Alright... But what're the meanings of these `CES5000000032` ids?**

In [170]:
_df_series = load_ce('ce.series')
def get_meaning_series(series_id):
    return _df_series[ _df_series['series_id'] == series_id ]

In [191]:
get_meaning_series('CES5000000007')

Unnamed: 0,series_id,supersector_code,industry_code,data_type_code,seasonal,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
7364,CES5000000007,50,50000000,7,S,Average weekly hours of production and nonsupe...,1964,M01,2019,M09,


**And `supersector_code`? `industry_code`? `data_type_code`?**

In [189]:
_df_supersector = load_ce('ce.supersector')
_df_industry = load_ce('ce.industry')
_df_datatype = load_ce('ce.datatype')

def get_meaning_supersector(code):
    return _df_supersector[ _df_supersector['supersector_code'] == code ].supersector_name.iloc[0]
def get_meaning_industry(code):
    return _df_industry[ _df_industry['industry_code'] == code ]
def get_meaning_datatype(code):
    return _df_datatype[ _df_datatype['data_type_code'] == code ]

In [187]:
arr = df_50c[ df_50c['year'] == 1964 ].series_id.unique()
print('len', len(arr))
arr

len 18


array(['CES5000000007', 'CES5000000008', 'CES5000000030', 'CES5000000031',
       'CES5000000032', 'CES5000000034', 'CES5000000035', 'CES5000000081',
       'CES5000000082', 'CEU5000000007', 'CEU5000000008', 'CEU5000000030',
       'CEU5000000031', 'CEU5000000032', 'CEU5000000034', 'CEU5000000035',
       'CEU5000000081', 'CEU5000000082'], dtype=object)

In [190]:
arr = df_50c[ df_50c['year'] == 2018 ].series_id.unique()
print('len', len(arr))
arr[:30]

len 234


array(['CES5000000007', 'CES5000000008', 'CES5000000030', 'CES5000000031',
       'CES5000000032', 'CES5000000034', 'CES5000000035', 'CES5000000081',
       'CES5000000082', 'CES5051100007', 'CES5051100008', 'CES5051100030',
       'CES5051100031', 'CES5051100032', 'CES5051100034', 'CES5051100035',
       'CES5051100081', 'CES5051100082', 'CES5051110007', 'CES5051110008',
       'CES5051110030', 'CES5051110031', 'CES5051110032', 'CES5051110034',
       'CES5051110035', 'CES5051110081', 'CES5051110082', 'CES5051111007',
       'CES5051111008', 'CES5051111030'], dtype=object)

**Interesting, looks like there're new `series` gained over time**

In [178]:
df = get_meaning_series(arr[15])
df

Unnamed: 0,series_id,supersector_code,industry_code,data_type_code,seasonal,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
20382,CEU5000000035,50,50000000,35,U,Indexes of aggregate weekly payrolls of produc...,1964,M01,2019,M09,


In [175]:
print(df.industry_code.iloc[0])
get_meaning_industry(df.industry_code.iloc[0]) 

50511000


Unnamed: 0,industry_code,naics_code,publishing_status,industry_name,display_level,selectable,sort_sequence
520,50511000,511,A,"Publishing industries, except Internet",4,T,521


In [176]:
print(df.data_type_code.iloc[0])
get_meaning_datatype(df.data_type_code.iloc[0])

35


Unnamed: 0,data_type_code,data_type_text
28,35,INDEXES OF AGGREGATE WEEKLY PAYROLLS OF PRODUC...


In [182]:
print(df.supersector_code.iloc[0])
get_meaning_supersector(df.supersector_code.iloc[0]) 

50


'Information'

## 2. Find Data of Interest

Looks like `ce.series` contains interesting stuff which allows me to find value (indexed by `series_id`) filtered with supersector, industry type, time span and etc. Should be a good start.

But to begin with, I would first like to know what are some `industries` in the `Information` supersector.

In [201]:
arr = df_50c[ df_50c['year'] == 2019 ].series_id.unique()
len(arr)

234

In [222]:
df = pd.concat([(get_meaning_series(s)) for s in arr[1:]])
df.head(3)

Unnamed: 0,series_id,supersector_code,industry_code,data_type_code,seasonal,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
7365,CES5000000008,50,50000000,8,S,Average hourly earnings of production and nons...,1964,M01,2019,M09,
7372,CES5000000030,50,50000000,30,S,Average weekly earnings of production and nons...,1964,M01,2019,M09,
7373,CES5000000031,50,50000000,31,S,Average weekly earnings of production and nons...,1964,M01,2019,M09,


In [218]:
df.industry_code.unique()

array([50000000, 50511000, 50511100, 50511110, 50511120, 50511200,
       50512000, 50515000, 50515110, 50517000, 50517300, 50518000,
       50519000])

In [233]:
df_industries_2019 = pd.concat([get_meaning_industry(i) for i in df.industry_code.unique()])
print(len(df_industries_2019))
df_industries_2019

13


Unnamed: 0,industry_code,naics_code,publishing_status,industry_name,display_level,selectable,sort_sequence
519,50000000,51,A,Information,2,T,520
520,50511000,511,A,"Publishing industries, except Internet",4,T,521
521,50511100,5111,A,"Newspaper, book, and directory publishers",5,T,522
522,50511110,51111,A,Newspaper publishers,6,T,523
523,50511120,51112,A,Periodical publishers,6,T,524
526,50511200,5112,A,Software publishers,5,T,527
527,50512000,512,A,Motion picture and sound recording industries,4,T,528
530,50515000,515,A,"Broadcasting, except Internet",4,T,531
532,50515110,51511,A,Radio broadcasting,6,T,533
535,50517000,517,A,Telecommunications,4,T,536


Or, find all the industry that has ever existed in the supersection 50 (Information)

In [223]:
_df_series.head(3)

Unnamed: 0,series_id,supersector_code,industry_code,data_type_code,seasonal,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
0,CES0000000001,0,0,1,S,"All employees, thousands, total nonfarm, seaso...",1939,M01,2019,M09,
1,CES0000000010,0,0,10,S,"Women employees, thousands, total nonfarm, sea...",1964,M01,2019,M09,
2,CES0000000025,0,0,25,S,"All employees, quarterly averages, seasonally ...",1939,M03,2019,M09,


In [237]:
arr = _df_series[ _df_series.supersector_code == 50 ].industry_code.unique()
arr

array([50000000, 50511000, 50511100, 50511110, 50511120, 50511130,
       50511190, 50511200, 50512000, 50512110, 50512130, 50515000,
       50515100, 50515110, 50515120, 50515200, 50517000, 50517300,
       50517311, 50517312, 50517900, 50517911, 50518000, 50519000,
       50519130, 50519190])

In [238]:
df_industries_all = pd.concat([get_meaning_industry(i) for i in arr])
print(len(df_industries))
df_industries_all

26


Unnamed: 0,industry_code,naics_code,publishing_status,industry_name,display_level,selectable,sort_sequence
519,50000000,51,A,Information,2,T,520
520,50511000,511,A,"Publishing industries, except Internet",4,T,521
521,50511100,5111,A,"Newspaper, book, and directory publishers",5,T,522
522,50511110,51111,A,Newspaper publishers,6,T,523
523,50511120,51112,A,Periodical publishers,6,T,524
524,50511130,51113,E,Book publishers,6,T,525
525,50511190,511149,C,"Directory, mailing list, and other publishers",6,T,526
526,50511200,5112,A,Software publishers,5,T,527
527,50512000,512,A,Motion picture and sound recording industries,4,T,528
528,50512110,51211,C,Motion picture and video production,6,T,529


In [236]:
print(f"There's {len(df_industries_all) - len(df_industries_2019)} industries not in the 2019 stats.")

There's 13 industries not in the 2019 stats.


**OK, with all these industies in the Infomation supersector, what kind of data we have from the whole dataset provide?**

According to the `ce.txt`:

          ...

          a - Employment - includes all employment datatypes                                 
                      (Datatype codes shown in parentheses):                                  

                  *  All employees (01)                                                       
                  *  Production or nonsupervisory employees (06)                              
                  *  Women employees (10)                                                     
                  *  1-month diffusion index (21)                                             
                  *  3-month diffusion index (22)                                             
                  *  6-month diffusion index (23)                                             
                  *  12-month diffusion index (24)                                            
                  *  All employee quarterly average (25)                                      
                  *  3-month moving average change (26)                                       

           b - Hours and earnings of all employees - includes all hours and earnings          
                     datatypes for all employee payroll: (Data type numbers shown in          
                     parentheses)                                                             

                  *  Average weekly hours for all employees (02)                              
                  *  Average hourly earnings for all employees (03)                           
                  *  Average overtime hours of all employees (04)                             
                  *  Average weekly earnings of all employees (11)                            
                  *  Average weekly earnings of all employees $82-84 (12)                  
                  *  Average hourly earnings of all employees $82-84 (13)                     
                  *  Average hourly earnings of all employees excluding overtime (15)         
                  *  Index of aggregate weekly hours of all employees, 2007=100 (16)          
                  *  Index of aggregate weekly payrolls of all employees, 2007=100 (17)       
                  *  Quarterly average weekly hours of all employees (19)                     
                  *  Quarterly average weekly overtime hours of all employees (20)            
                  *  Aggregate weekly hours of all employees (56)                             
                  *  Aggregate weekly payrolls of all employees (57)                          
                  *  Aggregate weekly overtime hours of all employees (58)                    

            c - Production employee hours and earnings - includes all hours and               
                     earnings data types for production or nonsupervisory                     
                     employees: (Data type numbers shown in parentheses)                      

                  *  Average weekly hours for production employees (07)                       
                  *  Average hourly earnings for production employees (08)                    
                  *  Average weekly overtime hours of production employees (09)               
                  *  Average weekly earnings of production employees (30)                     
                  *  Average weekly earnings of production employees, $82-84 (31)             
                  *  Average hourly earnings of production employees, $82-84 (32)             
                  *  Average hourly earnings of production employees excluding                
                     overtime (33)                                                            
                  *  Index of aggregate hours of production employees, 2002=100 (34)          
                  *  Index of aggregate payrolls of production employees, 2002=100 (35)       
                  *  Quarterly average weekly hours of production employees  (36)             
                  *  Quarterly average weekly overtime hours of production                    
                     employees (37)                                                           
                  *  Aggregate weekly hours of production employees (81)                      
                  *  Aggregate weekly payrolls of production employees (82)                   
                  *  Aggregate weekly overtime hours of production employees (83)  
                  
             ...

**Let's see the employee numbers of all time**

Which is `data_type_code == 01`

In [240]:
df_50a.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,CES5000000001,1939,M01,1112.0,
1,CES5000000001,1939,M02,1118.0,
2,CES5000000001,1939,M03,1126.0,
3,CES5000000001,1939,M04,1127.0,
4,CES5000000001,1939,M05,1125.0,
