The following code will help you get answers for the following measures.

- **Measure 3:** How many unique job titles are there in the firm?
    1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
        - CleanJobTitle
        - ConsolidatedJobTitle
        - CanonJobTitle
- **Measure 4:** Managerial intensity?
    1. For each firm-location-week-occupationbelow, calculate the number of job ads with the following job codes:
    2. For each firm-location-week-alloccups, calculate also the number of job ads with the following job code:
        - Everything starting with ‘11’
    3. For each firm-location-week, calculate also the total number of job ads posted by that firm regardless of the occupation

In [18]:
import dask, dask.dataframe as dd, dask.array as da
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import pandas as pd
import re, csv, os
import numpy as np
from dask import delayed, persist
from dask.distributed import Client
from glob import glob
from typing import List, Union


pd.set_option('display.max_columns', None)
csv.field_size_limit(10000000)

%matplotlib inline

In [19]:
path = '/Users/ramonperez/Dropbox/Burning Glass/Data/companies_76k/filtered_data_14/'
path_out = '/Users/ramonperez/Dropbox/Burning Glass/Analysis/approach_8/data_14'

In [20]:
col_names = ['JobID', 'CleanJobTitle', 'CanonCity', 'CanonState', 'CanonJobTitle', 'CanonCounty', 
             'BGTOcc', 'CanonPostalCode', 'ConsolidatedTitle', 'EmployerClean', 'JobDate']

dtypes={'CanonSkills': np.str, 'Latitude': np.float32, 'JobID': np.str, 'CanonJobTitle': np.str,
        'CanonYearsOfExperienceLevel': np.str, 'Longitude': np.float32, 'CanonJobType': np.str, 
        'CleanJobTitle': np.str, 'ConsolidatedInferredNAICS': np.str, 'CanonRequiredDegrees': np.str,
        'YearsOfExperience': np.str, 'CanonCity': np.str, 'CanonCounty': np.str, 'CanonJobHours': np.str,
        'CanonState': np.str, 'ConsolidatedONET': np.str, 'MSA': np.str, 'CanonMinimumDegree': np.str,
        'ConsolidatedDegreeLevels': np.str, 'BGTSubOcc': np.str, 'ConsolidatedTitle': np.str,
        'CanonSkillClusters': np.str, 'Language': np.str, 'JobDate': np.str,
        'MinDegreeLevel': np.str, 'LMA': np.str, 'MinExperience': np.str, 'CanonPostalCode': np.str,
        'InternshipFlag': np.bool_, 'Source': np.str, 'BGTOcc': np.str, 'CanonYearsOfExperienceCanonLevel': np.str}

In [21]:
ddf = dd.read_csv(os.path.join(path, 'da*.csv'), 
                 engine='python',
                 dtype=dtypes,
                 assume_missing=True,
                 error_bad_lines=False,
                 blocksize=None,
                 usecols=col_names,
                )
ddf

Skipping line 59: unexpected end of data


Unnamed: 0_level_0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,CanonPostalCode,CanonCounty,CanonJobTitle,BGTOcc,ConsolidatedTitle,EmployerClean
npartitions=50,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...


## Meassure 3

How many unique job titles are there in the firm?
1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
    - CleanJobTitle
    - ConsolidatedJobTitle
    - CanonJobTitle

For this measure we will need to create a couple of additional variables, years and weeks, in order to use them in our `.groupby()` call.

## You have changed ddf

In [5]:
JobDate = dd.to_datetime(ddf['JobDate'])
ddf6 = ddf.assign(JobDate=JobDate)
weeks = ddf6['JobDate'].dt.week
years = ddf6['JobDate'].dt.year
ddf7 = ddf6.assign(weeks=weeks, years=years)

We will remove duplicate titles from a specific dataset to then use groupby with this deduplicated version. We will then count the jobs within each of the job title variables above and reset the index to get rid of the three dimensional index.

In [6]:
ddf_deduplicated = ddf7.drop_duplicates(subset=['CleanJobTitle'])
ddf_deduplicated2 = ddf_deduplicated[['EmployerClean', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle', 'years', 'weeks']].replace('Unknown', np.nan)
firm_full_sample = ddf_deduplicated2.groupby('EmployerClean')[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_year = ddf_deduplicated2.groupby(['EmployerClean', 'years'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_week = ddf_deduplicated2.groupby(['EmployerClean', 'weeks'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the resulf from the above you can run the computations by uncommenting the cell below.

In [7]:
%%time

# firm_full_sample, firm_year, firm_week = dask.compute(firm_full_sample, firm_year, firm_week)

CPU times: user 4 µs, sys: 3 µs, total: 7 µs
Wall time: 27.2 µs


In [8]:
# firm_week.head(5)

## Meassure 4

Managerial intensity

### Part 1

For each firm-location-week-occupation, calculate the number of job ads with the managerial job codes available in the dataset.

In [9]:
occu_condition = ddf7['BGTOcc'].str.startswith('11')
managers_dummy_df = ddf7.assign(managerial_occu=occu_condition)
managers_only_df = managers_dummy_df[managers_dummy_df['managerial_occu'] == True]

In [10]:
managers_only_df1 = managers_only_df[['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 
                                      'weeks', 'BGTOcc', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].replace('Unknown', np.nan)
managers_group1 = managers_only_df1.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks', 'BGTOcc'])
individual_managers = managers_group1[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 2

For each firm-location-week-alloccups, calculate also the number of job ads with the following job code
- Everything starting with ‘11’

In [11]:
managers_group2 = managers_only_df1.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])
all_managers = managers_group2[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 3

For each firm-location-week, calculate the total number of job ads posted by that firm regardless of the occupation

In [12]:
ddf8 = ddf7[['EmployerClean', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle', 'years', 'weeks', 'CanonState', 'CanonCounty', 'CanonPostalCode']].replace('Unknown', np.nan)
firm_loc_week_group = ddf8.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])
firm_loc_week_df = firm_loc_week_group[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the result from the above cells you can run the computations by uncommenting the cell below.

In [13]:
%%time

# individual_managers, all_managers, firm_loc_week_df = dask.compute(individual_managers, all_managers, firm_loc_week_df)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs


In [14]:
# firm_loc_week_df.head()

## Save all Files

The following function will help you save a csv file with the following characteristics:
- choose between 1 or many datasets for the output of your measure
- create a new directory for this output, based on the path provided at the beginning of this notebook
- add a name for your file

In [15]:
def save_csv_files(new_dir_name, data, new_file_name, pandas_or_dask=True, partitions=None):
    
    if not os.path.exists(os.path.join(path_out, new_dir_name)):
        os.makedirs(os.path.join(path_out, new_dir_name))

    if pandas_or_dask == True:
        data = data.compute()
        data.to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}.csv'), index=False)
    else:
        # the following lines of code will take the last dataset, repartition it,
        # and save it to the desired location. Notice the wildcard "*" below. That is
        # the spot Dask will use to number your files starting from 0
        (data
         .repartition(npartitions=partitions)
         .to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}*.csv'), index=False)
         )

In [17]:
%%time

save_csv_files(new_dir_name='measure_3/', data=firm_full_sample,    new_file_name='firm_full_sample',    pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_year,           new_file_name='firm_year',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_week,           new_file_name='firm_week',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=individual_managers, new_file_name='individual_managers', pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=all_managers,        new_file_name='all_managers',        pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=firm_loc_week_df,    new_file_name='firm_loc_week_df',    pandas_or_dask=True)

CPU times: user 16min 10s, sys: 5min 55s, total: 22min 6s
Wall time: 25min 10s
