# Data Analysis at Scale on Filtered Data

The following code will help you get answers for the following measures.

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms
- **Measure 2:** Does the job ad list another job title and tells us who reports to whom?
    1. This measure needs to take every keyword above and export the next 7 words in the job ad following it, if any (return NaN otherwise), as a separate column, one for each of the dummies above. 
- **Measure 3:** How many unique job titles are there in the firm?
    1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
        - CleanJobTitle
        - ConsolidatedJobTitle
        - CanonJobTitle
- **Measure 4:** Managerial intensity?
    1. For each firm-location-week-occupationbelow, calculate the number of job ads with the following job codes:
    2. For each firm-location-week-alloccups, calculate also the number of job ads with the following job code:
        - Everything starting with ‘11’
    3. For each firm-location-week, calculate also the total number of job ads posted by that firm regardless of the occupation

### Things to keep in mind before running the code in this notebook.

- This notebook assumes you already ran the notebook called "06_dask_get_companies.ipynb", which gets you the filtered dataset from the cleaned sample. Conversely, you already have access to this dataset
- You will be using dask dataframes for distributed computing in your local machine. You can think of these dataframes as lazy pandas (no pun intended)
- Depending on how you modify this notebook and decide to use it moving forward, please keep in mind that you will be generating quite a few files at the end of this notebook so make sure to tweak the function at the end of the notebook to adjust the files you want

In [1]:
import dask, dask.dataframe as dd, dask.array as da
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import pandas as pd
import re, csv, os
import numpy as np
from dask import delayed, persist
from dask.distributed import Client
from glob import glob
from typing import List, Union


pd.set_option('display.max_columns', None)
csv.field_size_limit(10000000)

%matplotlib inline

Select the path where your filtered data lives and put it below in `path` variable.

In [8]:
path = '/Users/ramonperez/Dropbox/Burning Glass/Data/companies_76k/filtered_data_07/'
path_out = '/Users/ramonperez/Dropbox/Burning Glass/Analysis/approach_8/'

The following two lists contain the names of the clean variables from the previous steps and the data types we will be using to read them in.

In [9]:
col_names = ['JobID', 'CleanJobTitle', 'CanonCity', 'CanonState', 'Source', 'Latitude', 
             'Longitude', 'CanonJobTitle', 'CanonCounty', 'MSA', 'LMA', 'InternshipFlag',
             'ConsolidatedONET', 'CanonSkillClusters', 'CanonSkills', 'CanonMinimumDegree',
             'CanonRequiredDegrees', 'MinExperience', 'ConsolidatedInferredNAICS', 'BGTOcc',
             'YearsOfExperience', 'CanonJobHours', 'CanonJobType', 'CanonPostalCode', 
             'CanonYearsOfExperienceCanonLevel', 'CanonYearsOfExperienceLevel', 'ConsolidatedTitle',
             'BGTSubOcc', 'ConsolidatedDegreeLevels', 'MinDegreeLevel', 'EmployerClean',
             'clean_text', 'JobDate']

dtypes={'CanonSkills': np.str, 'Latitude': np.float32, 'JobID': np.str, 'CanonJobTitle': np.str,
        'CanonYearsOfExperienceLevel': np.str, 'Longitude': np.float32, 'CanonJobType': np.str, 
        'CleanJobTitle': np.str, 'ConsolidatedInferredNAICS': np.str, 'CanonRequiredDegrees': np.str,
        'YearsOfExperience': np.str, 'CanonCity': np.str, 'CanonCounty': np.str, 'CanonJobHours': np.str,
        'CanonState': np.str, 'ConsolidatedONET': np.str, 'MSA': np.str, 'CanonMinimumDegree': np.str,
        'ConsolidatedDegreeLevels': np.str, 'BGTSubOcc': np.str, 'ConsolidatedTitle': np.str,
        'CanonSkillClusters': np.str, 'Language': np.str, 'JobDate': np.str,
        'MinDegreeLevel': np.str, 'LMA': np.str, 'MinExperience': np.str, 'CanonPostalCode': np.str,
        'InternshipFlag': np.bool_, 'Source': np.str, 'BGTOcc': np.str, 'CanonYearsOfExperienceCanonLevel': np.str}

Starting in the below cell we will begin creating a directed acyclical graph using dask. This means that we will be making barely any computations until the very end of the notebook.

The snippet below will help us read in the amount files in the directory specified above. Make sure to place the wildcard in the appropriate spot, otherwise you will not be able to read in the data.

Parameters used:

- `engine='python'`
- `dtype=dtypes`: our list of data types above
- `assume_missing=True`: Yes, there might be some edge cases of missing values not taken care of in our previous step.
- `error_bad_lines=False`: We don't want any bad line in our data so let's allow dask to tell us.
- `blocksize=None`: Dask usually tries to read in a small sample of the data and makes inferences as to what data type belongs to what. Because in our case some of the job descriptions have quite large amounts of text per observation, dask won't play it nice with our use case and will most likely misinterpret the commas in some of the values in the `JobText` column. Because of this, we will read in every block without making inferences. Luckily, since in the previous step we created small enough files, even reading the full file in will end up being blazingly fast.
- `usecols=col_names`: our list of columns above

In [10]:
ddf = dd.read_csv(os.path.join(path, 'da*.csv'), 
                 engine='python',
                 dtype=dtypes,
                 assume_missing=True,
                 error_bad_lines=False,
                 blocksize=None,
                 usecols=col_names,
                )
ddf

Unnamed: 0_level_0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,LMA,MSA,CanonJobTitle,ConsolidatedONET,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMinimumDegree,CanonRequiredDegrees,MinExperience,ConsolidatedInferredNAICS,BGTOcc,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
npartitions=50,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
,object,object,object,object,object,float32,float32,object,object,object,object,object,object,bool,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Be careful when checking the `.head()` or `.tail()` of the datasets you will be working with. Depending on how much data you are trying to view, especially if it doesn't fit into memory, this could take anywhere between 2 minutes to an hour or so.

In [11]:
%%time

ddf.tail()

CPU times: user 85.4 ms, sys: 90 ms, total: 175 ms
Wall time: 189 ms


Unnamed: 0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,LMA,MSA,CanonJobTitle,ConsolidatedONET,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMinimumDegree,CanonRequiredDegrees,MinExperience,ConsolidatedInferredNAICS,BGTOcc,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
83,Cypress,CA,Senior Healthcare Analyst - Sas Programming Sk...,2007-12-31,339756230,33.815399,-118.037003,90630,Orange,DV064204|MT063110,31080: Metropolitan Statistical Area|348: Comb...,Unknown,29207100,False,Company from Job Board,Health Care: Clinical Data Management;Speciali...,{'Clinical Data Review': 'Health Care: Clinica...,Unknown,Unknown,5,524113,29-2071.96,fulltime,permanent,1-6,mid,Unknown,"Senior Healthcare Analyst, Sas",Unknown,Healthcare Analyst,5-6 years,UnitedHealth Group,Sr. Healthcare Analyst - SAS programming skill...
84,Santa Clara,CA,Iridesse Sales Professional,2007-12-31,339820992,37.352901,-121.953003,95052,Santa Clara,MT064194,41940: Metropolitan Statistical Area,Sales Professional,41203100,False,Company from Job Board,Marketing and Public Relations: Customer Relat...,{'Client Base Retention': 'Marketing and Publi...,Unknown,Unknown,2,448310,41-2031.00,Unknown,Unknown,1-6,mid,Unknown,Sales Professional,Unknown,Retail Sales Associate (General),minimum of 2 years,Tiffany & Co.,Careers Iridesse Sales Professional Req #: ...
85,San Francisco,CA,"Director, Distribution Center Operations",2007-12-31,339820871,37.7798,-122.417,94175,San Francisco,DV064188|MT064186,41860: Metropolitan Statistical Area|488: Comb...,Unknown,11102100,False,Company from Job Board,Finance: Budget Management;Specialized Skills|...,{'Budgeting': 'Finance: Budget Management;Spec...,Bachelor's,Bachelor's,13,448140,11-1021.91,fulltime,permanent,6+,high,16,"Director, Distribution,Operations",16,Retail Operations Supervisor,10+ years|At least 5 years,Levi Strauss,"Job Search Engine - Job Blogs Friday, October..."
86,Miami,FL,Financial Services Professional,2007-12-31,360306555,25.794901,-80.1278,33159,Miami-Dade,DV123312|MT123310,33100: Metropolitan Statistical Area|370: Comb...,Unknown,41303102,False,Job Board,Sales: Solution Sales Engineering;Specialized ...,{'Consultative Sales': 'Sales: Solution Sales ...,Unknown,Unknown,Unknown,6233,41-3031.00,Unknown,Unknown,Unknown,Unknown,Unknown,Financial Services Professional,Unknown,Financial Services Representative,Unknown,Brookdale Senior Living,- Levin Financial Group of Massmutual - Tamp...
87,Lorton,VA,District Sales Manager,2007-12-31,38009597932,38.704102,-77.227997,22199,Fairfax,DV114789|MT114790,47900: Metropolitan Statistical Area|548: Comb...,District Sales Manager,11202200,False,Company from Job Board,Common Skills|Specialized Skills|Analysis: Bus...,"{'Communication Skills': 'Common Skills', 'Com...",Higher Secondary Certificate,Unknown,Unknown,511110,11-2022.00,fulltime,permanent,Unknown,Unknown,16|12,District Sales Manager,12,Territory / Regional Sales Manager,Unknown,USA Today,


In [12]:
missing_count = ((ddf.isna().sum() / ddf.index.size) * 100)
missing_count_pct = missing_count.compute()
missing_count_pct

CanonCity                           0.000000
CanonState                          0.000000
CleanJobTitle                       0.000000
JobDate                             0.000000
JobID                               0.000000
Latitude                            0.000000
Longitude                           0.000000
CanonPostalCode                     0.000000
CanonCounty                         0.000000
LMA                                 0.000000
MSA                                 0.000000
CanonJobTitle                       0.000000
ConsolidatedONET                    0.000000
InternshipFlag                      0.000000
Source                              0.000000
CanonSkillClusters                  0.000000
CanonSkills                         0.000000
CanonMinimumDegree                  0.000000
CanonRequiredDegrees                0.000000
MinExperience                       0.000000
ConsolidatedInferredNAICS           0.000000
BGTOcc                              0.000000
CanonJobHo

In [34]:
no_missing = ddf['clean_text'].notnull()
ddf = ddf[no_missing]

## Measure 1

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms

The two lines below check for first instance of a keyword OR the next OR the next, and so forth. Notice the space in between the pipes (`|`), this is to tell python that, for example, the word `intern` should not be one that is part of `international` but rather its own entity.

In [35]:
downward = ddf['clean_text'].str.lower().str.contains(' will supervise | supervising | guiding | mentoring | leading | lead | overseeing | will guide | be in charge of | mentor | coaching | mentoring | coordinating | building teams | build team | guiding | advising | setting performance standard | sets performance standard | resolving conflict | resolves conflict | responsibility for outcomes | responsible for outcomes | directing | appointing | instructing | recruiting | managing | approve | approving | assign | assigning | delegate | delegating | control | controlling | review | reviewing | arbitrate | arbitrating | command | commanding | govern | governing ', regex=True)
upward = ddf['clean_text'].str.lower().str.contains(' reports to | report to | reporting to | answers to | answer to | managed by | responds to | respond to | directed by | receives guidance | receive guidance | supervised by | assists | assist | support | supports | supporting | helps | help | helping ', regex=True)

ddf0 = ddf.assign(downward=downward, upward=upward)

We will also create two lists with the words above and use them to create the dummies for our dataframes.

In [36]:
down_words = [' will supervise ', ' supervising ', ' guiding ', ' mentoring ', ' leading ',
              ' lead ', ' overseeing ', ' will guide ', ' be in charge of ', ' mentor ', 
              ' coaching ', ' mentoring ', ' coordinating ', ' building teams ', ' build team ', 
              ' guiding ', ' advising ', ' setting performance standard ', ' sets performance standard ',
              ' resolving conflict ', ' resolves conflict ', ' responsibility for outcomes ', 
              ' responsible for outcomes ', ' directing ', ' appointing ', ' instructing ',
              ' recruiting ', ' managing ', ' approve ', ' approving ', ' assign ', ' assigning ',
              ' delegate ', ' delegating ', ' control ', ' controlling ', ' review ', ' reviewing ',
              ' arbitrate ', ' arbitrating ', ' command ', ' commanding ', ' govern ', ' governing ']

up_words = [' reports to ', ' report to ', ' reporting to ', ' answers to ', ' answer to ', 
            ' managed by ', ' responds to ', ' respond to ', ' directed by ', ' receives guidance ',
            ' receive guidance ', ' supervised by ', ' assists ', ' assist ', ' support ', 
            ' supports ', ' supporting ', ' helps ', ' help ', ' helping ']

In [37]:
def get_indicators(data: pd.DataFrame, column: str, words: List[str]) -> pd.DataFrame:
    """
    This function will check for the existance of a word in a column of a dataframe,
    create a dummy variable for it, and add it to back into the dataframe.
    """
    for word in words: # and assign the keyword as a variable and a 1 if the word was found
        data[word.strip()] = data[column].str.lower().str.contains(word)
    return data

Dask has a very useful function called `.map_partitions()` that applies a function to each partition of the dask dataframe while treating these partitions as pandas dataframes. We pass in our function and function parameters without parentheses and without calling anything for the data argument.

In [38]:
ddf1 = ddf0.map_partitions(get_indicators, column='clean_text', words=down_words)
ddf2 = ddf1.map_partitions(get_indicators, column='clean_text', words=up_words)

## Measure 2

Let's first clean the list of words above so that we can add them to our dataframe as columns.

In [39]:
up_stripped = [w.strip() for w in up_words]
down_stripped = [w.strip() for w in down_words]

We will then sum up the appearances of both sets of columns to get a sence of how many of these kewords were spotted in a job description. We will then assign the new arrays back into our dask dataframe.

In [40]:
up_instances = ddf2.loc[:, up_stripped].sum(axis=1)
down_instances = ddf2.loc[:, down_stripped].sum(axis=1)
ddf3 = ddf2.assign(up_instances=up_instances, down_instances=down_instances)

In [41]:
def get_words(word: str, string: str, num_chars: int=60) -> Union[str, None]:
    """
    This function will retrieve the set of characters following a keywords that
    has been spotted in a piece of string. The defaul number of characters is 60.
    """
    
    if word in string:
        return string[string.index(word):string.index(word) + num_chars]

def get_some_text(data: pd.DataFrame, column: str, list_of_words: List[str]) -> pd.DataFrame:
    """
    This function extends the function get_words by adding the set of characters detected back into
    its respective column as a piece of string.
    """
    
    for word in list_of_words:
        data[word.strip()] = data[column].apply(lambda x: get_words(word, x))
    return data

We will now map our functions above to our dataframe partitions.

In [42]:
ddf4 = ddf3.map_partitions(get_some_text, column='clean_text', list_of_words=down_words)
ddf5 = ddf4.map_partitions(get_some_text, column='clean_text', list_of_words=up_words)
# ddf5.head()

## Meassure 3

How many unique job titles are there in the firm?
1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
    - CleanJobTitle
    - ConsolidatedJobTitle
    - CanonJobTitle

For this measure we will need to create a couple of additional variables, years and weeks, in order to use them in our `.groupby()` call.

In [43]:
JobDate = dd.to_datetime(ddf5['JobDate'])
ddf6 = ddf5.assign(JobDate=JobDate)
weeks = ddf6['JobDate'].dt.week
years = ddf6['JobDate'].dt.year
ddf7 = ddf6.assign(weeks=weeks, years=years)

We will remove duplicate titles from a specific dataset to then use groupby with this deduplicated version. We will then count the jobs within each of the job title variables above and reset the index to get rid of the three dimensional index.

In [44]:
ddf_deduplicated = ddf7.drop_duplicates(subset=['CleanJobTitle'])
firm_full_sample = ddf_deduplicated.groupby('EmployerClean')[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_year = ddf_deduplicated.groupby(['EmployerClean', 'years'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_week = ddf_deduplicated.groupby(['EmployerClean', 'weeks'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the resulf from the above you can run the computations by uncommenting the cell below.

In [45]:
%%time

# firm_full_sample, firm_year, firm_week = dask.compute(firm_full_sample, firm_year, firm_week)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 19.1 µs


In [46]:
# firm_week.head(5)

## Meassure 4

Managerial intensity

### Part 1

For each firm-location-week-occupation, calculate the number of job ads with the managerial job codes available in the dataset.

In [47]:
occu_condition = ddf7['BGTOcc'].str.startswith('11')
managers_dummy_df = ddf7.assign(managerial_occu=occu_condition)
managers_only_df = managers_dummy_df[managers_dummy_df['managerial_occu'] == True]

In [48]:
managers_group1 = managers_only_df.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks', 'BGTOcc'])
individual_managers = managers_group1[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 2

For each firm-location-week-alloccups, calculate also the number of job ads with the following job code
- Everything starting with ‘11’

In [49]:
managers_group2 = managers_only_df.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])
all_managers = managers_group2[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 3

For each firm-location-week, calculate the total number of job ads posted by that firm regardless of the occupation

In [50]:
firm_loc_week_group = ddf7.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])
firm_loc_week_df = firm_loc_week_group[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the result from the above cells you can run the computations by uncommenting the cell below.

In [51]:
%%time

# individual_managers, all_managers, firm_loc_week_df = dask.compute(individual_managers, all_managers, firm_loc_week_df)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.72 µs


In [52]:
# firm_loc_week_df.head()

## Save all Files

The following function will help you save a csv file with the following characteristics:
- choose between 1 or many datasets for the output of your measure
- create a new directory for this output, based on the path provided at the beginning of this notebook
- add a name for your file

In [59]:
def save_csv_files(new_dir_name, data, new_file_name, pandas_or_dask=True, partitions=None):
    
    if not os.path.exists(os.path.join(path_out, new_dir_name)):
        os.makedirs(os.path.join(path_out, new_dir_name))

    if pandas_or_dask == True:
        data.to_csv(os.path.join(path_out, new_dir_name) + f'{new_file_name}.csv', index=False)
    else:
        # the following lines of code will take the last dataset, repartition it,
        # and save it to the desired location. Notice the wildcard "*" below. That is
        # the spot Dask will use to number your files starting from 0
        (data
         .repartition(npartitions=partitions)
         .to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}*.csv'), index=False)
         )

In [None]:
%%time

save_csv_files(new_dir_name='measure_2/', data=ddf5,                new_file_name='keywords_',           pandas_or_dask=False, partitions=10)
save_csv_files(new_dir_name='measure_3/', data=firm_full_sample,    new_file_name='firm_full_sample',    pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_year,           new_file_name='firm_year',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_week,           new_file_name='firm_week',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=individual_managers, new_file_name='individual_managers', pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=all_managers,        new_file_name='all_managers',        pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=firm_loc_week_df,    new_file_name='firm_loc_week_df',    pandas_or_dask=True)