# Data Analysis at Scale on Filtered Data - Part 2

The code in this notebook will help you create a dataset that addresses the following measures.

- **Measure 3:** How many unique job titles are there in the firm?
    1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
        - CleanJobTitle
        - ConsolidatedJobTitle
        - CanonJobTitle
- **Measure 4:** Managerial intensity?
    1. For each firm-location-week-occupationbelow, calculate the number of job ads with the following job codes:
    2. For each firm-location-week-alloccups, calculate also the number of job ads with the following job code:
        - Everything starting with ‘11’
    3. For each firm-location-week, calculate also the total number of job ads posted by that firm regardless of the occupation

### Things to keep in mind before running the code in this notebook.

- This notebook assumes you already ran the notebook called "06_dask_get_companies.ipynb", which gets you the filtered dataset from the cleaned sample that you will use here. Conversely, you already have access to those datasets
- You will be using dask dataframes for distributed computing in your local machine. You can think of these dataframes as lazy pandas (no pun intended)
- Depending on how you modify this notebook and decide to use it moving forward, please keep in mind that you might be generating quite a few files at the end of this notebook so make sure to tweak the `save_csv_files`function at the end of the notebook and adjust it to your desire output

In [1]:
import dask, dask.dataframe as dd
import matplotlib.pyplot as plt
import pandas as pd
import re, csv, os
import numpy as np
from typing import List, Union


pd.set_option('display.max_columns', None)
csv.field_size_limit(10000000)

%matplotlib inline

Select the path where your filtered data files live at and assign it to the `path` variable below. In addition, select a path where you will like the final files to go into.

In [3]:
path = '~/Dropbox/Burning Glass/Data/companies_76k/filtered_data_07/'
path_out = '~/Dropbox/Burning Glass/Analysis/approach_8/data_07'

The following two lists contain the names of the clean variables from the previous steps and the data types we will be using to read them in. Because we don't need all the variables for the following 2 measures, we will limit our selection to only what is neede and to speed up process.

In [5]:
col_names = ['JobID', 'CleanJobTitle', 'CanonCity', 'CanonState', 'CanonJobTitle', 'CanonCounty', 
             'BGTOcc', 'CanonPostalCode', 'ConsolidatedTitle', 'EmployerClean', 'JobDate']

dtypes={'JobID': np.str, 'CanonJobTitle': np.str, 'EmployerClean': np.str,
        'CleanJobTitle': np.str, 'CanonCity': np.str, 'CanonCounty': np.str,
        'CanonState': np.str, 'ConsolidatedTitle': np.str, 'BGTOcc': np.str,
        'JobDate': np.str, 'CanonPostalCode': np.str}

From the cell below onwards, we will begin creating a directed acyclical graph using dask. This means that we will be making barely any computations until the very end of the notebook.

The snippet below will help us read in the amount files in the directory specified above. Make sure to place the wildcard `"*"` in the appropriate spot, otherwise you will not be able to read in the data. In regular expression term, a wildcard is a placeholder that indicates that any value can be placed at the spot where the wildcard is at. For example, the `"*"` in between `da` and `.csv` will allows us to select all of the files that start with `da` and end in `.csv`.

Parameters used:

- `engine='python'`: the default option uses `C` under the hood and although it is faster, it doesn't give much flexibility regarding data types
- `dtype=dtypes`: our list of data types above
- `assume_missing=True`: Yes, there might be some edge cases of missing values not taken care of in our previous step
- `error_bad_lines=False`: We don't want any bad line in our data so let's allow dask to tell us when they come up
- `blocksize=None`: Dask usually tries to read in a small sample of the data and makes inferences as to which data type belongs to a variable. Because in our case some of the job descriptions have quite large amounts of text, dask won't play nicely with our use case and will most likely misinterpret the commas in some of the values in the `JobText` column. To get around this, we will read in every block without making inferences. Luckily, since we created small enough files in the previous step, operations will be very fast.
- `usecols=col_names`: our list of columns above

In [6]:
ddf = dd.read_csv(os.path.join(path, 'da*.csv'), 
                 engine='python',
                 dtype=dtypes,
                 assume_missing=True,
                 error_bad_lines=False,
                 blocksize=None,
                 usecols=col_names,
                )
ddf

Unnamed: 0_level_0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,CanonPostalCode,CanonCounty,CanonJobTitle,BGTOcc,ConsolidatedTitle,EmployerClean
npartitions=14,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...


Be careful when checking the `.head()` or `.tail()` on large groups of data. Depending on how much data you are trying to view, especially if it doesn't fit into memory, this could take anywhere between 2 to 30 minutes or so.

In [9]:
%%time

ddf.tail()

CPU times: user 3.18 s, sys: 290 ms, total: 3.47 s
Wall time: 3.54 s


Unnamed: 0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,CanonPostalCode,CanonCounty,CanonJobTitle,BGTOcc,ConsolidatedTitle,EmployerClean
66048,Los Angeles,CA,Director Of Education And Outreach,2007-12-31,339743939,90006,Los Angeles,Director of Education,11-9151.00,Director of Education,Center Theatre Group
66049,Cypress,CA,Senior Healthcare Analyst - Sas Programming Sk...,2007-12-31,339756230,90630,Orange,Unknown,29-2071.96,"Senior Healthcare Analyst, Sas",UnitedHealth Group
66050,Santa Clara,CA,Iridesse Sales Professional,2007-12-31,339820992,95052,Santa Clara,Sales Professional,41-2031.00,Sales Professional,Tiffany & Co.
66051,San Francisco,CA,"Director, Distribution Center Operations",2007-12-31,339820871,94175,San Francisco,Unknown,11-1021.91,"Director, Distribution,Operations",Levi Strauss
66052,Miami,FL,Financial Services Professional,2007-12-31,360306555,33159,Miami-Dade,Unknown,41-3031.00,Financial Services Professional,Brookdale Senior Living


## Meassure 3

How many unique job titles are there in the firm?
1. For each firm-week, firm-year and firm-full-sample, calculate the number of unique (we should have 9 variables for each firm with the number of unique values that either don’t vary within firm, vary within firms each year or vary within firms each week):
    - CleanJobTitle
    - ConsolidatedJobTitle
    - CanonJobTitle

For this measure we will need to create a couple of additional variables, `years` and `weeks`, in order to use them in our `.groupby()` call.

In [10]:
JobDate = dd.to_datetime(ddf['JobDate'])
ddf1 = ddf.assign(JobDate=JobDate)
weeks = ddf1['JobDate'].dt.week
years = ddf1['JobDate'].dt.year
ddf2 = ddf1.assign(weeks=weeks, years=years)

We will remove duplicate titles from a specific dataset to then use groupby with this deduplicated version. We will then count the jobs within each of the job title variables above and reset the index to get rid of the three-dimensional dataframes. Also, in the cell below we will remove the `"Unkown"` placeholder for missing values with `np.nan` so that we don't count these `"Unkown"` elements as a job title.

In [11]:
ddf_deduplicated = ddf2.drop_duplicates(subset=['CleanJobTitle'])[['EmployerClean', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle', 'years', 'weeks']].replace('Unknown', np.nan)
# ddf_deduplicated2 = ddf_deduplicated[['EmployerClean', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle', 'years', 'weeks']].replace('Unknown', np.nan)
firm_full_sample = ddf_deduplicated.groupby('EmployerClean')[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_year = ddf_deduplicated.groupby(['EmployerClean', 'years'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()
firm_week = ddf_deduplicated.groupby(['EmployerClean', 'weeks'])[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the resulf from the above you can run the computations by uncommenting the cell below. Please note that at the end of this notebook you will save csv files with the output from the above lines so you can examine the results later on as well.

In [12]:
%%time

# firm_full_sample, firm_year, firm_week = dask.compute(firm_full_sample, firm_year, firm_week)

CPU times: user 51.6 s, sys: 5.44 s, total: 57.1 s
Wall time: 1min 30s


In [15]:
# firm_week.tail(15)

Unnamed: 0,EmployerClean,weeks,CleanJobTitle,ConsolidatedTitle,CanonJobTitle
31083,Zebra Technologies,51,3,3,3
31084,Zenith,1,5,5,4
31085,Zenith,2,1,1,0
31086,Zenith,5,1,1,0
31087,Zenith,6,1,1,0
...,...,...,...,...,...
31178,inVentiv Health,48,4,4,3
31179,inVentiv Health,49,5,5,4
31180,inVentiv Health,50,2,2,1
31181,inVentiv Health,51,7,7,6


## Meassure 4

### Part 1 - Managerial intensity

For each firm-location-week-occupation, calculate the number of job ads with the managerial job codes available in the dataset.

Managerial roles are, almost exclusively, the only occupations that start with 11. With that in mind, we will pick all occupations that start with 11 as opposed to specific ones, to increase the speed at which we create the dataframes.

In [16]:
# We will first create our dummy occupation condition
occu_condition = ddf2['BGTOcc'].str.startswith('11')

# we will then assign it back into the dask dataframe
managers_dummy_df = ddf2.assign(managerial_occu=occu_condition)

# and finally filter in the managerial roles
managers_only_df = managers_dummy_df[managers_dummy_df['managerial_occu'] == True]

In [17]:
# let's replace the Unknown placeholder again with np.nan
managers_only_df1 = managers_only_df[['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 
                                      'weeks', 'BGTOcc', 'CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].replace('Unknown', np.nan)

# let's group a dataset with our variables of interest
managers_group1 = managers_only_df1.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks', 'BGTOcc'])

# and then let's count the jobs in our three job title variables
individual_managers = managers_group1[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 2

For each firm-location-week-alloccups, calculate the number of job ads for all managerial roles (e.g. every occupation that starts with 11).

In [18]:
managers_group2 = managers_only_df1.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])
all_managers = managers_group2[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

### Part 3

For each firm-location-week, calculate the total number of job ads posted by that firm regardless of the occupation.

In [19]:
ddf3 = ddf2[['EmployerClean', 'CleanJobTitle', 'ConsolidatedTitle', 
             'CanonJobTitle', 'years', 'weeks', 'CanonState', 
             'CanonCounty', 'CanonPostalCode']].replace('Unknown', np.nan)

firm_loc_week_group = ddf3.groupby(['EmployerClean', 'CanonState', 'CanonCounty', 'CanonPostalCode', 'weeks'])

firm_loc_week_df = firm_loc_week_group[['CleanJobTitle', 'ConsolidatedTitle', 'CanonJobTitle']].count().reset_index()

If you want to check the result from the above cells you can run the computations by uncommenting the cells below.

In [20]:
%%time

individual_managers, all_managers, firm_loc_week_df = dask.compute(individual_managers, all_managers, firm_loc_week_df)

CPU times: user 43.2 s, sys: 13.1 s, total: 56.3 s
Wall time: 1min 2s


In [21]:
firm_loc_week_df.head()

Unnamed: 0,EmployerClean,CanonState,CanonCounty,CanonPostalCode,weeks,CleanJobTitle,ConsolidatedTitle,CanonJobTitle
0,1-800-Flowers.com,NY,Nassau,11514,2,1,1,0
1,170 Systems,MA,Middlesex,1730,2,1,1,1
2,1st Lake Properties,LA,Jefferson,70001,1,2,2,2
3,1st Lake Properties,MS,Madison,39158,1,1,1,1
4,24 Hour Fitness,AZ,Maricopa,85013,1,1,1,0


## Save all Files

The following function will help you save a csv file with the following characteristics:
- choose between 1 or many datasets for the output of your measure
- create a new directory for this output, based on the `path_out` variable provided at the beginning of this notebook
- add a name for your file

In [23]:
def save_csv_files(new_dir_name, data, new_file_name, pandas_or_dask=True, partitions=None):
    
    if not os.path.exists(os.path.join(path_out, new_dir_name)):
        os.makedirs(os.path.join(path_out, new_dir_name))

    if pandas_or_dask == True:
        data = data.compute()
        data.to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}.csv'), index=False)
    else:
        # the following lines of code will take the last dataset, repartition it,
        # and save it to the desired location. Notice the wildcard "*" below. That is
        # the spot Dask will use to number your files starting from 0
        (data
         .repartition(npartitions=partitions)
         .to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}*.csv'), index=False)
         )

In [51]:
%%time

save_csv_files(new_dir_name='measure_3/', data=firm_full_sample,    new_file_name='firm_full_sample',    pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_year,           new_file_name='firm_year',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_3/', data=firm_week,           new_file_name='firm_week',           pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=individual_managers, new_file_name='individual_managers', pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=all_managers,        new_file_name='all_managers',        pandas_or_dask=True)
save_csv_files(new_dir_name='measure_4/', data=firm_loc_week_df,    new_file_name='firm_loc_week_df',    pandas_or_dask=True)

CPU times: user 33min 53s, sys: 28min 31s, total: 1h 2min 24s
Wall time: 1h 31min 34s
