# Data Analysis at Scale on Filtered Data - Part 1

The code in this notebook will help you create a dataset that addresses the following measures.

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms
- **Measure 2:** Does the job ad list another job title and tells us who reports to whom?
    1. This measure needs to take every keyword above and export the next 7 words in the job ad following it, if any (return NaN otherwise), as a separate column, one for each of the dummies above. 

### Things to keep in mind before running the code in this notebook.

- This notebook assumes you already ran the notebook called "06_dask_get_companies.ipynb", which gets you the filtered dataset from the cleaned sample that you will use here. Conversely, you already have access to those datasets
- You will be using dask dataframes for distributed computing in your local machine. You can think of these dataframes as lazy pandas (no pun intended)
- Depending on how you modify this notebook and decide to use it moving forward, please keep in mind that you might be generating quite a few files at the end of this notebook so make sure to tweak the `save_csv_files`function at the end of the notebook and adjust it to your desire output

In [1]:
import dask, dask.dataframe as dd
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import pandas as pd
import re, csv, os
import numpy as np
from typing import List, Union


pd.set_option('display.max_columns', None)
csv.field_size_limit(10000000)

%matplotlib inline

Select the path where your filtered data files live at and assign it to the `path` variable below. In addition, select a path where you will like the final files to go into.

In [24]:
path = '~/Dropbox/Burning Glass/Data/companies_76k/filtered_data_07/'
path_out = '~/Desktop/'

ERROR! Session/line number was not unique in database. History logging moved to new session 1086


The following two lists contain the names of the clean variables from the previous steps and the data types we will be using to read them in.

In [25]:
dtypes = {'BGTSubOcc': np.str, 'MaxExperience': np.str, 'CanonJobTitle': np.str, 'CanonIntermediary': np.str, 
          'clean_text': np.str, 'CanonYearsOfExperienceLevel': np.str, 'InternshipFlag': np.bool_, 'MaxAnnualSalary': np.float32,
          'MaxHourlySalary': np.float32, 'DivisionCode': np.str, 'CanonPostalCode': np.str, 'CleanJobTitle': np.str,
          'CanonCounty': np.str, 'MinDegreeLevel': np.str, 'CanonJobHours': np.str, 'MinAnnualSalary': np.float32,
          'BGTOcc': np.str, 'YearsOfExperience': np.str, 'MaxDegreeLevel': np.str, 'CanonSkillClusters': np.str,
          'CanonMaximumDegree': np.str, 'CanonSkills': np.str, 'ConsolidatedDegreeLevels': np.str, 'CanonRequiredDegrees': np.str,
          'JobID': np.str, 'MinHourlySalary': np.float32, 'Longitude': np.float32, 'Latitude': np.float32,
          'CanonJobType': np.str, 'CanonState': np.str, 'LMA': np.str, 'ConsolidatedInferredNAICS': np.str,
          'MSA': np.str, 'CanonYearsOfExperienceCanonLevel': np.str, 'JobDate': np.str, 'CIPCode': np.str,
          'ConsolidatedTitle': np.str, 'ConsolidatedONET': np.str, 'MinExperience': np.str, 'EmployerClean': np.str,
          'CanonCity': np.str, 'Source': np.str, 'CanonMinimumDegree': np.str}

From the cell below onwards, we will begin creating a directed acyclical graph using dask. This means that we will be making barely any computations until the very end of the notebook.

The snippet below will help us read in the amount files in the directory specified above. Make sure to place the wildcard `"*"` in the appropriate spot, otherwise you will not be able to read in the data. In regular expression term, a wildcard is a placeholder that indicates that any value can be placed at the spot where the wildcard is at. For example, the `"*"` in between `da` and `.csv` will allows us to select all of the files that start with `da` and end in `.csv`.

Parameters used:

- `engine='python'`: the default option uses `C` under the hood and although it is faster, it doesn't give much flexibility regarding data types
- `dtype=dtypes`: our list of data types above
- `assume_missing=True`: Yes, there might be some edge cases of missing values not taken care of in our previous step
- `error_bad_lines=False`: We don't want any bad line in our data so let's allow dask to tell us when they come up
- `blocksize=None`: Dask usually tries to read in a small sample of the data and makes inferences as to which data type belongs to a variable. Because in our case some of the job descriptions have quite large amounts of text, dask won't play nicely with our use case and will most likely misinterpret the commas in some of the values in the `JobText` column. To get around this, we will read in every block without making inferences. Luckily, since we created small enough files in the previous step, operations will be very fast.
- `usecols=col_names`: our list of columns above

In [26]:
ddf = dd.read_csv(os.path.join(path, 'da*.csv'), 
                 engine='python',
                 dtype=dtypes,
                 assume_missing=True,
                 error_bad_lines=False,
                 blocksize=None,
                 usecols=dtypes.keys(),
                )
ddf

Unnamed: 0_level_0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,DivisionCode,LMA,MSA,CanonJobTitle,ConsolidatedONET,CanonIntermediary,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMaximumDegree,CanonMinimumDegree,CanonRequiredDegrees,CIPCode,MaxExperience,MinExperience,ConsolidatedInferredNAICS,BGTOcc,MaxAnnualSalary,MaxHourlySalary,MinAnnualSalary,MinHourlySalary,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MaxDegreeLevel,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
npartitions=12,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1
,object,object,object,object,object,float32,float32,object,object,object,object,object,object,object,object,bool,object,object,object,object,object,object,object,object,object,object,object,float32,float32,float32,float32,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Be careful when checking the `.head()` or `.tail()` on large groups of data. Depending on how much data you are trying to view, especially if it doesn't fit into memory, this could take anywhere between 2 to 30 minutes or so.

In [14]:
%%time

ddf.tail()

CPU times: user 7.33 s, sys: 7.07 s, total: 14.4 s
Wall time: 33.9 s


Unnamed: 0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,DivisionCode,LMA,MSA,CanonJobTitle,ConsolidatedONET,CanonIntermediary,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMaximumDegree,CanonMinimumDegree,CanonRequiredDegrees,CIPCode,MaxExperience,MinExperience,ConsolidatedInferredNAICS,BGTOcc,MaxAnnualSalary,MaxHourlySalary,MinAnnualSalary,MinHourlySalary,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MaxDegreeLevel,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
98868,Los Angeles,CA,Director Of Education And Outreach,2007-12-31,339743939,34.048,-118.291,24470,Los Angeles,31084,DV063108|MT063110,31080: Metropolitan Statistical Area|348: Comb...,Director of Education,11915100,Unknown,False,Recruiter,Health Care: Mental Health Diseases and Disord...,"{""Alzheimer's Disease knowledge"": 'Health Care...",Master's,Master's,Unknown,Unknown,Unknown,Unknown,7111,11-9151.00,-9999.0,-9999.0,-9999.0,-9999.0,Unknown,Unknown,Unknown,Unknown,18,Director of Education,18,18,Social / Human Services Manager,Unknown,Center Theatre Group,email tiis posting to a friend please with ...
98869,Cypress,CA,Senior Healthcare Analyst - Sas Programming Sk...,2007-12-31,339756230,33.815399,-118.037003,25094,Orange,11244,DV064204|MT063110,31080: Metropolitan Statistical Area|348: Comb...,Unknown,29207100,Unknown,False,Company from Job Board,Health Care: Clinical Data Management;Speciali...,{'Clinical Data Review': 'Health Care: Clinica...,Unknown,Unknown,Unknown,Unknown,6,5,524113,29-2071.96,-9999.0,-9999.0,-9999.0,-9999.0,fulltime,permanent,1-6,mid,Unknown,"Senior Healthcare Analyst, Sas",Unknown,Unknown,Healthcare Analyst,5-6 years,UnitedHealth Group,Sr. Healthcare Analyst - SAS programming skill...
98870,Santa Clara,CA,Iridesse Sales Professional,2007-12-31,339820992,37.352901,-121.953003,29516,Santa Clara,Unknown,MT064194,41940: Metropolitan Statistical Area,Sales Professional,41203100,Unknown,False,Company from Job Board,Marketing and Public Relations: Customer Relat...,{'Client Base Retention': 'Marketing and Publi...,Unknown,Unknown,Unknown,Unknown,Unknown,2,448310,41-2031.00,-9999.0,-9999.0,-9999.0,-9999.0,Unknown,Unknown,1-6,mid,Unknown,Sales Professional,Unknown,Unknown,Retail Sales Associate (General),minimum of 2 years,Tiffany & Co.,Careers Iridesse Sales Professional Req #: ...
98871,San Francisco,CA,"Director, Distribution Center Operations",2007-12-31,339820871,37.7798,-122.417,28639,San Francisco,41884,DV064188|MT064186,41860: Metropolitan Statistical Area|488: Comb...,Unknown,11102100,Unknown,False,Company from Job Board,Finance: Budget Management;Specialized Skills|...,{'Budgeting': 'Finance: Budget Management;Spec...,Unknown,Bachelor's,Bachelor's,Unknown,Unknown,13,448140,11-1021.91,-9999.0,-9999.0,-9999.0,-9999.0,fulltime,permanent,6+,high,16,"Director, Distribution,Operations",Unknown,16,Retail Operations Supervisor,10+ years|At least 5 years,Levi Strauss,"Job Search Engine - Job Blogs Friday, October..."
98872,Miami,FL,Financial Services Professional,2007-12-31,360306555,25.794901,-80.1278,-32377,Miami-Dade,33124,DV123312|MT123310,33100: Metropolitan Statistical Area|370: Comb...,Unknown,41303102,Unknown,False,Job Board,Sales: Solution Sales Engineering;Specialized ...,{'Consultative Sales': 'Sales: Solution Sales ...,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown,6233,41-3031.00,-9999.0,-9999.0,-9999.0,-9999.0,Unknown,Unknown,Unknown,Unknown,Unknown,Financial Services Professional,Unknown,Unknown,Financial Services Representative,Unknown,Brookdale Senior Living,- Levin Financial Group of Massmutual - Tamp...


## Measure 1

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms

The lines below check for first instance of a keyword OR the next OR the next, and so forth. Notice the space in between the pipes (`|`), this tells python that, for example, the word `intern` should not be part of `international` but rather its own entity.

In [15]:
downward = ddf['clean_text'].str.lower().str.contains(' will supervise | supervising | guiding | mentoring | leading | lead | overseeing | will guide | be in charge of | mentor | coaching | mentoring | coordinating | building teams | build team | guiding | advising | setting performance standard | sets performance standard | resolving conflict | resolves conflict | responsibility for outcomes | responsible for outcomes | directing | appointing | instructing | recruiting | managing | approve | approving | assign | assigning | delegate | delegating | control | controlling | review | reviewing | arbitrate | arbitrating | command | commanding | govern | governing ', regex=True)
upward = ddf['clean_text'].str.lower().str.contains(' reports to | report to | reporting to | answers to | answer to | managed by | responds to | respond to | directed by | receives guidance | receive guidance | supervised by | assists | assist | support | supports | supporting | helps | help | helping ', regex=True)

ddf0 = ddf.assign(downward=downward, upward=upward)

We will also create two lists with the words above and use them to create the dummies for our dataframes.

In [16]:
down_words = [' will supervise ', ' supervising ', ' guiding ', ' mentoring ', ' leading ',
              ' lead ', ' overseeing ', ' will guide ', ' be in charge of ', ' mentor ', 
              ' coaching ', ' mentoring ', ' coordinating ', ' building teams ', ' build team ', 
              ' guiding ', ' advising ', ' setting performance standard ', ' sets performance standard ',
              ' resolving conflict ', ' resolves conflict ', ' responsibility for outcomes ', 
              ' responsible for outcomes ', ' directing ', ' appointing ', ' instructing ',
              ' recruiting ', ' managing ', ' approve ', ' approving ', ' assign ', ' assigning ',
              ' delegate ', ' delegating ', ' control ', ' controlling ', ' review ', ' reviewing ',
              ' arbitrate ', ' arbitrating ', ' command ', ' commanding ', ' govern ', ' governing ']

up_words = [' reports to ', ' report to ', ' reporting to ', ' answers to ', ' answer to ', 
            ' managed by ', ' responds to ', ' respond to ', ' directed by ', ' receives guidance ',
            ' receive guidance ', ' supervised by ', ' assists ', ' assist ', ' support ', 
            ' supports ', ' supporting ', ' helps ', ' help ', ' helping ']

In some functions, I declare the data type explicitly to make it easier for any user to understand what goes in and what comes out of the function.

In [17]:
def get_indicators(data: pd.DataFrame, column: str, words: List[str]) -> pd.DataFrame:
    """
    This function will check for the existance of a word in a column of a dataframe,
    create a dummy variable for it, and add it to back into the dataframe.
    """
    for word in words: # and assign the keyword as a variable and a 1 if the word was found
        data[word.strip()] = data[column].str.lower().str.contains(word)
    
    return data

Dask has a very useful function called `.map_partitions()` that applies a function to each partition of the dask dataframe while treating these partitions as pandas dataframes. We pass in our function and function parameters without parentheses and without calling anything for the data argument since that will be the job of the partitions (e.g. small pandas dataframe).

In [18]:
ddf1 = ddf0.map_partitions(get_indicators, column='clean_text', words=down_words)
ddf2 = ddf1.map_partitions(get_indicators, column='clean_text', words=up_words)

## Measure 2

Let's first clean the list of words above so that we can add them to our dask dataframe as columns.

In [19]:
up_stripped = [w.strip() for w in up_words]
down_stripped = [w.strip() for w in down_words]

We will then sum up the appearances of both sets of columns to get a sence of how many of these kewords were spotted in a job description. We will then assign the new arrays back into our dask dataframe.

In [20]:
up_instances = ddf2.loc[:, up_stripped].sum(axis=1)
down_instances = ddf2.loc[:, down_stripped].sum(axis=1)

ddf3 = ddf2.assign(up_instances=up_instances, down_instances=down_instances)

We will now create two functions that will help us extract the words that follow our keywords. Notice that the `num_char` parameter below will extract by default the 60 characters following our keyword. You can change it to different values to get more or less words following the keyword.

In [21]:
def get_words(word: str, string: str, num_chars: int = 60) -> Union[str, None]:
    """
    This function will retrieve the set of characters following a keywords that
    has been spotted in a piece of string. The defaul number of characters is 60.
    """
    
    if word in string:
        return string[string.index(word):string.index(word) + num_chars]

def get_some_text(data: pd.DataFrame, column: str, list_of_words: List[str]) -> pd.DataFrame:
    """
    This function extends the function get_words by adding the set of characters detected back into
    its respective column as a piece of string.
    """
    
    for word in list_of_words:
        data[word.strip()] = data[column].apply(lambda x: get_words(word, x.lower()))
    return data

We will now map our functions above to our dataframe partitions.

In [22]:
ddf4 = ddf3.map_partitions(get_some_text, column='clean_text', list_of_words=down_words)
ddf5 = ddf4.map_partitions(get_some_text, column='clean_text', list_of_words=up_words)
# ddf5.head()

## Save all Files

The following function will help you save a csv file with the following characteristics:
- choose between 1 or many datasets for the output of your measure
- create a new directory for this output, based on the `path_out` variable provided at the beginning of this notebook
- add a name for your file

In [23]:
def save_csv_files(new_dir_name, data, new_file_name, pandas_or_dask=True, partitions=None):
    
    if not os.path.exists(os.path.join(path_out, new_dir_name)):
        os.makedirs(os.path.join(path_out, new_dir_name))

    if pandas_or_dask == True:
        data = data.compute()
        data.to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}.csv'), index=False)
    else:
        # the following lines of code will take the last dataset, repartition it,
        # and save it to the desired location. Notice the wildcard "*" below. That is
        # the spot Dask will use to number your files starting from 0
        (data
         .repartition(npartitions=partitions)
         .to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}*.csv'), index=False)
         )

In [None]:
%%time

save_csv_files(new_dir_name='measure_2/', data=ddf5, new_file_name='keywords_', pandas_or_dask=False, partitions=5)