# Data Analysis at Scale on Filtered Data

The following code will help you get answers for the following measures.

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms
- **Measure 2:** Does the job ad list another job title and tells us who reports to whom?
    1. This measure needs to take every keyword above and export the next 7 words in the job ad following it, if any (return NaN otherwise), as a separate column, one for each of the dummies above. 

### Things to keep in mind before running the code in this notebook.

- This notebook assumes you already ran the notebook called "06_dask_get_companies.ipynb", which gets you the filtered dataset from the cleaned sample. Conversely, you already have access to this dataset
- You will be using dask dataframes for distributed computing in your local machine. You can think of these dataframes as lazy pandas (no pun intended)
- Depending on how you modify this notebook and decide to use it moving forward, please keep in mind that you will be generating quite a few files at the end of this notebook so make sure to tweak the function at the end of the notebook to adjust the files you want

In [19]:
import dask, dask.dataframe as dd, dask.array as da
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import pandas as pd
import re, csv, os
import numpy as np
from dask import delayed, persist
from dask.distributed import Client
from glob import glob
from typing import List, Union


pd.set_option('display.max_columns', None)
csv.field_size_limit(10000000)

%matplotlib inline

Select the path where your filtered data lives and put it below in `path` variable.

In [36]:
path = '/Users/ramonperez/Dropbox/Burning Glass/Data/companies_76k/filtered_data_10/'
path_out = '/Users/ramonperez/Dropbox/Burning Glass/Analysis/approach_8/data_10'

The following two lists contain the names of the clean variables from the previous steps and the data types we will be using to read them in.

In [37]:
col_names = ['JobID', 'CleanJobTitle', 'CanonCity', 'CanonState', 'Source', 'Latitude', 
             'Longitude', 'CanonJobTitle', 'CanonCounty', 'MSA', 'LMA', 'InternshipFlag',
             'ConsolidatedONET', 'CanonSkillClusters', 'CanonSkills', 'CanonMinimumDegree',
             'CanonRequiredDegrees', 'MinExperience', 'ConsolidatedInferredNAICS', 'BGTOcc',
             'YearsOfExperience', 'CanonJobHours', 'CanonJobType', 'CanonPostalCode', 
             'CanonYearsOfExperienceCanonLevel', 'CanonYearsOfExperienceLevel', 'ConsolidatedTitle',
             'BGTSubOcc', 'ConsolidatedDegreeLevels', 'MinDegreeLevel', 'EmployerClean',
             'clean_text', 'JobDate']

dtypes={'CanonSkills': np.str, 'Latitude': np.float32, 'JobID': np.str, 'CanonJobTitle': np.str,
        'CanonYearsOfExperienceLevel': np.str, 'Longitude': np.float32, 'CanonJobType': np.str, 
        'CleanJobTitle': np.str, 'ConsolidatedInferredNAICS': np.str, 'CanonRequiredDegrees': np.str,
        'YearsOfExperience': np.str, 'CanonCity': np.str, 'CanonCounty': np.str, 'CanonJobHours': np.str,
        'CanonState': np.str, 'ConsolidatedONET': np.str, 'MSA': np.str, 'CanonMinimumDegree': np.str,
        'ConsolidatedDegreeLevels': np.str, 'BGTSubOcc': np.str, 'ConsolidatedTitle': np.str,
        'CanonSkillClusters': np.str, 'Language': np.str, 'JobDate': np.str,
        'MinDegreeLevel': np.str, 'LMA': np.str, 'MinExperience': np.str, 'CanonPostalCode': np.str,
        'InternshipFlag': np.bool_, 'Source': np.str, 'BGTOcc': np.str, 'CanonYearsOfExperienceCanonLevel': np.str}

Starting in the below cell we will begin creating a directed acyclical graph using dask. This means that we will be making barely any computations until the very end of the notebook.

The snippet below will help us read in the amount files in the directory specified above. Make sure to place the wildcard in the appropriate spot, otherwise you will not be able to read in the data.

Parameters used:

- `engine='python'`
- `dtype=dtypes`: our list of data types above
- `assume_missing=True`: Yes, there might be some edge cases of missing values not taken care of in our previous step.
- `error_bad_lines=False`: We don't want any bad line in our data so let's allow dask to tell us.
- `blocksize=None`: Dask usually tries to read in a small sample of the data and makes inferences as to what data type belongs to what. Because in our case some of the job descriptions have quite large amounts of text per observation, dask won't play it nice with our use case and will most likely misinterpret the commas in some of the values in the `JobText` column. Because of this, we will read in every block without making inferences. Luckily, since in the previous step we created small enough files, even reading the full file in will end up being blazingly fast.
- `usecols=col_names`: our list of columns above

In [38]:
ddf = dd.read_csv(os.path.join(path, 'da*.csv'), 
                 engine='python',
                 dtype=dtypes,
                 assume_missing=True,
                 error_bad_lines=False,
                 blocksize=None,
                 usecols=col_names,
                )
ddf

Unnamed: 0_level_0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,LMA,MSA,CanonJobTitle,ConsolidatedONET,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMinimumDegree,CanonRequiredDegrees,MinExperience,ConsolidatedInferredNAICS,BGTOcc,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
npartitions=50,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
,object,object,object,object,object,float32,float32,object,object,object,object,object,object,bool,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


Be careful when checking the `.head()` or `.tail()` of the datasets you will be working with. Depending on how much data you are trying to view, especially if it doesn't fit into memory, this could take anywhere between 2 minutes to an hour or so.

In [39]:
%%time

ddf.tail()

CPU times: user 2.22 s, sys: 912 ms, total: 3.14 s
Wall time: 3.9 s


Unnamed: 0,CanonCity,CanonState,CleanJobTitle,JobDate,JobID,Latitude,Longitude,CanonPostalCode,CanonCounty,LMA,MSA,CanonJobTitle,ConsolidatedONET,InternshipFlag,Source,CanonSkillClusters,CanonSkills,CanonMinimumDegree,CanonRequiredDegrees,MinExperience,ConsolidatedInferredNAICS,BGTOcc,CanonJobHours,CanonJobType,CanonYearsOfExperienceCanonLevel,CanonYearsOfExperienceLevel,ConsolidatedDegreeLevels,ConsolidatedTitle,MinDegreeLevel,BGTSubOcc,YearsOfExperience,EmployerClean,clean_text
26980,Kansas City,KS,Senior Project Manager - Telecommunications,2010-12-31,299596976,39.118599,-94.625397,66101,Wyandotte,MT292814,28140: Metropolitan Statistical Area,Unknown,15-119909,False,Job Board,Finance: Budget Management;Specialized Skills|...,{'Budgeting': 'Finance: Budget Management;Spec...,Unknown,Unknown,Unknown,Unknown,15-1199.95,Unknown,Unknown,Unknown,Unknown,Unknown,"Senior Project Manager, Telecommunications",Unknown,IT Project Manager,Unknown,Techrp,Job Details Sr. Project Manager- Telecommunica...
26981,Buffalo,NY,Associate Manager,2010-12-31,299940327,42.788399,-78.826897,14219,Erie,MT361538,15380: Metropolitan Statistical Area,Unknown,41-101100,False,Company,Finance: General Accounting;Specialized Skills...,{'Accounting': 'Finance: General Accounting;Sp...,Higher Secondary Certificate,Bachelor's|Higher Secondary Certificate,1,448140,41-1011.00,fulltime,permanent,0-1,low,16|12,Associate Manager,12,Retail Associate Manager,Minimum of one year,American Eagle Outfitters,Posting Job Title: Associate Manager Brand: ...
26982,Stuart,FL,Litigation Attorney,2010-12-31,307235633,27.192499,-80.2537,34994,Martin,MT123894,38940: Metropolitan Statistical Area,Litigation Attorney,23-101100,False,Job Board,Legal: Litigation;Specialized Skills|Specializ...,{'Litigation': 'Legal: Litigation;Specialized ...,Unknown,Unknown,3,5411,23-1011.00,fulltime,permanent,1-6,mid,Unknown,Litigation Attorney,Unknown,Attorney,2-5 Years|3-5 years,Peterson Bernard,Litigation Attorney: Peterson Bernard | Job I...
26983,Hopkins,MN,Senior Business Intelligence,2010-12-31,307056813,44.929501,-93.427696,55343,Hennepin,MT273346,33460: Metropolitan Statistical Area,Unknown,15-119908,False,Job Board,Information Technology: Version Control;Specia...,{'Apache Subversion (SVN)': 'Information Techn...,Bachelor's,Bachelor's,5,524113,15-1199.93,Unknown,Unknown,1-6,mid,16,Senior Business Intelligence,16,Business Intelligence Analyst,preferredAt least 5 years|5 years|3 years|5 years,UnitedHealth Group,"ManufacturingCrossing 12/27/2010 - Minnetonka,..."
26984,Dover,DE,"Techno, Cardiac Cath",2010-12-31,299825686,39.158699,-75.517403,19905,Kent,MT102010,20100: Metropolitan Statistical Area,Unknown,29-203100,False,Company,Health Care: Radiology;Specialized Skills|Heal...,{'Catheterization Laboratory (CATH LAB)': 'Hea...,Unknown,Unknown,3,622110,29-2031.00,fulltime,permanent,1-6,mid,Unknown,"Techno, Cardiac Cath",Unknown,Cardiovascular Technician / Technologist,2 years|3 years,Bayhealth Medical Center,Department: Cardiac Cath Lab Schedule: Full ti...


## Measure 1

- **Measure 1:** The new version Relative usage of up-ward & down-ward terms

The two lines below check for first instance of a keyword OR the next OR the next, and so forth. Notice the space in between the pipes (`|`), this is to tell python that, for example, the word `intern` should not be one that is part of `international` but rather its own entity.

In [5]:
downward = ddf['clean_text'].str.lower().str.contains(' will supervise | supervising | guiding | mentoring | leading | lead | overseeing | will guide | be in charge of | mentor | coaching | mentoring | coordinating | building teams | build team | guiding | advising | setting performance standard | sets performance standard | resolving conflict | resolves conflict | responsibility for outcomes | responsible for outcomes | directing | appointing | instructing | recruiting | managing | approve | approving | assign | assigning | delegate | delegating | control | controlling | review | reviewing | arbitrate | arbitrating | command | commanding | govern | governing ', regex=True)
upward = ddf['clean_text'].str.lower().str.contains(' reports to | report to | reporting to | answers to | answer to | managed by | responds to | respond to | directed by | receives guidance | receive guidance | supervised by | assists | assist | support | supports | supporting | helps | help | helping ', regex=True)

ddf0 = ddf.assign(downward=downward, upward=upward)

We will also create two lists with the words above and use them to create the dummies for our dataframes.

In [6]:
down_words = [' will supervise ', ' supervising ', ' guiding ', ' mentoring ', ' leading ',
              ' lead ', ' overseeing ', ' will guide ', ' be in charge of ', ' mentor ', 
              ' coaching ', ' mentoring ', ' coordinating ', ' building teams ', ' build team ', 
              ' guiding ', ' advising ', ' setting performance standard ', ' sets performance standard ',
              ' resolving conflict ', ' resolves conflict ', ' responsibility for outcomes ', 
              ' responsible for outcomes ', ' directing ', ' appointing ', ' instructing ',
              ' recruiting ', ' managing ', ' approve ', ' approving ', ' assign ', ' assigning ',
              ' delegate ', ' delegating ', ' control ', ' controlling ', ' review ', ' reviewing ',
              ' arbitrate ', ' arbitrating ', ' command ', ' commanding ', ' govern ', ' governing ']

up_words = [' reports to ', ' report to ', ' reporting to ', ' answers to ', ' answer to ', 
            ' managed by ', ' responds to ', ' respond to ', ' directed by ', ' receives guidance ',
            ' receive guidance ', ' supervised by ', ' assists ', ' assist ', ' support ', 
            ' supports ', ' supporting ', ' helps ', ' help ', ' helping ']

In [7]:
def get_indicators(data: pd.DataFrame, column: str, words: List[str]) -> pd.DataFrame:
    """
    This function will check for the existance of a word in a column of a dataframe,
    create a dummy variable for it, and add it to back into the dataframe.
    """
    for word in words: # and assign the keyword as a variable and a 1 if the word was found
        data[word.strip()] = data[column].str.lower().str.contains(word)
    return data

Dask has a very useful function called `.map_partitions()` that applies a function to each partition of the dask dataframe while treating these partitions as pandas dataframes. We pass in our function and function parameters without parentheses and without calling anything for the data argument.

In [8]:
ddf1 = ddf0.map_partitions(get_indicators, column='clean_text', words=down_words)
ddf2 = ddf1.map_partitions(get_indicators, column='clean_text', words=up_words)

## Measure 2

Let's first clean the list of words above so that we can add them to our dataframe as columns.

In [9]:
up_stripped = [w.strip() for w in up_words]
down_stripped = [w.strip() for w in down_words]

We will then sum up the appearances of both sets of columns to get a sence of how many of these kewords were spotted in a job description. We will then assign the new arrays back into our dask dataframe.

In [10]:
up_instances = ddf2.loc[:, up_stripped].sum(axis=1)
down_instances = ddf2.loc[:, down_stripped].sum(axis=1)
ddf3 = ddf2.assign(up_instances=up_instances, down_instances=down_instances)

In [11]:
def get_words(word: str, string: str, num_chars: int=60) -> Union[str, None]:
    """
    This function will retrieve the set of characters following a keywords that
    has been spotted in a piece of string. The defaul number of characters is 60.
    """
    
    if word in string:
        return string[string.index(word):string.index(word) + num_chars]

def get_some_text(data: pd.DataFrame, column: str, list_of_words: List[str]) -> pd.DataFrame:
    """
    This function extends the function get_words by adding the set of characters detected back into
    its respective column as a piece of string.
    """
    
    for word in list_of_words:
        data[word.strip()] = data[column].apply(lambda x: get_words(word, x))
    return data

We will now map our functions above to our dataframe partitions.

In [12]:
ddf4 = ddf3.map_partitions(get_some_text, column='clean_text', list_of_words=down_words)
ddf5 = ddf4.map_partitions(get_some_text, column='clean_text', list_of_words=up_words)
# ddf5.head()

## Save all Files

The following function will help you save a csv file with the following characteristics:
- choose between 1 or many datasets for the output of your measure
- create a new directory for this output, based on the path provided at the beginning of this notebook
- add a name for your file

In [50]:
def save_csv_files(new_dir_name, data, new_file_name, pandas_or_dask=True, partitions=None):
    
    if not os.path.exists(os.path.join(path_out, new_dir_name)):
        os.makedirs(os.path.join(path_out, new_dir_name))

    if pandas_or_dask == True:
        data = data.compute()
        data.to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}.csv'), index=False)
    else:
        # the following lines of code will take the last dataset, repartition it,
        # and save it to the desired location. Notice the wildcard "*" below. That is
        # the spot Dask will use to number your files starting from 0
        (data
         .repartition(npartitions=partitions)
         .to_csv(os.path.join(path_out, new_dir_name, f'{new_file_name}*.csv'), index=False)
         )

In [51]:
%%time

save_csv_files(new_dir_name='measure_2/', 
               data=ddf5,
               new_file_name='keywords_',
               pandas_or_dask=False, 
               partitions=25)

CPU times: user 13min 46s, sys: 12min 23s, total: 26min 9s
Wall time: 42min 37s
