# Data Science for Good: City of Los Angeles

This notebook analyses the data given in the "Data Science for Good: City of Los Angeles" competition. The problem statement given on the [competition website](https://www.kaggle.com/c/data-science-for-good-city-of-los-angeles) notes the following main challenge:

> The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

To this end, the following goals are stated in the competition:
> The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to: (1) identify language that can negatively bias the pool of applicants; (2) improve the diversity and quality of the applicant pool; and/or (3) make it easier to determine which promotions are available to employees in each job class.

This notebook is structured around the 4 parts mentioned in the problem statement above, i.e.,
* Task 1: creation of a single structured CSV file
* Task 2: identification of language that can negatively bias the pool of applicants
* Task 3: improvement of diversity and quality of the applicant pool
* Task 4: determining available promotions for employees in each job class

### Preparatory statements

Installations and imports.

In [None]:
# installations
!python -m spacy download en_core_web_md
!pip install multidict
!pip install vaderSentiment
!pip install networkx

In [None]:
# importing required (general) packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt 

import os

import re
import string

from graphviz import Digraph

from IPython.display import HTML, display
import tabulate
import ipywidgets as widgets

In [None]:
# increase plot size
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 10)

## Task 1: Create a single structured CSV

In the first part, we will create a single structured CSV file.

For the creation of the CSV, in this section, the followings steps are performed:

* read all files
* process all files and extract the required information in separate functions
* create data frame

The following code follows these steps to create a homogenous CSV which follows the given data dictionary and the provided example (`sample job class export template.csv`)`

### Read files

Methods for reading files.

In [None]:
bulletin_path = "../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Job Bulletins/"
files = os.listdir(bulletin_path)

print("There are ", len(files), " job bulletins available.")

In [None]:
def open_bulletin(file):
    with open(bulletin_path + file, encoding='utf-8', errors='ignore') as f:
        return "".join([line for line in f.readlines() if line.strip()])

In [None]:
titles_path = "../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data/job_titles.csv"
with open(titles_path) as f:
       job_titles = [line.lower().strip() for line in f.readlines() if line.strip()]

### Process files

Methods for processing files and helper methods for extracting all the required information.

In [None]:
def process_file(file, text):
    '''parses text and returns an array containing multiple dictionnaries
    as a result of normalizing the requirements attribute; the result belongs
    to the text of one single file'''
    
    processed_data = process_file_compacted(file, text)
    
    requirements = processed_data["requirements"]
        
    del processed_data["requirements"]
    
    data = []
    
    # create separate entry in data for each requirement, where all the attributes
    # are the same, except the requirement-related ones
    for requirement in requirements:
        split_data = processed_data.copy()
        split_data.update(requirement)
        data.append(split_data)
    
    return data


def process_file_compacted(file, text):
    '''parses text and returns an array containing multiple dictionnaries
    as a result of normalizing the requirements attribute; the result belongs
    to the text of one single file'''
    
    data = {
        "FILE_NAME" : get_file_name(file),
        "JOB_CLASS_TITLE" : get_job_class_title(text),
        "JOB_CLASS_NO" : get_job_class_no(text),
        "JOB_DUTIES" : get_job_duties(text),
        "DRIVERS_LICENSE_REQ" : get_drivers_license_requirement(text),
        "DRIVERS_LIC_TYPE" : get_drivers_license_type(text),
        "ADDTL_LIC" : get_additional_license(text),
        "EXAM_TYPE" : get_exam_type(text),
        "ENTRY_SALARY_GEN" : get_dwp_salary(text),
        "ENTRY_SALARY_DWP" : get_gen_salary(text),
        "OPEN_DATE" : get_open_date(text),
        "requirements" : get_requirements(text),
        #"jobpost" : text
    }
    return data


def get_file_name(file):
    return file


def get_job_class_title(text):
    # job class is given mostly as a upper-case letter string (on its own line)
    # it appears at the beginning of the document, hence, only the first
    # result is considered
    match = re.search("^\s*([A-Z0-9\s'\-a-z ()]*)(?: {5,}|\t|\n)", text)
    if match is None:
        return None
    job_class_title = match.group(0).replace("CAMPUS INTERVIEWS ONLY", "").replace("\n", "").strip().lower()
    return job_class_title


def get_job_class_no(text):
    # job class is a four-digit code
    # it appears at the beginning of the document, hence, only the first
    # result is considered    
    match = re.search("([0-9]{4,4})", text)
    if match is None:
        return None
    job_class_no = match.group(0)
    return job_class_no


def get_job_duties(text):
    # the job duties are listed after the title "DUTIES";
    # all the text after the title belongs to this section, up until 
    # the next upper-case title
    match = re.search("DUTIES(?:.*?RESPONSIBILITIES)?[:]?([\s\S]*?)(?=[A-Z]{3,})", text)
    if match is None:
        return None
    duties = match.group(1).strip() 
    return duties


def get_requirements(text):
    # the requirements are listed after the title "REQUIREMENT";
    # all the text after the title belongs to this section, up until 
    # the next upper-case title
    # the requirements section is process further and the
    # single parts of the requirements are analysed and split into
    # sub-categories
    match = re.search("REQUIREMENT.*\n([\s\S]*?)[A-Z]{3,}", text)
    if match is None:
        return None
    requirement = match.group(1).strip()
    requirements_raw = [item.strip() for item in re.split(r'[\s](?=[0-9]+\.)', requirement)]
    
    requirements = []
    
    for index, requirement_text in enumerate(requirements_raw):
        requirement_set_id = index + 1
        processed_sub_requirements = process_single_requirement(requirement_set_id, requirement_text)
        
        if processed_sub_requirements is not None:
            for processed_sub_requirement in processed_sub_requirements:
                requirements.append(processed_sub_requirement)
        
    return requirements


SUBSET_ID = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
            "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "AA", "BB", "CC", "DD", "EE",
            "FF", "GG", "HH", "II", "JJ", "KK", "LL", "MM", "NN", "OO", "PP", "QQ", "RR", 
            "SS", "TT", "UU", "VV", "WW", "XX", "YY", "ZZ"]


def process_single_requirement(requirement_set_id, text, requirement_subset_id = "", parent_text = "", hint = ""):
    # process a single requirement, split its parts and create
    # a dictionnary reflecting the content of the requirement
    requirement = {}
    requirement["REQUIREMENTS_SET_ID"] = requirement_set_id
    requirement["REQUIREMENTS_SUBSET_ID"] = requirement_subset_id

    has_subrequirement = has_sub_requirements(text)
    
    if(has_subrequirement):
        # the requirement has sub requirements, hence, the requirement
        # receives only the main-text (and ignores the sub-requirements)
        requirement_text = re.search(r'([\s\S\n]+)(?:\sa\.)', text).group(1).strip()
    else:
        requirement_text = text.strip()
    
    # split requirement into its part (divided by "and" or "or")
    parts_raw = get_requirement_parts(requirement_text)
    parts = [{"requirements_parts" : item } for item in parts_raw]
    for part in parts:
        # get details for the requirement part (e.g., educational, course-related, experience-related)
        part["details"] = get_requirement_details(part["requirements_parts"], hint)       
    
    # merge all the received details into the requirement dictionary for the given requirement
    for part in parts:
        for key, detail in part["details"].items():
            if(key not in requirement):
                requirement[key] = detail
    
    processed_requirements = []
    processed_requirements.append(requirement)
    
    # if requirement has sub requirements, these have to be processed as well; this is
    # done separately
    if(has_subrequirement):
        sub_requirements_raw = re.split(r'\n([a-z]\..*)', text)
        sub_requirements_raw = list(filter(None, sub_requirements_raw))
        sub_requirements_raw = list(filter(lambda i : re.search(r'^[a-z]+\.', i), sub_requirements_raw))
        
        sub_requirements = []
        
        # iterate over a. ... b. ... c. ...
        for index, sub_requirement in enumerate(sub_requirements_raw):
            sub_requirement_text = re.sub(r'^[a-z]+\.', "", sub_requirement)
            
            # as sometimes, the sub-requirement does not contain enough information
            # to classify a text (and its information) into one of the categories, i.e.,
            # course-related, educational, experience-related, we pass on a "hint" which
            # comes from the requirement's text, rather than the sub-requirements text
            hint = ""
            if(bool(re.search(r'experience\s*in\s*\:', requirement_text))):
                hint = "experience"
            elif(is_course_related(requirement_text)):
                hint = "course"
            elif(is_educational(requirement_text)):
                hint = "education"

            sub_requirement = process_single_requirement(requirement_set_id,  sub_requirement_text, SUBSET_ID[index], requirement_text, hint)
            sub_requirements.extend(sub_requirement)
        
        processed_requirements.extend(sub_requirements)
        return processed_requirements 
    
    return processed_requirements


def has_sub_requirements(text):
    return bool(re.search(r'[\n](?=[a-z]+\.)', text))


def get_requirement_details(text, hint = ""):
    # processes a requirement text based on whether it is course-related, educational
    # or experience-related and searches for the information to add to
    # the corresponding attributes
    # a hint can be passed, to ensure that the right category is chosen
    
    use_hint = !(is_course_related(text) or is_educational(text) or is_experience_related(text))
    
    part = {}
    
    if(is_course_related(text) or (use_hint and hint == "course")):
        part["COURSE_COUNT"] = get_course_count(text)
        part["COURSE_LENGTH"] = get_course_length(text)
        part["COURSE_SUBJECT"] = get_course_subject(text)
        part["MISC_COURSE_DETAILS"] = text
    elif(is_educational(text) or (use_hint and hint == "education")):
        part["EDUCATION_YEARS"] = get_education_years(text)
        part["SCHOOL_TYPE"] = get_school_type(text)
        part["EDUCATION_MAJOR"] = get_education_major(text)
    elif(is_experience_related(text) or (use_hint and hint == "experience")):
        part["EXPERIENCE_LENGTH"] = get_experience_years(text)
        part["FULL_TIME_PART_TIME"] = get_full_time_part_time(text)
        part["EXP_JOB_CLASS_TITLE"] = get_exp_job_class_title(text)
        part["EXP_JOB_CLASS_ALT_RESP"] = get_exp_job_class_alt_resp(text)
        part["EXP_JOB_CLASS_FUNCTION"] = get_exp_job_class_function(text)        
        
    return part
        

def check_text_for_catch_words(text, catch_words):
    for word in catch_words:
        if(word in text): 
            return True
    return False
    
def is_educational(text):
    # check if a text is educational; this is done by checking for certain catch words
    catch_words = ["education", "university", "college", "high school", "apprenticeship"]
    return check_text_for_catch_words(text, catch_words)


def is_course_related(text):
    # check if a text is course-related; this is done by checking for certain catch words
    catch_words = ["course"]
    return check_text_for_catch_words(text, catch_words)

def is_experience_related(text):    
    # check if a text is course-related; this is done by checking for certain catch words 
    # or checking if a full-time/part-time information is given or a job experience class
    catch_words = ["experience"]
    result = check_text_for_catch_words(text, catch_words)

    if(result):
        return True
    elif (get_experience_years(text) is not None or get_full_time_part_time(text) is not None or get_exp_job_class_title(text) is not None):
        return True
    
    return False


def get_years(text):
    # extracts from a text a year/month information, by checking for certain catch words 
    # (i.e., year and month)
    # note that months are converted to years
    years_match = re.search("([0-9A-Za-z]+)[\- ]?year[s]?", text)
    if years_match is not None:
        years =  years_match.group(1).strip().lower()
        int_years = text_to_number(years)
        
        if int_years is not None:
            return float(int_years)
        else:
            return None
    
    months_match = re.search("([0-9A-Za-z]+)[\- ]?month[s]?", text)
    if months_match is not None:
        months =  months_match.group(1).strip().lower()
        int_months = text_to_number(months)
        
        if int_months is not None:
            return round(float(int_months) / 12.0, 2)
        else:
            return None

        
def get_experience_years(text):
    return get_years(text)


def get_education_years(text):
    return get_years(text)


def text_to_number(text):
    # translates a textual information of a number into a integer
    try:
        i = int(text)
        return i
    except ValueError:
        typical_numbers = {"one" : 1, "two" : 2, "three" : 3, "four" : 4, "five" : 5, "six" : 6, 
                          "seven" : 7, "eight" : 8, "nine" : 9, "ten" : 10, "eleven" : 11, "twelve" : 12,
                          "thirteen" : 13, "fourteen" : 14, "fifteen" : 15, "sixteen" : 16, "seventeen" : 17,
                          "eighteen" : 18, "nineteen" : 19, "twenty" : 20, "thirty": 30, "fourty" : 40, "fifty" : 50,
                          "sixty" : 60, "seventy" : 70, "eighty" : 80, "ninety" : 90}
        
        if(text in typical_numbers):     
            return typical_numbers[text]
        else:
            return None
    
    
def get_school_type(text):
    # extracts school type information, i.e., college/university, high school, apprenticeship
    if("college" in text or "university" in text):
        return "COLLEGE OR UNIVERSITY"
    elif("high school" in text):
        return "HIGH SCHOOL"
    elif("apprenticeship" in text):
        return "APPRENTICESHIP"
    else:
        None

        
def get_education_major(text):
    # extracts information on major/concentration in education-related texts
    # it splitts the majors by or/and and joins them by a unified '|' symbol
    match = re.search(r"(?:major|concentration)\s*in\s*([\s\S]*)(?:[;]+|or a closely related field)", text)
    if match is not None:
        majors_text = match.group(1).strip()
        majors_text = re.sub(r'\([^)]*\)', '', majors_text) #remove information in brackets
        majors_split = re.split(';|,|\sor\s|\sand\s', majors_text)
        majors = [item.strip().lower() for item in majors_split]
        majors = list(filter(None, majors))
        return "|".join(majors)
    return None


def get_requirement_parts(text):
    # gets the single parts of a requirement (which are connected by an and or an or)
    matches = re.split(r'\b(?:and|or)(?=[\s][A-Za-z0-9]*?[- ](?:year[s]|month[s]))', text)
    matches = list(filter(None, matches))
    if(len(matches) > 1):
        return [item.strip() for item in matches]
    return [text]


def get_full_time_part_time(text):
    # extracts full-time or part-time information from text
    if("full time" in text or "full-time" in text):
        return "FULL_TIME"
    elif("part_time" in text or "part-time" in text):
        return "PART_TIME"
    else:
        return None
    
    
def get_exp_job_class_title(text):
    # extracts the job class from a text by comparing it to a job titles dictionnary
    for word in job_titles:
        if(word in text.lower()): 
            return word
    return None

def get_exp_job_class_alt_resp(text):
    # extracts the alternative job class from a text by searching for "in a class"
    match = re.search(r'(?:in a class)(.*)[.,;!\n]', text)
    if match is not None:
        return match.group(1).strip()
    return None   

def get_exp_job_class_function(text):
    #extracts the job class function by searching for the string "experience"
    match = re.search(r'(?:[0-9a-z]*\s*)?(?:.*?)(?:[Ee]xperience\s*\S*?\s*(?:an|a)?\s*)((?:with)?.*?)(?:[;.\n])', text)
    if match is not None:
        return match.group(1).strip()
    return None   

def get_course_count(text):
    # extracts the number of courses
    match = re.search(r'([A-Za-z0-9]*)\scourse[s]?', text)
    if match is not None:
        return text_to_number(match.group(1).strip())
    return None    


def get_course_length(text):
    # extracts the semester or quarters for a course, the information is 
    # returned in a unified schema with 'S' denoting semesters, 'Q' quarters and
    # both information separated by a '|'
    semester_match = re.search(r'([A-Za-z0-9]+)\ssemester[s]?', text)
    quarters_match = re.search(r'([A-Za-z0-9]+)\squarter[s]?', text)
    
    if semester_match is not None:
        semesters = text_to_number(semester_match.group(1).strip())
    else:
        semesters = None
    
    if quarters_match is not None:
        quarters = text_to_number(quarters_match.group(1).strip())
    else:
        quarters = None

    if semesters is not None and quarters is not None:
        return str(quarters) + "Q" + "|"+ str(semesters) + "S"
    elif semesters is not None:
        return str(semesters) + "S"
    elif quarters is not None:
        return str(quarters) + "Q"
    return None    


def get_course_subject(text):
    # extracts the subject of the courses; the information is returned
    # in a unified schema where the subjects are separated by a '|'
    match = re.search(r'course[s]?(?:.*?)(?:in |:)(.*)', text)
    if match is not None:
        subjects_text = match.group(1).strip()
        subjects_text = re.sub(r'\([^)]*\)', '', subjects_text) #remove information in brackets
        subjects_splitted = re.split("[,;]", subjects_text)
        subjects = [re.sub(r'[\.,;]', '', item) for item in subjects_splitted]
        subjects = [item.strip().lower() for item in subjects]
        subjects = [re.sub(r'^(or|and|either)\s', '', item) for item in subjects] #remove leading or/and
        subjects = [re.sub(r'\s(or|and)$', '', item) for item in subjects] #remove ending or/and   
        subjects = [re.sub(r'^\s*(or|and)\s*$', '', item) for item in subjects] #remove empty or/and  
        subjects = list(filter(None, subjects))
        return "|".join(subjects)
    return None    


def get_drivers_license_requirement(text):
    # extracts requirements for driver's license from the text; if a sentence mentioning
    # the driver's license contains the words may/might, a "P" is returned (for possible),
    # whereas if a driver's license is truly required, a "R" is returned
    sentences = text.split(".")
    requirements = list(filter(lambda x: re.search(r"driver.* license", x), sentences))
    if(requirements):
        for requirement in requirements:
            match = re.search("\s*(may|might)\s*", requirement, re.IGNORECASE)
            if match is not None:
                return "P"
            else:
                return "R"
    else:
        return ""

    
def get_drivers_license_type(text):
    # extracts requirements for driver's license type from the text
    # the information is returned in a unified format
    sentences = text.split(".")
    requirements = list(filter(lambda x: re.search(r"driver.* license", x), sentences))
    for requirement in requirements:
        match = re.search("[Cc]lass\s*([A-Z1-9\/]{1,3}(?:(?:\s*or\s*|\s*\,\s*)(?:[Cc]lass\s*)?[A-Z1-9\/]{1,3})*)", requirement, re.IGNORECASE)
        if match is not None:
            return match.group(1).strip().replace("or", ",").replace("Class", "").replace("class", "").replace(" ", "")
    return None


def get_additional_license(text):
    # extracts additional license requirements which do not involve the driver's license
    match = re.search("REQUIREMENT([\s\S]*)", text, re.IGNORECASE)
    text = match.group(1)
    sentences = [sentence.strip() for sentence in re.split("[\.]", text)]
    return "\n".join(list(filter(lambda x: re.search(r"(?<!driver's) license", x), sentences)))    


def get_exam_type(text):
    # extracts information on the examination, 'EXAMINATION' is used as catch word as the category
    # is decided based on the appearing word within the same sentence, i.e., 
    # "INTERDEPARTMENTAL" and "OPEN", "INTERDEPARTMENTAL", "OPEN", "DEPARTEMENTAL"
    sentences = text.replace("\n", "").split(".")
    examinations = list(filter(lambda x: re.search(r"(?:THIS|THE)\s*EXAMINATION", x), sentences))
    
    if(examinations):
        examination = re.search(r"(([A-Z]*\s)*)", examinations[0]).group(0).replace('\n', ' ').strip()
        
        if("INTERDEPARTMENTAL" in examination and "OPEN" in examination):
            return "OPEN_INT_PROM"
        elif("INTERDEPARTMENTAL" in examination):
            return "INT_DEP_PROM"
        elif("DEPARTMENTAL" in examination):
            return "DEPT_PROM"
        elif("OPEN" in examination):
            return "OPEN"
        else:
            return None

        
def get_dwp_salary(text):
    # extracts the salary information at DWP; the information is searched by looking for
    # lines which contain information with a dollar symbol ($)
    # to consider it as a DWP salary, the word "Department" should appear in the same line
    # the information is returned in a unified format: either as a single number or with a
    # range separated by a '-'
    lines = text.split("\n")
    salary_lines = list(filter(lambda x: re.search(r"\$", x), lines))

    for salary_line in salary_lines:
        if("Department" in salary_line or "department" in salary_line):
            match = re.search(r'\$\s?[0-9,]*(?:\s*(?:to|-)\s*\$\s?[0-9,]*)?', salary_line)
            if match is not None:
                salary = match.group(0).strip()
                return re.sub(r"\s*to\s*", "-", salary)

            
def get_gen_salary(text):
    # extracts the salary information; the information is searched by looking for
    # lines which contain information with a dollar symbol ($)
    # the information is returned in a unified format: either as a single number or with a
    # range separated by a '-'
    lines = text.split("\n")
    salary_lines = list(filter(lambda x: re.search(r"\$", x), lines))

    for salary_line in salary_lines:
        if("Department" not in salary_line and "department" not in salary_line):
            match = re.search(r'\$\s?[0-9,]*(?:\s*(?:to|-)\s*\$\s?[0-9,]*)?', salary_line)
            if match is not None:
                salary = match.group(0).strip()
                return re.sub(r"\s*to\s*", "-", salary)
  
            
def get_open_date(text):
    # extracts the open date information by searching for the title "Open Date" and a 
    # text fromatted like a date
    match = re.search(r'Open\s*Date\s*:\s*([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{2,4})', text, re.IGNORECASE)
    if match is not None:
        return match.group(1).strip()

Read all files, create data frame and export data frame as CSV

In [None]:
def get_data():
    '''reads the data from all files and creates a
    data frame with all the extracted information
    '''
    data = []

    for file in files:
        text = open_bulletin(file)

        results = process_file(file, text)

        for result in results:
            data.append(result)

    df = pd.DataFrame.from_dict(data)
    df.fillna(value=pd.np.nan, inplace=True)
    
    return df

In [None]:
data = get_data()

The data is exported into `export.csv`. Note that `"` is used as a text qualifier.

In [None]:
data.to_csv("export.csv", index = False)

In [None]:
display(data)

## Task 2: Identify Language that can Negatively Bias the Pool of Applicants

### Method

In the following task, we only focus on the job duties (`JOB_DUTIES`) attribute for the analysis of the language, as it is possibly the most important field and where the language is comparably unconstrained. 

For this task, we use the VADER sentiment analysis tools, as available on [GitHub](https://github.com/cjhutto/vaderSentiment) and as introduced in:

> Hutto, Clayton J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth international AAAI conference on weblogs and social media. 2014.

The VADER package is used to receive a valence score denoting how positive/negative the language used in the text is.

We first introduce the methods used for the analysis.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Moreover, the spaCy package is used for tokenization, lemmatization and normalization of the sentences. spaCy is available on [Github](https://github.com/explosion/spaCy).

In [None]:
import spacy
from spacy.lang.en import English

In [None]:
sm_parser = spacy.load("en_core_web_sm")
stopwords = spacy.lang.en.stop_words.STOP_WORDS
punctuations = string.punctuation

sm_parser.max_length = 51391693

In [None]:
def tokenize_clean_normalize(text):
    '''tokenizes, cleans and normalizes a given text, i.e., the text is split/parsed, lower-cased, stripped from
    any whitespace or other characters, words are lemmatized, and stopwords (as well as punctuations) are removed'''
    
    if(text is not None and len(str(text))) > 0:
        tokens = sm_parser(str(text))
        tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ]
        tokens = [ word for word in tokens if word not in stopwords and word not in punctuations ]
        return tokens
    else:
        return []

In [None]:
analyzer = SentimentIntensityAnalyzer()

def analyze_sentiment_document(text):
    '''returns the valence for a document, i.e., the mean over the valence of the single words; 
    a positive valence denotes positive sentiment in the text (max: 1.0), 
    a negative valence denotes negative sentiment in the text (min: -1.0)'''
    
    try:
        if(text is not None and len(str(text))) > 0:
            document = sm_parser(str(text))
            analysis = [analyzer.polarity_scores(sentence.text)["compound"] for sentence in document.sents]
            return np.mean(analysis)
        else:
            return None
    except:
        return 0.0

    
def analyze_sentiment_word(word):
    '''returns the valence for a single word; 
    a positive valence denotes positive sentiment in the text (max: 1.0), 
    a negative valence denotes negative sentiment in the text (min: -1.0)'''
    
    try:
        analysis = analyzer.polarity_scores(word)
        return analysis["compound"]
    except:
        return 0.0

In the following, we introduce a method to analyze the readability of a text (using the [Automated readability index](https://en.wikipedia.org/wiki/Automated_readability_index)).

In [None]:
def analyze_readability(text):
    if(text is not None and len(str(text))) > 0:
        text = str(text)
        num_characters = len(re.sub(r"[^a-zA-Z0-9]","", text))

        tokens = list(sm_parser(str(text)))
        tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ]
        tokens = [ word for word in tokens if word not in punctuations ]
        num_words = len(tokens)

        num_sentences = len(re.sub(r"[^\.;:?!]","", text))

        score = 4.71 * (num_characters / max(num_words, 1)) + 0.5 * (num_words / max(num_sentences, 1)) - 21.43
        return int(score)

### Analysis

In [None]:
job_posts = get_data()[["JOB_DUTIES", "FILE_NAME"]].drop_duplicates()
job_posts["valence"] = job_posts["JOB_DUTIES"].apply(analyze_sentiment_document)

In the following, we plot the documents by their valence, where a valence of -1.0 denotes most extreme negative and +1.0 most extreme positive. As the plot shows, some job postings have a positive valence, whereas others contain negative language.

In [None]:
def valence_colors(lst):
    cols = []
    for l in lst:
        if l < 0:
            cols.append('red')
        elif l > 0:
            cols.append('green')
        else:
            cols.append('grey')
    return cols

plt.scatter(x = range(0, len(job_posts)), y = job_posts.sort_values(["valence"])["valence"], c = valence_colors(job_posts.sort_values(["valence"])["valence"]))
plt.show()

To exemplify, a document with a high positive valence, is the following one:

In [None]:
high_valence_file = job_posts.iloc[job_posts["valence"].idxmax()]["FILE_NAME"]
high_valence_file

In [None]:
pdf_file = "../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data/PDFs/2015/March 2015/03062015/SIGN SHOP SUPERVISOR 3419.pdf"
from wand.image import Image as WImage
high_valence_post = WImage(filename=pdf_file)
high_valence_post

Methods for displayal of positive/negative wording.

For the following analysis, we use the WordCloud package which is also available on [Github](https://github.com/amueller/word_cloud).

In [None]:
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import multidict
import colorsys
from gensim.corpora.dictionary import Dictionary
import matplotlib.colors as colors

In [None]:
def get_color_for_word(word, font_size, position,orientation,random_state=None, **kwargs):
    # returns a color value for a given word, where positive words are marked in green
    # negative words are marked in red
    valence = analyze_sentiment_word(word)
    valence = (valence + 1.0) / 2.0
    
    cdict = {'red':   ((0.0, 0.0, 0.0), (0.5, 0.5, 0.5), (1.0, 1.0, 1.0)),
             'green': ((0.0, 0.5, 0.5), (0.5, 0.5, 0.5), (1.0, 0.0, 0.0)),
             'blue':  ((0.0, 0.0, 0.0), (0.5, 0.5, 0.5), (1.0, 0.0, 0.0))
            }
    
    cmap = colors.LinearSegmentedColormap('GnRd', cdict)
    rgba = cmap(1-valence)
    color = "rgb(" + str(int(rgba[0] * 255)) + "," + str(int(rgba[1] * 255)) + "," + str(int(rgba[2] * 255)) + ")"
    return(color) 
        

def is_words_acceptable_for_wordcloud(word):
    # checks if word is worthy to be considered in word cloud (e.g., we remove months,
    # numbers)
    months = ['january', 'february', 'march', 'april', 'may', 'june', 'july',
          'august', 'september', 'october', 'november', 'december']
    
    if(word in months or is_number(word)):
        return False
    
    return True


def is_number(word):
    # checks if a word is a number
    try:
        float(word)
        return True
    except ValueError:
        return False
    
    
def generate_word_cloud(documents, min_valence = -1.0, max_valence = 1.0, max_words = 20):
    ''' generates a word cloud by processing the given documents; a min_valence and max_valence
    can be passed to the method (e.g., to only display negatively connotated or positively connotated
    words)'''
    
    tokenized_documents = [tokenize_clean_normalize(document) for document in documents]
    
    dictionary = Dictionary(tokenized_documents)
    dictionary.compactify()

    mdictionary = multidict.MultiDict()

    for key, value in dictionary.dfs.items():
        valence = analyze_sentiment_word(dictionary[key])
        if(is_words_acceptable_for_wordcloud(dictionary[key]) and  valence > min_valence and valence < max_valence):
            mdictionary.add(dictionary[key], value)
        
    wc = WordCloud(background_color="white", max_words=max_words)
    wc.generate_from_frequencies(mdictionary)
    wc.recolor(color_func=get_color_for_word)

    return (dictionary, wc)

The following word clouds display positive words used in the job posts in green, negative words in red.

In [None]:
(dictionary, wc_positive) = generate_word_cloud(job_posts["JOB_DUTIES"], 0.0, 1.0)
plt.imshow(wc_positive, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
(dictionary, wc_negative) = generate_word_cloud(job_posts["JOB_DUTIES"], -1.0, 0.0)
plt.imshow(wc_negative, interpolation='bilinear')
plt.axis("off")
plt.show()

To not limit the selection of words to the ones used already in the posts, we use an external dataset (from [Kaggle](https://www.kaggle.com/madhab/jobposts)) of 19'000 online job posts to analyse what other words might be used for the job postings and which ones should be used/avoided. The following plot shows the distribution of the valence value for the job posts found in the external dataset.

In [None]:
online_posts = pd.read_csv("../input/jobposts/data job posts.csv").sample(1000)
online_posts["valence"] = online_posts["JobDescription"].apply(analyze_sentiment_document)
plt.scatter(x = range(0, len(online_posts)), y = online_posts.sort_values(["valence"])["valence"], c = valence_colors(online_posts.sort_values(["valence"])["valence"]))
plt.show()

The following table shows the distribution of the LA job posts in the different categories (positive / neutral / negative language) and the distribution of the categories in the external data sets. We can see that the LA data set has comparably more job posts with a negative connotated language than the external data set.

In [None]:
perc_pos_job_posts = job_posts[job_posts["valence"] > 0].shape[0] / job_posts.shape[0]
perc_neut_job_posts = job_posts[job_posts["valence"] == 0].shape[0] / job_posts.shape[0]
perc_neg_job_posts = job_posts[job_posts["valence"] < 0].shape[0] / job_posts.shape[0]


perc_pos_online_posts = online_posts[online_posts["valence"] > 0].shape[0] / online_posts.shape[0]
perc_neut_online_posts = online_posts[online_posts["valence"] == 0].shape[0] / online_posts.shape[0]
perc_neg_online_posts = online_posts[online_posts["valence"] < 0].shape[0] / online_posts.shape[0]

table = [["","LA job posts","External dataset"],
         ["positive", str(round(perc_pos_job_posts * 100, 2)) + "%", str(round(perc_pos_online_posts * 100, 2)) + "%" ],
         ["neutral",  str(round(perc_neut_job_posts * 100, 2)) + "%", str(round(perc_neut_online_posts * 100, 2)) + "%"],
         ["negative", str(round(perc_neg_job_posts * 100, 2)) + "%", str(round(perc_neg_online_posts * 100, 2)) + "%"]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In the following word clouds, we mark with green the words that have a postive valence and are used in the external data sets. These word may be used as an inspiration to improve job posts.

In [None]:
(dictionary, wc_online_positive) = generate_word_cloud(online_posts["JobDescription"], 0.0, 1.0)
plt.imshow(wc_online_positive, interpolation='bilinear')
plt.axis("off")
plt.show()

Similarly, in the following, we display words which should be avoided in job postings (from external data source).

In [None]:
(dictionary, wc_online_negative) = generate_word_cloud(online_posts["JobDescription"], -1.0, 0.0)
plt.imshow(wc_online_negative, interpolation='bilinear')
plt.axis("off")
plt.show()

In summary, here, we present a word list with positive words which should be used more (sorted by valence):

In [None]:
positive_words = list(wc_positive.words_.keys()) + list(wc_online_positive.words_.keys())
positive_dictionary = pd.DataFrame.from_dict([{"word" : word, "valence" : analyze_sentiment_word(word)} for word in positive_words]).drop_duplicates()
positive_dictionary.sort_values(by="valence", ascending=False).head(10)

And similarly the words which should rather be avoided (again, sorted by valence):

In [None]:
negative_words = list(wc_negative.words_.keys()) + list(wc_online_negative.words_.keys())
negative_dictionary = pd.DataFrame.from_dict([{"word" : word, "valence" : analyze_sentiment_word(word)} for word in negative_words]).drop_duplicates()
negative_dictionary.sort_values(by="valence").head(10)

Considering the complexity of the wording, we may report the following.

In [None]:
job_posts = get_data()[["JOB_DUTIES", "FILE_NAME"]].drop_duplicates()
job_posts["readability"] = job_posts["JOB_DUTIES"].apply(analyze_readability)

plt.scatter(x = range(0, len(job_posts)), y = job_posts.sort_values(["readability"])["readability"])
plt.show()

### Recommendations and Evaluation

* Improve the wording of the job posts, by using more positively connotated words. Positive words include words like "successful", "effective", "motivated", "benefit".
* Of course, our analysis is based on the valence given by the VADER tool. A negative valence may result, however, due to the job's true nature; consider a job posting for a criminal prosecutor who is involved in researching "fraud, violence, accidents, etc." - the negative valence cannot be avoided in such a case. We have, however, not considered this in our analysis.
* Incrementing the existing job posting tool to suggest different wordings to avoid a negative language might be a useful approach.

## Task 3: Improve the Diversity and Quality of the Applicant Pool

### Method

We consider the task of improving diversity and the quality of the applicant pool by looking at the aspect of gendered wording.

To approach the problem, we follow the ideas and the word list given in the following paper:
 
> Gaucher, D., Friesen, J., & Kay, A. C. (2011). Evidence that gendered wording in job advertisements exists and sustains gender inequality. Journal of personality and social psychology, 101(1), 109.

The authors provide in this paper, a list of masculine related and feminine related words which influence whether a text is seen as being directed towards a male or a female person. The word lists are the following:

In [None]:
masculine_words = ['active', 'adventurous', 'aggress', 'ambitio', 'analy', 'assert', 'athlet', 'autonom', 'battle', 'boast', 'challeng', 'champion', 'compet', 'confident', 'courag', 'decid', 'decision', 'decisive', 'defend', 'determin', 'domina', 'dominant', 'driven', 'fearless', 'fight', 'force', 'greedy', 'head-strong', 'headstrong', 'hierarch', 'hostil', 'impulsive', 'independen', 'individual', 'intellect', 'lead', 'logic', 'objective', 'opinion', 'outspoken', 'persist', 'principle', 'reckless', 'self-confiden', 'self-relian', 'self-sufficien', 'selfconfiden', 'selfrelian', 'selfsufficien', 'stubborn', 'superior', 'unreasonab']
feminine_words = ['agree', 'affectionate', 'child', 'cheer', 'collab', 'commit', 'communal', 'compassion', 'connect', 'considerate', 'cooperat', 'co-operat', 'depend', 'emotiona', 'empath', 'feel', 'flatterable', 'gentle', 'honest', 'interpersonal', 'interdependen', 'interpersona', 'inter-personal', 'inter-dependen', 'inter-persona', 'kind', 'kinship', 'loyal', 'modesty', 'nag', 'nurtur', 'pleasant', 'polite', 'quiet', 'respon', 'sensitiv', 'submissive', 'support', 'sympath', 'tender', 'together', 'trust', 'understand', 'warm', 'whin', 'enthusias', 'inclusive', 'yield', 'share', 'sharin']

In the following, we define a function to compute the gender bias of a text towards one gender (in the following, we will denote the score as gender bias score). The gender bias score ranges from -1.0 denoting a strongly masculine-oriented language to 1.0 denoting a strongly feminine-oriented language.

In [None]:
def analyze_gendered_words(text):
    '''returns a tuple denoting what percentage of the words of the full text is masculine,
    and what percentage is feminine'''
    
    m_sum = 0
    f_sum = 0
    
    nwords = len(str(text).split(" "))
    
    for word in masculine_words:
        if(word in str(text)):
            m_sum += 1

    for word in feminine_words:
        if(word in str(text)):
            f_sum += 1
            
    return (m_sum / float(nwords), f_sum / float(nwords))


def analyze_gender_bias(text):
    '''returns a bias score where -1.0 denotes a strongly masculine-oriented language
    and 1.0 denotes a strongly feminine-oriented language'''  
    
    result = analyze_gendered_words(text)
    return (result[1] - result[0])

### Analysis

In [None]:
job_posts = get_data()[["FILE_NAME", "JOB_CLASS_TITLE", "JOB_DUTIES", "ENTRY_SALARY_GEN"]].drop_duplicates()
job_posts["gender_bias"] = job_posts["JOB_DUTIES"].apply(analyze_gender_bias)

The following plot and the corresponding numbers below show that - although there is a large number of posts formulated in a gender-neutral language - a great part is gendered:

In [None]:
def gender_bias_colors(lst):
    cols = []
    for l in lst:
        if l < 0:
            cols.append('blue')
        elif l > 0:
            cols.append('red')
        else:
            cols.append('grey')
    return cols

plt.scatter(x = range(0, len(job_posts)), y = job_posts.sort_values(["gender_bias"])["gender_bias"], c = gender_bias_colors(job_posts.sort_values(["gender_bias"])["gender_bias"]))
plt.show()

In [None]:
print("postings with male-oriented language:", len(job_posts.loc[job_posts['gender_bias'] < 0]))
print("postings with female-oriented language:", len(job_posts.loc[job_posts['gender_bias'] > 0]))
print("postings with neutral language:", len(job_posts.loc[job_posts['gender_bias'] == 0]))

print("\n")
print("mean:", job_posts['gender_bias'].mean())

In the following, we analyse whether the language is related to salary or job class.

In [None]:
def get_ceil_entry_salary(text):
    n = str(text).strip()
    
    if n is None or len(str(n)) == 0:
        return None
    else:
        res = str(n).replace("$","").split("-")
        if(len(res) > 1):
            return float(res[1].replace(",", ""))
        elif(len(res[0]) > 0):
            return float(res[0].replace(",", ""))
        else:
            return None

The analysis finds no correlation between the salary and the gender bias score, as shown by the Pearson correlation coefficient:

In [None]:
job_posts["CEIL_ENTRY_SALARY_GEN"] = job_posts["ENTRY_SALARY_GEN"].apply(get_ceil_entry_salary)
filtered_job_posts = job_posts[job_posts["CEIL_ENTRY_SALARY_GEN"].notnull()]
filtered_job_posts = filtered_job_posts[job_posts["gender_bias"].notnull()]
filtered_job_posts['CEIL_ENTRY_SALARY_GEN'].corr(filtered_job_posts['gender_bias'])

In the following, we analyse whether a gender biased language is related to the position. We compare the mean valence for positions where the class title contains certain words (e.g., "senior", "supervisor", "head" denoting senior-level positions; "junior", "assistant", "secretary", "entry" denoting entry-level positions).

In [None]:
senior_level = pd.concat([
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("supervisor", case=False, na=False)],    
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("head", case=False, na=False)],    
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("senior", case=False, na=False)],    
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("director", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("inspector", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("principal", case=False, na=False)]],ignore_index=True).drop_duplicates().reset_index(drop=True)


entry_level = pd.concat([
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("junior", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("assistant", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("secretary", case=False, na=False)]
],ignore_index=True).drop_duplicates().reset_index(drop=True)


print("senior-level, mean:", senior_level['gender_bias'].mean())
print("entry-level, mean:", entry_level['gender_bias'].mean())

There is a slight tendency for senior-level positions to have a male-oriented language, and entry-level positions a feminine-oriented language. While this tendency is only subtle, it should be considered further in more detail. The same is true for jobs with a technical orientation (i.e., where the job class title is "engineering", "technic", "mechanical") vs. job with a nursing focus (i.e., where the job class title is "education", "nurse"):

In [None]:
technical = pd.concat([
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("engineering", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("technical", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("mechanic", case=False, na=False)]],ignore_index=True).drop_duplicates().reset_index(drop=True)

education = pd.concat([
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("nurse", case=False, na=False)],
    job_posts[job_posts["JOB_CLASS_TITLE"].str.contains("care", case=False, na=False)]],ignore_index=True).drop_duplicates().reset_index(drop=True)

print("technical jobs, mean:", technical['gender_bias'].mean())
print("education jobs, mean:", education['gender_bias'].mean())

### Recommendations and Evaluation

* Ensure a balanced use of feminine and masculine words. Male-oriented words include words like "adventurous", "aggressive", "challengin", "reckless", "selfconfident", "superior"; female-oriented words include words like "affectionate", "considerate", "compassion", "emotional", "trust", "understanding", "sharing
* Following scientific practice, of course, more studies should be performed whether this list is complete and whether all words are truly biased towards one gender. In this task, we follow the results of the study by Gaucher et al..
* Incrementing the existing job posting tool to suggest different wordings to avoid a gender-biased language might be a useful approach.

### Future Work

To improve the pool of applicants, our idea was to possibly increase in the first place the number of applicants. To this end, we analyzed the [WUZZUF Jobs Posts](https://www.kaggle.com/WUZZUF/wuzzuf-job-posts) dataset and built a machine learning pipeline to predict the number of views using the job post. However, the experiment failed (and is, hence, not listed here), as the training set was too small and for training a regression. Nevertheless, we add this idea here, as it might be used for future research.

## Task 4: Make it easier to Determine which Promotions are Available to Employees in each Job Class

### Method

To answer the question of possible promotions for employees in each job class, the goal is to graphically display for each job the promotional possibilities.

In [None]:
job_posts = get_data()[["FILE_NAME", "JOB_CLASS_TITLE", "EXP_JOB_CLASS_TITLE", "EXPERIENCE_LENGTH", "FULL_TIME_PART_TIME"]]
job_posts_num_of_exp_job = job_posts["EXP_JOB_CLASS_TITLE"].notnull().groupby([job_posts["FILE_NAME"], job_posts["JOB_CLASS_TITLE"]]).sum().reset_index(name ='num_of_exp_job_class_title')

 # leafs, i.e., entry jobs
jobs_without_exp_job = job_posts_num_of_exp_job[job_posts_num_of_exp_job["num_of_exp_job_class_title"] == 0]["JOB_CLASS_TITLE"].drop_duplicates()

In [None]:
def find_promotions(job_class_title):
    return job_posts[job_posts["EXP_JOB_CLASS_TITLE"] == job_class_title][["JOB_CLASS_TITLE", "EXPERIENCE_LENGTH", "FULL_TIME_PART_TIME"]].drop_duplicates().dropna().values.tolist()


def find_promotion_tree(job_class_title, experience_length = 0, full_time_part_time = "", depth = 10):
    promotions = find_promotions(job_class_title)
    
    if(depth > 0):
        promotions = [find_promotion_tree(job_class, experience_length, full_time_part_time, depth = depth - 1) 
                      for job_class, experience_length, full_time_part_time in promotions if job_class != job_class_title and job_class is not None]
        
    return {"job_title" : job_class_title, "experience_length" : experience_length, "full_time_part_time" : full_time_part_time, "promotions" : promotions}

In [None]:
def create_graph(promotion_tree):
    dot = Digraph(comment='Promotions', strict = True)
        
    if(not promotion_tree or (len(promotion_tree) == 1 and not promotion_tree[0]["promotions"])):
        dot.attr(label=r'\nno promotions available')
    
    for job in promotion_tree:
        if(job["promotions"]): # only show the jobs which allow for promotions
            dot.node(job["job_title"].replace(" ", "_"), job["job_title"])

            for promotion in job["promotions"]:
                dot.node(promotion["job_title"].replace(" ", "_"), promotion["job_title"])
                dot.edge(job["job_title"].replace(" ", "_"), promotion["job_title"].replace(" ", "_"), label=str(promotion["experience_length"]) + "years \n"+ str(promotion["full_time_part_time"]))
                create_graph(promotion["promotions"])

    return dot


def find_promotion_tree_create_graph(job_class_title):
    promotion_tree = [find_promotion_tree(job_class_title)]
    return create_graph(promotion_tree)

### Analysis

The following figure displays the full promotional tree as given by the job bulletins.

In [None]:
full_promotion_tree = [find_promotion_tree(job) for job in jobs_without_exp_job]
create_graph(full_promotion_tree)

In the following, we supply a small widget which allows to select the current job class title and it displays possible promotions (with number of years and a note whether it is full-time or part-time experience).

In [None]:
jobs = pd.concat([job_posts["JOB_CLASS_TITLE"],job_posts["EXP_JOB_CLASS_TITLE"]],ignore_index=True).drop_duplicates().reset_index(drop=True).sort_values().values.tolist()

widget_output = widgets.Output()
w = widgets.Dropdown(options=jobs,description='Current job:',disabled=False)

@widget_output.capture(clear_output=True)
def on_selection_change(change):
    job_class_title = change["new"]
    graph = find_promotion_tree_create_graph(job_class_title)
    with widget_output:
        display(w)
        display(graph)

w.observe(on_selection_change, names='value')
        
with widget_output:
    graph = find_promotion_tree_create_graph("accountant")
    display(w)

Choose here the current position; it will display - if the data is available - possible promotions.

In [None]:
display(widget_output)

### Recommendations

* We have used the parsed data to display possible promotional paths. 
* A tool similar to the one prototyped in this notebook might be useful for employees.