# Data Science for Good: City of Los Angeles


## Summary

This Kernel is an entry into the City of LA Kaggle competition which requires the production of a single structured job bulletins CSV file and analysis designed to identify improvements.

The dataframe which is the source of the CVS file can be found [here](#sdf). A second dataframe called df_eda_exam has also been produced [here](#eda_exam) for the Exploratory Data Analysis (EDA).

The [applications](#gedata) for a sample of the job bulletins has been show to be a reasonable representation of the full set of job bulletins. Analysis of the CSV file and the applications shows:

* [significant gender imbalance](#gender) for applications to some job class type and salary levels
 
* an [under-representation of the hispanic community](#ethnicity) in all levels which is most apparent at the top salary range

The need for welcoming, inclusive and more readable language has been identified by [qualitative](#jblanganal) and [quantitive](#langanal) means. [Recommendations](#bullrec) are made to improve the job bulletin language to fulfill the requirements of the problem stated. The recommendations could be implemented in a staged process.

Graphical representation of the [explicit links](#el) between jobs showing promotion routes have been provided. The data used for these graphial representations can be interogated to provide employee career progression suggestions.

A series of [Next Steps](#nextsteps) are suggested to improve the structured data file and the analysis.



# Contents

## The problem
## The Single Structure CVS file
## CVS file production
## Exploratory Data Analysis
        Applications Data
            Gender
            Ethnicity
        EDA Conclusion
## Language Analysis       
    ### Sentiment Analysis
    ### Gender Coding
    ### You and Me
    ### Readability
## Job Bulletins: conclusions and recommendations
## Explicit links
## Implicit links
## Data Notes
## Data Dictionary
    ### Load in Kaggle Dictionary
    ### Modify the Dictionary
## Loading in the Job Classes and adding missing ones
## Displaying the Sample Template
## Functions called by the cell that produces the structured data file
## Cell that produces the structured data file
## Saving and reloading the raw data
## Data Clean and Validation
## This is the Structured Data File
## EDA Code
## Gender and Ethnicity Analysis
    ### LA Ethnicity
    ### Application Data
    ### Checking how representative the application data is of the 678 job bulletins
    ### Comparing the sum of all applications
    ### Reviewing application balance for various groupings of jobs
## Finding Explicit Links
    ### Making suggestions about possible promotions
    ### A diagram showing all promotional pathways
## Job Bulletin Language Analysis
## Job Bulletins: conclusions and recommendations
## Implicit Links using Fasttext with sentence embedding
## Output for competition



## The problem

This Kernel is an entry into the City of LA Kaggle competition with the following problem statement:

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to: 

(1) identify language that can negatively bias the pool of applicants; 

(2) improve the diversity and quality of the applicant pool; and/or 

(3) make it easier to determine which promotions are available to employees in each job class.

During the competition, the discussion moved from a focus on diversity to the idea of  balance. Does the LA City recruitment process encourage the development of a work force where the make-up  of the city’s population is represented well in all departments and level of seniority? This would be a balanced workforce.


## The Single Structure CVS file

The dataframe which is the source of the CVS file can be found [here](#sdf). As the requirements can be complex there  is often more that one row per job bulletin

A second dataframe called df_eda_exam has also been produced [here](#eda_exam) for the Exploratory Data Analysis (EDA). This dataframe consists of one row per job bulletin which makes analysis more straightforward. 

## CVS file production
The CSV file has been produced in an iterative manner with a cycle of code refactoring as familiarilty with the data grew. Anomolies have been identified by output to the terminal. Some of these outputs have been retained in the final version.

The CSV file has been tested in the EDA  as well as the data cleaning and validation sections by comparing the outputs with the revised data dictionary. Much of this comparison has been done visually given the power of pandas to focus using things like .describe, .groupby and sort.

To increase readability and the potential for re-use by others, the coding strategy has been to:

isolate the different data types processing into separate functions as far as possible.

use regex as the main technique for data extraction, using the less esoteric methods available. 

Readability has been prioritised over processing speed with the exception of coding for just one read per job bulletin.

It is likely that the data cleaning routines will not used extensively and so time has not been spent in optimising or generalising. As functions will not be reused, parameter passing is not strictly necessary. However as Python global usage is not straight forward, parameter passing has been retained.

This approach would not have been appropriate if the task was to provide a generalised method for many different organisations. The current method is focussed on simplifying the process by taking advantage of the particular format used. Given enough data, a machine learning approach would be possible. Alternatively a pre-processing stage could have been introduced for first stage cleaning that could be bespoke.

## Exploratory Data Analysis
The main purpose of the EDA is to extract insights to develop recommendations concerning the improvement of the job bulletins as required in the problem statement. .

As a useful secondary effect, the analysis further tests the validity of the CSV file.

A range of common graphics packages have been used includding matlib, seaborn, networkx and graphviz.

As well as the usual python libraries I have used Textblob, Textstat for the language analysis.

### Applications Data

There is a file providing [gender and ethnicity data](#gedata)  for 187 of the 678 job bulletins provided.

[Here](#sample) I show that the sample is a reasonable representation of the ful set of bulletins by comparing ENTRY_SALARY, SCHOOL_TYPE, EXAM_TYPE, DEGREE_REQ, FULL_TIME? and driver license required.


The following insights have been drawn from the data:

[There is an gender inbalance](#gendertotal) in the total applications with 40% coming from women

[The reflection of ethnicity](#ethnicitytotal) in the total applications compared to the [city population](#laethnicity) is more nuanced. Using the terms in the gender and ethnicity data:

    applications from the black and filipino community are much higher than would be predicted from population data
    
    applications from the hispanic community are lower than would be predicted from population data
    
    applications from the white and asian community are much lower than would be predicted from population data
 
 Digging deeper into the data, it is possible to generate further actionable insights:

#### Gender<a id='gender'></a>
From a [gender point of view](#genderdegree), the total number of applications for non-degree level jobs is reasonably balanced. However there is a huge gender imbalance at the job level. For instance only a small fraction of the applications for [apprenticeship type jobs](#genderapprenticeship) and jobs requiring [special driver licenses](#genderdl) are from women.

The gender imbalance in the total number of applications  is almost entirely due to the larger percentage of males applying for [degree level jobs](#genderdegree). This grouping includes jobs where previously experience has needed a degree level qualification. When we look at applications for jobs that require a [college/university education](#genderapprenticeship) for the current role, women are far better represented. However they are still in the minority.

Women are less likely to apply for senior roles that are implied both a [higher salary](#gendersalary) and a college/university qualification.

 #### Ethnicity<a id='ethnicity'></a>
 The applications for [appenticeship roles](#ethnicityapprenticeship) most accurately reflects the LA population. This is is useful to know in light of the fact that there is such a wide gender disparity for these roles.
 
 The set of [salary charts](#ethnicitysalary) show the  following trends:
 
*      the hispanic  community is under-represented at all level and it is most apparent at the top salary range
     
*      the caucasion community dominates the top salary ranges
     
*      the asian community is  better represented for roles with higher salaries and academic requirements.      
*      the the black and filipino community are well represented in all charts
     
 ### EDA Conclusion
**When identifying language that can negatively bias the pool of applicants and finding ways of improve the diversity and quality of the applicant pool, we should put a priority on the gender and hispanic imbalances identified.**

## Language Analysis<a id='langanal'></a>

The  job bulletins have been reviewed quantitatively for the following properties:


### [Sentiment Analysis](#sa)

The results show that the bulletins usually deliver a negative sentiment which is largely due to the legal process statements.

### [Gender Coding](#gc)

The results show that the bulletins are heavily biased to masculine wording.

### [You and Me](#yandme)

Textio, a platform that predicts the type of response job offers will get based on their wording, report that ads using “you” and “we” are filled faster. Unfortunately when we do find  instances of the words in the job bulletins, they are related to formal instructions and so the opportunity is missed. 

### [Readability](#readr)

Many of the job bulletins have a reading ease suitable for university work. Texts need to be much easier to read.


## Job Bulletins: conclusions and recommendations<a id='bullrec'></a>

The current job bulletins do not communicate to the potential applicant why they should be interested in the job and the organisation. The bulletins are process driven and read like offical public notices. The small print has taken over. There is a place for small print but it needs to be demoted. Afterall who reads the t&cs?

Ideally the format should change to respond to general best practice and the specific LA City requirements of encouraging applications to address the  gender and the Hispanic imbalances . Here is an ideal proposed structure:

Job Title indicating the level of the job

Salary

Why they should be interested in terms of: 

	The job challenges and opportunities
    
	The team and its culture
    
	The opportunity to progress
    
	The work environment
    
	The benefits
    
Written to help them see themselves in the role

Clear and short application notes

This would be a big change and so a first stage bulletins could be to refactored as follows:

Job Title indicating the level of the job

Salary

Duties

Requirements

Application Process

Small print

The text should be written to:

* Improve readability with shorter words and sentences. 
* Replace insider/jargon words like “This examination is based on a validation study”
* Reduce the negative sentiment typical of the process driven approach that talks about examinations and disqualification.
* Express ideas in feminine co-operative style rather than masculine
* Let the organisation culture shine through the words
* Show the diversity statement is more than a set of necessary legal words.

*** Are job bulletins available in Spanish? If not, could it happen, it would definitely send a mesage of inclusivity. If they are available, could they be made more accessible**

 
Try to loosen up on the strict requirements. Studies show that women are much less likely to apply for a job than men if they fall short of all the stated requirements. Here is an [intersting  paragraph](https://hbr.org/2014/08/why-women-dont-apply-for-jobs-unless-theyre-100-qualified) about job requirements:

"There was a sizable gender difference in the responses for one other reason: 15% of women indicated the top reason they didn’t apply was because “I was following the guidelines about who should apply.” Only 8% of men indicated this as their top answer. Unsurprisingly, given how much girls are socialized to follow the rules, a habit of “following the guidelines” was a more significant barrier to applying for women than men."

## [Explicit links](#explicitlinks)

Extract<a id='el'></a> from the Description of promotions in job bulletins.docx file:

*All such promotions could thus be represented in a single directed graph, representing all the City’s job classes and the allowable promotions between them. You are invited to explore ways to visualize this directed graph.*

Two methods of representing these links are explored:

    circular representation with the job in focus at the centre

    hierarchical representation with the job in focus at the appropriate level

The single graph requested is presented [here](#hierarchyall).

Perhaps a more useful presentation is at the department level and some examples are presented [here](#hierarchydepartment). It is usefuk to note that these examples often differ from the City JOb Path diagrams because the raw data used is different. The output from the digraph format used here is in text format and could be used as the input for other methods of presentation.

A hierarchical method has also been used to [present graphcially](#hierarchyreq) a fuller set of requirements including school type and licence requirement.

The circular diagrams are useful when focissing on one job class. [Here](#el) are some examples.

The first set of plots show the [subordinate](#el) positions to a role, ie who could be promoted. The second set of plots show the subordinate positions to one of the [difficult](#difficult) to fill roles report by the hosts. The third set of plots show what [promotions](#promotional) are available to a role. There are some amazing routes...


## [Implicit Linking](#implicitlinks)

A test to develop implicit linking using Fasttext and word embedding comparison of the duty or requirements has been tried.

The inital results are not useful for a number of reasons and [recommendation](#implicitlinks) for future work are provided.

## Next Steps<a id='nextsteps'></a>

The structured data file may not be in the ideal format and so this kernel could be rerun with the necessary code edits. The code has been written with readability and ease of revision as a top priority to allow for this.

The provided data is not complete and there are some inconsistencies. If the kernel is rerun with new data, comment out the chk_file_valid(filename) function [here](#topcell). Remove the names of new files from the [missing list](#missingfiles). The kernal will then print out details of anomolies it finds.

In this kernal the structured datafile has been used with application data to draw insights about where balance could be a problem. By replace the application data with successful candidate and current team make-up data further insight are possible without any code changes.

There are a number of suggestions concerning the re=presentation of the job bulletins. Again, when revisions are produced the kernal can be re-run to show what improvements have been made. 

When editting a single file, it can be checked using either Microsoft Word or [this resource](http://www.readabilityformulas.com/freetests/six-readability-formulas.php) for readability. For [gender coding](#gc), refer to these words. Also you can do a you/we/us count using your word processing package. The [output](#sa) from the Sentiment Analysis shows the kind of words to avoid. Words do matter, and with these pointers, I think we all hope that more inclusive job bulletins will naturally follow. Remember to keep it short and relegate the small print!

By adding further data, the explicit link data and diagrams will be automatically updated. It is interesting to compare the diagrams here with the Job Path data provided. Differences are due to both missing job bulletins and new links found by the software.

Implicit links could be an interesting follow up project. We would need revised job bulletins and any further data including fuller job descriptions.


## Data Notes

#### 683 bulletins were provided. Of these 2 were duplicate job classes and only the latest bulletin has been included:

CHIEF CLERK POLICE 1249 083118.txt

SENIOR UTILITY SERVICES SPECIALIST 3573 113018.txt

#### In 3 cases the body text does not match the file name and so these bulletins have also been excluded:

ANIMAL CARE TECHNICIAN SUPERVISOR 4313 122118.txt

WASTEWATER COLLECTION SUPERVISOR 4113 121616.txt

SENIOR EXAMINER OF QUESTIONED DOCUMENTS 3231 072216 REVISED 072716.txt

#### In one case a class code is not provided:

Vocational Worker DEPARTMENT OF PUBLIC WORKS.txt

The structured data file therefore contains 678 jobs with 677 class codes.



In [None]:
import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import networkx as nx
import nltk, string
import matplotlib.pyplot as plt
import re,glob
import time
import os
import random
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

%matplotlib inline

import os
print(os.listdir("../input"))

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop_words = set(stopwords.words('english'))
stop_words.update(['gives'])

random_state = 21

'''remove punctuation, lowercase, stem'''
punct = '-'
remove_punctuation_map = dict((ord(char), ' ') for char in punct)    
def normalize(text):
    return nltk.word_tokenize(text.lower().translate(remove_punctuation_map))

def clean_text(text):
    text = text.lower().translate(remove_punctuation_map)
    
    return ' '.join(lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text))

numwords={}
numbers = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight","nine", "ten","eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen"]
for idx, word in enumerate(numbers):    numwords[word] = (idx)



#column heads equals 25 plus any added
len_ch = 25




'''remove punctuation, lowercase, stem'''


def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]



## Data Dictionary

### Load in Kaggle Dictionary

In [None]:
path = '../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data/'
filename = 'kaggle_data_dictionary.csv'

df_kddict = pd.read_csv(path + filename)
df_kddict.set_index('Field Name', inplace=True)
df_kddict.head(40)

### Modify the Dictionary

In [None]:
df_kddict_T = df_kddict.T
# df_kddict_T.head()

In [None]:
df_kddict_T.insert(loc = 25, column = 'NOTES', value = '')
df_kddict_T.insert(loc = 26, column = 'SELECTION_PROCESS', value = '')
df_kddict_T.insert(loc = 18, column = 'VOCATIONAL_QUAL', value = '')
df_kddict_T.insert(loc = 5, column = 'AND_OR', value = '')
df_kddict_T.insert(loc = 7, column = 'DEGREE_REQ', value = '')
len_ch += 5
df_new_kddict = df_kddict_T.T
# df_new_kddict.head(40)

In [None]:
#REQUIREMENT_SET_ID is set to string. Although it is a number, math ops are not required
df_new_kddict.at['REQUIREMENT_SET_ID','Data Type'] = 'string'
df_new_kddict.at['REQUIREMENT_SET_ID','Allowable Values'] = '0-9'
df_new_kddict.at['REQUIREMENT_SET_ID','Accepts Null Values?'] = 'Yes'
df_new_kddict.at['REQUIREMENT_SET_ID','Additional Notes'] = 'Many requirements do not have an assigned REQUIREMENT_SET_ID '

#REQUIREMENT_SUBSET_ID is set to string. Although it is a number, math ops are not required
df_new_kddict.at['REQUIREMENT_SUBSET_ID','Accepts Null Values?'] = 'Yes'
df_new_kddict.at['REQUIREMENT_SUBSET_ID','Additional Notes'] = 'Many sub- requirements do not have an assigned REQUIREMENT_SUBSET_ID '

#AND_OR is set to string. Although it is a number, math ops are not required
df_new_kddict.at['AND_OR','Data Type'] = 'string'
df_new_kddict.at['AND_OR','Description'] = 'Requirement conjunctions: The And or Or that separates requirement sets, suggesting options for entry into the job class or multiple requirements to satisfyE1. Overall requirement conjunction, separating overall requirement setsE2. Sub-requirement conjunction, separating sub-requirement sets'
df_new_kddict.at['AND_OR','Annotation Letter'] = 'E1/E2'
df_new_kddict.at['AND_OR','Accepts Null Values?'] = 'Yes'

#DEGREE_REQ will be useful when reviewing balance in the EDA
df_new_kddict.at['DEGREE_REQ','Data Type'] = 'string'
df_new_kddict.at['DEGREE_REQ','Description'] = 'Is a degree required for a requirement option or was it required for a previous role'
df_new_kddict.at['DEGREE_REQ','Accepts Null Values?'] = 'No'
df_new_kddict.at['DEGREE_REQ','Additional Notes'] = 'DEGREE_REQ will be useful when reviewing balance in the EDA'


#There are nine School_Types 

df_new_kddict.at['SCHOOL_TYPE','Allowable Values'] =  'AMERICAN BAR ASSOCIATION ACCREDITED LAW SCHOOL, APPRENTICESHIP, \
COLLEGE, COLLEGE OR TRADE SCHOOL, COLLEGE OR UNIVERSITY, \
COLLEGE OR UNIVERSITY OR TRADE SCHOOL, HIGH SCHOOL OR G.E.D. EQUIVALENT \
HIGH SCHOOL, UNIVERSITY, COLLEGE, TRADE OR TECHNICAL SCHOOL,UNIVERSITY' 

#As SCHOOL_TYPE include certification, the details should appear here
types = df_new_kddict.loc['EDUCATION_MAJOR','Additional Notes']
df_new_kddict.at['EDUCATION_MAJOR','Additional Notes'] = types +', Certificate details'


#As well as part/full time,experience can be measure in hours 

types = df_new_kddict.loc['FULL_TIME_PART_TIME','Allowable Values']
df_new_kddict.at['FULL_TIME_PART_TIME','Allowable Values'] = types +', xxx HOURS'
df_new_kddict.at['FULL_TIME_PART_TIME','Additional Notes'] = 'Some experience requirements are measured in hours'


#As well as *S*Q,courses can be measure in hours 

types = df_new_kddict.loc['COURSE_LENGTH','Allowable Values']
df_new_kddict.at['COURSE_LENGTH','Allowable Values'] = types +', xxx HOURS'

#Full list of options
df_new_kddict.at['DRIV_LIC_TYPE','Allowable Values'] = 'A|A OR B|A AND/OR B|A,B OR C|A OR I|B|B OR C|B AND/OR C|C'
df_new_kddict.at['DRIV_LIC_TYPE','Additional Notes'] = 'Where there are both requirements and possibilities only the requirements are recorded'


#There is a fifth Exam_Types 
types = df_new_kddict.loc['EXAM_TYPE','Allowable Values']
df_new_kddict.at['EXAM_TYPE','Allowable Values'] = types +', EXEMPT_EMPLOYEES'

#Not all bulletins have gen salaries
df_new_kddict.at['ENTRY_SALARY_GEN','Accepts Null Values?'] = 'Yes'

#Not all bulletins have open dates recorded
df_new_kddict.at['OPEN_DATE','Accepts Null Values?'] = 'Yes'


#df_new_kddict.reset_index()
df_new_kddict.reset_index(level=0, inplace=True)
df_new_kddict.head(40)

In [None]:
column_heads = df_new_kddict['Field Name'].tolist()
print (column_heads)

In [None]:
c_h = df_new_kddict['Field Name'].to_dict()
col_heads = dict([(value, key) for key, value in c_h.items()]) 

print (col_heads)

## Loading in the Job Classes and adding missing ones<a id='missingfiles'></a>

In [None]:
path = '../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data/'
filename ='job_titles.csv'
with open(path + "/" + filename, 'r', errors='ignore') as f:
        j_t = f.readlines()
len_j_t = len(j_t)
for i in range(len_j_t):
        j_t[i] = j_t[i].replace("\n","")
        if j_t[i]  ==  'Vocational Worker  DEPARTMENT OF PUBLIC':
            j_t[i] = 'VOCATIONAL WORKER'

j_t.append('SEASONAL POOL MANAGER')
j_t.append('OPEN WATER LIFEGUARD')
j_t.append('ELECTRICAL ENGINEER')
j_t.append('WASTEWATER TREATMENT MECHANIC')
j_t.append('TELECOMMUNICATIONS PLANNER')
j_t.append('CONSTRUCTION EQUIPMENT SERVICE SUPERVISOR')
j_t.append('COMPUTER OPERATOR')
j_t.append('SPECIAL PROGRAM ASSISTANT')
j_t.append('FIRE PROTECTION ENGINEER')
j_t.append('PRINT SHOP TRAINEE')
j_t.append('IMPROVEMENT ASSESSOR')
j_t.append('ENGINEERING ASSOCIATE')
j_t.append('SAFETY ENGINEER PRESSURE VESSELS')
j_t.append('SENIOR ROOFER')
j_t.append('SENIOR CLERK TYPIST')
j_t.append('SENIOR STREET SERVICES INVESTIGATOR')
j_t.append('LIBRARY CLERICAL ASSISTANT')
j_t.append('WASTEWATER TREATMENT ELECTRICIAN')
j_t.append('PERFORMING ARTS PROGRAM COORDINATOR')
j_t.append('PROCUREMENT AIDE')
j_t.append('LOAD DISPATCHER')
j_t.append('ASSOCIATE ZOO CURATOR')
j_t.append('AIRPORT POLICE SERGEANT')
j_t.append('FIRE SPRINKLER FITTER')




#Executive, Senior, Coordinating or Web Content Producer.?


#sort the list so that longest by number of words is first
#this allows us to remove a match and so duplicates are avoided when searching for subordinate roles
j_t.sort(key=lambda x: len(x.split()), reverse=True)

#print (j_t)

## Displaying the Sample Template

In [None]:
# path = '../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data/'
# sample_job_class = pd.read_csv(path + 'sample job class export template.csv')
# df_job_class = sample_job_class.copy()
# sample_job_class.head()

## Functions called by the cell that produces the structured data file

In [None]:
def find_experience(line,job):
    #j_t contains a list of the job class names
    len_j_t = len(j_t)
    max_pos = 0
    exp = ''
    last_exp = ''
    line = line.upper()
    full_line = line
    job = job.upper()

    #only focus on part of line that is relevant
    assist_pattern =  '(.*)'+ 'ASSISTING' + '(.*)'
    if re.search(assist_pattern, line):
        line = re.search(assist_pattern, line).group(1)
    assist_pattern =  '(.*)'+ 'SUBSTITUT' + '(.*)'
    if re.search(assist_pattern, line):
        line = re.search(assist_pattern, line).group(1)
    assist_pattern =  '(.*)'+ 'YEARS OF WHICH' + '(.*)'
    if re.search(assist_pattern, line):
        line = re.search(assist_pattern, line).group(1)
    assist_pattern =  '(.*)'+ 'INCLUDING'+ ('.*?')+ ('YEARS') + '(.*)'
    if re.search(assist_pattern, line):
        line = re.search(assist_pattern, line).group(1)

    #search the line for each job class
    #print ('line',line)
    for i in range(len_j_t):
        pattern =  j_t[i]+ '( I{0,3} )'
        pattern =  j_t[i]+ '( I{0,3}[ |\.|;|,])'
        pattern2 =  j_t[i] + '( |\.|;|,)'
        
        special = 'FIREFIGHTER'
        special_ex = 'ENDORSEMENT'
        special2 = 'ARCHITECT'
        special2_ex = 'LICENSE'
        
        if job == j_t[i] and re.search(pattern2, line):
            #remove instances of the job title in the requirements section
            #print ('job removed', job)
            src = re.search(pattern2, line)
            line =line.replace(src.group(0),'')
                           
        #we are looking for subordinates, so dont want assistants or the job itself
        if re.search(pattern2, line) and job!= j_t[i] and job != ('ASSISTANT '+ j_t[i]):
            #exclude references where job title is used for other information
            if not(j_t[i] == special and  re.search(special + " " +special_ex, line)):
                 if not(j_t[i] == special2 and  re.search(special2 + " " +special2_ex, line)):
                    #include all the subsets I, II, III etc
                    src = re.search(pattern2, line)
                    matches = re.findall(pattern, line) 
                    if matches:
                        for match in matches:
                            exp = exp + j_t[i]  + match +', '
                    else:
                        exp = exp + j_t[i] +', '
                    #remove the job so we dont include shorter versions: eg "senior job" and "job"
                    line =line.replace(src.group(0),'')
                    #but need to find the last job in line to mark beginning of other text
                    src = re.search(j_t[i], full_line)
                    if src.start(0) > max_pos:
                        max_pos = src.start(0)
                        last_exp = j_t[i]
    exp = exp.rstrip(', ')
    return exp, last_exp

# TEST CODE
# line = "1. Two years of full-time paid experience as a Safety Engineer Pressure Vessels with the City of Los Angeles; and"
# job = 'Senior Safety Engineer Pressure Vessels'

# find_experience(line,job)

In [None]:
def examination_type(line, level):
#The exam type information is sometimes spread over two lines and so we have two levels, one for each line.
#level = 0 for identifying we are in exam type
#level = 1 for the second line

    if (level == 0):
        exam_type_pattern = '(THIS EXAM)(.*?)( IS TO BE GIVEN)(.*)'   
        exam_type_pattern2 = 'FOR EXEMPT EMPLOYEES SEEKING TO BECOME(.*)'   
        if re.search(exam_type_pattern,line):
            #print (re.search(exam_type_pattern,line).group(4))
            row[col_heads['EXAM_TYPE']]  = re.search(exam_type_pattern,line).group(4)
            level = 1
        if  re.search(exam_type_pattern2,line):
            #print (line)
            row[col_heads['EXAM_TYPE']]  = line
            level = 1            
    else:
        #print ('line, level',line, level)
#         #print (line)
#         row[col_heads['EXAM_TYPE']]  = row[col_heads['EXAM_TYPE']] + ' ' + line
        level = 0
#Sometimes the second line has other info, or there isn't a second line
        exam_type_pattern3 = 'The City of'   
        
        if not re.search(exam_type_pattern3,line):
            #normal case
            exam_type = row[col_heads['EXAM_TYPE']] + ' ' + line  
            exam_type = exam_type.replace ('AN AN','AN')
            exam_type = exam_type.replace ('AND OPEN','AND AN OPEN')
            exam_type = exam_type.replace ('AND OPEN','AND AN INTERDEPARTMENTAL')
            exam_type = exam_type.replace ('BOTH ON','ON')
            exam_type = exam_type.replace ('ON BOTH','ON')
            exam_type = exam_type.replace ('ON A ','ON AN ') 
            exam_type = exam_type.replace ('AND ON AN','AND AN')  
            exam_type = exam_type.replace ('AND ON AN','AND AN')  
            exam_type = exam_type.replace ('TO ON','ON')  
            exam_type = exam_type.replace ('ONLY ON A','ON A')  
            exam_type = exam_type.replace ('BASIS ONLY','BASIS')  
       
            exam_type = exam_type.replace ('INTERDEPARMENTAL','INTERDEPARTMENTAL')  
            exam_type = exam_type.replace ('BASIS   NVVC','BASIS')  
   
        
            exam_type = exam_type.replace ('AND INTERDEPARTMENTAL','AND AN INTERDEPARTMENTAL')  
       
            exam_type = exam_type.replace ('COMPETITVE','COMPETITIVE')
            exam_type = exam_type.replace ('ON AN OPEN COMPETITIVE AND AN INTERDEPARTMENTAL PROMOTIONAL BASIS','ON AN INTERDEPARTMENTAL PROMOTIONAL AND AN OPEN COMPETITIVE BASIS')
            
            row[col_heads['EXAM_TYPE']]  = exam_type
        else:
            row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace ('ONLY ON A','ON A') 
#Dictionary requires an abbreviated set of options
        row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace ('ON AN OPEN COMPETITIVE BASIS', 'OPEN')
        row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace ('FOR EXEMPT EMPLOYEES SEEKING TO BECOME CIVIL SERVICE EMPLOYEES', 'EXEMPT_EMPLOYEES')
        row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace('ON AN INTERDEPARTMENTAL PROMOTIONAL BASIS','INT_DEPT_PROM')
        row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace('ON AN DEPARTMENTAL PROMOTIONAL BASIS','DEPT_PROM')
        row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']].replace('ON AN INTERDEPARTMENTAL PROMOTIONAL AND AN OPEN COMPETITIVE BASIS','OPEN_INT_PROM')

           
            
            
            
            
   
    return level

In [None]:
def degree_req(line):
# cases to consider:
# require possession of a degree 
# degree require 

    line = line.lower()
            
    degree_req_pattern = '(degree)(.*)(required)'  
    degree_req_pattern2 = '(require)(.*?)(degree)'   
        
    if re.search(degree_req_pattern,line) or  re.search(degree_req_pattern2,line):
        if not re.search('desired',line):
            #print ('deg', line)
            row[col_heads['DEGREE_REQ']]  = 'YES'
            #for senior jobs a college education is implicit because of required previous experience
            if pd.isna(row[col_heads['SCHOOL_TYPE']]) :
                 row[col_heads['SCHOOL_TYPE']] = 'COLLEGE OR UNIVERSITY'
    
    return    

In [None]:
def clean_licence_line(lclass):
    lclass = lclass.rstrip(" ").lstrip(" ")
    lclass = lclass.rstrip("\'").lstrip("\'").upper()
    lclass = lclass.replace( '"', '')
    lclass = lclass.replace( 'CLASS', '')
    lclass = lclass.replace( '(', '')
    lclass = lclass.replace( ')', '')
    lclass = lclass.replace( '  ', ' ')
    lclass = lclass.replace( 'I OR A', 'A OR I')
    lclass = lclass.replace( 'B OR A', 'A OR B')
    lclass = lclass.replace( 'C OR B', 'B OR C')
    lclass = lclass.replace( 'C AND/OR B', 'B AND/OR C')
    lclass = lclass.replace( '1/A OR 2/B', 'A OR B')
    return lclass


def driver_licence(line):
# cases to consider:
# a valid california driver's license is required.
# a valid california driver's license may be required.
# for positions requiring a valid class b driver's license
# a valid unrestricted california commercial class a or class b driver's license and valid medical certificate approved
# a valid california commercial class b driver's license 
#Some positions may require a valid California Class C and/or Class B driver's license.
# 

    line = line.lower()
#     if ('driver'  in line):
#         if ('some positions may require' not in line):
#             if ("a valid california driver's license is required" not in line):
#                 print (line)
            
    driver_lic_pattern = '(driver)(.*)(required)'  
    driver_lic_may_pattern = '(driver)(.*)(may)(.*)(required|approved)'   
    driver_lic_may_pattern2 = '(positions)(.*?)(may)(.*?)(require)(.*?)(driver)'   
    
    #driver_lic_class_pattern = '(.*)(class)(.*)(driver)(.*)(required|approved)'   
#    driver_lic_class_pattern = '(.*?)(class)(?!\..*?)(driver)(.*?)(;|\.|:|,|\r|\n)'   
    driver_lic_class_pattern = '(.*?)(class)(.*?)(commercial|california|driver)(.*?)(required)(.*?)(;|\.|:|,|\r|\n)'   
    #driver_lic_may_class_pattern = 'positions(.*?)may(.*?)(requir)(.*?)(class)(.*)(driver)'   
    driver_lic_may_class_pattern = 'positions(.*?)may(.*?)(requir)(.*?)(class)(.*?)(driver|license)'   
    driver_lic_may_class_pattern2 = '(.*?)(class)(.*?)(commercial|california|driver)(.*?)(may)(.*?)(;|\.|:|,|\r|\n)' 
   
    driver_lic_endorse_pattern = '(license with )(.*?)(endorsement)'   
    
    if re.search(driver_lic_may_pattern,line) or  re.search(driver_lic_may_pattern2,line):
            #print ('dlp', ' P', line)
            row[col_heads['DRIVERS_LICENSE_REQ']]  = 'P'
    elif re.search(driver_lic_pattern,line):
#            print ('dlp', ' R')
            row[col_heads['DRIVERS_LICENSE_REQ']]  = 'R'
    if re.search(driver_lic_class_pattern,line):
            may = re.search(driver_lic_class_pattern,line).group(1)
            #print ('may', may)
            #print ("re.search('may',may)",re.search('may',may))
            if pd.isna(re.search('may',may)) and not re.search('(incumbents|employees| of this class)|(positions in the class)|(class of commercial)',line):
                lclass = re.search(driver_lic_class_pattern,line).group(3)
                #print ('dlpc', lclass)
                lclass = clean_licence_line(lclass)
                #print ('dlpcs', lclass)
                if row[col_heads['DRIV_LIC_TYPE']] == '':
                    row[col_heads['DRIV_LIC_TYPE']]  =  lclass
#                 if row[col_heads['DRIV_LIC_TYPE']] == '':
#                     row[col_heads['DRIV_LIC_TYPE']]  =  lclass.upper() + " "
#                 else:
#                     row[col_heads['DRIV_LIC_TYPE']]  = row[col_heads['DRIV_LIC_TYPE']] + lclass + " "

                if row[col_heads['DRIVERS_LICENSE_REQ']] != 'R':
                    row[col_heads['DRIVERS_LICENSE_REQ']]  = 'R'
#                 print ('R')

    if re.search(driver_lic_may_class_pattern,line):
  #              print("row[col_heads['DRIVERS_LICENSE_REQ']]",row[col_heads['DRIVERS_LICENSE_REQ']])
                lclass = re.search(driver_lic_may_class_pattern,line).group(6)
                lclass = clean_licence_line(lclass)
                
                #print ('dlpcsm', lclass)
                #print (lclass)
                #print (line)
                if row[col_heads['DRIVERS_LICENSE_REQ']] != 'R' or row[col_heads['DRIVERS_LICENSE_REQ']] != 'P':
                    row[col_heads['DRIV_LIC_TYPE']]  = lclass
                    row[col_heads['DRIVERS_LICENSE_REQ']]  = 'P'
                    #print ('dlmpc', 'P',lclass.upper())
                    #print ('dlmpcline', line)

    if re.search(driver_lic_may_class_pattern2,line) and not re.search('(incumbents|employees| of this class)|(positions in the class)|(class of commercial)',line):
        lclass = re.search(driver_lic_may_class_pattern2,line).group(3)
        lclass = clean_licence_line(lclass)
        if row[col_heads['DRIVERS_LICENSE_REQ']] != 'R' or row[col_heads['DRIVERS_LICENSE_REQ']] != 'P':
            row[col_heads['DRIV_LIC_TYPE']]  = lclass
            row[col_heads['DRIVERS_LICENSE_REQ']]  = 'P'
               
    if re.search(driver_lic_endorse_pattern,line):
            lclass = re.search(driver_lic_endorse_pattern,line).group(2) + re.search(driver_lic_endorse_pattern,line).group(3)
            #print ('endorse', lclass)
            row[col_heads['ADDTL_LIC']]  = lclass.upper()
    
    return    

In [None]:

def entry_salaries (body,row):
    
    line = body.replace (',','')
    line = body.replace ('*','')
    line = line.lower()
    line = line.replace(' to ','-')
    
    #print ("sal line", line)
    salary_flat_pattern = '(\$|\$ )(\d{4,8})(.*)(flat)(.*?)(rated)'
    salary_range_pattern = '(\$|\$ )(\d{4,8})(-)(\$|'')(\d{4,8})'
    

    salary_pattern = '(.*)(department of water and power)(.*)'  
    if re.search(salary_pattern,line):
        part1 = re.search(salary_pattern,line).group(1)
        part2 = re.search(salary_pattern,line).group(3)
    else:
        part1 = line
        part2 = ''

    part1s = ''
    part2s = ''

    #salary_flat_pattern = '(\$)(.*)(flat-rated\))'
    if re.search(salary_flat_pattern,part1):
        part1s = re.search(salary_flat_pattern,part1).group(1) + re.search(salary_flat_pattern,part1).group(2)  +'(flat-rated)' 
    if re.search(salary_flat_pattern,part2):
        part2s = re.search(salary_flat_pattern,part2).group(1) + re.search(salary_flat_pattern,part2).group(2)  +'(flat-rated)'
        
    if re.search(salary_range_pattern,part1):
        part1s = re.search(salary_range_pattern,part1).group(0) 
    if re.search(salary_range_pattern,part2):
        part2s = re.search(salary_range_pattern,part2).group(0) 
    if (part1s != ''):
        row[col_heads['ENTRY_SALARY_GEN']]  = "\\" + part1s
    if (part2s != ''):
        row[col_heads['ENTRY_SALARY_DWP']]  = "\\" + part2s
    if re.search('scale pending',part1):
         row[col_heads['ENTRY_SALARY_GEN']]  = part1
    if re.search('scale pending',part2):
         row[col_heads['ENTRY_SALARY_GEN']]  = part2
       
    
    #print (line)
    #print (part1s)
    #print (part2s)

    return row


In [None]:
def req_clr (row):
#When iterating through the requirements, the previous requirements 
#need to be cleared but other details are retained
    row[col_heads['EDUCATION_YEARS']] = ''
    row[col_heads['SCHOOL_TYPE']] = ''
    row[col_heads['EDUCATION_MAJOR']] = ''
    row[col_heads['EXPERIENCE_LENGTH']] = ''
    row[col_heads['FULL_TIME_PART_TIME']] = ''
    row[col_heads['COURSE_LENGTH']] = ''
    row[col_heads['COURSE_COUNT']] = ''
    row[col_heads['COURSE_SUBJECT']] = ''
    row[col_heads['MISC_COURSE_DETAILS']] = ''
    row[col_heads['EXP_JOB_CLASS_FUNCTION']] = ''
    row[col_heads['EXP_JOB_CLASS_ALT_RESP']] = ''
    row[col_heads['EXP_JOB_CLASS_TITLE']] = ''
    row[col_heads['REQUIREMENT_SET_ID']] = ''
    row[col_heads['REQUIREMENT_SUBSET_ID']] = ''
    row[col_heads['AND_OR']] = ''

    return (row)

In [None]:
def fill_row(title,content):
    #print ('content', content)
    #print ('row[col_heads[title]] ', row[col_heads[title]] )
    row[col_heads[title]] = content
    return True

def clr_and_fill_row(title,content):
    row[col_heads[title]] = content
    return 
 

In [None]:


def clean_txt (line):
    #specific non-standard usage with modifications to some text to make searches work
    #hard coded: too few examples to generalise
    line = line.replace ('health and safety',' health & safety')
    line = line.replace ('Sr.','Senior')
    line = line.replace ('Pre-','Pre')
    line = line.replace ('Administratve','Administrative')
    
    line = line.replace ('Construction Maintenance Superintendent','Construction and Maintenance Superintendent')
    return line

def  convert_words_to_number (line):
# convert word numbers to numbers as strings
            
    oldline = line.replace ('-',' ')
    line =""
    for i, word in enumerate(oldline.split()):
        #print (word)
        word_l = word.lower()
        if word_l in numwords:
            #print ('numwords[word_l]',numwords[word_l])
            word = str (numwords[word_l])
        line = line + " " + word
    #print ('oldline',oldline)
    #print ('line',line)
    return line

num_index_pattern = '^(.|'')(\d)(\.)(.*)'
char_index_pattern = '(?i)^(.|''|.\()([a-z])(\.|\))(.*)'
edpattern = '(apprenticeship|high school, university, college, trade or technical school|trade school or college|college or trade school|college or university or trade school|college or university|college|university|American Bar Association accredited law school)'


def requirement_and_sub(line, row):
    #print ('set line', line)
    if  re.search(num_index_pattern,line):
        curr_req = re.search(num_index_pattern,line).group(2)
        #row[col_heads['REQUIREMENT_SET_ID']] = int(curr_req)
        row[col_heads['REQUIREMENT_SET_ID']] = curr_req
    elif  re.search(char_index_pattern,line):
        curr_reqsub = re.search(char_index_pattern,line).group(2)
        row[col_heads['REQUIREMENT_SUBSET_ID']] = curr_reqsub.upper()
    else:   
        row[col_heads['REQUIREMENT_SET_ID']] = ''
        row[col_heads['REQUIREMENT_SUBSET_ID']] = ''
    if line.endswith('or'):
         row[col_heads['AND_OR']] = 'OR'
    if line.endswith('and'):
         row[col_heads['AND_OR']] = 'AND'
       
        
    return row

def education(line, row):
    semester_pattern ='(\d{1,3})(.)(semester)'
    coursework_pattern ='(\d{1,3} hours)( of course work)(.*?)(;|\.)'
    qtr_pattern ='(\d{1,3})(.)(quarter units)(.*?)( from|in|with|of|at)(.*)(;|\.)'
    courses_pattern = '(?i)(completion of )(\d{1,3})(.)(course)'
    pattern2 = 'college|university'
    major_pattern = '(?i)(major in|degree*.?in|degree from|college in)(.*?)(;|including|in |and |from|\.)'
    grad_pattern = '(?i)(graduation from)'
    pattern3 = 'high school or G.E.D. equivalent'
    course_len = ''
    exp_found = False
    if (re.search(major_pattern, line) and not re.search('may be substituted', line)):
        exp_found = fill_row('EDUCATION_MAJOR',(re.search(major_pattern, line).group(2).replace(", ", "|").replace(" or ", "|")))
        row[col_heads['DEGREE_REQ']]  = 'YES'
    if (re.search(grad_pattern, line) and not re.search('may be substituted', line)):
        row[col_heads['DEGREE_REQ']]  = 'YES'

    if (re.search(edpattern, line)):
        clr_and_fill_row('SCHOOL_TYPE', re.search(edpattern, line).group(0).upper())
        course_len_pattern ='(\d{1,3})(.)(year)(.{1,5})(college or university or trade school|college or university|college|university)'
        if (re.search(course_len_pattern, line)):
            #exp_found = fill_row('EDUCATION_YEARS',int(re.search(course_len_pattern, line).group(1)))
            exp_found = fill_row('EDUCATION_YEARS',(re.search(course_len_pattern, line).group(1)))

    if (re.search(semester_pattern, line)):
        course_len = re.search(semester_pattern, line).group(1)+ 'S'
    if (re.search(qtr_pattern, line)):
        course_len = course_len + re.search(qtr_pattern, line).group(1)+ 'Q'
        exp_found = fill_row('COURSE_SUBJECT',re.search(qtr_pattern, line).group(6))
        exp_found = fill_row('COURSE_LENGTH', course_len)
    if (re.search(courses_pattern, line)):
        exp_found = fill_row('COURSE_COUNT',re.search(courses_pattern, line).group(2))

    if (re.search(coursework_pattern, line)):
        exp_found = fill_row('COURSE_SUBJECT',re.search(coursework_pattern, line).group(2) + re.search(coursework_pattern, line).group(3))
        exp_found = fill_row('COURSE_LENGTH',re.search(coursework_pattern, line).group(1))

        
#school                    
    if (re.search(pattern3, line)):
        exp_found = fill_row('SCHOOL_TYPE', pattern3.upper())
    return row,exp_found

def record_apprenticeship (line, row):
    app_pattern ='(?i)(completion of)(.*?)(apprenticeship)(.*?)(;|\.|and)'
    if (re.search(app_pattern, line)):
        row[col_heads['EDUCATION_MAJOR']] = row[col_heads['EDUCATION_MAJOR']]  + " " + re.search(app_pattern, line).group(0)
    
    return row

def apply_certificate(certificate):
    exp_found = False
    if not (re.search("at the time", certificate)) and not (re.search("medical", certificate)) \
    and not (re.search("conducting disciplinary", certificate)) and not (re.search("certificates of occupancy,", certificate))\
    and not (re.search("business tax certificates", certificate)) and not (re.search("submit verification", certificate)) \
    and not (re.search("issuance", certificate)):

        exp_found = True
        row[col_heads['VOCATIONAL_QUAL']] = row[col_heads['VOCATIONAL_QUAL']]  + " " + certificate 
    #    if row[col_heads['EDUCATION_MAJOR']] =='':
        row[col_heads['EDUCATION_MAJOR']] = row[col_heads['EDUCATION_MAJOR']]  + " " + row[col_heads['VOCATIONAL_QUAL']] 
        row[col_heads['SCHOOL_TYPE']] = 'CERTIFICATION' 
    return exp_found

def cert_and_completion(line, row):
#certification
    exp_found = False
#(?s:.*) forces regex to find last match

    cert_pattern ='(?i)(possession of)(.*?)(certificate)(.*?)(;|\.|and)'
    cert_pattern1 ='(?i)(certified)(.*?)(certification council)'
    cert_pattern2 ='(?i)(and|or)(.*?)(certification)(.*?)(;|\.)'
    cert_pattern3 ='(?i)(?s:.*)(\.|;|and |or )(.*?)(certificate)(.*?)(;|\.|and)'

    cert_pattern ='(?i)(possession|completion of)(.*?)(certificate)(.*?)(;|\.|and)'

    if (re.search(cert_pattern, line)):
        exp_found = apply_certificate(re.search(cert_pattern, line).group(1) +\
                                                    re.search(cert_pattern, line).group(2) + \
                                                    re.search(cert_pattern, line).group(3) + re.search(cert_pattern, line).group(4))
    elif (re.search(cert_pattern1, line)):
        exp_found = apply_certificate(re.search(cert_pattern1, line).group(0))

    elif (re.search(cert_pattern2, line)):
        exp_found = apply_certificate(re.search(cert_pattern2, line).group(2)+re.search(cert_pattern2, line).group(3)\
                                      +re.search(cert_pattern2, line).group(4))

    elif (re.search(cert_pattern3, line)):
        exp_found = apply_certificate( re.search(cert_pattern3, line).group(2) +\
                                                    re.search(cert_pattern3, line).group(3) + \
                                                    re.search(cert_pattern3, line).group(4))


#completion of miscellaneous course and certification requirements that cannot readily be converted to school type and EDUCATION_MAJOR
    comp_pattern ='(?i)(completion|attainment)(.*?)(in |of )(.*?)(;|\.)'
    if (re.search(comp_pattern, line))  and not re.search(edpattern, line): 
        exp_found = fill_row('VOCATIONAL_QUAL', (re.search(comp_pattern, line)).group(0))
        #print ('line misc 1',line)
    if (re.search(comp_pattern, line))  and  re.search('high school, university, college, trade or technical school|community college or trade school', line):
        exp_found = fill_row('VOCATIONAL_QUAL', (re.search(comp_pattern, line)).group(0))
        #print ('line misc 2',line)
  
    return row,exp_found

def experience_len(line, row):
    exp_found = False

    pattern ='(full.time|part.time|years as a|\d{0,1}.\d{2,6} hours)(.*)(|;|\.)'
    if (re.search(pattern, line)):
        if re.search(pattern, line).group(1) != 'years as a':
            if  pd.isna(re.search("course work", re.search(pattern, line).group(2))):
                row[col_heads['FULL_TIME_PART_TIME']] = re.search(pattern, line).group(1).upper()
                #print ('in ftpt',line)
        month_pattern = '(\d{1,3})(.)(month)'
        year_pattern = '(\d{1,3})(.)(year)(.{1,5})(full.time|part.time|as a)'
        if (re.search(month_pattern, line)):
            fract_yr = (int(re.search(month_pattern, line).group(1)))/12
            fract_yr = str(fract_yr)
            exp_found = fill_row('EXPERIENCE_LENGTH', fract_yr)
            exp_found = fill_row('EXP_JOB_CLASS_FUNCTION', re.search(pattern, line).group(2))
        if (re.search(year_pattern, line)):
            #yr = (int(re.search(year_pattern, line).group(1)))
            yr = str(re.search(year_pattern, line).group(1))
            #print ('yr exp A',yr,re.search(pattern, line).group(1))
            exp_found = fill_row('EXPERIENCE_LENGTH', yr)
            exp_found = fill_row('EXP_JOB_CLASS_FUNCTION',  re.search(pattern, line).group(2))
    year_pattern2 ='(\d{1,3})(.)(years of experience)(.*)(|;|\.|and|,)'
    if (re.search(year_pattern2, line)):
        #yr = (int(re.search(year_pattern2, line).group(1)))
        yr = str(re.search(year_pattern2, line).group(1))
        #print ('yr exp B',yr,re.search(year_pattern2, line).group(1))
        #print ('yr exp lne',line)
        exp_found = fill_row('EXPERIENCE_LENGTH', yr)
        exp_found = fill_row('EXP_JOB_CLASS_FUNCTION',  re.search(year_pattern2, line).group(3) + re.search(year_pattern2, line).group(4)  )

    return row,exp_found

def catch_all (line,row):
#strip the req id and then print without mod
    if  re.search(num_index_pattern,line):
            line = re.search(num_index_pattern,line).group(4)
    if  re.search(char_index_pattern,line):
            line = re.search(char_index_pattern,line).group(4)
    exp_found = fill_row('EXP_JOB_CLASS_FUNCTION', line)
    return row

def experience(line, row,exp,last_exp):              
    row[col_heads['EXP_JOB_CLASS_TITLE']] = exp
    job_pattern =last_exp.lower()+'(.*)'+'(;|\.|:|,|\r|\n)'
    #print('exp',line)
    if (re.search(job_pattern, line.lower())):
        alt = re.search(job_pattern, line.lower()).group(1)
        or_pattern = '(.{1,2})(or.)'
        if (re.match(or_pattern, alt)):
            #print('exp2',line)
            exp_found = fill_row('EXP_JOB_CLASS_ALT_RESP', re.search(job_pattern, line.lower()).group(1))
            exp_found = clr_and_fill_row('EXP_JOB_CLASS_FUNCTION', '')
        else:
            class_pattern = '(.*)(in a class|at the level|performing the duties|paid experience as)(.*)(;|\.|:|,|\r|\n)'
            if re.search(class_pattern,line.lower()):
                #print('exp3',line)

                exp_found = clr_and_fill_row('EXP_JOB_CLASS_FUNCTION', '')
                alt_class = "or " + re.search(class_pattern, line.lower()).group(2) +re.search(class_pattern, line.lower()).group(3)+ re.search(class_pattern, line.lower()).group(4)
                exp_found = fill_row('EXP_JOB_CLASS_ALT_RESP', alt_class)
            else:
                job_pattern2 ='(experience)(.*)' +last_exp.lower()+'(.*)'+'(;|\.|:|,|\r|\n)'
                if re.search(job_pattern2, line.lower()):
                    exp_found = fill_row('EXP_JOB_CLASS_FUNCTION', re.search(job_pattern2, line.lower()).group(0))
                else:
                    exp_found = fill_row('EXP_JOB_CLASS_FUNCTION', re.search(job_pattern, line.lower()).group(1))
    return row



In [None]:
def process_state (state,body,line,row,job,data_list):
    #print ('state',state)
    global eda_row
    if state == 'duties':
        body += line
        row[col_heads['JOB_DUTIES']] = body.replace('\n','')
    if state == 'annualsalary':
        line = line.replace (',','').replace(' to ','-')
        body += line
        entry_salaries (body,row)
    if state == 'notes':
        row[col_heads['NOTES']] += line
    if state == 'selection':
        row[col_heads['SELECTION_PROCESS']] += line        
    if state == 'requirements':
        exp_found = False
        line = clean_txt (line)       
        if (line !=''):
            sub_pattern = 'substitut'
            #if substitute is found in a requirement, it refers to substituting requirements
            #and is not a new requirement. Hence the information is recorded but not processed
            if (re.search(sub_pattern, line)):
                    exp_found = fill_row('EXP_JOB_CLASS_FUNCTION', line)
            else:
                line = convert_words_to_number (line)
                row = requirement_and_sub(line, row)
                row,exp_found =  education(line, row)
                row,exp_found = cert_and_completion(line, row)
                row = record_apprenticeship (line, row)
                row,exp_found = experience_len(line, row)
                exp,last_exp = find_experience (line,job)
                if exp == '' and not exp_found:
                    catch_all (line,row)
                else:
                    if (exp):
                        row = experience(line, row,exp,last_exp)
                    #Specials where job requirement is defined in terms of the job itself
                    if re.search('Zoo Registrar',line):
                        exp_found = fill_row('EXP_JOB_CLASS_ALT_RESP','performing the duties of a Zoo Registrar')
                        exp_found = clr_and_fill_row('EXP_JOB_CLASS_FUNCTION', '')
                    if re.search('Administratve Hearing Examiner',line):
                        exp_found = fill_row('EXP_JOB_CLASS_ALT_RESP','within the past two years from the date of filing as an exempt or contract Administratve Hearing Examiner')
                        exp_found = clr_and_fill_row('EXP_JOB_CLASS_FUNCTION', '')
   
            #one row is allocated for each requirement            
            save_row = row.copy()
            data_list.append(save_row)
            #however for the EDA it is convenient to have one row per job
            len_row = len(row)
            for i in range(len_row):
                if eda_row[i] == '':
                     eda_row[i] = row[i]
                elif  row[i] not in eda_row[i]:
                    eda_row[i] = eda_row[i] + ' ' + row[i]
            row = req_clr (row)
    return row, body,data_list

In [None]:
#each sub requirement generates a new row
#to fulfill the requirements of the sample file,when a file is finished, the last information found 
#must be "backfilled" into the earlier rows

def backfill (data_list, start_job_index, row):
    
    last_job_index = len(data_list)
    for i in range ( start_job_index,last_job_index):
        data_list[i][col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']]
        data_list[i][col_heads['DRIVERS_LICENSE_REQ']] = row[col_heads['DRIVERS_LICENSE_REQ']]
        data_list[i][col_heads['DRIV_LIC_TYPE']] = row[col_heads['DRIV_LIC_TYPE']]
        data_list[i][col_heads['ADDTL_LIC']] = row[col_heads['ADDTL_LIC']]
        data_list[i][col_heads['DEGREE_REQ']] = row[col_heads['DEGREE_REQ']]

    data_list[start_job_index][col_heads['NOTES']] = row[col_heads['NOTES']]
    data_list[start_job_index][col_heads['SELECTION_PROCESS']] = row[col_heads['SELECTION_PROCESS']] 

    eda_row[col_heads['EXAM_TYPE']] = row[col_heads['EXAM_TYPE']]
    eda_row[col_heads['DRIVERS_LICENSE_REQ']] = row[col_heads['DRIVERS_LICENSE_REQ']]
    eda_row[col_heads['DRIV_LIC_TYPE']]= row[col_heads['DRIV_LIC_TYPE']]
    eda_row[col_heads['ADDTL_LIC']] = row[col_heads['ADDTL_LIC']]
    eda_row[col_heads['DEGREE_REQ']] = row[col_heads['DEGREE_REQ']]
   #print ('eda_row',eda_row)
    save_row = eda_row.copy()
    eda_data_list.append(save_row)

    return last_job_index

In [None]:
 def line_interp (line, title_line, row, state, body,level,job,data_list):
    class_pattern ='(Code:.*)(\d{4})(.*)'
    sub_pattern = 'substitut'

#save the job title and report if inconsistent    
    if line != '' and title_line:
        j = line.replace('\tREVISED','')
        job= j.replace('\n','').lower().title()
        #in some cases the first line of the file is not the job title
        if job.upper() != 'CAMPUS INTERVIEWS ONLY':
            row[col_heads['JOB_CLASS_TITLE']] = job
            title_line = False
        else:
            print ()
            print ('CAMPUS INTERVIEWS ONLY at file top',row[col_heads['FILE_NAME']] )
            print ()
            
#save the job class
    if (re.search(class_pattern, line)):
        if row[col_heads['JOB_CLASS_NO']]  =='':
            
            
            job_class = re.search(class_pattern, line).group(2)
            if re.search (job_class,row[col_heads['FILE_NAME']]):            
                #only save the first instance because
                #SENIOR ELECTRIC SERVICE REPRESENTATIVE has wrong code at btm of file
                row[col_heads['JOB_CLASS_NO']] = re.search(class_pattern, line).group(2)
            else:
                print ()
                print (row[col_heads['FILE_NAME']] ,'JOB_CLASS mismatch')
                print ()
                

    elif "Open Date:" in line:
        row[col_heads['OPEN_DATE']]  = line.split("Open Date:")[1].split("(")[0].strip().replace ('-','/')

#use the upper case headings to define the current state
#process the state on reaching the next heading
    elif (line.isupper()and not "$" in line):
        state = ''
        body =''
    if state != '':
        row, body,data_list = process_state (state,body,line,row,job,data_list)
    elif re.search('DUTIES',line):
        state = 'duties'
    elif re.search('REQUIREMENT',line):
        state = 'requirements'
    elif re.search('(ANNUAL SALARY)|(ANNUALSALARY)',line):
        state = 'annualsalary'
    elif re.search('NOTE',line):
        state = 'notes'
    elif re.search('SELECTION',line):
        state = 'selection'

#look for licence and degree type anywhere in the file as the information is dispersed        
    driver_licence(line)
    degree_req(line)
    level = examination_type(line,level)
    return line, title_line, row, state, body,level, job,data_list

In [None]:
def chk_file_valid(filename):
    #files are marked manually as invalid where appropriate
    file_valid = True
    if filename == 'ANIMAL CARE TECHNICIAN SUPERVISOR 4313 122118.txt':
        #excluded because bulliten text is wrong
        file_valid = False
    if filename == 'WASTEWATER COLLECTION SUPERVISOR 4113 121616.txt':
        #excluded because bulliten text is wrong
        file_valid = False
    if filename == 'SENIOR EXAMINER OF QUESTIONED DOCUMENTS 3231 072216 REVISED 072716.txt':
        #excluded because bulliten text is wrong
        file_valid = False
    if filename == 'SENIOR UTILITY SERVICES SPECIALIST 3753 121815 (1).txt':
        #excluded because a newer bulliten exists
        file_valid = False
    if filename == 'CHIEF CLERK POLICE 1219 061215.txt':
        #excluded because a newer bulliten exists
        file_valid = False
       
    if file_valid == False:
        print ()
        print ('invalid', filename)
        print()
    return file_valid
    

## Cell that produces the structured data file<a id='topcell></a>

A '.' is printed for each job bulletin reviewed

Job bulletin inconsistencies are reported including:

**double requirement found** which means that the terminator ; or was found **within** a requriement line. An attempt is made to handle this but it would be better if the file followed the template guidelines.

**invalid files** because they ar either duplicates or have contents inconsistent with the title

**CAMPUS INTERVIEWS ONLY at file top** where the job title is not the first line

**mismatch between class number in the file and in the filename**. Sometimes the content is ok. When it is not, it is marked maually as invalid in chk_file_valid






In [None]:
bulletin_dir = "../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Job Bulletins"
data_list = []
eda_data_list = []
body = ''
state = ''
job = ''
level = 0
cnt = 0
start_job_index = 0
global eda_row

for filename in os.listdir(bulletin_dir):
     if cnt >-1 and cnt <700:   #use this line to control number of files reviewed during testing
        row = [''] * len_ch
        eda_row= [''] * len_ch
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            row[col_heads['FILE_NAME']] = filename
            print ('. ',end="")
            title_line = True
            file_valid = chk_file_valid(filename)
            if (file_valid):
                for index,line in enumerate(f.readlines()):
                    line = line.rstrip().lstrip()
                    if (line !='' and line != "OR" ):
                        pattern = '(.*?)(; or)(.*)' 
                        #sometimes a significant alternative is included within a requirement
                        if re.search(pattern, line) and(len(re.search(pattern, line).group(3))) > 40 and  state == 'requirements':
                            print () 
                            print ('double requirement found', filename)
                            print()   

                            _, title_line, row, state, body,level,job,data_list = \
                                    line_interp (re.search(pattern, line).group(1) + re.search(pattern, line).group(2), title_line, row, state, body,level,job,data_list)
                            _, title_line, row, state, body,level,job,data_list = \
                                    line_interp (re.search(pattern, line).group(3), title_line, row, state, body,level,job,data_list)
                        else:
                            line, title_line, row, state, body,level,job,data_list = \
                                line_interp (line, title_line, row, state, body,level,job,data_list)             
                start_job_index = backfill(data_list,start_job_index,row)
        cnt += 1

df_job_class = pd.DataFrame(data_list)
df_job_class.columns = column_heads
df_eda_exam = pd.DataFrame(eda_data_list)
df_eda_exam.columns = column_heads


In [None]:
#pd.options.display.max_colwidth = 200 ;-1
pd.set_option('max_colwidth', 700)
with pd.option_context("display.max_rows", 700): display (df_job_class)

In [None]:
#cannot sort job class
# df_job_class.sort_values('FILE_NAME', inplace = True)
# df_job_class = df_job_class.reset_index(drop=True)
# df_eda_exam.sort_values('FILE_NAME', inplace = True)
# df_eda_exam = df_eda_exam.reset_index(drop=True)


# df_job_class.head(640)

## Saving and reloading the raw data
### Saving

In [None]:
workingpath = ('../working/')

df_job_class.to_csv(workingpath + 'job_class.csv')
df_eda_exam.to_csv(workingpath + 'eda_exam.csv')

### Reloading

In [None]:
workingpath = ('../working/')
df_job_class = pd.read_csv(workingpath + 'job_class.csv', \
                                 converters={'JOB_CLASS_NO': str, 'REQUIREMENT_SET_ID': str, \
                                             'EXP_JOB_CLASS_TITLE': str},index_col=0)
df_eda_exam = pd.read_csv(workingpath + 'eda_exam.csv', \
                                 converters={'JOB_CLASS_NO': str, 'REQUIREMENT_SET_ID': str, 'EXP_JOB_CLASS_TITLE': str},index_col=0)
df_job_class = df_job_class.replace('',np.NaN)
df_eda_exam = df_eda_exam.replace('',np.NaN)

df_job_class.head(640)


In [None]:
include =['object', 'float', 'int'] 
df_eda_exam.describe( include = include)

## Data Clean and Validation



### The following cells test for those columns where all rows must contain the appropriate type.


In [None]:
print ('FILE_NAME:       ',df_eda_exam['FILE_NAME'].apply(type).eq(str).all())
print ('JOB_CLASS_TITLE: ',df_eda_exam['JOB_CLASS_TITLE'].apply(type).eq(str).all())
print ('JOB_CLASS_NO:    ',df_eda_exam['JOB_CLASS_NO'].apply(type).eq(str).all())
print ('JOB_DUTIES:      ',df_eda_exam['JOB_DUTIES'].apply(type).eq(str).all())
print ('DEGREE_REQ):     ',df_eda_exam['DEGREE_REQ'].apply(type).eq(str).all())
print ('EXAM_TYPE:       ',df_eda_exam['EXAM_TYPE'].apply(type).eq(str).all())



JOB_CLASS_NO, JOB_DUTIES and DEGREE_REQ fail currently

### The following cells test for those columns where all rows must contain the appropriate type or NaN.

In [None]:
print ('EDUCATION_YEARS:   ',df_eda_exam['EDUCATION_YEARS'].apply(type).eq(float).any())
print ('EXPERIENCE_LENGTH: ',df_eda_exam['EXPERIENCE_LENGTH'].apply(type).eq(float).any())

print ('OPEN_DATE: ',df_eda_exam['OPEN_DATE'].dtype)



### JOB_CLASS_NO

In [None]:
#The Describe function output above indicates that 1 job bulletin has no Job_Class
#This cell showed it was:
#Vocational Worker DEPARTMENT OF PUBLIC WORKS.txt
# df_eda_exam['JOB_CLASS_NO']= np.where(pd.isna(df_eda_exam['JOB_CLASS_NO']), 
#                             '    ', 
#                             df_eda_exam['JOB_CLASS_NO'])
# df_eda_exam.sort_values('JOB_CLASS_NO',  inplace = True)
# df_eda_exam.head(10)

### Missing Duties replaced with'0000'

In [None]:
df_job_class['JOB_CLASS_NO']= np.where(pd.isna(df_job_class['JOB_CLASS_NO']), 
                            '0000', 
                            df_job_class['JOB_CLASS_NO'])
df_eda_exam['JOB_CLASS_NO']= np.where(pd.isna(df_eda_exam['JOB_CLASS_NO']), 
                            '0000', 
                            df_eda_exam['JOB_CLASS_NO'])
df_eda_exam.sort_values('JOB_CLASS_NO',  inplace = True)

df_eda_exam.head(2)


In [None]:
df_eda_exam['JOB_CLASS_NO']= np.where(pd.isna(df_eda_exam['JOB_CLASS_NO']), 
                            '0000', 
                            df_eda_exam['JOB_CLASS_NO'])
df_eda_exam.sort_values('JOB_CLASS_NO',  inplace = True)

df_eda_exam.head(2)

### Missing Duties replaced with'NOT PROVIDED'

In [None]:
#The Describe function output above shows that 7 job bulletins have no duty
#This cell finds the 7:
#ENGINEER OF FIRE DEPARTMENT
#Vocational Worker DEPARTMENT OF PUBLIC WORKS
#FIRE ASSISTANT CHIEF 
#FIRE BATTALION CHIEF
#FIRE HELICOPTER PILOT 
#FIRE INSPECTOR 
#APPARATUS OPERATOR 

# df_eda_exam['JOB_DUTIES']= np.where(pd.isna(df_eda_exam['JOB_DUTIES']), 
#                             '    ', 
#                             df_eda_exam['JOB_DUTIES'])
# df_eda_exam.sort_values('JOB_DUTIES',  inplace = True)
# df_eda_exam.head(10)

In [None]:
# Replace missing duties NOT PROVIDED
df_job_class['JOB_DUTIES']= np.where(pd.isna(df_job_class['JOB_DUTIES']), 
                            'NOT PROVIDED', 
                            df_job_class['JOB_DUTIES'])

#df_job_class.head(1)


In [None]:

df_eda_exam['JOB_DUTIES']= np.where(pd.isna(df_eda_exam['JOB_DUTIES']), 
                            'NOT PROVIDED', 
                            df_eda_exam['JOB_DUTIES'])

df_eda_exam.head(2)


### For later EDA it will be useful to know which jobs require no DRIV_LIC_TYPE

This is a more specialists job related license linked with traditionally male orientated jobs

In [None]:
df_job_class['DRIV_LIC_TYPE']= np.where(pd.isna(df_job_class['DRIV_LIC_TYPE']), 
                            'NONE', 
                            df_job_class['DRIV_LIC_TYPE'])
#print (df_job_class['DEGREE_REQ'].nunique())
df_job_class.head(20)
df_eda_exam['DRIV_LIC_TYPE']= np.where(pd.isna(df_eda_exam['DRIV_LIC_TYPE']), 
                            'NONE', 
                            df_eda_exam['DRIV_LIC_TYPE'])
#print (df_eda_exam['DEGREE_REQ'].nunique())
df_eda_exam.head(20)

### For later EDA it will be useful to know which jobs require no SCHOOL_TYPE

In [None]:
df_job_class['SCHOOL_TYPE']= np.where(pd.isna(df_job_class['SCHOOL_TYPE']), 
                            'NONE', 
                            df_job_class['SCHOOL_TYPE'])
#print (df_job_class['DEGREE_REQ'].nunique())
df_job_class.head(20)
df_eda_exam['SCHOOL_TYPE']= np.where(pd.isna(df_eda_exam['SCHOOL_TYPE']), 
                            'NONE', 
                            df_eda_exam['SCHOOL_TYPE'])
#print (df_eda_exam['DEGREE_REQ'].nunique())
df_eda_exam.head(20)

### For later EDA it will be useful to know which jobs require a degree or previous experience that required a degree

In [None]:
df_job_class['DEGREE_REQ']= np.where(pd.isna(df_job_class['DEGREE_REQ']), 
                            'NO', 
                            df_job_class['DEGREE_REQ'])
#print (df_job_class['DEGREE_REQ'].nunique())
df_job_class.head(20)
df_eda_exam['DEGREE_REQ']= np.where(pd.isna(df_eda_exam['DEGREE_REQ']), 
                            'NO', 
                            df_eda_exam['DEGREE_REQ'])
#print (df_eda_exam['DEGREE_REQ'].nunique())
df_eda_exam.head(20)

In [None]:
#types not right yet
df_job_class["OPEN_DATE"] = df_job_class["OPEN_DATE"].astype('datetime64[ns]')
df_eda_exam["OPEN_DATE"] = df_eda_exam["OPEN_DATE"].astype('datetime64[ns]')


In [None]:
print ('FILE_NAME:       ',df_eda_exam['FILE_NAME'].apply(type).eq(str).all())
print ('JOB_CLASS_TITLE: ',df_eda_exam['JOB_CLASS_TITLE'].apply(type).eq(str).all())
print ('JOB_CLASS_NO:    ',df_eda_exam['JOB_CLASS_NO'].apply(type).eq(str).all())
print ('JOB_DUTIES:      ',df_eda_exam['JOB_DUTIES'].apply(type).eq(str).all())
print ('DEGREE_REQ):     ',df_eda_exam['DEGREE_REQ'].apply(type).eq(str).all())
print ('EXAM_TYPE:       ',df_eda_exam['EXAM_TYPE'].apply(type).eq(str).all())

print ('EDUCATION_YEARS:   ',df_eda_exam['EDUCATION_YEARS'].apply(type).eq(float).any())
print ('EXPERIENCE_LENGTH: ',df_eda_exam['EXPERIENCE_LENGTH'].apply(type).eq(float).any())

print ('OPEN_DATE: ',df_eda_exam['OPEN_DATE'].dtype)





## This is the Structured Data File<a id='sdf'></a>

This is the structured data file following production, cleaning and validation.


In [None]:
#to print out the full file
pd.options.display.max_colwidth = 200 ;-1
with pd.option_context("display.max_rows", 2000): display (df_job_class)

## EDA Code

The df_eda_exam dataframe has one row for each job class and is useful for analysis.<a id='eda_exam'></a>


In [None]:
include =['object', 'float', 'int'] 
df_eda_exam.describe( include = include)

In [None]:
pd.options.display.max_colwidth = 200 ;-1
with pd.option_context("display.max_rows", 2000): display (df_eda_exam)

### Reviewing the data 

#### Requirements

Requirement lists are complicated with one job bulletin providing 8 main options and another providing 10 sub options.

Over specification of requirements is off putting for some underrepresented groups and a less formal, more people focussed approach  should be helpful. For instance, the Real Estate Associate requirement is currently:

*1. Graduation from an accredited four-year college or university and successful completion of at least:
	a)	six semester or eight quarter units of college level courses in real estate from an accredited college or university; or
	b)  	60 hours of course work from a recognized professional real estate/right-of-way association; or
2. Two years of full-time paid experience as a Real Estate Trainee for the City of Los Angeles and successful completion of at least:*
*a) 	six semester or eight quarter units of college level courses in real estate from an accredited college or university; or
b) 48 hours of course work from a recognized professional real estate/right-of-way association.

One year of full-time paid experience in performing right-of-way work; appraising the market value of real property; managing commercial or industrial real property; or negotiating on behalf of a large organization or governmental agency for the acquisition, sale, or lease of real property rights may be substituted for up to two years of college education (i.e., 30 semester/45 quarter units = 1 year of college education) lacking on a year-for-year basis, but may not be substituted for the required courses in real estate.*

This could be loosened to  say:

1. Graduation from an accredited four-year college/university and successful completion of some real estate course work; or
2. Two years of full-time paid experience working in real estate and completion of some real estate course work.





In [None]:
df_exam_group = df_job_class.groupby('REQUIREMENT_SET_ID').count()
df_exam_group.head(20)

In [None]:
df_exam_group = df_job_class.groupby('REQUIREMENT_SUBSET_ID').count()
df_exam_group.head(20)

In [None]:
df_exam_group = df_job_class.groupby('AND_OR').count()
df_exam_group.head(20)

#### SCHOOL_TYPE

In [None]:
df_job_class_group = df_job_class.groupby('SCHOOL_TYPE').count()

df_job_class_group.head(30)

In [None]:
df_exam_group = df_eda_exam.groupby('SCHOOL_TYPE').count()

df_exam_group.head(30)

In [None]:
df_job_class_f =df_job_class['SCHOOL_TYPE'].value_counts().reset_index()
df_job_class_f['index']=df_job_class_f['index'].apply(lambda x : x.title())
df_job_class_f=df_job_class_f.groupby('index',as_index=False).agg('sum')
labels=df_job_class_f['index']
sizes=df_job_class_f['SCHOOL_TYPE']
plt.figure(figsize=(5,7))
plt.pie(sizes,labels=labels)
plt.gca().axis('equal')
plt.title('SCHOOL_TYPE including apprenticeship and certification' )
plt.show()

#### EDUCATION_MAJOR

In [None]:
df_exam_group = df_job_class.groupby('EDUCATION_MAJOR').count()

with pd.option_context("display.max_rows", 2000): display (df_exam_group)

#### FULL_TIME_PART_TIME

The majority of the experience required by the job requirements is for FULL_TIME.

Some experience is measured in hours, eg:

ADMINISTRATIVE HEARING EXAMINER
520 hours of paid experience with the City of Los Angeles within the past two years from the date of filing as an exempt or contract Administratve Hearing Examiner.


In [None]:
df_exam_group = df_job_class.groupby('FULL_TIME_PART_TIME').count()

df_exam_group.head(20)

#### Course Information

In [None]:
df_eda_exam_group = df_eda_exam.groupby('COURSE_COUNT').count()
df_eda_exam_group.head(30)

In [None]:
df_eda_exam_group = df_eda_exam.groupby('COURSE_LENGTH').count()
df_eda_exam_group.head(30)

In [None]:
df_eda_exam_group = df_eda_exam.groupby('COURSE_SUBJECT').count()
with pd.option_context("display.max_rows", 2000): display (df_exam_group)

#### DRIVERS_LICENSE_REQ


In [None]:
df_eda_exam_group = df_eda_exam.groupby('DRIVERS_LICENSE_REQ').count()
df_eda_exam_group.head(30)

#### DRIV_LIC_TYPE

In [None]:
df_eda_exam_group = df_eda_exam.groupby('DRIV_LIC_TYPE').count()
df_eda_exam_group.head(30)

#### ADDTL_LIC

In [None]:
df_eda_exam_group = df_eda_exam.groupby('ADDTL_LIC').count()
df_eda_exam_group.head(20)

#### EXAM_TYPE

This means the type of opportunity.

Just over a half are open to non-employees


In [None]:
df_eda_exam_f =df_eda_exam['EXAM_TYPE'].value_counts().reset_index()
df_eda_exam_f['index']=df_eda_exam_f['index'].apply(lambda x : x.title())
df_eda_exam_f=df_eda_exam_f.groupby('index',as_index=False).agg('sum')
labels=df_eda_exam_f['index']
sizes=df_eda_exam_f['EXAM_TYPE']
plt.figure(figsize=(5,7))
plt.pie(sizes,explode=(0.1, 0.1,0.1,0.1,0.1),labels=labels)
plt.gca().axis('equal')
plt.title('Exam Type for all Job Bulletins' )
plt.show()

In [None]:
df_eda_exam_group = df_eda_exam.groupby('EXAM_TYPE').count()
df_eda_exam_group.head(30)

#### VOCATIONAL_QUAL

This is ofetn a repeat of contents of the Major column but is also a catch all for miscellaneous training requirements

In [None]:
df_exam_group = df_eda_exam.groupby('VOCATIONAL_QUAL').count()

with pd.option_context("display.max_rows", 2000): display (df_exam_group)

In [None]:
df_exam_group = df_eda_exam.groupby('DEGREE_REQ').count()

df_exam_group.head(30)

### Reformatting salary information to allow graphical presentation

In [None]:

df_eda_exam = df_eda_exam.sort_values('ENTRY_SALARY_GEN')

df_eda_exam['entry_salary'] = 99
df_eda_exam['final_salary'] = 99
df_eda_exam['pc_range'] = 99

#with pd.option_context("display.max_rows", 2000): display (df_eda_exam)


In [None]:
salary_flat_pattern = '(\d{4,6})(.*)(flat-rated)'
salary_range_pattern = '(\d{4,6})(.*?)(\d{4,6})'

#df_salary_eda['entry_salary'] = df_salary_eda['ENTRY_SALARY_GEN']

for i, row in df_eda_exam.iterrows():
    salary_range =  ''
    if not pd.isna(row['ENTRY_SALARY_GEN']):
        salary_range = row['ENTRY_SALARY_GEN']
    elif not pd.isna(row['ENTRY_SALARY_DWP']):
        salary_range = row['ENTRY_SALARY_DWP']
   # print ('FILE_NAME', row['FILE_NAME'])
    #print ('salary_range',salary_range)
    
    #print ('ENTRY_SALARY_GEN',row['ENTRY_SALARY_GEN'])
    #print ('ENTRY_SALARY_DWP',row['ENTRY_SALARY_DWP'])
      
    entry_salary = -1
    final_salary = -1

    if re.search(salary_flat_pattern,salary_range):
        #print ('re.search(salary_flat_pattern,salary_range).group(1)',re.search(salary_flat_pattern,salary_range).group(1))
        entry_salary = int(re.search(salary_flat_pattern,salary_range).group(1))
        final_salary = 0
    if re.search(salary_range_pattern,salary_range):
        #print ('re.search(salary_range_pattern,salary_range).group(1)',re.search(salary_range_pattern,salary_range).group(1))
        entry_salary = int(re.search(salary_range_pattern,salary_range).group(1))
        final_salary = int(re.search(salary_range_pattern,salary_range).group(3))
    #print ('entry_salary',entry_salary)
    #print ('final_salary',final_salary)
    if final_salary != 0:
        pc_range = 100* (final_salary - entry_salary)/entry_salary
    else:
        pc_range = 0
    df_eda_exam.loc[i,'entry_salary'] = entry_salary
    df_eda_exam.loc[i,'final_salary'] = final_salary
    df_eda_exam.loc[i,'pc_range'] = pc_range
df_eda_exam.describe()                                  

In [None]:
bins = pd.cut(df_eda_exam['entry_salary'], [0, 50000, 75000, 100000, 150000, 200000, 250000])

df_eda_exam.groupby(bins)['entry_salary'].agg(['count'])



#### Salary

In two cases, an entry salary is not available in either the GEN or DWP column. In these case -1 is used.

In [None]:
df_eda_exam = df_eda_exam.sort_values('entry_salary')
df_eda_exam.head(3)


In [None]:
import seaborn as sns

plt.rcParams['figure.figsize']=(12,6)
sns.distplot(df_eda_exam.entry_salary.fillna(axis=0, method='ffill'), bins =30, color = 'red')


## Gender and Ethnicity Analysis

### LA Ethnicity<a id='laethnicity'></a>

[Los Angeles Demographics](http://www.census.gov/quickfacts/fact/table/losangelescitycalifornia/PST045218#PST045218)

According to the US Census bureau and using their terms, the composition of Los Angeles was estimated in 2018 as:

        Hispanic or Latino:48.7%
        White: 28.4%
        Asian: 11.7%
        Black or African American: 8.9%
        Two or More Races: 3.5%
        American Indian and Alaska Native: 0.7%
        Native Hawaiian and Other Pacific Islander: 0.2%

In [None]:
labels=['Hispanic or Latino','White','Asian','Black or African American','Two or More Races','American Indian and Alaska Native','Native Hawaiian and Other Pacific Islander']
sizes=[48.7,28.4,11.7,8.9,3.5,0.7,.2]
plt.figure(figsize=(5,7))
plt.pie(sizes,explode=(0.1, 0.1,0.1,0.1,0.1,0.1,0.1),labels=labels)
plt.gca().axis('equal')
plt.title('LA Ethnicity' )
plt.show()

### Application Data<a id='gedata'></a>

We have [access](https://catalog.data.gov/dataset/applicant-information-from-7-1-2014-to-9-30-2014-7835b) to the application data of 187 of the 672 job bulletins supplied

In [None]:
#df_ge = pd.read_csv("../input/la-applicants-gender-and-ethnicity/Job_Applicants_by_Gender_and_Ethnicity.csv")
df_ge = pd.read_csv("../input/la-applicants-gender-and-ethnicity/rows.csv")

Hispanic = 48.7
White =  28.4
Asian =  11.7
Black_or_African_American =  8.9
Two_or_more_races = 3.5
American_Indian_and_Alaska_Native = 0.7
Native_Hawaiian_and_Pacific_Islander = 0.2

df_g_and_e = df_ge.copy()
#df_g_and_e = df_g_and_e.rename(columns={'Job Description': 'JOB_DESCRIPTION', 'Job Number': 'JOB_CLASS_NO'})
df_g_and_e = df_g_and_e.rename(columns={'Job Description': 'JOB_DESCRIPTION'})

#df_g_and_e['JOB_CLASS_NO'] = df_g_and_e['JOB_CLASS_NO'].map(lambda x: str(x)[:4])
df_g_and_e['JOB_CLASS_NO'] = df_g_and_e['JOB_DESCRIPTION'].map(lambda x: re.search ('(\d{4})',x).group(1)  if re.search ('(\d{4})',x)  else None)

df_g_and_e.fillna('NONE', inplace = True)

class_pattern = '(\d{4})'
for index, row in df_g_and_e.iterrows():
    #print('before',df_g_and_e.iloc[index]['JOB_DESCRIPTION'])
    #print('before',df_g_and_e.iloc[index]['Job Number'])
    if re.search ('(^\d{4})',df_g_and_e.iloc[index]['Job Number']):
        #print ('in')
        df_g_and_e.at[index,'JOB_CLASS_NO'] = df_g_and_e.iloc[index]['Job Number'][:4]
    else:
        #print ('out')
        df_g_and_e.at[index,'JOB_CLASS_NO'] = re.search (class_pattern,df_g_and_e.iloc[index]['JOB_DESCRIPTION']).group(1)
    #print('after',df_g_and_e.iloc[index]['JOB_CLASS_NO'])
        

                                                           
df_g_and_e.insert(loc = 3, column = 'HISPANIC_REP', value = 100 * df_g_and_e['Hispanic']/ 
                  (Hispanic * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))
df_g_and_e.insert(loc = 4, column = 'CAUCASIAN_REP', value = 100 * df_g_and_e['Caucasian']/ 
                  (White * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))
df_g_and_e.insert(loc = 5, column = 'ASIAN_REP', value = 100 * df_g_and_e['Asian']/ 
                  (Asian * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))
df_g_and_e.insert(loc = 6, column = 'BLACK_REP', value = 100 * df_g_and_e['Black']/ 
                  (Black_or_African_American * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))
df_g_and_e.insert(loc = 7, column = 'AI_OR_AN_REP', value = 100 * df_g_and_e['American Indian/ Alaskan Native']/ 
                  (American_Indian_and_Alaska_Native * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))
df_g_and_e.insert(loc = 8, column = 'FILIPINO_REP', value = 100 * df_g_and_e['Filipino']/ 
                  (Native_Hawaiian_and_Pacific_Islander * (df_g_and_e['Apps Received']- df_g_and_e['Unknown_Ethnicity'])))

df_g_and_e =df_g_and_e.sort_values('Apps Received', ascending  = False)
df_g_and_e.describe()

In the dataframe below a 100% match of applications to the population would give a value of 1.

In [None]:
with pd.option_context("display.max_rows", 2000): display (df_g_and_e)


### Checking how representative the application data is of the 678 job bulletins<a id='sample'></a>

The gender and ethnicity data of the job applicants looks like useful information. But we do need to be sure that it is a reasonable representation of the job bulletin set provided.

To do this we need to merge the g&e data witht the structured file. We can then compare the full set with the sample for a number of different measures. 
 

In [None]:
df_eda_ge_prep = df_eda_exam.copy()
df_eda_ge_prep = df_eda_ge_prep.drop (['REQUIREMENT_SET_ID','REQUIREMENT_SUBSET_ID','AND_OR','JOB_DUTIES','EXP_JOB_CLASS_ALT_RESP', \
                                    'EXP_JOB_CLASS_FUNCTION','OPEN_DATE','NOTES','SELECTION_PROCESS'], axis = 1)
df_eda_ge = df_eda_ge_prep.merge(right=df_g_and_e, how = 'right',
                                            left_on ='JOB_CLASS_NO',
                                            right_on ='JOB_CLASS_NO')

df_eda_ge.describe()

In [None]:
with pd.option_context("display.max_rows", 2000): display (df_eda_ge)

In [None]:
def plot_full_sample_cmp_sub(var_f, var_s, index, column):
    labels1=var_f[index]
    sizes1=var_f[column]
    labels2=var_s[index]
    sizes2=var_s[column]
    colors = ['yellowgreen','red','gold','lightskyblue','lightcoral','blue','pink', 'darkgreen','yellow','grey','violet','magenta','cyan']
    
    fig = plt.figure()
    ax1 = fig.add_axes([0, 0, .5, .5], aspect=1)
    ax1.pie(sizes1, labels=labels1, colors = colors,radius = 1.2)
    ax1.set_title ('ALL JOB BULLETINS\n')

    ax2 = fig.add_axes([.5, .0, .5, .5], aspect=1)
    ax2.pie(sizes2, labels=labels2,colors = colors, radius = 1.2)
    ax2.set_title ('SAMPLE WITH APPLICATIONS DATA\n')
    plt.show()
    return

def plot_full_sample_cmp(column):
    var_f=df_eda_exam[column].value_counts().reset_index()
    var_f['index']=var_f['index'].apply(lambda x : x.title())
    var_f=var_f.groupby('index',as_index=False).agg('sum')

    var_s=df_eda_ge[column].value_counts().reset_index()
    var_s['index']=var_s['index'].apply(lambda x : x.title())
    var_s=var_s.groupby('index',as_index=False).agg('sum')

    plot_full_sample_cmp_sub(var_f, var_s, 'index', column)
    return


plot_full_sample_cmp ('EXAM_TYPE')

In [None]:
plot_full_sample_cmp ('SCHOOL_TYPE')

In [None]:
bins = pd.cut(df_eda_exam['entry_salary'], [-1, 50000, 75000, 100000, 150000, 200000, 250000])
var_f = df_eda_exam.groupby(bins)['entry_salary'].agg(['count']).reset_index()
labels1=var_f['entry_salary']
sizes1=var_f['count']

bins = pd.cut(df_eda_ge['entry_salary'], [-1, 50000, 75000, 100000, 150000, 200000, 250000])
var_s = df_eda_ge.groupby(bins)['entry_salary'].agg(['count']).reset_index()

plot_full_sample_cmp_sub(var_f, var_s, 'entry_salary', 'count')
plt.show()

In [None]:
plot_full_sample_cmp ('DEGREE_REQ')

In [None]:
plot_full_sample_cmp ('FULL_TIME_PART_TIME')

In [None]:
plot_full_sample_cmp ('DRIVERS_LICENSE_REQ')

Reviewing the above charts by eye indicates that the sample is a reasonable reflection of the full set of bulletins.


### Comparing the sum of all applications

#### Gender<a id='gendertotal'></a>

In [None]:
df_ge_summ = df_ge.copy()
df_ge_summ.drop(df_ge_summ.columns[[0, 1, 2,3,7,8,9,10,11,12,13]], axis=1, inplace=True)
df_ge_summ = df_ge_summ.sum(axis = 0, skipna = True)

df_ge_summ.head()

In [None]:
labels=df_ge_summ.index
sizes=df_ge_summ
plt.figure(figsize=(5,7))
plt.pie(sizes,explode=(0.1, 0.1,0.1),labels=labels)
plt.gca().axis('equal')
plt.title('Gender of all applicatants' )
plt.show()

#### Ethnicity<a id='ethnicitytotal'></a>

In [None]:
df_ge_summ = df_ge.copy()
df_ge_summ.drop(df_ge_summ.columns[[0, 1, 2,3,4,5,6,13]], axis=1, inplace=True)
df_ge_summ = df_ge_summ.sum(axis = 0, skipna = True)
#df_ge_summ.head(10)

In [None]:
#LA population data
#using the sample application data titles and removing 2 or races as we don't have application data for this category
total = 48.7 + 28.4 + 11.7 + 8.9 +0.7 + .2
Black = 8.9/total
Hispanic = 48.7/total
Asian = 11.7/total
Caucasian = 28.4/total
American_IndianAlaskan_Native = .7/total
Filipino = .2/total


In [None]:
def cmp_apps_with_pop (labels2,sizes2, title):

    labels1=['Black','Hispanic','Asian','Caucasian','American Indian/Alaska Native','Filipino']
    sizes1=[Black,Hispanic,Asian,Caucasian,American_IndianAlaskan_Native,Filipino]
    fig = plt.figure()
    ax1 = fig.add_axes([0, 0, .5, .5], aspect=1)
    ax1.pie(sizes1, explode=(0.1, 0.1, 0.1,0.1,0.1,0.1), labels=labels1, radius = 1.2)
    ax1.set_title ('LA ETHNICITY\n')

    ax2 = fig.add_axes([.5, .0, .5, .5], aspect=1)
    ax2.pie(sizes2, labels=labels2, explode=(0.1, 0.1, 0.1,0.1, 0.1,0.1),radius = 1.2)
    ax2.set_title (title)
    plt.show()
    return

cmp_apps_with_pop (df_ge_summ.index,df_ge_summ, 'APPLICATIONS FOR ALL JOBS\n')


### Reviewing application balance for various groupings of jobs

#### Degree required?


In [None]:
df_eda_ge_group = df_eda_ge.groupby('DEGREE_REQ').sum()
df_eda_ge_group.head()

In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Female':'Male']
df_eda_ge_group_T.head(30)

#### Gender: degree required?<a id='genderdegree'></a>

In [None]:
def cmp_apps_with_pop_g (labels,sizes1, title1, sizes2, title2):

    fig = plt.figure()
    ax1 = fig.add_axes([0, .0, .5, .5], aspect=1)
    ax1.pie(sizes1, labels=labels, explode=(0.1, 0.1),radius = 1.2)
    ax1.set_title (title1)

    ax2 = fig.add_axes([.5, .0, .5, .5], aspect=1)
    ax2.pie(sizes2, labels=labels, explode=(0.1, 0.1),radius = 1.2)
    ax2.set_title (title2)
    plt.show()
    return

cmp_apps_with_pop_g (df_eda_ge_group_T.index,df_eda_ge_group_T['NO'], 'APPLICATIONS FOR NON-DEGREE LEVEL JOBS\n',df_eda_ge_group_T['YES'], 'APPLICATIONS FOR DEGREE LEVEL JOBS\n')


In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Black':'Filipino']

df_eda_ge_group_T.head(30)

In [None]:
# #LA population data
# #using the sample application data titles and removing 2 or races aswe don't have application data for this category
# total = 48.7 + 28.4 + 11.7 + 8.9 +0.7 + .2
# Black = 8.9/total
# Hispanic = 48.7/total
# Asian = 11.7/total
# Caucasian = 28.4/total
# American_IndianAlaskan_Native = .7/total
# Filipino = .2/total

# labels=['Black','Hispanic','Asian','Caucasian','American Indian/Alaska Native','Filipino']
# sizes=[Black,Hispanic,Asian,Caucasian,American_IndianAlaskan_Native,Filipino]
# plt.figure(figsize=(5,7))
# plt.pie(sizes,explode=(0.1, 0.1,0.1,0.1,0.1,0.1),labels=labels)
# plt.gca().axis('equal')
# plt.title('LA ETHNICITY' )
# plt.show()


In [None]:
def cmp_apps_with_pop (labels2,sizes2, title):

    labels1=['Black','Hispanic','Asian','Caucasian','American Indian/Alaska Native','Filipino']
    sizes1=[Black,Hispanic,Asian,Caucasian,American_IndianAlaskan_Native,Filipino]
    fig = plt.figure()
    ax1 = fig.add_axes([0, 0, .5, .5], aspect=1)
    ax1.pie(sizes1, explode=(0.1, 0.1, 0.1,0.1,0.1,0.1), labels=labels1, radius = 1.2)
    ax1.set_title ('LA ETHNICITY\n')

    ax2 = fig.add_axes([.5, .0, .5, .5], aspect=1)
    ax2.pie(sizes2, labels=labels2, explode=(0.1, 0.1, 0.1,0.1, 0.1,0.1),radius = 1.2)
    ax2.set_title (title)
    plt.show()
    return

cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['YES'], 'APPLICATIONS FOR DEGREE LEVEL JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['NO'], 'APPLICATIONS FOR NON DEGREE LEVEL JOBS\n')


#### School Type


In [None]:
df_eda_ge_group = df_eda_ge.groupby('SCHOOL_TYPE').sum()
df_eda_ge_group.head(20)

In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Female':'Male']
df_eda_ge_group_T.head(30)

#### Gender: SCHOOL_TYPE Applications<a id='genderapprenticeship'></a>

In [None]:
cmp_apps_with_pop_g (df_eda_ge_group_T.index,df_eda_ge_group_T['APPRENTICESHIP'], 'APPLICATIONS FOR APPRENTICESHIP LEVEL JOBS\n',df_eda_ge_group_T['COLLEGE OR UNIVERSITY'], 'APPLICATIONS FOR COLLEGE OR UNIVERSITY LEVEL JOBS\n')


#### Ethnicity: SCHOOL_TYPE Applications<a id='ethnicityapprenticeship'></a>

In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Black':'Filipino']
df_eda_ge_group_T.head(30)

In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['APPRENTICESHIP'], 'APPLICATIONS FOR APPRENTICESHIP LEVEL JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['COLLEGE OR UNIVERSITY'], 'APPLICATIONS FOR COLLEGE OR UNIVERSITY LEVEL JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['CERTIFICATION'], 'APPLICATIONS FOR CERTIFICATION LEVEL JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['NONE'], 'APPLICATIONS FOR CERTIFICATION LEVEL JOBS\n')


### Gender: Driver License<a id='genderdl'></a>

In [None]:
df_eda_ge_group = df_eda_ge.groupby('DRIV_LIC_TYPE').sum()
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Female':'Male']
df_eda_ge_group_T.head(30)

In [None]:
labels=['Female','Male']
sizes =[2+1+337+96+33,461+84+4370+3437+250]
plt.figure(figsize=(5,7))
plt.pie(sizes,explode=(0.1, 0.1),labels=labels)
plt.gca().axis('equal')
plt.title('Special Driver Licence Required' )
plt.show()

#### Salary

In [None]:
bins = pd.cut(df_eda_ge['entry_salary'], [-1, 50000, 75000, 100000, 150000, 200000, 250000])
df_eda_ge_group = df_eda_ge.groupby(bins).sum()
df_eda_ge_group.head()

#### Gender: salary<a id='gendersalary'></a>

In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Female':'Male']
df_eda_ge_group_T.head(30)

In [None]:
df_eda_ge_group_T = df_eda_ge_group_T.rename(columns={ df_eda_ge_group_T.columns[0]: "Up to $50k",df_eda_ge_group_T.columns[1]: "50k-$75k",df_eda_ge_group_T.columns[4]: "Over $150k"})
df_eda_ge_group_T.head()

In [None]:
cmp_apps_with_pop_g (df_eda_ge_group_T.index,df_eda_ge_group_T['Up to $50k'], 'APPLICATIONS FOR UP TO $50k JOBS\n',df_eda_ge_group_T['50k-$75k'],'APPLICATIONS FOR $50k-$75k+ LEVEL JOBS\n')


In [None]:
cmp_apps_with_pop_g (df_eda_ge_group_T.index,df_eda_ge_group_T['Up to $50k'], 'APPLICATIONS FOR UP TO $50k JOBS\n',df_eda_ge_group_T['Over $150k'],'APPLICATIONS FOR $150k+ LEVEL JOBS\n')


#### Ethnicity: salary<a id='ethnicitysalary'></a>

In [None]:
df_eda_ge_group_T = df_eda_ge_group.T
df_eda_ge_group_T = df_eda_ge_group_T['Black':'Filipino']
df_eda_ge_group_T = df_eda_ge_group_T.rename(columns={ df_eda_ge_group_T.columns[0]: "Up to $50k",\
                                                      df_eda_ge_group_T.columns[1]: "50k-$75k",\
                                                      df_eda_ge_group_T.columns[2]: "75k-$100k",\
                                                      df_eda_ge_group_T.columns[3]: "100k-$150k",\
                                                      df_eda_ge_group_T.columns[4]: "Over $150k"})

df_eda_ge_group_T.head(30)

In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['Up to $50k'], 'APPLICATIONS FOR UP TO $50k JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['50k-$75k'], 'APPLICATIONS FOR $50-75k JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['75k-$100k'], 'APPLICATIONS FOR $75-100k JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['100k-$150k'], 'APPLICATIONS FOR $100-150k JOBS\n')


In [None]:
cmp_apps_with_pop (df_eda_ge_group_T.index,df_eda_ge_group_T['Over $150k'], 'APPLICATIONS FOR UP TO $150+k JOBS\n')


## Finding Explicit Links<a id='explicitlinks'></a>

In [None]:
df_job_class.head()

In [None]:
df_explicit = pd.DataFrame()
data_list = []

df_job_class_len = len(df_job_class)
index = 0
while index < df_job_class_len:
    job = df_job_class.iloc[index]['JOB_CLASS_TITLE']
    reqs = df_job_class.iloc[index]['EXP_JOB_CLASS_TITLE']
    exp_len = df_job_class.iloc[index]['EXPERIENCE_LENGTH']
    license = df_job_class.iloc[index]['DRIVERS_LICENSE_REQ']
    license_type = df_job_class.iloc[index]['DRIV_LIC_TYPE']
    ed_yr = df_job_class.iloc[index]['EDUCATION_YEARS']
    school_type = df_job_class.iloc[index]['SCHOOL_TYPE']
    course_length = df_job_class.iloc[index]['COURSE_LENGTH']
    #print ('job',job) 
    
    #print ('req',reqs)
    if not pd.isna(reqs):
        for i, word in enumerate(reqs.split(',')): 
            word = word.rstrip(' ').lstrip(' ')
            #print ('req words',word)
            data_list_len = len(data_list)
            copy_found  = False
            j = 0
            while copy_found == False and j < data_list_len:
                list_job = data_list[j][0].upper()
                list_word = data_list[j][1].upper()
                if list_job == job and list_word == word:
                    print ('list_job',list_job)
                    copy_found = True
                j += 1
            if word != '' and copy_found == False:
                data_list.append([job.upper(), word,exp_len,license,license_type,ed_yr,school_type,course_length])
    index += 1
df_explicit = pd.DataFrame(data_list)
df_explicit.columns = ["JOB", "REQUIREMENT", "EXPERIENCE_LENGTH","DRIVERS_LICENSE_REQ","DRIV_LIC_TYPE","EDUCATION_YEARS","SCHOOL_TYPE","COURSE_LENGTH"]
df_explicit.head()


In [None]:
include =['object', 'float', 'int'] 
df_eda_exam.describe( include = include)

### Making suggestions about possible promotions

The df_explicit  dataframe can be used to make promotion suggestions. Given an employees job class, education and driver licence details, the dataframe can be interrogated to supply possible promotion routes or provide information about extra experience/qualification that would be needed to open up a career route.


In [None]:
with pd.option_context("display.max_rows", 2000): display (df_explicit)


In [None]:
include =['object', 'float', 'int'] 
df_explicit.describe( include = include)

### Job classes with the most first level subordinates

In [None]:
df_explicit_g = df_explicit.groupby('JOB').count()

df_explicit_s_g = df_explicit_g.sort_values('REQUIREMENT',ascending = False)
df_explicit_s_g.head()

### Looking for subordinates, ie who could apply

In [None]:

G = nx.Graph()

job = 'WATER UTILITY SUPERINTENDENT'
#G.add_node(job)
df_explicit_len = len(df_explicit)
index = 0
edges={}

while index < df_explicit_len:
    if df_explicit.iloc[index]['JOB'] == job:
        #print (job, ":   ",df_explicit.iloc[index]['REQUIREMENT'] )
        G.add_edge(job,df_explicit.iloc[index]['REQUIREMENT'])
        edges[job,df_explicit.iloc[index]['REQUIREMENT']] = '2yr'
        
    index +=1
plt.figure(figsize=(15, 15)) 
plt.axis('off')
pos = nx.circular_layout(G)

nx.draw_networkx_edge_labels(G,pos,edge_labels=edges
,font_color='red')

nx.draw_networkx(G, pos,with_labels=True, node_color='red', font_size=12, node_size=20000, arrows = True, width = 2)
plt.show()
#print (edges)


Now find the total career paths all the way back to an entry job...

In [None]:

def findsubsrecurse (job, edges, depth ):
    G.add_node(job)
    df_explicit_len = len(df_explicit)
    index = 0
    #print ('depth', depth)
    
    while index < df_explicit_len:
        #print ('index',index)
        
        if  (df_explicit.iloc[index]['JOB'] == job or\
             df_explicit.iloc[index]['JOB'] == job +' I' or\
             df_explicit.iloc[index]['JOB'] == job + " II" or\
             df_explicit.iloc[index]['JOB'] == job + " III") and depth<10:
            #print ('link found')
            #print ('job,req', job,df_explicit.iloc[index]['REQUIREMENT'] )
            G.add_edge(df_explicit.iloc[index]['REQUIREMENT'],job)
            edges[df_explicit.iloc[index]['REQUIREMENT'],job] = \
            str(df_explicit.iloc[index]['EXPERIENCE_LENGTH'])+'yr'
            depth += 1
            #print ('depth in loop1',depth)
            edges, depth  = findsubsrecurse (df_explicit.iloc[index]['REQUIREMENT'], edges, depth )
            
        index +=1
    return edges, depth

def findsubs (job):
    edges={}
    job= job.upper()
    depth = 0
    edges, depth  = findsubsrecurse (job,edges, depth )
    plt.figure(figsize=(15, 15)) 
    plt.axis('off')
    pos = nx.circular_layout(G)
    #pos = nx.spectral_layout(G)
    pos[job] = np.array([0, 0])
    nx.draw_networkx_edge_labels(G,pos,edge_labels=edges,font_color='red')

    nx.draw_networkx(G, pos,with_labels=True, node_color='red', 
                     font_size=12, node_size=2000, arrows = True, width = 2)
    plt.show()

    
    #print (pos)
    return

### Network diagrams to show explicit links between job classes<a id='el'></a>

In [None]:
G = nx.DiGraph()
findsubs ('SENIOR SYSTEMS ANALYST')

In [None]:
G = nx.DiGraph()
findsubs ('WATER UTILITY SUPERINTENDENT')

In [None]:
G = nx.DiGraph()
findsubs ('CHIEF INSPECTOR')

In [None]:
G = nx.DiGraph()
findsubs ('ELECTRICAL SERVICES MANAGER')


In [None]:
G = nx.DiGraph()
findsubs ('UTILITY SERVICES MANAGER')

### Most of the difficult to fill roles are open to everyone not just LA City employees.<a id='difficult'></a>



The following 17 job classes can be challenging to fill with qualified candidates:

    Accountant
    Accounting Clerk
    Applications Programmer
    Assistant Street Lighting Electrician
    Building Mechanical Inspector
    Detention Officer
    Electrical Mechanic
    Equipment Mechanic
    Field Engineering Aide
    Housing Inspector
    Housing Investigator
    Librarian
    Security Officer
    Senior Administrative Clerk
    Senior Custodian
    Senior Equipment Mechanic
    Tree Surgeon

In the future, our Personnel Department expects to find it challenging to fill the following classes:

    IT-related classes (e.g., Applications Programmer)
    Wastewater classes
    Inspector classes
    Journey-level classes


In [None]:
G = nx.DiGraph()
# no subs findsubs ('Applications Programmer') education plus paid experience performing systems or programming tasks in a professional IT environment
# no subs findsubs ('Accountant') graduation required  not work experience required
# no subs findsubs ('Accounting Clerk') but paid clerical acconting work is required
# no subs findsubs ('Assistant Street Lighting Electrician') experience working in the construction, maintenance, and repair of street lighting circuitry
# findsubs ('Building Mechanical Inspector')  #ASSISTANT INSPECTOR
#findsubs ('Detention Officer') #PARK RANGER
# no subs findsubs (' Equipment Mechanic')
# no subs findsubs (' Field Engineering Aide')
# no subs findsubs ('Housing Inspector')#ASSISTANT INSPECTOR
# no subs findsubs ('Housing Investigator')
# no subs findsubs ('Librarian')
#findsubs ('Security Officer')  #PARK RANGER
# no subs findsubs ('Senior Administrative Clerk')
# no subs findsubs ('Senior Custodian')
findsubs ('Senior Equipment Mechanic') #HEAVY DUTY EQUIPMENT MECHANIC Auto Electrician
#findsubs ('Tree Surgeon') # Tree Surgeon assistant



### Finding promotion routes<a id='promotional'></a>

In [None]:

def findsuprecurse (experience, edges):
    G.add_node(experience)
    df_explicit_len = len(df_explicit)
    index = 0
    while index < df_explicit_len:
        if  (df_explicit.iloc[index]['REQUIREMENT'] == experience or\
             df_explicit.iloc[index]['REQUIREMENT'] == experience +' I' or\
             df_explicit.iloc[index]['REQUIREMENT'] == experience + " II" or\
             df_explicit.iloc[index]['REQUIREMENT'] == experience + " III"):
             
            G.add_edge(experience,df_explicit.iloc[index]['JOB'])
            edges[experience, df_explicit.iloc[index]['JOB'] ]= \
                 str(df_explicit.iloc[index]['EXPERIENCE_LENGTH'])+'yr'
            findsuprecurse (df_explicit.iloc[index]['JOB'], edges)
        index +=1
    return edges
def findsup (job):
    edges={}
   
    edges = findsuprecurse (job, edges)
    plt.figure(figsize=(15, 15)) 
    plt.axis('off')
    pos = nx.circular_layout(G)
    pos[job] = np.array([0, 0])
    
    nx.draw_networkx_edge_labels(G,pos,edge_labels=edges,font_color='red')

    nx.draw_networkx(G, pos,with_labels=True, node_color='red', 
                     font_size=12, node_size=2000, arrows = True, width = 2)
    plt.show()
    #print (edges)
    return


In [None]:
G = nx.DiGraph()
findsup ('SYSTEMS ANALYST')

In [None]:
G = nx.DiGraph()
findsup('PUBLIC RELATIONS SPECIALIST')

In [None]:
G = nx.DiGraph()
findsup('WELDER')

In [None]:
G = nx.DiGraph()
findsup('ELECTRICAL CRAFT HELPER')

In [None]:

G = nx.DiGraph()
findsup('ACCOUNTANT')

In [None]:
G = nx.DiGraph()
findsup('AUTOMOTIVE SUPERVISOR')

In [None]:
# dot.edge_attr.update(arrowhead='vee', arrowsize='2', dir ='back')

# for index, row in df_explicit.iterrows():
#     dot.edge(str(row["JOB"]), "AND"+str(index))
#     dot.edge("AND"+str(index), str(row["REQUIREMENT"]), label=str (row['EXPERIENCE_LENGTH']+'yr'))
#     if row['DRIVERS_LICENSE_REQ'] != '':
#             dot.edge( "AND"+str(index), "DRIVER LICENCE", label = str(row["DRIVERS_LICENSE_REQ"]))

    
# dot

### Graphical representation of requirement<a id='hierarchyreq'></a>

In [None]:
from graphviz import Digraph
dot = Digraph(name='Promotion Options')

def find_promotions(job):
    dot.edge_attr.update(arrowhead='vee', arrowsize='2', dir ='back')

    df_explicit_len = len(df_explicit)
    index = 0
    andstr = ''
    licstr = ''
    while index < df_explicit_len:
        if  (df_explicit.iloc[index]['REQUIREMENT'] == job or\
             df_explicit.iloc[index]['REQUIREMENT'] == job +' I' or\
             df_explicit.iloc[index]['REQUIREMENT'] == job + " II" or\
             df_explicit.iloc[index]['REQUIREMENT'] == job + " III"):

            if not pd.isna(df_explicit.iloc[index]['SCHOOL_TYPE']):
                
                if not pd.isna(df_explicit.iloc[index]["EDUCATION_YEARS"]):
                    #print ('df_explicit.iloc[index]["EDUCATION_YEARS"]',df_explicit.iloc[index]["EDUCATION_YEARS"])
                    dot.edge( "AND"+andstr, df_explicit.iloc[index]["SCHOOL_TYPE"].title(), label = str(df_explicit.iloc[index]["EDUCATION_YEARS"])+ 'yr')
                elif not pd.isna(df_explicit.iloc[index]["COURSE_LENGTH"]):
                    #print ('df_explicit.iloc[index]["COURSE_LENGTH"]',df_explicit.iloc[index]["COURSE_LENGTH"])
                    dot.edge( "AND"+andstr, df_explicit.iloc[index]["SCHOOL_TYPE"].title(), label = df_explicit.iloc[index]["COURSE_LENGTH"].title())
            
            dot.edge(df_explicit.iloc[index]["JOB"].title(), "AND"+andstr)
            dot.edge("AND"+andstr, df_explicit.iloc[index]["REQUIREMENT"].title())   #, label=str (df_explicit.iloc[index]['EXPERIENCE_LENGTH']+'yr'))
            if not pd.isna(df_explicit.iloc[index]['DRIVERS_LICENSE_REQ']):
                if not pd.isna(df_explicit.iloc[index]["DRIV_LIC_TYPE"]):
                    dot.edge( "AND"+andstr, "Driver License", label = df_explicit.iloc[index]["DRIVERS_LICENSE_REQ"].title() )
                else:
                    dot.edge( "AND"+andstr, "Driver License", label = df_explicit.iloc[index]["DRIVERS_LICENSE_REQ"].title() + ", " +df_explicit.iloc[index]["DRIV_LIC_TYPE"].title())
            andstr= andstr + " "
            licstr= licstr + " "
        index +=1
    
    return

In [None]:
dot = Digraph()
find_promotions('ASSISTANT SIGNAL SYSTEMS ELECTRICIAN')
dot

In [None]:
dot = Digraph()
find_promotions('ENVIRONMENTAL SUPERVISOR')
dot

In [None]:
dot = Digraph()
find_promotions('AUDITOR')
dot

In [None]:
dot = Digraph()
find_promotions('WELDER')
dot

In [None]:
dot = Digraph()
find_promotions('ACCOUNTANT')
dot

In [None]:
dot = Digraph()
find_promotions('AUTOMOTIVE SUPERVISOR')
dot

In [None]:
from graphviz import Digraph
#dot = Digraph(name='Promotion Options')

def find_promotions_min(job):
    dot.edge_attr.update(arrowhead='vee', arrowsize='2', dir ='back')

    df_explicit_len = len(df_explicit)
    index = 0
    andstr = ''
    licstr = ''
    while index < df_explicit_len:
        if  (df_explicit.iloc[index]['REQUIREMENT'] == job or\
             df_explicit.iloc[index]['REQUIREMENT'] == job +' I' or\
             df_explicit.iloc[index]['REQUIREMENT'] == job + " II" or\
             df_explicit.iloc[index]['REQUIREMENT'] == job + " III"):
            label=str (df_explicit.iloc[index]['EXPERIENCE_LENGTH'])+'yr'
#             print ('index, label',index,label)
#             print ("df_explicit.iloc[index]['REQUIREMENT']",df_explicit.iloc[index]['REQUIREMENT'])
#             print ("df_explicit.iloc[index]['JOB']",df_explicit.iloc[index]['JOB'])
            dot.edge(df_explicit.iloc[index]["JOB"].title(), df_explicit.iloc[index]["REQUIREMENT"].title(), \
                     label=str (df_explicit.iloc[index]['EXPERIENCE_LENGTH'])+'yr')
        index +=1
    return

### A diagram showing all promotional pathways<a id='hierarchyall'></a>

In [None]:
dot = Digraph()
 
index = 0
df_eda_exam
while index < len(df_eda_exam):
#    print ('main index',index)
    find_promotions_min(df_eda_exam.iloc[index]['JOB_CLASS_TITLE'].upper())
    
    print ('.',end="")
    index +=1
dot

In [None]:
#The outout looks much more interesting on my Kernel where the x axis is unlimited.
#Here is the dot file...

print (dot)

Output which lists all the promotion links

In [None]:
from graphviz import Digraph
dot = Digraph(name='Promotion Options')
dot.edge_attr.update(arrowhead='vee', arrowsize='2', dir ='back')
    
def find_promotions_minrecurse(job, depth):    
    dot.edge_attr.update(arrowhead='vee', arrowsize='2', dir ='back')
    df_explicit_len = len(df_explicit)
    #print (df_explicit_len)
    index = 0
    while index < df_explicit_len:
        #print ('job',job)
        if  (df_explicit.iloc[index]['JOB'] == job or\
             df_explicit.iloc[index]['JOB'] == job +' I' or\
             df_explicit.iloc[index]['JOB'] == job + " II" or\
             df_explicit.iloc[index]['JOB'] == job + " III") and depth<40:           
            req_pattern ='(.*)( I)'
            next_job = df_explicit.iloc[index]['REQUIREMENT']
            #print ('next_job',next_job)
            if re.search (req_pattern,next_job ):
                next_job = re.search (req_pattern,next_job ).group(1)

            dot.edge(df_explicit.iloc[index]["JOB"].title(), \
                     next_job.title(), \
                     label=str (df_explicit.iloc[index]['EXPERIENCE_LENGTH'])+'yr')
            depth += 1
            #print ('depth in loop1',depth)
           
            depth  = find_promotions_minrecurse (next_job,  depth )
    
        index +=1
    return depth



def findsubords (job):
    
    depth = 0
    depth  = find_promotions_minrecurse (job,depth )
    return

### Hierarchical Diagram at Department Level<a id='hierarchydepartment'></a>

In [None]:
dot = Digraph(strict = True)


findsubords ('ENGINEER OF SURVEYS')
dot

In [None]:
dot = Digraph(strict = True)

findsubords ('DIRECTOR OF AIRPORTS ADMINISTRATION')
dot

In [None]:
dot = Digraph()

findsubords ('ELECTRICAL SERVICES MANAGER')
dot

In [None]:
dot = Digraph(strict = True)

findsubords ('UTILITY SERVICES MANAGER')
dot

## Job Bulletin Language Analysis<a id='jblanganal'></a>


The concept of Belongingness should be useful in this analysis. [Belongingness](http://www.fortefoundation.org/site/DocServer/gendered_wording_JPSP.pdf?docID=16121) is defined as a feeling that one fits in with others in a particular domain- it affects people's willingness to engage. Belongingness cues can be picked up from the environment.

People will pick up these cues in job descriptions and draw conclusions beyond the set of requirements.

The bulletins should therefore [use inclusive language](http://breezy.hr/blog/3-simple-rules-for-using-inclusive-language-in-your-job-ads), i.e. make an intentional choice to use words that do not marginalize groups of people who may be knowingly or unknowingly discriminated against because of their culture, race, ethnicity, gender, sexual orientation, age, disability, socioeconomic status, appearance — or any other factor that simply shouldn’t play a role.

[This BBVA article](http://www.bbva.com/en/an-inclusive-workplace-begins-with-the-wording-of-job-ads/) contains a 10 point checklist that can act as a starting point in analysing current job bulletins and  data driven evidence to support recomended changes.


### Of the 10 point, we can measure the LA City job bulletins for the following aspect:

**Avoid extreme language:** this can include jargon, imprecise exclusive words like "expert", negative sentiment and difficult to read sentences. For instance, is the level of english required to understand the bulletin higher than that required to perform the job? All these examples exclude people that could be good candidates. 

**Avoid words that may convey stereotypes:** words like "lead" and "determine" reflect masculinity and may deter women from applying. Words that convey individuality over group endevour deter some under represented groups ([see page 13 of this article)](http://gender-decoder.katmatfield.com/static/documents/Gaucher-Friesen-Kay-JPSP-Gendered-Wording-in-Job-ads.pdf). 

**Avoid using masculine nouns and pronouns.** Using the second-person singular allows to avoid using masculine nouns and adjectives. However, when a direct reference is unavoidable, it is advisable to use gender-neutral nouns, such as “the person” or “the candidates”.

**Use “you” and “us”.** According to Textio, a platform that predicts the type of response job offers will get based on their wording, offers that use “you” and “we” are filled faster. Expressions like “you love finding the best solution to a problem” to address candidates are much better than impersonal ones like “the ideal candidate”.

**Write as concisely as possible.** Job offers should be brief. Ads written concisely are usually filled faster and usually draw in more applications.

### The remaining aspects are  more qualitative but recommendations about the following can still be made:

**Avoid unclear or unnecessary requirements**: these cannot be measured but they can be found. For instance: "a driver licence *may be* required" will deter both high quality candidates who don't want to waste their time as well as less confident candidates. The EDA shows that some job requirements many complicated requirement options.

**Convey a growth mindset.** Companies that are committed to the development of their talent are more likely to attract candidates from underrepresented groups. Expressions that reflect fixed qualities such as “natural-born analytical thinker”, “extremely intelligent” or “constantly outperforming” discourage aspiring candidates who may have high growth potential. The opposite happens with expressions such as “passionate learner” or “motivated to take on challenges”.

**State the company’s purpose and values.** Emphasizing the company’s values and mission is a good practice that should be taken into account when drawing up the offer, as it can help the candidates determine if it is a place where they would like to work.

**Demonstrate commitment to diversity and inclusion.** It is very advisable to devote some space to describing the company’s commitment to looking for all kinds of talent to build a diverse workforce in which all social groups are represented.

    
Hiring better starts by writing better. Good writing is, in many cases, the key to promoting inclusion. According to Textio, openings advertised using inclusive language get filled 17% faster and attract 23% more female candidates.
    

    

    

    

    


### Job Adverts: best practice

As well as considering diversity and bias, we need to improve the quality of applications. [This blog](https://www.talentlyft.com/en/blog/article/167/job-advertisement-best-practices) gives some interesting insights and the recommendations are consistent with the BBVA article above.

Essentially the quality candidate is in charge. You need to sell the job and tell them why they should consider it. 

### Sentiment Analysis: polarity<a id='sa'></a>

The Textblob library provides a quick method of measuring the sentiment and subjectivity of a piece of text. For the job bulletins, subjectivity is not an interesting concept but the polarity data is useful. 

Negative sentiments is unwelcoming and exclusive.

The cell below can be modified to output the particular lines that are judged to have negative sentiment. They are largely related to process information about what happens if candidates fail to do stuff. A better way of making these legal process points should be sought.


In [None]:
from textblob import TextBlob
cnt = 0
sent =  []

for filename in os.listdir(bulletin_dir):
#     if cnt >-1 and cnt <20:
        neg_sent = 0
        pos_sent = 0
        neut_sent = 0
        cmp_sent = 0
        total = 0
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            row[col_heads['FILE_NAME']] = filename
            #print (filename)
            for line in f.read().split('\n'):
               
                analysis = TextBlob(line)
                if filename == 'HEAVY DUTY EQUIPMENT MECHANIC 3743 021717.txt' and analysis.sentiment.polarity < -0.15:
                    print ('analysis',analysis.sentiment.polarity)
                    print ('line', line)

                cmp_sent += analysis.sentiment.polarity
#Initial tests shows the subjectivity measure is not useful for this application
#                 if analysis.sentiment.subjectivity >0:
#                     print("subjectivity",analysis.sentiment.subjectivity)
#                     print (line)
                if analysis.sentiment.polarity < 0:
                    neg_sent+= 1
                    total +=1
                    #print ('analysis.sentiment: polarity, subjectivity',analysis.sentiment.polarity,analysis.sentiment.subjectivity)
                    #print ('analysis.sentiment_assessments',analysis.sentiment_assessments)

                    #print ('neg line', line)
                if analysis.sentiment.polarity > 0:
                    pos_sent+= 1
                    total +=1
                    #print ('analysis.sentiment: polarity, subjectivity',analysis.sentiment.polarity,analysis.sentiment.subjectivity)
                    #print ('analysis.sentiment_assessments',analysis.sentiment_assessments)
                if analysis.sentiment.polarity == 0:
                    neut_sent+= 1
                   
                #         print ('neg_sent', neg_sent)
#         print ('pos_sent', pos_sent)
#         print ('neut_sent', neut_sent)
#         print ('cmp_sent', cmp_sent/total)
        sent.append([filename,cmp_sent/total,neg_sent,pos_sent])
        cnt +=1
#         print ('sent list',sent)
#print ('sent list',sent)
        
df_sent = pd.DataFrame(sent)




In [None]:

df_sent.columns = ["FILENAME", "SENTIMENT", "NEG_SENTIMENT", "POS_SENTIMENT"]
df_sent = df_sent.sort_values('SENTIMENT')
df_sent.head()

In [None]:
df_sent = df_sent.sort_values('SENTIMENT', ascending  = False)
df_sent.head()

In [None]:
df_sent.describe()

### Gender Coding<a id='gc'></a>

This tool uses the original list of gender-coded words from the research paper written by [Danielle Gaucher, Justin Friesen, and Aaron C. Kay: Evidence That Gendered Wording in Job Advertisements Exists and Sustains Gender Inequality (Journal of Personality and Social Psychology, July 2011, Vol 101(1), p109-28).](http://gender-decoder.katmatfield.com/static/documents/Gaucher-Friesen-Kay-JPSP-Gendered-Wording-in-Job-ads.pdf)

Their results support the proposition that some words have feminine connotations and other have masculine connotations. This gender coding results in maintaining traditional gender division in work as people identify the wording with their understanding of whether they will "belong".

The words identified in the paper are reproduced below and in the following cells the job bulletins are ["gender scored".](#gcresults)





In [None]:
feminine_coded_words = ["agree", "affectionate", "child", "cheer", "collab", "commit", "communal",   "compassion", "connect", "considerate", "cooperat", "co-operat", "depend",   "emotiona", "empath", "feel", "flatterable", "gentle", "honest", "interpersonal",   "interdependen", "interpersona", "inter-personal", "inter-dependen", "interpersona",   "kind", "kinship", "loyal", "modesty", "nag", "nurtur", "pleasant", "polite",   "quiet", "respon", "sensitiv", "submissive", "support", "sympath", "tender",   "together", "trust", "understand", "warm", "whin", "enthusias", "inclusive",   "yield", "shar"]

masculine_coded_words = [ "active", "adventurous", "aggress", "ambitio", "analy",   "assert", "athlet", "autonom", "battle", "boast", "challeng", "champion",   "compet", "confident", "courag", "decid", "decision", "decisive", "defend",   "determin", "domina", "dominant", "driven", "fearless", "fight", "force",   "greedy", "head-strong", "headstrong", "hierarch", "hostil", "implusive",   "independen", "individual", "intellect", "lead", "logic", "objective", "opinion",   "outspoken", "persist", "principle", "reckless", "self-confiden", "self-relian", "selfsufficien", "selfconfiden", "selfrelian", "selfsufficien", "stubborn", "superior","unreasonab"]


In [None]:
def assess(ad_text):
    #print (ad_text)
    ad_text = ''.join([i if ord(i) < 128 else ' ' for i in ad_text])
    ad_text = re.sub("[\\s]", " ", ad_text, 0, 0)
    ad_text = re.sub("[\.\t\,\:;\(\)\.]", "", ad_text, 0, 0).split(" ")
    ad_text = [ad for ad in ad_text if ad != ""]
    
    feminine_coded_words_fnd = ''
    feminine_coded_words_cnt = 0
    for adword in ad_text:
        for word in feminine_coded_words:
            if adword.startswith(word):
                feminine_coded_words_cnt +=1
                if word not in feminine_coded_words_fnd:
                    feminine_coded_words_fnd += " " + word

    masculine_coded_words_fnd = ''
    masculine_coded_words_cnt = 0
    for adword in ad_text:
        for word in masculine_coded_words:
            if adword.startswith(word):
                masculine_coded_words_cnt += 1
                if word not in masculine_coded_words_fnd:
                     masculine_coded_words_fnd += " " + word
                
    #print ('feminine_coded_words_fnd', feminine_coded_words_fnd)
    #print ('masculine_coded_words_fnd', masculine_coded_words_fnd)

    return feminine_coded_words_fnd, masculine_coded_words_fnd,feminine_coded_words_cnt,masculine_coded_words_cnt

In [None]:
cnt = 0
word_code =  []

for filename in os.listdir(bulletin_dir):
#    if cnt >-1 and cnt <20:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            f,m, f_cnt, m_cnt = assess (text)
            g_c = 'not det'
            if f_cnt and not m_cnt:
                g_c = "strongly feminine"
            elif m_cnt and not f_cnt:
                g_c = "strongly masculine"
            elif (not m_cnt and not f_cnt) or (m_cnt == f_cnt):
                g_c = "neutral"
            elif f_cnt > m_cnt:
                g_c = "feminine"
            elif m_cnt > f_cnt:
                g_c = "masculine"
            word_code.append([filename,f,m, f_cnt, m_cnt, g_c])
            cnt +=  1
#print  (word_code)            
df_word_code = pd.DataFrame(word_code)
df_word_code.columns = ["FILENAME", "F_WORDS", "M_WORDS", "F_CNT", "M_CNT","GENDER_CODE"]
df_word_code = df_word_code.sort_values('GENDER_CODE')




#### Gender coding results<a id='gcresults'></a>

In [None]:
df_word_code.head(10)

In [None]:
df_word_code = df_word_code.sort_values('GENDER_CODE', ascending  = False)
df_word_code.head(10)


In [None]:
df_word_code.describe()

In [None]:
df_word_code_group = df_word_code.groupby('GENDER_CODE').count()
df_word_code_group.head()

In [None]:
# from nltk.tokenize import sent_tokenize, word_tokenize
# not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
# get_word_count = lambda text: len(list(filter(not_punctuation, word_tokenize(text))))
# get_sent_count = lambda text: len(sent_tokenize(text))

# from nltk.corpus import cmudict
# prondict = cmudict.dict()
# prondict["apple"]



In [None]:

# numsyllables_pronlist = lambda w: len(list(filter(lambda s: s[-1].isdigit(), w)))
# def numsyllables(word):
#     try:
#         return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
#     except KeyError:
#         return [0]



In [None]:
# def numsyllables(word):
#     try:
#         x = prondict[word.lower()]
#         syll = 0
#         for y in x[0]:
#             if y[-1].isdigit():
#                 syll += 1
#         return  syll
#     except KeyError:
#         return 0
    

In [None]:
# def text_statistics(text):
#     word_count = get_word_count(text)
#     sent_count = get_sent_count(text)
#     print('word_count',word_count)
#     #w is argument to the function
#     syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text)))
#     print ('syllable_count',syllable_count)
#     return word_count, sent_count, syllable_count


In [None]:
# flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
# def flesch(text):
#     word_count, sent_count, syllable_count,not_found = text_statistics(text)
# #     print ('word_count ',word_count, 'sent_count ', sent_count, 'syllable_count ', syllable_count)
# #     print ('words_not_found',not_found)
#     return flesch_formula(word_count, sent_count, syllable_count)
 
# fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
# def flesch_kincaid(text):
#     word_count, sent_count, syllable_count,not_found = text_statistics(text)
#     return fk_formula(word_count, sent_count, syllable_count)


In [None]:
# cnt = 0
# for filename in os.listdir(bulletin_dir):
#     if cnt >-1 and cnt <2:
#         with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
#             text = f.read()
#             text = text.replace('. . ', '' )
#             print (filename)
#             print ('flesch_reading_ease',flesch(text))
#             print ('flesch_reading_grade',flesch_kincaid(text))
#             #print (text)
#             print ()
#         cnt +=1
 

In [None]:
# from nltk.corpus import cmudict
# word = 'president'
# d = cmudict.dict()
# for x in d[word.lower()]:
#     for y in x:
#         if y[-1].isdigit():
#             print (y)
#     print (x)

In [None]:
# from nltk.corpus import cmudict
# d = cmudict.dict()
# def nsyl(word):
#     a = [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] 
# #     b = list(y for y in x if y[-1].isdigit()) for x in d[word.lower()]
# #     a = [len b]
#     return a
# print (nsyl('president'))

### You and Me<a id='yandme'></a>

Textio report that taking a personal approach rather than a formal one produces more applications. This is an opportunity to talk to the applicant as a person and let the culture shine through. 

In the following cells, we count the usage of the words you and we in the  job bulletins. Unfortunately when we do find  instances of the words, they are related to formal instructions and so the opportunity is missed. 

In [None]:
youandme = []
cnt = 0
for filename in os.listdir(bulletin_dir):
    
    if cnt >203 and cnt <205:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            text = text.lower()
            print (filename)
            print ('You instances')
            you_list = re.findall ('( |^)(you )(.*?)(,|\.|;)' , text)
            print (len(you_list))
            print (you_list)

            print ('We instances')
            we_list = re.findall ('( |^)(we )(.*?)(,|\.|;)' , text)
            print (len(we_list))
            print (we_list)

            
            print ()

    cnt +=1


In [None]:
youandme = []
cnt = 0
for filename in os.listdir(bulletin_dir):
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            text = text.lower()
            you_list = re.findall ('( |^)(you )(.*?)(,|\.|;)' , text)
            we_list = re.findall ('( |^)(we )(.*?)(,|\.|;)' , text)
#            we_list = re.findall ('(we )(.*?)(,|\.|;)' , text)
            youandme.append([filename,len(you_list),len(we_list)])

#print (youandme)
df_youandme = pd.DataFrame(youandme)
df_youandme.columns = ["FILENAME", "YOU_INSTANCE", "WE_INSTANCE"]
df_youandme = df_youandme.sort_values('YOU_INSTANCE',ascending  =False)
df_youandme.head()

In [None]:
df_youandme = df_youandme.sort_values('WE_INSTANCE',ascending  =False)
df_youandme.head()

### Readability<a id='read'></a>

The Flesch Formula or Flesch Reading Ease Formula tells us how easy or difficult a text is to read. The higher the Reading Ease number is then the easier it is to read. The grade score corresponds to a notional US school grade.

An ease score of 60-70 is a good target for thr internet. 0-30 is suited for university work

The results below show a wide variation in the automatic counting methods found.
I have therefore coded an alternative version that agrees reasonably well with Microsoft and  the readabilityformula website

ARTS ASSOCIATE

21.8   Flesch-Kincaid  Reading Ease score

17.1   Flesch-Kincaid Grade Level

SENIOR AUTOMOTIVE SUPERVISOR

24.2   Flesch-Kincaid  Reading Ease score

14.7   Flesch-Kincaid Grade Level

**The following results were obtained from the free test found here**:

http://www.readabilityformulas.com/freetests/six-readability-formulas.php
                              
ARTS ASSOCIATE

23.5   Flesch-Kincaid  Reading Ease score

15.5   Flesch-Kincaid Grade Level

SENIOR AUTOMOTIVE SUPERVISOR

33.5   Flesch-Kincaid  Reading Ease score

12.2   Flesch-Kincaid Grade Level

**Microsoft Word produced the following results:**

ARTS ASSOCIATE

19.3   Flesch-Kincaid  Reading Ease score

16.1   Flesch-Kincaid Grade Level

with 1890 words and 52 sentences

SENIOR AUTOMOTIVE SUPERVISOR

24.7   Flesch-Kincaid  Reading Ease score

14.2   Flesch-Kincaid Grade Level

with 890 words and 35 sentences

**Textstat produced the following results:**

ARTS ASSOCIATE 2454 072117 REV 072817.txt

 -38.47 flesch_reading_ease 

33.1 flesch_kincaid_grade 

SENIOR AUTOMOTIVE SUPERVISOR

-15.39 Flesch-Kincaid  Reading Ease score   

26.3  Flesch-Kincaid Grade Level   


#### Textstat results

This method is not used as the results don't agree well with other online sources.


In [None]:
!pip install PyPDF2
!pip install textstat
from textstat.textstat import textstatistics, easy_word_set, legacy_round



In [None]:
#commented out as commiting was crashing
# !git clone https://github.com/shivam5992/textstat.git
# !cd textstat
# !pip install textstat

In [None]:
from textstat.textstat import textstatistics, easy_word_set, legacy_round

import textstat

cnt = 0
for filename in os.listdir(bulletin_dir):
    if cnt >-1 and cnt <3:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            print (filename)
            print ('flesch_reading_ease',textstat.flesch_reading_ease(text))
            print ('flesch_kincaid_grade',textstat.flesch_kincaid_grade(text))
            print ('smog_index',textstat.smog_index(text))
            print ('coleman_liau_index',textstat.coleman_liau_index(text))
            print ('automated_readability_index',textstat.automated_readability_index(text))
            print ('dale_chall_readability_score',textstat.dale_chall_readability_score(text))
            print ('difficult_words',textstat.difficult_words(text))
            print ('linsear_write_formula',textstat.linsear_write_formula(text))
            print ('gunning_fog',textstat.gunning_fog(text))
            print ('text_standard',textstat.text_standard(text))

            #print (text)

            print ()
        cnt +=1
 

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(list(filter(not_punctuation, word_tokenize(text))))
get_sent_count = lambda text: len(sent_tokenize(text))

from nltk.corpus import cmudict
prondict = cmudict.dict()

numsyllables_pronlist = lambda w: len(list(filter(lambda s: s[-1].isdigit(), w)))
def numsyllables(word):
    try:
        return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
    except KeyError:
        return [0]

def text_statistics(text):
    word_count = get_word_count(text)
    sent_count = get_sent_count(text)
    
    syllable_count = 0
    not_found = 0
    for word in word_tokenize(text):
        if word.isalpha():
            syllable_count += numsyllables(word)[0]
            if numsyllables(word)[0] == 0:
                not_found += 1
    syllable_count += not_found * syllable_count / word_count
    return word_count, sent_count, syllable_count,not_found

flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
def flesch(text):
    word_count, sent_count, syllable_count,not_found = text_statistics(text)
#     print ('word_count ',word_count, 'sent_count ', sent_count, 'syllable_count ', syllable_count)
#     print ('words_not_found',not_found)
    return flesch_formula(word_count, sent_count, syllable_count)
 
fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
def flesch_kincaid(text):
    word_count, sent_count, syllable_count,not_found = text_statistics(text)
    return fk_formula(word_count, sent_count, syllable_count)


In [None]:
cnt = 0
for filename in os.listdir(bulletin_dir):
    if cnt >-1 and cnt <2:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            text = text.replace('. . ', '' )
            print (filename)
            print ('flesch_reading_ease',flesch(text))
            print ('flesch_reading_grade',flesch_kincaid(text))
            #print (text)
            print ()
        cnt +=1
 

In [None]:
readability = []
cnt = 0
for filename in os.listdir(bulletin_dir):
#    if cnt >-1 and cnt <2:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            text = f.read()
            text = text.replace('. . ', '' )
#             print (filename)
#             print ('flesch_reading_ease',flesch(text))
#             print ('flesch_reading_grade',flesch_kincaid(text))
#             #print (text)
#             print ()
            readability.append([filename,flesch(text),flesch_kincaid(text)])

#        cnt +=1


  

df_readability = pd.DataFrame(readability)
df_readability.columns = ["FILENAME", " FLESCH_READING_EASE", "FLESCH_READING_GRADE"]
df_readability = df_readability.sort_values('FLESCH_READING_GRADE')


#### Readability Results<a id='readr'></a>
   
Many of the job bulletins have a reading ease suitable for university work. Texts need to be much easier to read.

In [None]:
df_readability.head(10)

In [None]:
df_readability = df_readability.sort_values('FLESCH_READING_GRADE',ascending = False )
df_readability.head(10)

#### Line by line Examination
This cell can be used to examine a job bulletin, line by line to see where the readability is compromised

In [None]:
cnt = 0
for filename in os.listdir(bulletin_dir):
    #change this line to exam a different file
    if cnt >-1 and cnt <2:
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            for index,line in enumerate(f.readlines()):
                text = line.rstrip().lstrip()
                if (text !=''):
                    print (text)
                    print ('flesch_reading_ease',flesch(text))
                    print ('flesch_reading_grade',flesch_kincaid(text))

                    print ()
        cnt +=1
 



#### Example of difficult to understand text

On the http://www.readabilityformulas.com site, the following text scored:

**Flesch Reading Ease score: -99.4 (text scale)
Flesch Reading Ease scored your text: impossible to comprehend.

**Flesch-Kincaid Grade Level: 57.4
Grade level: College Graduate and above.****

The examination score will be based entirely on the interview.  In the interview, emphasis may be placed on the candidate's experience, training and professional development as they have provided the knowledge of: grant administration processes including collecting and reviewing applications, managing contracts, maintaining the department database of records and archives, presenting grants to the Cultural Affairs commission, and communicating information to grantees; various art disciplines including dance, literary arts, media arts, music, theater, urban and design arts, visual arts, public sculptures, monuments, murals, and  interdisciplinary or multidisciplinary art experiences where performances and other activities explore non-traditional formats and processes fusing or transcending distinct art disciplines; and the ability to logically and effectively organize priorities sufficient to plan and coordinate community art programs; seek advice regarding possible problems in order to determine how unexpected changes will affect other aspects of a project or program; conduct online and traditional research to gather data, fact check information, and prepare memoranda, letters, news releases, or reports; make recommendations regarding art programs or departmental changes based on staff experience and customer feedback, data, and qualitative histories or outcomes; persuasively communicate art program information including evaluations, department opinions, and recommended courses of action to diverse audiences through oral presentations; facilitate discussions in community meetings and grant review sessions; interact with artists, developers, contractors, agencies, other City employees, management, elected officials, the public, and others in a courteous, tactful, persuasive, and effective manner; and other necessary skills, knowledge, and abilities.



The following rewording scored:

Flesch Reading Ease score: 31.4 (text scale)
Flesch Reading Ease scored your text: difficult to read.

Flesch-Kincaid Grade Level: 12.8
Grade level: College.


The examination score will be based entirely on the interview.  In the interview, emphasis may be placed on the candidate's experience, training and professional development.  

We will be looking for experience in a grant administration processes. This should include collecting and reviewing applications, managing contracts and communicating information to grantees. If you are successful you will also be involved in maintaining the department database of records and archives and presenting grants to the Cultural Affairs commission.

Do you have experience in a variety of art disciplines? For instance  dance, literary arts, media arts, music, theater, urban and design arts, visual arts, public sculptures, monuments, murals.

Have you developed any art experiences where performances and other activities explore non-traditional formats and processes? For instance, we are interested in interdisciplinary or multidisciplinary art experiences which fuse or transcend distinct art disciplines. 

You should be able to show you can logically and effectively plan and coordinate community art programs. How have you dealt with unexpected changes or other problems?

What experience do you have in gathering data from various sources including online, staff experience and customer feedback? Have you been involved in fact checking, preparing reports and news releases? Have you been involved in make written or oral recommendations regarding art programs or departmental changes based on the data gathered? 

Have you facilitated discussions in community meetings and grant review sessions?

We would also like to hear about your experience of interacting with artists, developers, contractors, agencies, other City employees, management, elected officials and the public. As you would expect we insist on a courteous, tactful, persuasive, and effective approach.

## Job Bulletins: conclusions and recommendations

The current job bulletins do not communicate to the potential applicant why they should be interested in the job and the organisation. The bulletins are process driven and read like offical public notices. The small print has taken over. There is a place for small print but it needs to be demoted. Afterall who reads the t&cs?

Ideally the format should change to respond to general best practice and the specific LA City requirements of encouraging applications to address the  gender and the Hispanic imbalances . Here is an ideal proposed structure:

Job Title indicating the level of the job

Salary

Why they should be interested in terms of: 

	The job challenges and opportunities
    
	The team and its culture
    
	The opportunity to progress
    
	The work environment
    
	The benefits
    
Written to help them see themselves in the role

Clear and short application notes

This would be a big change and so a first stage bulletins could be to refactored as follows:

Job Title indicating the level of the job

Salary

Duties

Requirements

Application Process

Small print

The text should be written:

* Improve readability with shorter words and sentences. 
* Replace insider/jargon words like “This examination is based on a validation study”
* Reduce the negative sentiment typical of the process driven approach that talks about examinations and disqualification.
* Express ideas in feminine co-operative style rather than masculine
* Let the organisation culture shine through the words
* Show the diversity statement is more than a set of necessary legal words.
 
Try to loosen up on the strict requirements. Studies show that women are much less likely to apply for a job than men if they fall short of all the stated requirements. Here is an [intersting  paragraph](https://hbr.org/2014/08/why-women-dont-apply-for-jobs-unless-theyre-100-qualified) about job requirements:

"There was a sizable gender difference in the responses for one other reason: 15% of women indicated the top reason they didn’t apply was because “I was following the guidelines about who should apply.” Only 8% of men indicated this as their top answer. Unsurprisingly, given how much girls are socialized to follow the rules, a habit of “following the guidelines” was a more significant barrier to applying for women than men."

## Implicit Links using Fasttext with sentence embedding 

There is a Kaggle tutorial on Word2Vec here: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

Here is an article which explains of the idea behind Word2Vec: http://cgi.cs.mcgill.ca/~enewel3/posts/implementing-word2vec/

"One of these assumptions is the distributional hypothesis, which is the idea that the meaning of a word can be understood from the words that tend to be near it. For example “bread” might tend to show up near “eat”, “bake”, “butter”, “toast”, etc., and this entourage gives a signal of what “bread” means."

Here is the original paper: https://arxiv.org/pdf/1301.3781.pdf

This Kaggle blog alerted to me that FaxtText may produce better results than Word2Vec: https://www.kaggle.com/antonsruberts/sentence-embeddings-centorid-method-vs-doc2vec

The author Antons Rubert states:

"The main difference of FastText from Word2Vec is that it uses sub-word information (i.e character n-grams). While it brings additional utility to the embeddings, it also considerably slows down the process."

The method used here is identical to the one I have used previously for the Word2Vec model except that the vectors are calculated by FastText rather than Word2Vec.

By using FastText here, we retain all the advantages of the Word2Vec process and improve the accuracy.

First build the FastText model and show that the vectors can produce word similarities in the question bow..




In [None]:
df_duties = df_eda_exam.copy()
df_duties = df_duties.sort_values('JOB_DUTIES')
df_duties['JOB_DUTIES']= np.where(pd.isna(df_duties['JOB_DUTIES']), 
                            '.', 
                            df_duties['JOB_DUTIES'])

df_duties.head()

In [None]:
#just trying req text instead



df_duties = df_eda_exam.copy()

df_duties['EXP_JOB_CLASS_TITLE']= np.where(pd.isna(df_duties['EXP_JOB_CLASS_TITLE']), 
                            '.', 
                            df_duties['EXP_JOB_CLASS_TITLE'])
df_duties['EXP_JOB_CLASS_ALT_RESP']= np.where(pd.isna(df_duties['EXP_JOB_CLASS_ALT_RESP']), 
                            '.', 
                            df_duties['EXP_JOB_CLASS_ALT_RESP'])
df_duties['EXP_JOB_CLASS_FUNCTION']= np.where(pd.isna(df_duties['EXP_JOB_CLASS_FUNCTION']), 
                            '.', 
                            df_duties['EXP_JOB_CLASS_FUNCTION'])

#over write duties for quick test
df_duties['JOB_DUTIES'] = df_duties['EXP_JOB_CLASS_TITLE'] + " " + df_duties['EXP_JOB_CLASS_ALT_RESP'] + " "+  df_duties['EXP_JOB_CLASS_FUNCTION'] 

df_duties.head(10)

In [None]:
df_duties = df_duties[df_duties.JOB_DUTIES != '']
df_duties.describe()

In [None]:
start = time.time()
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


total_duties = ["".join(x) for x in (df_duties['JOB_DUTIES'])]
vectorizer = TfidfVectorizer(tokenizer=normalize)
tfidf = vectorizer.fit_transform(total_duties)

cachedStopWords = stopwords.words("english")
#print(total_duties)

total_duties_l  = [x.lower() for x in total_duties]
#print(total_duties_l)

# punct = '-'
# remove_punctuation_map = dict((ord(char), ' ') for char in punct)    
remove_punctuation_map = dict((ord(char), ' ') for char in string.punctuation)    

all_words = [nltk.word_tokenize(x.translate(remove_punctuation_map)) for x in total_duties_l]

for i in range(len(all_words)):  
    all_words[i] = [w for w in all_words[i] if w not in cachedStopWords]

#print(all_words)

end = time.time()
print('run time',end - start)
print (len(df_eda_exam))
print(len(total_duties_l))


In [None]:
from gensim.models import FastText
start = time.time()
embed_size = 300
"""all_words is a list of all the duties with the words separated and cleaned"""
ft_model = FastText(all_words, size=embed_size, window=5, min_count=2, workers=1
                    ,sg=1)
print('Time to build FastText model: {} mins'.format(round((time.time() - start) / 60, 2)))


ft_model.wv.most_similar(positive=["water"])


### Duty embedding

In the following code the FastText word vectors are combined to produce vectors for each Duty. This is done by finding the average of all the embeddings improved by taking into account the tfidf scores as described in this paper: http://www2.aueb.gr/users/ion/docs/BioNLP_2016.pdf.

These Duty vectors are then used to find similarity using cosine similarity.


In [None]:
"""Uncomment this to see how tdidfs are stored"""
#print (tfidf)


#### sentence embedding for duties using FastText

In [None]:

start = time.time()
#tfidf is calculated above
#There a tfidf value for every word in all_words
rows, cols = tfidf.nonzero()
print (rows)
print (cols)
rows_l = len(rows)

s_embed = []
s_embeds = []
dividend = []
atStart = True
oldr = -1
w_cnt = 0
"""using vectorization calculated in the Word2Vec section"""
vocab = vectorizer.get_feature_names()
#print (vocab)
#this method of calculating the embeddings is a bit ugly but takes advantage of how tfidfs are stored
#for every question
for i in range (rows_l):
    r = rows[i]
    c = cols[i]
    if (oldr != r):
        #new Duty and so store last embeddings
        if (atStart == False):
            #calc embedding for last Duty
            s_embed = np.divide(dividend, divisor)
            s_embeds.append(s_embed.flatten())
            
        else: 
            atStart = False
        oldr = r
        w_cnt = 0
        dividend = np.zeros((1, embed_size))
        divisor = 0

       
    #find the next word
    word = vocab[c]
    if word in ft_model.wv.vocab:
        #word is in the vocab and so calculate its contribution to the question vector
        wt = tfidf[r,c]
        #print (wt, word)
        w_embed = ft_model.wv[word]
        #print(w_embed)
        #print(w_embed * wt)
        dividend = np.add(dividend, w_embed * wt)
        divisor += wt
        w_cnt +=1
#    else:
#        print (word, " not in vocab")
s_embed = np.divide(dividend, divisor)
s_embeds.append(s_embed.flatten())
#print (s_embeds)
end = time.time()
print('Sentence embedding run time',end - start)
start = time.time()

q_embed_array = cosine_similarity(s_embeds, s_embeds)
print('q_embed_array',q_embed_array)
print ('q_embed_arrayshape', len(q_embed_array))
end = time.time()
print('cosine sim time',end - start)

In [None]:
"""function to produce dataframe of results of similarity tests"""
def get_sim_duties (column_head,index,sim_array,duties,duty):
    h_threshold =0.94
    l_threshold =0.9

    col_h = column_head + str(index)
    
    df_sim_d = pd.DataFrame({'Cosine':sim_array[:,index], col_h:duties['JOB_DUTIES']})

    df_sim_d_sorted = df_sim_d.sort_values('Cosine',ascending = False )
    if df_sim_d_sorted.iloc[0]['Cosine'] > .9999:
        df_sim_d_sorted = df_sim_d_sorted.drop(df_sim_d_sorted.index[0])

    h_num = 0
    l_num = 0
    worst_h_num = -1
    i = 0
    duties_len = len(duties)
    #while i< duties_len and df_sim_d_sorted.iloc[i]['Cosine'] > l_threshold:
        #print ('i, df_sim_d_sorted.iloc[i]['Cosine']')
#         if df_sim_d_sorted.iloc[i]['Cosine'] > l_threshold:
#             l_num += 1
#             worst_match_to_profs= i
#         if df_sim_q_sorted.iloc[i]['Cosine'] > h_threshold:
#             worst_h_num = i
#             h_num += 1
#         i += 1
    
    df_sim_d_sample = df_sim_d_sorted[:10]
        
    best_cos_0 = df_sim_d_sample.iloc[0]['Cosine']
    best_cos_9 = df_sim_d_sample.iloc[9]['Cosine']
    
    df_sim_d_sample = df_sim_d_sample.drop ('Cosine', axis=1).reset_index()
    df_sim_d_sample = df_sim_d_sample.drop ( 'index', axis=1)

    df_sim_d_sample_T = df_sim_d_sample.T
    df_sim_d_sample_T.insert(loc=0, column='Job', value=[duty.iloc[index]['JOB_CLASS_TITLE']] )
    df_sim_d_sample_T.insert(loc=1, column='Duty', value=[duty.iloc[index]['JOB_DUTIES']]  )
    df_sim_d_sample_T.insert(loc=2, column='best_cos', value=best_cos_0)
    df_sim_d_sample_T.insert(loc=3, column='10th_best_cos', value=best_cos_9)
#     df_sim_q_sample_T.insert(loc=4, column='similar Q to students', value= h_num)
#     df_sim_q_sample_T.insert(loc=5, column='Qs to profs', value=l_num)
    df_sim_d_sample_T.insert(loc=4, column='best matches', value=' ')

#     if worst_h_num > -1:
#         df_sim_q_sample_T.insert(loc=17, column='worst match to students', value=df_sim_q_sorted.iloc[worst_h_num][col_h])
#     else:
#         df_sim_q_sample_T.insert(loc=17, column='worst match to students', value='not available')
    
#     if l_num > 0:
#         df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value=df_sim_q_sorted.iloc[worst_match_to_profs][col_h])
#     else:
#         df_sim_q_sample_T.insert(loc=18, column='worst match to profs', value='not available')
    
    
    
    return ( df_sim_d_sample_T)

In [None]:
"""Compare  Duty with Duty using sentence embedding"""
start = time.time()

results_T = get_sim_duties ('Duty',0,q_embed_array,df_duties,df_duties)
# for i in range(1,sample_len):
for i in range(1,20):

    next_result = get_sim_duties ('Duty',i,q_embed_array,df_duties,df_duties) 
    results_T = pd.concat([results_T,next_result])
results_FastText = results_T.T
pd.options.display.max_colwidth = 500
pd.options.display.max_seq_items = 2000
end = time.time()
print('df time',end - start)

display (results_FastText) 

# Implicit linking using Fasttext and word embedding.<a id='implicitlinks'></a>

In the dataframe above, the first rows are for the job being considered. The rows below show the duties of jobs that are considered to be similar.

The similarity is measured by the cosine value and the closer it is to 1, the better the match. The table shows that it is difficult to distinguish between duties with them all having similar cosine values. This is due mainly to the fact that the job specific words are overpowered by the process words which are common. Matching based on the level of the job is also unhelpful (manager of x does not impy suitabilty to manager of b). Overall, there is too little text to make a useful match.

As the provision of implicit links is not part of the competition, this line of enquiry has been terminated.

Recommendations for future work:
When the requirements have been re-worked with less precision and less process wording the approach could be retried. Tuning is possible by removing common unhelpful words. Better still, is there a source that has a more extensive description of the job that could be used?



## Output for competition

### The structured data file

In [None]:
df_job_class.to_csv("lacitystructureddatafile.csv")
df_job_class.describe()

### The revised data dictionary

In [None]:
df_new_kddict.to_csv("reviseddatadictonary.csv")
df_new_kddict.describe()

### Explicit link data

In [None]:
df_explicit.to_csv("explicitlinkdata.csv")
df_explicit.describe()

In [None]:
# #Code to find files in available datasets
# import os
# inputFolder = '../input/'
# for root, directories, filenames in os.walk(inputFolder):
#    for filename in filenames:
#        print(os.path.join(root,filename))


In [None]:
#Used for cleaning data
#find duplicates class titles
#no duplicates now

# df_eda_exam_len = len(df_eda_exam)
# print (df_eda_exam_len)
# index = 0
# while index < df_eda_exam_len:
#     index2 = 0
#     #print (index)
#     var = df_eda_exam.iloc[index]['JOB_CLASS_TITLE']
#     while index2 < df_eda_exam_len:
#         if df_eda_exam.iloc[index]['FILE_NAME'] != df_eda_exam.iloc[index2]['FILE_NAME']:
     
#             if var == df_eda_exam.iloc[index2]['JOB_CLASS_TITLE']:
#                 print (df_eda_exam.iloc[index]['FILE_NAME'], df_eda_exam.iloc[index2]['FILE_NAME'])
#         index2 += 1
#     index += 1


In [None]:
# #Used for cleaning data
# #find duplicates class codes

# df_eda_exam_len = len(df_eda_exam)
# print (df_eda_exam_len)
# index = 0
# while index < df_eda_exam_len:
#     index2 = 0
#     #print (index)
#     var = df_eda_exam.iloc[index]['JOB_CLASS_NO']
#     while index2 < df_eda_exam_len:
#         if df_eda_exam.iloc[index]['FILE_NAME'] != df_eda_exam.iloc[index2]['FILE_NAME']:
     
#             if var == df_eda_exam.iloc[index2]['JOB_CLASS_NO']:
#                 print (df_eda_exam.iloc[index]['FILE_NAME'], df_eda_exam.iloc[index2]['FILE_NAME'])
#         index2 += 1
#     index += 1


In [None]:
# #find number of open jobs where internal explicit candidates has been identified

# df_job_class_len = len(df_job_class)
# print (df_job_class_len)
# index = 0
# jobs_found = 0
# jobs_not_found = 0
# job_no_sub = 0
# job_chk = 0
# while index < df_job_class_len:
#     if df_job_class.iloc[index]['EXP_JOB_CLASS_TITLE'] != '':
#         job_chk +=1
#         print (df_job_class.iloc[index]['EXAM_TYPE'])
#         if re.search('ON AN INTERDEPARTMENTAL PROMOTIONAL AND AN OPEN COMPETITIVE BASIS',df_job_class.iloc[index]['EXAM_TYPE'] ) or re.search('ON AN OPEN COMPETITIVE BASIS',df_job_class.iloc[index]['EXAM_TYPE']) :
#             print (df_job_class.iloc[index]['FILE_NAME'], df_job_class.iloc[index]['EXAM_TYPE'])
#             jobs_found += 1
            
            
#             print ("exp", df_job_class.iloc[index]['FILE_NAME'],df_job_class.iloc[index]['EXP_JOB_CLASS_TITLE'])
#         elif re.search('ON AN DEPARTMENTAL PROMOTIONAL BASIS',df_job_class.iloc[index]['EXAM_TYPE']) \
#                 or re.search('ON AN INTERDEPARTMENTAL PROMOTIONAL BASIS',df_job_class.iloc[index]['EXAM_TYPE']):
#             jobs_not_found += 1
#     else:
#         job_no_sub += 1
#     index += 1
# print ('jobs_found',jobs_found)
# print ('jobs_not_found',jobs_not_found)
# print ('jos_no_sub',job_no_sub)

# print ('job_chk',job_chk)
