# Data Science for Good: City of Los Angeles

# 0. Workflow stages

The competition solution workflow goes through the following stages.

1. Question definition and goal.
2. Prepare and cleanse the data.
3. Analyze, identify patterns, and explore the data.
4. Recommendations



# 1. Question definition and goal

**Question**

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

**Goal**

The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to:
1. identify language that can negatively bias the pool of applicants;
2. improve the diversity and quality of the applicant pool; 
3. make it easier to determine which promotions are available to employees in each job class.



# 2. Prepare and cleanse the data
**Objective**

Parse job bulletin text files and create output dataframe with the structure as mentioned in "Sample job class export template.csv". We have exported the output dataframe in "job_bulletins.csv" accordingly. 

## 2.1 Headings
After looking at the data in text files, we can observe that there is some specefic pattern or format which is kept while writing job bulletins. We can use this to parse the text. For example, the headings are written in upper case letters and the order of the headings almost coincides with each other.

In [None]:
import os, sys
import pandas as pd,numpy as np
import re
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import xml.etree.cElementTree as ET
from collections import OrderedDict
import matplotlib.pyplot as plt
import seaborn as sns
import json
from datetime import datetime
import calendar
from wordcloud import WordCloud ,STOPWORDS
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import word2vec
from sklearn.manifold import TSNE
from nltk import pos_tag
from nltk.help import upenn_tagset
import gensim
import matplotlib.colors as mcolors
plt.style.use('ggplot')

In [None]:
'''
data directory 
Job Bulletins: This directory contains the job bulletins in text format.
Additional data: This directory contains additional data in pdf and csv format.
'''
bulletin_dir = "../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Job Bulletins"
bulletins=os.listdir(bulletin_dir)
additional_data_dir = '../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Additional data'

In [None]:
headings = {}
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        for line in f.readlines():
            line = line.replace("\n","").replace("\t","").replace(":","").strip()
            
            if line.isupper():
                if line not in headings.keys():
                    headings[line] = 1  #add the heading as a new key into dictionary 
                else:
                    count = int(headings[line])
                    headings[line] = count+1  #add value 1 to the corresponding heading 

In [None]:
del headings['$103,606 TO $151,484'] #This is not a heading, it's an Annual Salary component
headingsFrame = [] 
for i,j in (sorted(headings.items(), key = lambda kv:(kv[1], kv[0]), reverse = True)):
    headingsFrame.append([i,j]) #convert the dictionary into dataframe
headingsFrame = pd.DataFrame(headingsFrame)
headingsFrame.columns = ["Heading","Count"]
headingsFrame.head()

## 2.2 Basic information
We extract the basic information of "FILE_NAME", "OPEN_DATE", "POSITION", "JOB_CLASS_NO" into a dataframe df here. 

In [None]:
#Add 'FILE_NAME', 'POSITION', 'JOB_CLASS_NO'
data_list = []
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        position=''
        for line in f.readlines():
            #Parse job bulletins
            if "Open Date:" in line:
                job_bulletin_date = line.split("Open Date:")[1].split("(")[0].strip()
            if "Class Code:" in line:
                job_class_no = line.split("Class Code:")[1].split(".")[0].split("C")[0].strip().split(" ")[0]
            if len(position)<2 and len(line.strip())>1:
                position=line.split("Class Code:")[0].strip().lower()
        data_list.append([filename, job_bulletin_date, position, job_class_no])

In [None]:
df = pd.DataFrame(data_list)
df.columns = ["FILE_NAME", "OPEN_DATE", "POSITION", "JOB_CLASS_NO"]
df.head()

## 2.3 Requirement
We extract the requirement of the jobs into a dataframe df_requirements here. 

In [None]:
#Add 'REQUIREMENT_SET_ID','REQUIREMENT_SUBSET_ID','REQUIREMENT_TEXT'
requirements = []
requirementHeadings = [k for k in headingsFrame['Heading'].values if 'requirement' in k.lower()]
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        readNext = 0
        isNumber=0
        prevNumber=0
        prevLine=''
        
        for line in f.readlines():
            clean_line = line.replace("\n","").replace("\t","").replace(":","").strip()   
            if readNext == 0:                         
                if clean_line in requirementHeadings:
                    readNext = 1
            elif readNext == 1:
                if clean_line in headingsFrame['Heading'].values:
                    if isNumber>0:
                        requirements.append([filename,prevNumber,'',prevLine])
                    break
                elif len(clean_line)<2:
                    continue
                else:
                    rqrmntText = clean_line.split('.')
                    if len(rqrmntText)<2:
                        requirements.append([filename,'','',clean_line])
                    else:                        
                        if rqrmntText[0].isdigit():
                            if isNumber>0:
                                requirements.append([filename,prevNumber,'',prevLine])
                            isNumber=1
                            prevNumber=rqrmntText[0]
                            prevLine=clean_line
                        elif re.match('^[a-z]$',rqrmntText[0]):
                            requirements.append([filename,prevNumber,rqrmntText[0],prevLine+' - '+clean_line])
                            isNumber=0
                        else:
                            requirements.append([filename,'','',clean_line])

In [None]:
df_requirements = pd.DataFrame(requirements)
df_requirements.columns = ['FILE_NAME','REQUIREMENT_SET_ID','REQUIREMENT_SUBSET_ID','REQUIREMENT_TEXT']
df_requirements.head()

In [None]:
#Check for one sample file 
df_requirements.loc[df_requirements['FILE_NAME']=='SYSTEMS ANALYST 1596 102717.txt']

## 2.4 Salary
We extract the salary of the jobs into a dataframe df_salary here. 

In [None]:
#Check for salary components
salHeadings = [k for k in headingsFrame['Heading'].values if 'salary' in k.lower()]
sal_list = []
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        readNext = 0
        for line in f.readlines():
            clean_line = line.replace("\n","").replace("\t","").replace(":","").strip()  
            if clean_line in salHeadings:
                readNext = 1
            elif readNext == 1:
                if clean_line in headingsFrame['Heading'].values:
                    break
                elif len(clean_line)<2:
                    continue
                else:
                    sal_list.append([filename, clean_line])

In [None]:
df_salary = pd.DataFrame(sal_list)
df_salary.columns = ['FILE_NAME','SALARY_TEXT']
df_salary.head()

However, the salary information contain different formats and it differs for workers in different departments. So we use the regular expressions to extract the salary numbers for city workers in departments other than DWP as df_salary_gen and for the Department of Water and Power workers as df_salary_dwp.

In [None]:
files = []
for filename in os.listdir(bulletin_dir):
    files.append(filename)

In [None]:
#Add 'ENTRY_SALARY_GEN','ENTRY_SALARY_DWP'
pattern = r'\$?\ ?([0-9]{1,3},\ ?([0-9]{3},)*\ ?[0-9]{3}|[0-9]+)(.[0-9][0-9])?\*?'
df_salary_dwp=pd.DataFrame(columns=['FILE_NAME','ENTRY_SALARY_START_DWP','ENTRY_SALARY_END_DWP'])
df_salary_gen=pd.DataFrame(columns=['FILE_NAME','ENTRY_SALARY_START_GEN','ENTRY_SALARY_END_GEN'])
dwp_salary_list = {}
gen_salary_list = {}
for filename in files:
    for sal_text in df_salary.loc[df_salary['FILE_NAME']==filename]['SALARY_TEXT']:
        if 'department of water' in sal_text.lower():
            if filename in dwp_salary_list.keys():
                continue
            matches = re.findall(pattern+' to '+pattern, sal_text) 
            if len(matches)>0:
                salary_dwp = ' - '.join([x for x in matches[0] if x and not x.endswith(',')])
                start=matches[0][0]
                end=matches[0][3]
            else:
                matches = re.findall(pattern, sal_text)
                if len(matches)>0:
                    salary_dwp = matches[0][0]
                else:
                    salary_dwp = 0
                start=salary_dwp
                end=salary_dwp
            dwp_salary_list[filename]= salary_dwp
            start=int(start.split(',')[0].strip()+start.split(',')[1].strip() )
            end=int(end.split(',')[0].strip()+end.split(',')[1].strip() ) 
            df_salary_dwp=df_salary_dwp.append({'FILE_NAME':filename,'ENTRY_SALARY_START_DWP':start,
                               'ENTRY_SALARY_END_DWP':end},ignore_index=True)

            
        else:
            if filename in gen_salary_list.keys():
                continue
            matches = re.findall(pattern+' to '+pattern, sal_text)
            if len(matches)>0:
                salary_gen = ' - '.join([x for x in matches[0] if x and not x.endswith(',')])
                start=matches[0][0]
                end=matches[0][3]
            else:
                matches = re.findall(pattern, sal_text)
                if len(matches)>0:
                    salary_gen = matches[0][0]
                else:
                    salary_gen = 0
                start=salary_gen
                end=salary_gen
            gen_salary_list[filename]= salary_gen
            if start!=0:
                start=int(start.split(',')[0].strip()+start.split(',')[1].strip() )
                end=int(end.split(',')[0].strip()+end.split(',')[1].strip() ) 
            df_salary_gen=df_salary_gen.append({'FILE_NAME':filename,'ENTRY_SALARY_START_GEN':start,
                               'ENTRY_SALARY_END_GEN':end},ignore_index=True)


        


## 2.5 Education major
We use the library of nltk to process the natural language and create a column of 'EDUCATION_MAJOR' in df_requirements. The main idea is to first create a part of speech tags, and then find Noun/Pronoun tags following the words majoring/major/apprenticeship.

In [None]:
def preprocess(txt):
    txt = nltk.word_tokenize(txt)
    txt = nltk.pos_tag(txt)
    return txt

In [None]:
def getEducationMajor(row):
    txt = row['REQUIREMENT_TEXT']
    txtMajor = ''
    if 'major in' not in txt.lower() and ' majoring ' not in txt.lower():
        return txtMajor
    result = []
    
    istart = txt.lower().find(' major in ')
    if istart!=-1:
        txt = txt[istart+10:]
    else:
        istart = txt.lower().find(' majoring ')
        if istart==-1:
            return txtMajor
        txt = txt[istart+12:]
    
    txt = txt.replace(',',' or ').replace(' and/or ',' or ').replace(' a closely related field',' related field')
    sent = preprocess(txt)
    pattern = """
            NP: {<DT>? <JJ>* <NN.*>*}
           BR: {<W.*>|<V.*>} 
        """
    cp = nltk.RegexpParser(pattern)
    cs = cp.parse(sent)
    #print(cs)
    checkNext = 0
    for subtree in cs.subtrees():
        if subtree.label()=='NP':
            result.append(' '.join([w for w, t in subtree.leaves()]))
            checkNext=1
        elif checkNext==1 and subtree.label()=='BR':
            break
    return '|'.join(result)

In [None]:
#Add EDUCATION_MAJOR
df_requirements['EDUCATION_MAJOR']=df_requirements.apply(getEducationMajor, axis=1)

In [None]:
df_requirements.loc[df_requirements['EDUCATION_MAJOR']!=''].head()

In [None]:
#function to fill majors for apprenticeship programs
def getApprenticeshipMajor(row):
    txt = row['REQUIREMENT_TEXT']
    txtMajor = row['EDUCATION_MAJOR']
    if 'apprenticeship' not in txt:
        return txtMajor
    if txtMajor != '':
        return txtMajor
    result = []
    
    istart = txt.lower().find(' apprenticeship program')
    if istart!=-1:
        txt = txt[istart+23:]
    else:
        istart = txt.lower().find(' apprenticeship ')
        if istart==-1:
            return txtMajor
        txt = txt[istart+15:]
    
    txt = txt.replace(',',' or ').replace(' full-time ',' ')
    sent = preprocess(txt)
    pattern = """
            NP: {<DT>? <JJ>* <NN>*}
           BR: {<W.*>|<V.*>} 
        """
    cp = nltk.RegexpParser(pattern)
    cs = cp.parse(sent)
    #print(cs)
    checkNext = 0
    for subtree in cs.subtrees():
        if subtree.label()=='NP':
            result.append(' '.join([w for w, t in subtree.leaves()]))
            checkNext=1
        elif checkNext==1 and subtree.label()=='BR':
            break
    return '|'.join(result)

In [None]:
df_requirements['EDUCATION_MAJOR']=df_requirements.apply(getApprenticeshipMajor, axis=1)

In [None]:
df_requirements[(df_requirements['EDUCATION_MAJOR']!='') & (df_requirements['REQUIREMENT_TEXT'].str.contains('apprentice'))].head()

## 2.6 Duties
We extract the duties of the jobs into a dataframe df_duties here. 

In [None]:
def getValues(searchText, COL_NAME):
    data_list = []
    dataHeadings = [k for k in headingsFrame['Heading'].values if searchText in k.lower()]

    for filename in os.listdir(bulletin_dir):
        with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
            readNext = 0 
            datatxt = ''
            for line in f.readlines():
                clean_line = line.replace("\n","").replace("\t","").replace(":","").strip()   
                if readNext == 0:                         
                    if clean_line in dataHeadings:
                        readNext = 1
                elif readNext == 1:
                    if clean_line in headingsFrame['Heading'].values:
                        break
                    else:
                        datatxt = datatxt + ' ' + clean_line
            data_list.append([filename,datatxt.strip()])
    result = pd.DataFrame(data_list)
    result.columns = ['FILE_NAME',COL_NAME]
    return result

In [None]:
#Add JOB_DUTIES
df_duties = getValues('duties','JOB_DUTIES')

In [None]:
print(df_duties['JOB_DUTIES'].loc[df_duties['FILE_NAME'] == 'AIRPORT POLICE SPECIALIST 3236 063017 (2).txt'].values)

## 2.7 Title and school
We update df_requirements with the information of EXP_POSITION' and 'SCHOOL_TYPE'.

In [None]:
#Function to retrieve values that match with pre-defined values 
def section_value_extractor( document, section, subterms_dict, parsed_items_dict ):
    retval = OrderedDict()
    single_section_lines = document.lower()
    
    for node_tag, pattern_string in subterms_dict.items():
        pattern_list = re.split(r",|:", pattern_string[0])#.sort(key=len)
        pattern_list=sorted(pattern_list, key=len, reverse=True)
        #print (pattern_list)
        matches=[]
        for pattern in pattern_list:
            if pattern.lower() in single_section_lines:
                matches.append(pattern)
                single_section_lines = single_section_lines.replace(pattern.lower(),'')
        #print (matches)
        if len(matches):
            info_string = ", ".join(list(matches)) + " "
            retval[node_tag] = info_string
    return retval

In [None]:
#Function to read xml configuration to return json formatted string
def read_config( configfile ):
    root = ET.fromstring(configfile)
    config = []
    for child in root:
        term = OrderedDict()
        term["Term"] = child.get('name', "")
        for level1 in child:
            term["Method"] = level1.get('name', "")
            term["Section"] = level1.get('section', "")
            for level2 in level1:
                term[level2.tag] = term.get(level2.tag, []) + [level2.text]

        config.append(term)
    json_result = json.dumps(config, indent=4)
    return config

In [None]:
def parse_document(document, config):
    parsed_items_dict = OrderedDict()

    for term in config:
        term_name = term.get('Term')
        extraction_method = term.get('Method')
        extraction_method_ref = globals()[extraction_method]
        section = term.get("Section")
        subterms_dict = OrderedDict()
        
        for node_tag, pattern_list in list(term.items())[3:]:
            subterms_dict[node_tag] = pattern_list
        parsed_items_dict[term_name] = extraction_method_ref(document, section, subterms_dict, parsed_items_dict)

    return parsed_items_dict

In [None]:
#Read job_titles to use them to find patterns in the requirement text to extract POSITIONes
job_titles = pd.read_csv(additional_data_dir+'/job_titles.csv', header=None)

job_titles = ','.join(job_titles[0])
job_titles = job_titles.replace('\'','').replace('&','and')

In [None]:
configfile = r'''
<Config-Specifications>
<Term name="Requirements">
        <Method name="section_value_extractor" section="RequirementSection">
            <SchoolType>College or University,High School,Apprenticeship,Certificates</SchoolType>
            <JobTitle>'''+job_titles+'''</JobTitle>
        </Method>
    </Term>
</Config-Specifications>
'''

In [None]:
config = read_config(configfile)
result = df_requirements['REQUIREMENT_TEXT'].apply(lambda k: parse_document(k,config))
i=0
df_requirements['EXP_POSITION']=''
df_requirements['SCHOOL_TYPE']=''
for item in (result.values):
    for requirement,dic in list(item.items()):        
        if 'JobTitle' in dic.keys():
            df_requirements.loc[i,'EXP_POSITION'] = dic['JobTitle']
        if 'SchoolType' in dic.keys():
            df_requirements.loc[i,'SCHOOL_TYPE'] = dic['SchoolType']
    i=i+1

In [None]:
#Let's check the result for one sample file
df_requirements[df_requirements['FILE_NAME']=='SYSTEMS ANALYST 1596 102717.txt'][['FILE_NAME','EXP_POSITION','SCHOOL_TYPE']]

In [None]:
#Combine all the results into one dataframe
result = pd.merge(df, df_requirements, how='inner', left_on='FILE_NAME', right_on='FILE_NAME', sort=True)

result = pd.merge(result, df_salary_dwp, how='left', left_on='FILE_NAME', right_on='FILE_NAME', sort=True)

result = pd.merge(result, df_salary_gen, how='left', left_on='FILE_NAME', right_on='FILE_NAME', sort=True)

result = pd.merge(result, df_duties, how='left', left_on='FILE_NAME', right_on='FILE_NAME', sort=True)

In [None]:
# result.drop(columns=['REQUIREMENT_TEXT'], inplace=True)
result[result['FILE_NAME']=='SYSTEMS ANALYST 1596 102717.txt']

### 2.8 Data columns

**Columns Added** 
>        'FILE_NAME', 'OPEN_DATE',  'POSITION', 'JOB_CLASS_NO', 'REQUIREMENT_SET_ID', 
       'REQUIREMENT_SUBSET_ID', 'REQUIREMENT_TEXT', 'EDUCATION_MAJOR','EXP_POSITION', 
       'SCHOOL_TYPE','ENTRY_SALARY_START_DWP','ENTRY_SALARY_END_DWP', 
       'ENTRY_SALARY_START_GEN','ENTRY_SALARY_END_GEN', 'JOB_DUTIES'
      

In [None]:
df=result
df.to_csv('job_bulletins.csv')

# 3. Explore the data

## 3.1 Jobs


In [None]:
print("There are %d text files in bulletin directory and there are %d different jobs available." %
      (len(bulletins),df['POSITION'].nunique()))

In [None]:
plt.figure(figsize=(8,5))
text=''.join(job for job in df['POSITION'])                                ##joining  data to form text
text=word_tokenize(text)
jobs=Counter(text)                                                         ##counting number of occurences
jobs_class=[job for job in jobs.most_common(12) if len(job[0])>3]          ##selecting most common words
a,b=map(list, zip(*jobs_class))
sns.barplot(b,a,palette='rocket')                                           ##creating barplot
plt.title('Job sectors')
plt.xlabel("count")
plt.ylabel('sector')

We can see that service sector dominates in creating opputunities.

**Has job opportunities really increased recently?**




In [None]:
'''Extracting year out of opendate timestamp object and counting
    the number of each occurence of each year using count_values() '''
df['OPEN_DATE']=pd.to_datetime(df['OPEN_DATE'])
df['YEAR_OF_OPEN']=[date.year for date in df['OPEN_DATE']]

count=df['YEAR_OF_OPEN'].value_counts(ascending=True)
years=['2018', '2017', '2016', '2015', '2014', '2019', '2012', '2013', '2008', '2006',
           '2005', '2002', '1999']
plt.figure(figsize=(7,5))
plt.plot([z for z in reversed(years)],count.values,color='blue')

plt.title('Oppurtunities over years')
plt.xlabel('years')
plt.ylabel('count')
plt.gca().set_xticklabels([z for z in reversed(years)],rotation='45')
plt.show()

- It is evident from the above graph that job opportunities is constantly increasing after 2013. 
- Job opportunities have never decreased.

**Which month of the year offers most opportunities?**

In [None]:
'''Extracting month out of opendate timestamp object and counting
    the number of each occurence of each months using count_values() '''


plt.figure(figsize=(7,5))
df['OPEN_MONTH']=[z.month for z in df['OPEN_DATE']]
count=df['OPEN_MONTH'].value_counts(sort=False)
sns.barplot(y=count.values,x=count.index,palette='rocket')
plt.gca().set_xticklabels([calendar.month_name[x] for x in count.index],rotation='45')
plt.show()

We can see that there is more job opportunities created in the months of **March,October and December**


## 3.2 Salary
**What are the best paid jobs in LA?**

In [None]:
'''finding the most paid 10 jobs at LA'''
df.ENTRY_SALARY_START_GEN=df.ENTRY_SALARY_START_GEN.astype(float)
most_paid=df[['POSITION','ENTRY_SALARY_START_GEN']].groupby('POSITION',as_index=False).mean()
most_paid=most_paid.sort_values(by='ENTRY_SALARY_START_GEN',ascending=False)[:10]
plt.figure(figsize=(7,5))
sns.barplot(y=most_paid['POSITION'],x=most_paid['ENTRY_SALARY_START_GEN'],palette='rocket')
plt.title('Best paid jobs in LA')

**What are the jobs with highest salary deviation?**

In [None]:
''''calculating salary start - salary end '''
df['SALARY_DIFF']=abs(df['ENTRY_SALARY_END_GEN']-df['ENTRY_SALARY_START_GEN']).astype(float)
ranges=df[['POSITION','SALARY_DIFF']].groupby('POSITION',as_index=False).mean().sort_values(by='SALARY_DIFF',ascending=False)[:10]
plt.figure(figsize=(7,5))
sns.barplot(y=ranges['POSITION'],x=ranges['SALARY_DIFF'],palette='RdBu')   ##plotting



## 3.3 Requirements
### 3.3.1 Word cloud

In [None]:
def show_wordcloud(data, title = None):
    
    
    '''funtion to produce and display wordcloud
        taken 2 arguments
        1.data to produce wordcloud
        2.title of wordcloud'''
    
    
    wordcloud = WordCloud(
        background_color='white',
        stopwords=set(STOPWORDS),
        max_words=250,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
show_wordcloud(text,'REQUIREMENTS')

### 3.3.2 Most influential words in requirements

In [None]:
req=' '.join(text for text in df['REQUIREMENT_TEXT'])
lem=WordNetLemmatizer()
text=[lem.lemmatize(w) for w in word_tokenize(req)]
vect=TfidfVectorizer(ngram_range=(1,3),max_features=100)
vectorized_data=vect.fit_transform(text)
#id_map=dict((v,k) for k,v in vect.vocabulary_.items())
vect.vocabulary_.keys()

### 3.3.3 Most common requirements

In [None]:
token=word_tokenize(req)
counter=Counter(token)
count=[x[0] for x in counter.most_common(40) if len(x[0])>3]
print("Most common words in Requirement")
print(count)

- It can be observed that companies prefer  
- **experienced** 
- **educated professionals**  having **degree from an accredicted university**
- also willing to work **full-time**

### 3.3.4 Word 2 Vec and TSNE

In [None]:
def build_corpus(df,col):
    
    '''function to build corpus from dataframe'''
    lem=WordNetLemmatizer()
    corpus= []
    for x in df[col]:
        
        
        words=word_tokenize(x)
        corpus.append([lem.lemmatize(w) for w in words])
    return corpus

corpus=build_corpus(df,'REQUIREMENT_TEXT')
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=30, workers=4)


In [None]:
def tsne_plot(model,title='None'):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=80, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(12, 12)) 
    plt.title(title)
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()
    
tsne_plot(model,'Requirements')

## 3.4 Duties
### 3.4.1 Word cloud

In [None]:
duties= ' '.join(d for d in df['JOB_DUTIES'])
show_wordcloud(duties,'Duties')

### 3.4.2 Most influential words in duties

In [None]:
lem=WordNetLemmatizer()
text=[lem.lemmatize(w) for w in word_tokenize(duties)]
vect=TfidfVectorizer(ngram_range=(1,3),max_features=200)
vectorized_data=vect.fit_transform(text)
#id_map=dict((v,k) for k,v in vect.vocabulary_.items())
vect.vocabulary_.keys()

### 3.4.3 Most common words in duties

In [None]:
token=word_tokenize(duties)
counter=Counter(token)
count=[x[0] for x in counter.most_common(40) if len(x[0])>3]
print("Most common words in Duties")
print(count)

### 3.4.4 Word 2 Vec and TSNE

In [None]:
corpus=build_corpus(df,'JOB_DUTIES')
model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=40, workers=4)

tsne_plot(model,'Duties')

## 3.5 Gender bias

In the follow section I am trying to investigate if there is any gender biased term used in **Requirement** and **Duties** section of the job bulletin.   
For that I will pos tag all the text data in the requirement field and then,
- Extract the words having pronoun tag.
- check if any gender biased terms like he/she is used in the field.


In [None]:
def pronoun(data):
    
    '''function to tokenize data and perform pos_tagging.Returns tokens having "PRP" tag'''
    
    prn=[]
    vrb=[]
    token=word_tokenize(data)
    pos=pos_tag(token)
   
    vrb=Counter([x[0] for x in pos if x[1]=='PRP'])
    
    return vrb
    


req_prn=pronoun(req)
duties_prn=pronoun(duties)
print('pronouns used in requirement section are')
print(req_prn.keys())
print('\npronouns used in duties section are')
print(duties_prn.keys())


1. Surprisingly, no gender biased or racist pronouns in **Requirement** or **Duties section** can be found
2. All the pronouns used are neutral.

# 4. Recommendations 

## 4.1  First goal

> identify language that can negatively bias the pool of applicants

We find there is a lot we can do in the job descriptions to reduce the bias in the applicants pool.

1. The word grade levels are too high. There are not only too many high-syllable words, but also too many words in a sentence.
 - Our suggestion is to use simpler and lower level words in the descriptions.**


2. The content might be overly-formal, reducing readability.
 - Our suggestion is to use less formal words in the descriptions so that more people can read and understand the job descriptions and apply afterwards.**


3. The length of the postings is generally way too long and exceeds 700 word limit.
 - Our suggestion is to simplify or visually break up the description



## 4.2 Second goal

> improve the diversity and quality of the applicant pool

We believe a well standardized job description can significantly improve the diversity and quality of the applicant pool.

1. Standardize the degrees and coursework selections (such as a drop-down) so that they can be more easily mapped
2. Standardize the education type into a dropdown so that four-year-degree and Bachelor's are synched up. 
    -Some postings state four-year-college but not necessarily a degree. 
3. Standardize the job titles across the different data sources
    -Ideally, job titles should match in the requirements and across documents and job graphs
4. Standardize the certification, license, and training course names across descriptions
 
 


## 4.3 Third goal

> make it easier to determine which promotions are available to employees in each job class

The promotions are not shown clearly in the text job postings. We find it in the Job Paths in the format of PDF files. Extracting the tree relationships from the PDF files is extremely difficult. Moreover, it is already clear and easy enough to determine which promotions are available to employees in each job class. So we end up here. 



### Acknowledgement 
Thanks to Sahil Tyagi's [Kernel](https://www.kaggle.com/tyagit3/starter-text-bulletins-to-dataframe) for instructions of extracting the information from text bulletins and Shahules786's [Kernel](https://www.kaggle.com/shahules/discovering-opportunities-at-la/log#Which-are-the-best-paid-jobs-in-LA?) for the inspiration of the investigated items. 