# City of Los Angeles Analysis

The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions.

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a structured CSV file and then to use this data to:

(1) identify language that can negatively bias the pool of applicants;

(2) improve the diversity and quality of the applicant pool; and/or

(3) make it easier to determine which promotions are available to employees in each job class

The following notebook will try to extract data for each of the required columns. The status is below. If there are any comments or suggestions on the code, please feel free to do so. And as usual, if you find this kernel useful then **please do not forget to upvote**. Happy Kaggling :)

## Things to do

1) Data Cleansing
    
    a) FileName (Completed)
    b) JOB_CLASS_TITLE (Completed)
    c) JOB_CLASS_NO (Completed)
    d) REQUIREMENT_SET_ID
    e) REQUIREMENT_SUBSET_ID
    f) JOB_DUTIES (Completed)
    g) EDUCATION_YEARS (Completed)
    h) SCHOOL_TYPE (Completed)
    i) EDUCATION_MAJOR
    j) EXPERIENCE_LENGTH (Completed)
    k) FULL_TIME_PART_TIME (Completed)
    l) EXP_JOB_CLASS_TITLE
    m) EXP_JOB_CLASS_ALT_RESP
    n) EXP_JOB_CLASS_FUNCTION
    o) COURSE_COUNT
    p) COURSE_LENGTH
    q) COURSE_SUBJECT
    r) MISC_COURSE_DETAILS
    s) DRIVERS_LICENSE_REQ
    t) DRIV_LIC_TYPE
    u) ADDTL_LIC
    v) EXAM_TYPE
    w) ENTRY_SALARY_GEN
    x) ENTRY_SALARY_DWP
    y) OPEN_DATE

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats

In [None]:
sample_output_df = pd.read_csv("../input/cityofla/CityofLA/Additional data/sample job class export template.csv")
sample_output_df.T

In [None]:
kaggle_dic = pd.read_csv("../input/cityofla/CityofLA/Additional data/kaggle_data_dictionary.csv")
kaggle_dic

In [None]:
job_bulletins_path = "../input/cityofla/CityofLA/Job Bulletins/"
print("Number of Job bulletins : ",len(os.listdir(job_bulletins_path)))

In [None]:
os.listdir(job_bulletins_path)[0:2]

In [None]:
with open(job_bulletins_path + os.listdir(job_bulletins_path)[15]) as f: 
    print (f.read(5000))

### A) Filename

In [None]:
#Extracting file name
jobs_list = []
for file_name in os.listdir(job_bulletins_path):
    with open(job_bulletins_path + file_name, encoding = "ISO-8859-1") as f:
        content = f.read()
        jobs_list.append([file_name, content])
jobs_df = pd.DataFrame(jobs_list)
jobs_df.columns = ["FileName", "Content"]
jobs_df.head()

### B) JOB_CLASS_TITLE

In [None]:
#Extracting Job class title
import re

def extract_job_class_title(text):
    word1 = " ".join(re.findall("[a-zA-Z]+", text))
    return word1.rsplit(' ', 1)[0]
    
jobs_df["JOB_CLASS_TITLE"] = jobs_df["FileName"].apply(lambda x: extract_job_class_title(x))
jobs_df.head()

### C) JOB_CLASS_TITLE

In [None]:
#Extracting Job Class No
def extract_job_class_no(text):
    try:
        return int(re.findall(r'\d+',text)[0])
    except:
        -1

jobs_df["JOB_CLASS_NO"] = jobs_df["FileName"].apply(lambda x: extract_job_class_no(x))
jobs_df.head()  

In [None]:
#There is only one which does not have a JobClassNo
jobs_df[jobs_df.isnull().any(axis=1)]

### D) REQUIREMENT_SET_ID

In [None]:
#Still Unclear on what these mean

### E) REQUIREMENT_SUBSET_ID

In [None]:
#Still Unclear on what these mean

### F) JOB_DUTIES 

In [None]:
def extract_job_duties(text):
    words = 'DUTIES'.split(' ')
    sentences = re.findall(r"([^.]*\.)" ,text)  
    for sentence in sentences:
        try:
            if any(word in sentence for word in words):
                return sentence.split('\n')[4]
        except:
            if any(word in sentence for word in words):
                return sentence.split('\n')[3]

            
jobs_df["JOB_DUTIES"] = jobs_df["Content"].apply(lambda x: extract_job_duties(x))
jobs_df.head()

### G) EDUCATION_YEARS

In [None]:
#Extracting data for education years and education type
def extract_edu_info(text):
    words = 'college university'.split(' ')
    sentences = re.findall(r"([^.]*\.)" ,text)  
    for sentence in sentences:
        if any(word in sentence for word in words):
            return sentence
            

jobs_df["EDUCATION_INFO"] = jobs_df["Content"].apply(lambda x: extract_edu_info(x))
jobs_df.head()

#There are a lot of job positions (62%) that do not require a college university education
jobs_df[jobs_df.isnull().any(axis=1)].shape[0]/jobs_df.shape[0] * 100

In [None]:
numbers = ["one","two","three", "four","five","six","seven","eight","nine"]

def extract_edu_years(text):
    try:
        y = (set(re.findall(r'\s|,|[^-\s]+', text.lower())).intersection(set(numbers)))
        return y
    except:
        "Null"
        
jobs_df["EDUCATION_YEARS"] = jobs_df["EDUCATION_INFO"].apply(lambda x: extract_edu_years(x))
jobs_df.head()

### H) SCHOOL_TYPE

In [None]:
def extract_school_type(text):
    try:
        y = (set(re.findall(r'\s|,|[^-\s]+', text.lower())).intersection(set(['college','university'])))
        return y
    except:
        "Null"
        
jobs_df["SCHOOL_TYPE"] = jobs_df["EDUCATION_INFO"].apply(lambda x: extract_school_type(x))
jobs_df.head()

### I) EDUCATION_MAJOR

In [None]:
filelist = os.listdir("../input/cityofla/CityofLA/Additional data/City Job Paths")

In [None]:
#Still figuring out on how to do this
filelist = [i.split('.')[0].replace('_',' ').lower() for i in filelist]
text = jobs_df.EDUCATION_INFO[0].lower()
x = [i for i in filelist if i in text]
print(x)

### J) EXPERIENCE_LENGTH

In [None]:
#Extracting the EXPERIENCE_LENGTH

numbers = ["one","two","three", "four","five","six","seven","eight","nine"]

def extract_exp_len(text):
    words = 'full-time paid experience'.split(' ')
    sentences = re.findall(r"([^.]*\.)" ,text)  

    list = []

    for sentence in sentences:
        if any(word in sentence for word in words):
            #print(sentence)
            list.append(sentence)
            
    try:
        y = (set(list[0].lower().split()).intersection(set(numbers)))
        return(y)
    except:
        "Null"

jobs_df["EXPERIENCE_LENGTH"] = jobs_df["Content"].apply(lambda x: extract_exp_len(x))
jobs_df.head()

In [None]:
jobs_df[jobs_df['EXPERIENCE_LENGTH'].isnull()]

### K) FULL_TIME_PART_TIME

In [None]:
def extract_FULL_TIME_PART_TIME(text):
    words = 'full-time part-time'.split(' ')
    sentences = re.findall(r"([^.]*\.)" ,text)  

    for sentence in sentences:
        if any(word in sentence for word in words):
            x = set(sentence.split()).intersection(set(words))
            return x
        
jobs_df["FULL_TIME_PART_TIME"] = jobs_df["Content"].apply(lambda x: extract_FULL_TIME_PART_TIME(x))
jobs_df.head()
jobs_df[jobs_df['FULL_TIME_PART_TIME'].isnull()].shape[0]