# Data Science for Good: City of Los Angeles

## Problem Objective:
Help the City of Los Angeles to structure and analyze its job descriptions

The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions.

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a structured CSV file and then to use this data to:

(1) identify language that can negatively bias the pool of applicants;

(2) improve the diversity and quality of the applicant pool; and/or

(3) make it easier to determine which promotions are available to employees in each job class.


## Import Packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import xml.etree.ElementTree as ET
import zipfile
import os
from os import walk
import shutil
from shutil import copytree, ignore_patterns
from PIL import Image
from wand.image import Image as Img
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

%matplotlib inline

In [None]:
print(os.listdir("../input"))

In [None]:
bulletin_dir = "../input/cityofla/CityofLA/Job Bulletins"
addl_data_dir="../input/cityofla/CityofLA/Additional data"

## Introduction
Let us first start by looking at some of the job postings to get an idea of how they look like.


In [None]:
pdf = '../input/cityofla/CityofLA/Additional data/PDFs/2014/September 2014/09262014/PRINCIPAL INSPECTOR 4226.pdf'
Img(filename=pdf, resolution=200)

In [None]:
pdf = '../input/cityofla/CityofLA/Additional data/PDFs/2018/December/Dec 7/COMMERCIAL FIELD REPRESENTATIVE 1600 120718.pdf'
Img(filename=pdf, resolution=200)

## Let's check sample job class export template.csv

In [None]:
sample_job_template = pd.read_csv(os.path.join(addl_data_dir, 'sample job class export template.csv'))
sample_job_template

## Let's check kaggle_data_dictionary.csv file

In [None]:
data_dictionary = pd.read_csv(os.path.join(addl_data_dir, 'kaggle_data_dictionary.csv'))
data_dictionary.head()

## Job titles listing:

In [None]:

job_titles = pd.read_csv(os.path.join(addl_data_dir, 'job_titles.csv'), names=['JOB TITLES'])
job_titles.head()

## Iterate over Job Bulletins directory
We are also given the job descriptions in plain text files. Let us get the total number of files and have a look at top few lines of one of the files.

In [None]:
job_files = os.listdir(bulletin_dir)
print("No of files in Job Bulletins Folder:",len(job_files))

In [None]:
with open(bulletin_dir+"/"+job_files[0]) as file:
    print("File name: ",file.name)
    print("=====================================")
    print(file.read(1000))

## Data Extraction

In this section, let us extract the data and create a structured table out of it.


In [None]:
# Code from the starter kernel to iterate over Job Bulletins directory
data_list = []
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        for line in f.readlines():
            #Insert code to parse job bulletins
            if "Class Code:" in line:
                class_code=line.split("Class Code:")[1].split("Open Date")[0].strip()
            if "Open Date:" in line:
                job_bulletin_date = line.split("Open Date:")[1].split("(")[0].strip()
        data_list.append([filename,class_code,job_bulletin_date])

In [None]:
# Form a DataFrame 
df = pd.DataFrame(data_list)
df.columns = ["FILE_NAME","CLASS_CODE","OPEN_DATE"]
df.head()

In [None]:
df.info()

> ### Let's convert `OPEN_DATE` to `DATETIME`

In [None]:
df.OPEN_DATE = pd.to_datetime(df.OPEN_DATE)

In [None]:
df.info()

In [None]:
df.tail()

## Convert to CSV

In [None]:
df.to_csv("job_bulletins.csv",index= False)

In [None]:
job_df = pd.read_csv("job_bulletins.csv")
job_df.head()

## More to come. Stay tuned!