# Data Science for Good: City of Los Angeles
# Starter Kernel

### Introduction

The City of Los Angeles has a variety of job classes, but unfortunately much of the data regarding these classes is stored in free form job bulletins.

We see tremendous value in structuring this data as it could help us better understand our workforce, and  improve our hiring processes.

*That's why we're asking for your help!* 

**We need you to use your data skills to take the information from the job bulletins, and store them in a structured CSV.**

**After you've structured the CSV, we want you to analyze the job bulletin data. The job bulletin analysis would ideally cover one of the three topics below:**
1. Identify language that can bias the pool of applicants
2. Improve the diversity and quality of the applicant pool
3. Increase the discoverability of promotional pathways

**Lastly, make sure you code/analysis is well documented as we may use the results in the future to improve the City's hiring processes.**

Below is the standard Kaggle intro cell, which gives an explanation of the environment we're operating in as well as imports pandas, numpy, and os.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['cityofla']


### Structuring the Job Bulletin Data into a CSV

Here we're collecting data from all of the job bulletins and storing it in a CSV.

The code block below goes through all of the job bulletins, extracts the FILE_NAME and OPEN_DATE, and puts it in a list.

We're only collecting two features here, but ideally your code would collect a lot more. Please refer to the **kaggle_data_dictionary - output_fields.csv** and the **sample job class export template.csv** to see all the features that we're interested in.

Additional consideration will be given to those who extract more features outside what is listed in the data dictionary we provided. Just be sure to provide an updated data dictionary so that we may understand what you have collected.

In [2]:
bulletin_dir = "../input/cityofla/CityofLA/Job Bulletins"
data_list = []
for filename in os.listdir(bulletin_dir):
    with open(bulletin_dir + "/" + filename, 'r', errors='ignore') as f:
        for line in f.readlines():
            #Insert code to parse job bulletins
            if "Open Date:" in line:
                job_bulletin_date = line.split("Open Date:")[1].split("(")[0].strip()
        data_list.append([filename, job_bulletin_date])

We are now taking the data within the list, putting it in a dataframe so that we can conduct our analysis, and saving the dataframe as a CSV.

In [3]:
df = pd.DataFrame(data_list)

In [4]:
df.columns = ["FILE_NAME", "OPEN_DATE"]

In [5]:
df.head()

Unnamed: 0,FILE_NAME,OPEN_DATE
0,PARK MAINTENANCE SUPERVISOR 3145 102618.txt,10-26-18
1,MOTION PICTURE AND TELEVISION MANAGER 1789 111...,11-17-17
2,HOUSING INVESTIGATOR 8516 062918.txt,06-29-18
3,DEPARTMENTAL CHIEF ACCOUNTANT 1593 111717 revi...,11-17-17
4,POLICE LIEUTENANT 2232 020918.txt,02-09-18


In [6]:
df.to_csv("competition_output.csv")

### Analysis

Now that we have the data structured in a dataframe, we can conduct our analysis.

Below we're converting the open date into a datetime field. In our analysis we can see the earliest posting was released in 2002, and the latest posting was released February of this year.

The analysis we provided below is basic, and not very useful.

**Make sure that your analysis is useful by providing actionable insights that are related to the three topics listed:**
1. Identify language that can bias the pool of applicants
2. Improve the diversity and quality of the applicant pool
3. Increase the discoverability of promotional pathways

In [7]:
df["OPEN_DATE"] = df["OPEN_DATE"].astype('datetime64[ns]')

In [8]:
df.describe()

Unnamed: 0,FILE_NAME,OPEN_DATE
count,668,668
unique,668,212
top,ROOFER 3476 121214.txt,2017-01-20 00:00:00
freq,1,12
first,,2002-10-11 00:00:00
last,,2019-02-01 00:00:00
