In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
from datetime import datetime
import re

**Help the City of Los Angeles to structure and analyze its job descriptions**

The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions.

### Problem Statement
The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a structured CSV file and then to use this data to: (1) identify language that can negatively bias the pool of applicants; (2) improve the diversity and quality of the applicant pool; and/or (3) make it easier to determine which promotions are available to employees in each job class.

### Data Description

The job bulletins will be provided as a folder of plain-text files, one for each job classification.

 Job Bulletins [Folder]
 
- 683 plain-text job postings

Instructions and Additional Documents [Folder]

- Job Bulletins with Annotations

- Annotation Descriptions.docx

- City Job Paths

- PDFs

- Description of promotions in job bulletins.docx

- Job_titles.csv

- Kaggle_data_dictionary.csv

- Sample job class export template.csv

In [2]:
input_path="../input/cityofla/CityofLA/"

In [3]:
job_bulletein_path=input_path+"Job Bulletins/"
job_bulletein_path

'../input/cityofla/CityofLA/Job Bulletins/'

### Let us concatenate the text files and create a dataframe


In [4]:
files=glob.glob(job_bulletein_path+"/*.txt")
print("Number of files in Job Bulletein path ",len(files))

Number of files in Job Bulletein path  683


### Let us extract Job Class Title, Job Class No, Open Date and Revised Date if present from the filename

In [5]:
data=pd.DataFrame()
job_class_title=[]
job_class_number=[]
open_date=[]
revised_date=[]
file_name=[]

for __file__ in files:

    file_name.append(__file__.replace(job_bulletein_path,""))
    jp=__file__.replace(job_bulletein_path,"").upper()
    jp=jp.replace("DRAFT.TXT","")
    jp=jp.replace(".TXT","")
    jp=re.sub(r"\d+","",jp)
    jp=jp.replace("REVISED","")
    jp=jp.replace("REV","")
    jp=jp.replace("()","").strip()
    jp=jp.replace("BULLETIN FINAL","")
    
    
    job_class_title.append(jp)
    
    ## Get job_class_number - it is either length 3 or length 4
    
    class_regex = ' \d{3,4} '
    class_num = re.findall(class_regex, __file__)
    #print(class_num)
    job_class_number.append(class_num)

    ### Get open date and Revised Date
    
    date_regex='\d{5,6}'
    dates_search=re.findall(date_regex,__file__)
    if len(dates_search)>1:
        revised_date.append(dates_search[1])
    else:
        revised_date.append("")
    
    if len(dates_search)>0:
        open_date.append(dates_search[0])
    else:
        open_date.append("")

In [6]:
data['JOB_CLASS_TITLE']=job_class_title
data['JOB_CLASS_NO']=job_class_number
data['FILE_NAME']=file_name
data['OPEN_DATE']=open_date
data['REVISED_DATE']=revised_date

In [7]:
data.to_excel("JobBulleteins_version1.xlsx")