***
This notebook is a part of the solution for DSG: City of LA competition. The solution splited into 5 parts. Here is the list of notebook in correct order. The part of solution you are currently reading is highlighted in bold. 

[1. Introduction to the solution of DSG: City of LA](https://www.kaggle.com/niyamatalmass/1-introduction-to-the-solution-of-dsg-city-of-la)

[**2. Raw Job Postings to structured CSV**](https://www.kaggle.com/niyamatalmass/2-raw-job-bulletins-to-structured-csv)

[3. Identify biased language](https://www.kaggle.com/niyamatalmass/3-identify-biased-language)

[4. Improve the diversity and quality](https://www.kaggle.com/niyamatalmass/4-improve-the-diversity-and-quality)

[5. Jobs Promotional Pathway](https://www.kaggle.com/niyamatalmass/5-jobs-promotional-pathway)
***
<br/>

<h1 align="center"><font color="#5831bc" face="Comic Sans MS">Convert raw job bulletins to structured CSV file</font></h1> 

# <font color="#5831bc" face="Comic Sans MS">Notebook Overview</font> 
This notebook successfully converts all the job bulletins text file into a beautiful CSV file. The technique used for doing that is **Regex and Custom Named Entity Recognition or NER model**. Please read the previous part of the solutions for easy understanding. All the descriptions, information and recommendations are provided with each step. 


There are multiple steps for creating that beautiful CSV file.

1. **Extract all main necessary section as whole text from job bulletins using regex**
2. **Extract possible data field from those extracted texts using regex**
3. **Build a custom NER model in spaCy**
4. **Extract rest of the field using NER model**
5. **Overview of structured CSV**

Each section has rich documentations techniques used in that section. Let's get started! 
***

In [None]:
import pandas as pd
import numpy as np
import re
import glob
import matplotlib.pyplot as plt

%matplotlib inline

# <font color="#5831bc" face="Comic Sans MS">1. Extract all main sections as whole text using regex</font> 
In this section, we will firstly create a pandas dataframe and store each job bulletins whole text as a row. We do this to easily apply our filter and preprocessing. The City of LA job postings has multiple main sections like ```ANNUAL SALARY, DUTIES, REQUIREMENTS, PROCESS NOTES, WHERE TO APPLY```. We first extract the whole section and store them as a column in our dataframe. For example, from ```ANNUAL SALARY``` we can extract ```salary field``` of our data dictionary very easily. Obviously, you can do it directly but I do that for modularity and easily manageable and customizable. Also, this makes our regex pattern more accurate. Without further do let's do it!

> ### <font color="#5831bc" face="Comic Sans MS">Define all our function first</font>
> In this section, I will define all the function needed for extracting each main section. These functions extremely modular and well coded. Also, the functions are well documented.

In [None]:
def convert_jobs_to_df(
    path='../input/data-science-for-good-city-of-los-angeles/cityofla/CityofLA/Job Bulletins/*.txt',
    raw_text_col_name='raw_job_text'):
    
    """
    Convert each text file in job bulletins to pandas dataframe
    
    -----------
    Returns
        Pandas Dataframe
            ------------------------------------
            |    index     |  raw_text         |
            |-----------------------------------
            |     0    |  raw_job_descriptions | 
            ____________________________________
    """
    
    
    job_list = []
    
    files = glob.glob(path)
    for file in files:
        with open(file, 'r', errors='replace') as f:
            content = f.read()
            job_list.append(content)
            
    return pd.DataFrame({raw_text_col_name:job_list})

In [None]:
# all the method start with _ sign are used 
# in pandas apply function

def _class_code_apply(text):
    """
    This class extract job class code
    """
    match = re.search('Class Code: (\d+)', text)
    class_code = None
    try:
        class_code = match.group(1)
    except:
        class_code = None
    return class_code
        

def _open_date_apply(text):
    
    """
    Extract entire job open date section
    """
    
    open_date = ''
    result= re.search(
        "(Class Code:|Class  Code:)(.*)(ANNUAL SALARY|ANNUALSALARY)",
        text)
    
    shortContent=''
    if result:
        shortContent=result.group(2).strip()
        result= re.search(
            "Open Date:(.*)REVISED",
            shortContent,flags=re.IGNORECASE)
        if result:
            open_date=result.group(1).strip()
        if open_date=='':
            result= re.search(
                "Open Date:(.*)\(Exam",
                shortContent,flags=re.IGNORECASE)
            if result:
                open_date=result.group(1).strip()
        if open_date=='':
            result= re.search(
                "Open Date:(.*)",
                shortContent,flags=re.IGNORECASE)
            if result:
                open_date=result.group(1).strip()
    return open_date


def _exam_type_apply(text):
    
    """
    Extract entire exam type section
    """
    
    exam_type = ""
    result= re.search(
        "(Class Code:|Class  Code:)(.*)(ANNUAL SALARY|ANNUALSALARY)",
        text)
    
    shortContent=''
    if result:
        shortContent=result.group(2).strip()
        result= re.search(
            "\(+(.*?)\)", shortContent,flags=re.IGNORECASE)
        if result:
            exam_type=result.group(1).strip()
    return exam_type


def _salary_apply(text):
    """
    Extract entire salary section
    """
    salary = ''
    salary_notes = ''
    result=re.search(
        "(ANNUAL SALARY|ANNUALSALARY)(.*?)DUTIES", text)
    if result:
        salContent= result.group(2).strip()
        if "NOTE:" in salContent or "NOTES:" in salContent:
            result=re.search(
                "(.*?)(NOTE:|NOTES:)",
                salContent,flags=re.IGNORECASE)
            if result:
                salary=result.group(1).strip()  
            result= re.search(
                "(NOTE:|NOTES:)(.*)",
                salContent,flags=re.IGNORECASE)
            if result:
                salary_notes= result.group(2).strip()
        else:
            salary = salContent
    else:
        result=re.search(
            "(ANNUAL SALARY|ANNUALSALARY)(.*?)REQUIREMENT",
            text,flags=re.IGNORECASE)
        if result:
            salContent= result.group(2).strip()
            if "NOTE:" in salContent or "NOTES:" in salContent:
                result=re.search(
                    "(.*?)(NOTE:|NOTES:)",
                    salContent,flags=re.IGNORECASE)
                if result:
                    salary=result.group(1).strip()  
                result= re.search(
                    "(NOTE:|NOTES:)(.*)",
                    salContent,flags=re.IGNORECASE)
                if result:
                    salary_notes= result.group(2).strip()
            else:
                salary= salContent
    salary_text = "|||||||||||||||".join([salary, salary_notes])
    return salary_text


def _duties_apply(text):
    """
    Extract job duties section
    """
    duties=''
    result=duties= re.search("DUTIES(.*?)REQUIREMENT", text)
    if result:
        duties= result.group(1).strip()
    return duties

def _requirements_apply(text):
    """
    Extract entire job requirements section
    """
    req='|'.join(["REQUIREMENT/MIMINUMUM QUALIFICATION",
                  "REQUIREMENT/MINUMUM QUALIFICATION",
                  "REQUIREMENT/MINIMUM QUALIFICATION",
                  "REQUIREMENT/MINIMUM QUALIFICATIONS",
                  "REQUIREMENT/ MINIMUM QUALIFICATION",
                  "REQUIREMENTS/MINUMUM QUALIFICATIONS",
                  "REQUIREMENTS/ MINIMUM QUALIFICATIONS",
                  "REQUIREMENTS/MINIMUM QUALIFICATIONS",
                  "REQUIREMENTS/MINIMUM REQUIREMENTS",
                  "REQUIREMENTS/MINIMUM QUALIFCATIONS",
                  "MINIMUM REQUIREMENTS:",
                  "REQUIREMENTS",
                  "REQUIREMENT"])
    
    result= re.search(f"({req})(.*)(WHERE TO APPLY|HOW TO APPLY)", text)
    requirements=''
    if result:
        requirements = result.group(2).strip()
    return requirements


def _where_to_apply(text):
    
    """
    Extract entire 'WHERE TO APPLY' section
    """
    
    where_to_apply = ''
    result= re.search(
        "(HOW TO APPLY|WHERE TO APPLY)(.*)(APPLICATION DEADLINE|APPLICATION PROCESS)",
        text)
    if result:
        where_to_apply= result.group(2).strip()
    else:
        result= re.search(
            "(HOW TO APPLY|WHERE TO APPLY)(.*)(SELECTION PROCESS|SELELCTION PROCESS)",
            text)
        if result:
            where_to_apply= result.group(2).strip()
    return where_to_apply

def _deadline_apply(text):
    """
    Extract entire deadline section
    """
    
    deadline=''
    result= re.search(
        "(APPLICATION DEADLINE|APPLICATION PROCESS)(.*?)(SELECTION PROCESS|SELELCTION PROCESS)",
        text)
    if result:
        deadline= result.group(2).strip()
    else:
        result= re.search(
            "(APPLICATION DEADLINE|APPLICATION PROCESS)(.*?)(Examination Weight:)",
            text)
        if result:
            deadline= result.group(2).strip()
            
    return deadline

def _selection_process_apply(text):
    
    """
    Extract selectioin process section
    """
    
    selection_process=''
    result=selection_process= re.search(
        "(SELECTION PROCESS|Examination Weight:)(.*)(APPOINTMENT|APPOINTMENT IS SUBJECT TO:)",
        text)
    if result:
        selection_process= result.group(2).strip()
    else:
        result=selection_process= re.search(
            "(SELECTION PROCESS|Examination Weight:)(.*)",
            text)
        if result:
            selection_process= result.group(2).strip()
            
    return selection_process

In [None]:
def _whole_clean_text(text):
    return text.replace("\n","").replace("\t","").strip()

def pre_processing(dataframe):
    # remove all first new line charecters from text
    dataframe['raw_job_text'] = dataframe['raw_job_text'].apply(
        lambda x: x.lstrip())
    return dataframe

def extract_job_title(dataframe):
    # split at newline charecter, then grab first text
    # and that is the title
    dataframe['JOB_CLASS_TITLE'] = dataframe['raw_job_text'].apply(
        lambda x: x.split('\n', 1)[0])
    dataframe['JOB_CLASS_TITLE'] = dataframe['JOB_CLASS_TITLE'].apply(
        lambda x: _whole_clean_text(x))
    return dataframe

def extract_class_code(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    # find class code
    dataframe['JOB_CLASS_NO'] = temp.apply(lambda x: _class_code_apply(x))
    return dataframe

def extract_open_date(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['OPEN_DATE'] = temp.apply(lambda x: _open_date_apply(x))
    return dataframe

def extract_exam_type(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['TEMP_EXAM_TYPE'] = temp.apply(lambda x: _exam_type_apply(x))
    return dataframe

def extract_salary(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['TEMP_SALARY'] = temp.apply(lambda x: _salary_apply(x))
    return dataframe


def extract_duties(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['JOB_DUTIES'] = temp.apply(lambda x: _duties_apply(x))
    return dataframe

def extract_requirements(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['TEMP_REQUIREMENTS'] = temp.apply(lambda x: _requirements_apply(x))
    return dataframe

def extract_where_to_apply(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['WHERE_TO_APPLY'] = temp.apply(lambda x: _where_to_apply(x))
    return dataframe

def extract_deadline(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['DEADLINE'] = temp.apply(lambda x: _deadline_apply(x))
    return dataframe

def extract_selection_process(dataframe):
    # remove all extra white spaces
    temp = dataframe['raw_job_text'].apply(lambda x: ' '.join(x.split()))
    
    dataframe['SELECTION_PROCESS'] = temp.apply(lambda x: _selection_process_apply(x))
    return dataframe



> ### <font color="#5831bc" face="Comic Sans MS">Let's extract each main section of job bulletins!</font>
> We already defined our necessary function. Now, all we have to is apply this function in pandas dataframe. We made our function in a way that we just of to pass the dataframe name to the function and function will do the rest and return a processed dataframe.

In [None]:
# first let's convert folder of raw text job bulletins
# to pandas dataframe
df_jobs = convert_jobs_to_df()

# do some initial text cleaning
df_jobs = pre_processing(df_jobs)

###############################
# Here is actual extraction of main section begin
# we just call the function
###############################
df_jobs = extract_job_title(df_jobs) # extract job title

df_jobs = extract_class_code(df_jobs) # extract class code

df_jobs = extract_open_date(df_jobs) # extract open date

df_jobs = extract_exam_type(df_jobs) # extract exam type section

df_jobs = extract_salary(df_jobs) # extract salary section

df_jobs = extract_duties(df_jobs) # extract duties section

df_jobs = extract_requirements(df_jobs) # extract requirements section

df_jobs = extract_where_to_apply(df_jobs) # extract where to apply section

df_jobs = extract_deadline(df_jobs) # extract deadline section

df_jobs = extract_selection_process(df_jobs) # extract selectin pro section

# create a new column containing whole text but clean from new line and tab 
df_jobs['raw_clean_job_text'] = df_jobs['raw_job_text'].apply(
    lambda x: _whole_clean_text(x))

# finally let's see what we have got
df_jobs.head()


Looking very good. We have successfully extracted all the main section of each job bulletins. And stored them in pandas dataframe. It will be really helpful in the next section of solutions for **applying more regorious regex and function**.

***

# <font color="#5831bc" face="Comic Sans MS">2. Extract possible data fields using regex</font>
In the previous step, we extracted each** main section** of job bulletins. In this step, we are going to extract possible data field from those main sections. You may ask, why are you saying ***possible field***? Because as I said in my first notebook, that not every possible data fields is possible to extract using regex. Some field value pattern changes a lot with each job bulletins. But in this section, **I will be extracting those field value that can be extracted using regex**. 

I will be extracting these data field value: ```ENTRY_SALARY_GEN```, ```ENTRY_SALARY_DWP```, ```EXAM_TYPE```, ```DRIVERS_LICENSE_REQ```, ```DRIV_LIC_TYPE```. Without further talking, let's get into work! 

> ### <font color="#5831bc" face="Comic Sans MS">First Define our necessary function</font>
> As always first, define all our functions first. It will help us reuse our code and make our code more moduler. We provided detailed documentation for each functions as comment.

In [None]:
################################
# ENTRY_SALARY_GEN
################################
def salary(content):   
    try:
        salary=re.compile(r'\$(\d+,\d+)((\s(to|and)\s)(\$\d+,\d+))?') #match salary
        sal=re.search(salary,content)
        if sal:
            range1=sal.group(1)
            if range1 and '$' not in range1:
                range1='$'+range1
            range2=sal.group(2)
            if range2:
                range2=sal.group(2).replace('to','')
                range2=range2.replace('and','')
            if range1 and range2:
                return f"{range1}-{range2.strip()}"
            elif range1:
                return f"{range1} (flat-rated)"
        else:
            return ''
    except Exception as e:
        return ''

################################
# ENTRY_SALARY_DWP
################################
def salaryDWP(content):
    try:
        result= re.search("(Department of Water and Power is)(.*)", content)
        if result:
            salary=re.compile(r'\$(\d+,\d+)((\s(to|and)\s)(\$\d+,\d+))?') #match salary
            sal=re.search(salary,result.group(2))
            if sal:
                range1=sal.group(1)
                if range1 and '$' not in range1:
                    range1='$'+range1
                range2=sal.group(2)
                if range2:
                    range2=sal.group(2).replace('to','')
                    range2=range2.replace('and','')
                if range1 and range2:
                    return f"{range1}-{range2.strip()}"
                elif range1:
                    return f"{range1} (flat-rated)"
            else:
                return ''
    except Exception as e:
        return ''  

################################
# DRIVERS_LICENSE_REQ
# Whether a driver's license is required, 
# possibly required, or not required
# (note: the job class will most likely not explicitly say if a license is not required)
# P,R
################################
def drivingLicenseReq(content):
    try:
        result= re.search(
            "(.*?)(California driver\'s license|driver\'s license)",
            content)
        
        if result:
            exp=result.group(1).strip()
            exp=' '.join(exp.split()[-10:]).lower()
            if 'may require' in exp:
                return 'P'
            else:
                return 'R'
        else:
            return ''
    except Exception as e:
        return '' 

################################
#DRIV_LIC_TYPE
################################
def drivingLicense(content):
    driving_License=[]
    result= re.search(
        "(valid California Class|valid Class|valid California Commercial Class)(.*?)(California driver\'s license|driver\'s license)",
        content)
    if result:
        dl=result.group(2).strip()
        dl=dl.replace("Class","").replace("commercial","").replace("or","").replace("and","")
        if 'A' in dl:
            driving_License.append('A')
        if 'B' in dl:
            driving_License.append('B') 
        if 'C' in dl:
            driving_License.append('C')  
        if 'I' in dl:
            driving_License.append('I')   
        return ','.join(driving_License)
    else:
        return ''

################################
#EXAM_TYPE
################################
def examType(content):
    '''Code explanation:
    OPEN: Exam open to anyone (pending other requirements)
    INT_DEPT_PROM: Interdepartmental Promotional
    DEPT_PROM: Departmental Promotional
    OPEN_INT_PROM: Open or Competitive Interdepartmental Promotional
    '''
    exam_type=''
    if 'INTERDEPARTMENTAL PROMOTIONAL AND AN OPEN COMPETITIVE BASIS' in content:
        exam_type='OPEN_INT_PROM' 
    elif 'OPEN COMPETITIVE BASIS' in content:
         exam_type='OPEN'
    elif 'INTERDEPARTMENTAL PROMOTIONAL' or 'INTERDEPARMENTAL PROMOTIONAL' in content:
        exam_type='INT_DEPT_PROM'
    elif 'DEPARTMENTAL PROMOTIONAL' in content:
        exam_type='DEPT_PROM' 
    return exam_type

def split_salary(text):
    """
    When we extracted salary section, 
    we merge general and DWP salary together 
    with a seperator '|||||||||||||||'
    This function split that
    """
    return text.split('|||||||||||||||')[0]


> ### <font color="#5831bc" face="Comic Sans MS">Let's extract those data fields!</font>
> We have already defined our functions. Now we just have to call them and store their value in pandas dataframe. Details documentation is provided with each line of code.

In [None]:
# split salary into general and DWP
df_jobs['TEMP_SALARY'] = df_jobs['TEMP_SALARY'].apply(lambda x: split_salary(x))

# extract ENTRY_SALARY_GEN and ENTRY_SALARY_DWP
df_jobs['ENTRY_SALARY_GEN'] = df_jobs['TEMP_SALARY'].apply(lambda x: salary(x))
df_jobs['ENTRY_SALARY_DWP'] = df_jobs['TEMP_SALARY'].apply(lambda x: salaryDWP(x))

# extract DRIVERS_LICENSE_REQ and DRIV_LIC_TYPE
df_jobs['DRIVERS_LICENSE_REQ'] = df_jobs['TEMP_REQUIREMENTS'].apply(
    lambda x: drivingLicenseReq(x))
df_jobs['DRIV_LIC_TYPE'] = df_jobs['raw_clean_job_text'].apply(lambda x: drivingLicense(x))

# extract EXAM_TYPE
df_jobs['EXAM_TYPE'] = df_jobs['raw_clean_job_text'].apply(lambda x: examType(x))

# finally let's see what we have done
df_jobs[['ENTRY_SALARY_GEN', 'ENTRY_SALARY_DWP',
         'DRIVERS_LICENSE_REQ', 'DRIV_LIC_TYPE', 'EXAM_TYPE']].head(10)

Awesome! We have successfully **extracted 5 data fields column value** from each job bulletins. That's really awesome! We have progressed towards our goal a lot. In our next section, we will see how to extract those column value that are impossible to extract using regex!
<br>

***

# <font color="#5831bc" face="Comic Sans MS">3. Build a custom NER model in spaCy</font>
We already extracted some of the important data field value. But there are some columns values that are **impossible to extract** using regex. We're not telling that it's will never be possible, but what we're saying that it's really hard. Because the pattern of the value is different in most of the jobs. The data field we can't extract using regex are 

```EDUCATION_YEARS, SCHOOL_TYPE ,EDUCATION_MAJOR, EXPERIENCE_LENGTH ,FULL_TIME_PART_TIME,EXP_JOB_CLASS_TITLE, EXP_JOB_CLASS_ALT_RESP,  EXP_JOB_CLASS_FUNCTION, EXP_JOB_CLASS_ADDITIONAL_FUNCTION,COURSE_COUNT, COURSE_LENGTH, COURSE_SUBJECT, MISC_COURSE_DETAILS, EXP_JOB_COMPANY, DEGREE NAME, EXP_JOB_CLASS_ALT_JOB_TITLE, REQUIRED_CERTIFICATE, CERTIFICATE_ISSUED_BY,COURSE_TITLE, REQUIRED_EXAM_PASS, EXPERIENCE_EXTRA_DETAILS```

As we see it's a lot of data field we shouldn't be using regex for extracting value. One interesting and useful fact is that **all of these data field value can be found** in one particular job description section. That is ```REQUIREMENTS``` section. It's a piece of very good news for us because do you remember we already extracted that section and store those in pandas column.


## <font color="#5831bc" face="Comic Sans MS">So, what is a NER?</font>
Named entity recognition (NER), also known as entity chunking/extraction, is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes. That's we wanted! We have our ```REQUIREMENTS``` section text and we want to extract information from that into predefined data field or classes. Here is an example of NER in text:
![](http://imanage.com/wp-content/uploads/2014/10/NER1.png)


## <font color="#5831bc" face="Comic Sans MS">How NER works?</font>
I don't want to go into mathematical details about NER, I just want to say that first, we will train a NER model using our hand labelling dataset. And then your model can predict new data's labelling. Description and example of traning data is provided below when we import it.


## <font color="#5831bc" face="Comic Sans MS">How to train a custom NER?</font>
Nowadays, many states of art libraries are present that offer already trained NER model ready to use. But why don't we just use those? Well, that pre-trained NER model trained with a dataset that is totally different than ours. And we have our own custom labels. For that reason, we have to make our own model designed for our own use cases. 

Firstly, we need a custom hand labelled dataset. What does that mean? Our dataset consists of ```REQUIREMENTS``` section text and each text has some labelling identifying which word or phrase go to which data field. If we can train our custom model with that then our model can predict for other jobs as well.

I have build custom hand label dataset with 70 jobs ```REQUIREMENTS``` labelled with the correct data field. We can use a different library for training our model, but I would like to use spaCy for that. It is a very popular and useful library for NLP. It makes really easy for working with NER. Without further do, let's begin! 

> ### <font color="#5831bc" face="Comic Sans MS">Some Preprocessing to our REQUIREMENTS section texts</font>
> We previously extracted entire ```REQUIREMENTS``` section texts and store those in a column. But we don't need the entire text. There are texts about process notes and certainly, we don't need them.

> **Also, ```REQUIREMENTS``` section has numbered list separated by or/and. We need to split them at ```or``` and store them as separate row because we need to extract data fields from each bullet points**.  So let's process our requirements section.

In [None]:
# function for seperating process notes
# from requirements section
def _seperate_process_notes(text):
    result = re.search('(.*)(PROCESS NOTES|NOTES)(.*)', text)
    if result:
        req = result.group(1)
        process_notes = result.group(3)
    else:
        req = text 
        process_notes = None
    return req, process_notes

def _split_requirements(text):
    req_list = re.split('or \d\.', text)
    return req_list


# function for spliting requirements section 
# numbered points and store them as seperate rows
def split_list_to_rows(dataframe, col_name):
    # here col_name is the name of the column which contains list
    return pd.DataFrame({
          col:np.repeat(dataframe[col].values, dataframe[col_name].str.len())
          for col in dataframe.columns.drop(col_name)}
        ).assign(**{col_name:np.concatenate(dataframe[col_name].values)})[dataframe.columns]


In [None]:
# seperate process notes
df_jobs['REQUIREMENTS'], df_jobs['REQUIREMENTS_PROCESS'] = \
zip(*df_jobs['TEMP_REQUIREMENTS'].apply(lambda x: _seperate_process_notes(x)))

# remove some text
df_jobs['REQUIREMENTS'] = df_jobs['REQUIREMENTS'].str.replace(r'PROCESS', '')
# df_jobs['REQUIREMENTS'].to_csv('./jobs_desc.csv', index=None)

# split requirements and store them as seperate rows
df_jobs['req_list'] = df_jobs['REQUIREMENTS'].apply(lambda x: _split_requirements(x))
df_jobs = split_list_to_rows(df_jobs, 'req_list')


# finally let's see what we did
df_jobs[['JOB_CLASS_TITLE', 'JOB_CLASS_NO', 'req_list']].head(20)

Wow! We successfully clean our requirements section and done some pre-processing. We can see the same job has multiple rows because we split by ```and/or``` in each ```REQUIREMENTS``` section and store each of them separate rows. Let's move on to our next section for build NER model in spaCy.

> ### <font color="#5831bc" face="Comic Sans MS">Import and Process our training data</font>
> I build a hand-labelled custom dataset consist of only 70 job description requirements. The dataset is provided with this kernel as an external dataset. But in order to use this dataset for spaCy NER model, we need to do some pre-processing to our external data. Without further talking, let's do that!

In [None]:
def convert_dataturks_to_spacy(dataturks_JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(dataturks_JSON_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content']
            entities = []
            for annotation in data['annotation']:
                #only a single point in text annotation.
                point = annotation['points'][0]
                labels = annotation['label']
                # handle both list of labels or a single label.
                if not isinstance(labels, list):
                    labels = [labels]

                for label in labels:
                    #dataturks indices are both inclusive [start, end] but spacy is not [start, end)
                    entities.append((point['start'], point['end'] + 1 ,label))


            training_data.append((text, {"entities" : entities}))

        return training_data
    except Exception as e:
        logging.exception("Unable to process " + dataturks_JSON_FilePath + "\n" + "error = " + str(e))
        return None

In [None]:
import logging
import json
TRAIN_DATA = convert_dataturks_to_spacy('../input/ner-annotation-of-city-of-la-jobs/city_la_jobs_ner_labeling.json')

TRAIN_DATA[0]

We successfully imported and processed our training data. It's a json file. In above, we can see a portion of of file. First we see our job requirements text. Then a bunch of entities. **Entities means our data field that we need**. For example first entity ```EXP_JOB_CLASS_FUNCTION``` denoted by ```118 and 229``` means **the value for ```EXP_JOB_CLASS_FUNCTION``` data field can be found in ```118 and 229``` number position in the requirements text of that job**. Now we understand our training data, let's move onto training our model!

> ### <font color="#5831bc" face="Comic Sans MS">Building and Training NER model</font>
> Building and training NER model in spaCy is pretty straightforward. Below I have provided informative comment for code. Also, the spaCy official website has an extensive guide on building a NER model. I highly recommend checking it out. After training completed, the model can be saved to disk for later use cases. But we didn't do it here, we don't need it right now. 

In [None]:
from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from tqdm import tqdm, tqdm_notebook


def ner_model(model=None, output_dir='./', n_iter=500):
    print('Training started...')
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly â€“ but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in tqdm_notebook(range(n_iter)):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
    print('Training completed.')
    return nlp
#             print("Losses", losses)

    # test the trained model
#     for text, _ in TRAIN_DATA:
#         doc = nlp(text)
#         print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
# #         print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
#     if output_dir is not None:
#         output_dir = Path(output_dir)
#         if not output_dir.exists():
#             output_dir.mkdir()
#         nlp.to_disk(output_dir)
#         print("Saved model to", output_dir)


# we have build our model class
# let's train our model

nlp = ner_model()

Finally, we created and trained a custom Named Entity Recognition model. The model can be saved to disk so that we can reuse it later. But for now, let's use this model to extract our data field.
***

# <font color="#5831bc" face="Comic Sans MS">4. Extract rest of the fields using NER model</font>
Previously, we build our NER model. In this step, we are going to use our newly trained model to extract our data fields. First we will load the model from disk if we store the model in disk. But for now let's just directly use the model. Let's do that!

> ### <font color="#5831bc" face="Comic Sans MS">Extract data fields and process</font>
> In this section, we will do our most important and desired task, extracting necessary data field using custom build NER model. First, we will build a function to make code easy. Then apply that function to pandas requirement column. We have the ```nlp``` variable where our model is stored. We will use that. 

In [None]:
# function for extracting data field value
def _ner_apply(text):
    # pass our text to spacy
    # it will return us doc (spacy doc)
    doc = nlp(text)
    # return list of tuples look like 
    # this [('four-year', 'EDUCATION_YEARS'), ('college or university', 'SCHOOL_TYPE')]
    return [(ent.text, ent.label_) for ent in doc.ents]


# apply the function and store the result in a new column 
df_jobs['temp_entity'] = df_jobs['req_list'].apply(lambda x: _ner_apply(x))

# finally look at our data
df_jobs[['req_list', 'temp_entity']].head(10)

Okay, finally we extracted our data field. But it doesn't look clean and nice. Because the data our predict function returns are like a list of tuples. For example, ```[('four-year', 'EDUCATION_YEARS'), ('college or university', 'SCHOOL_TYPE')]``` for each requirements row. So we have to process it and store each of the data field value as separate column. Let's do that.

In [None]:
import itertools

# process our data field column and seperate each column and store their value in their column
flatter = sorted([list(x) + [idx] for idx, y in enumerate(df_jobs['temp_entity']) 
                  for x in y], key = lambda x: x[1]) 

# Find all of the values that will eventually go in each F column                
for key, group in itertools.groupby(flatter, lambda x: x[1]):
    list_of_vals = [(val, idx) for val, _, idx in group]

    # Add each value at the appropriate index and F column
    for val, idx in list_of_vals:
        df_jobs.loc[idx, key] = val
        
df_jobs['REQUIREMENT_SET_ID'] = df_jobs.groupby('JOB_CLASS_NO').cumcount() + 1

Let's see what our final dataframe look like and what we extracted after clearing our fields. 

In [None]:
COLUMNS_ORDER = ['JOB_CLASS_TITLE', 'JOB_CLASS_NO', 
                 'REQUIREMENT_SET_ID',
                 'JOB_DUTIES', 'ENTRY_SALARY_GEN',
                 'ENTRY_SALARY_DWP','OPEN_DATE',
                 'EDUCATION_YEARS', 'SCHOOL_TYPE',
                 'EDUCATION_MAJOR', 'DEGREE NAME','EXPERIENCE_LENGTH',
                 'FULL_TIME_PART_TIME',
                 'EXP_JOB_CLASS_TITLE', 'EXP_JOB_CLASS_FUNCTION',
                 'EXP_JOB_COMPANY','EXP_JOB_CLASS_ALT_JOB_TITLE',
                 'EXP_JOB_CLASS_ALT_RESP',
                 'COURSE_LENGTH', 'COURSE_SUBJECT',
                 'REQUIRED_CERTIFICATE','CERTIFICATE_ISSUED_BY',
                 'DRIVERS_LICENSE_REQ', 'DRIV_LIC_TYPE',
                 'EXAM_TYPE', 'req_list', 'raw_clean_job_text',
                 'REQUIREMENTS']

In [None]:
# first order our column in desire way
df_jobs_clean_col = df_jobs[COLUMNS_ORDER]


df_jobs_clean_col = df_jobs_clean_col[df_jobs_clean_col['EDUCATION_YEARS'].str.contains(
    "year", flags=re.IGNORECASE, na=True)]

df_jobs_clean_col = df_jobs_clean_col[df_jobs_clean_col['SCHOOL_TYPE'].str.contains(
    "school|college|university|apprenticeship|G.E.D", flags=re.IGNORECASE, na=True)]

df_jobs_clean_col = df_jobs_clean_col[df_jobs_clean_col['EXPERIENCE_LENGTH'].str.contains(
    "year|month|hour", flags=re.IGNORECASE, na=True)]

df_jobs_clean_col = df_jobs_clean_col[df_jobs_clean_col['FULL_TIME_PART_TIME'].str.contains(
    "time", flags=re.IGNORECASE, na=True)]

In [None]:

# save the structured csv for next solutions
df_jobs_clean_col.to_csv('./jobs.csv', index=None)
# print the CSV
df_jobs_clean_col.head(25)

Wow! That's look very good. If we scroll to right we can see our model extracted data fields value correctly. It's a pleasure to see all the structured version of job bulletins. The dataset is also saved to disk for later use in the solution. 
***

# <font color="#5831bc" face="Comic Sans MS">5. Overview of structured CSV</font>

We finally made a structured version of our job bulletins. After seeing the dataset, we can surely say that most of the data fields extracted correctly. Some data field may contain some outliers but we can ignore those. 

    - We didn't only use regex for extracting data fields. By using NER we have the opportunity to extract data fields from any random or new job information. If we only use regex, for new jobs regex pattern may not match and throw an error. But my solutions can extract data fields from any new job posting of the City of LA. 

    - Those data fields extracted using regex in my solution are proven that their pattern will not change in new jobs. So they will not break in new jobs. 

    - For all these reasons, We can say that this solution solves the first problem of this competition (converting jobs to structured CSV) very perfectly.
***

The City of LA provided us with a data dictionary containing an example of possible data fields we can extract. We follow that example. And we made some new data fields and also delete some data fields. Let's talk about that. 

Our structured CSV has these fields name: 

    'JOB_CLASS_TITLE', 'JOB_CLASS_NO', 
     'REQUIREMENT_SET_ID',
     'JOB_DUTIES', 'ENTRY_SALARY_GEN',
     'ENTRY_SALARY_DWP','OPEN_DATE',
     'EDUCATION_YEARS', 'SCHOOL_TYPE',
     'EDUCATION_MAJOR', 'DEGREE NAME','EXPERIENCE_LENGTH',
     'FULL_TIME_PART_TIME',
     'EXP_JOB_CLASS_TITLE', 'EXP_JOB_CLASS_FUNCTION',
     'EXP_JOB_COMPANY','EXP_JOB_CLASS_ALT_JOB_TITLE',
     'EXP_JOB_CLASS_ALT_RESP',
     'COURSE_LENGTH', 'COURSE_SUBJECT',
     'REQUIRED_CERTIFICATE','CERTIFICATE_ISSUED_BY',
     'DRIVERS_LICENSE_REQ', 'DRIV_LIC_TYPE',
     'EXAM_TYPE', 'req_list', 'raw_clean_job_text'
     
     
The City of LA requested to provide information on newly created data fields. So let's do that. 

    - DEGREE NAME - Name of the degree (bachelor's, associate etc) required. It will helpful to do eda on educations and others. 

    - EXP_JOB_COMPANY - Experience in doing jobs in which company (e.g City of Los Angels). It will helpful to know the company employee must work for. Because most of the time, job bulletins says explicitly that they must experience with City of Los Angels. To identify this this data field will be very useful. 

    - EXP_JOB_CLASS_ALT_JOB_TITLE - Alternative job class title. We have alternate job class responsible class. But with this we can also have the title. This can enhance our explicit promotional pathway. 

    - REQUIRED_CERTIFICATE - Certificate required for this job. Some jobs wants different types of certificate. With this data field we can identify those. 

    - CERTIFICATE_ISSUED_BY - Certificate Issued by which organization. Required certificate is issued by which company? This column will answer that. 

    - req_list - REQUIREMENTS after splitting at and/or and stored in separate columns. Requirements sections number of requirements separeted by and/or. We splited at ```or``` and store them as seperate rows. This data field store that. 

    - raw_clean_job_text - cleaned whole raw job text. In the next part of the solution, we need whole raw job text. This data field will help us for that. 
***

# <font color="#5831bc" face="Comic Sans MS">Conclusion</font>
We finally converted our raw job posting to structured CSV. This solved one of the hardest problems in this competition. The structured CSV saved to disk. Feel free download and check it's quality. We will also check some of its columns in part 4 of the solutions. Thank you very much for reading the notebook. See you in the next part of the solutions. 