This notebook was created to check what other data could be extracted from the job descriptions.

Spacy will be used for this with en_core_web_lg.

In [1]:
import spacy

In [2]:
%%bash
python3.6 -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


In [3]:
import en_core_web_lg

nlp = en_core_web_lg.load()

In [4]:
import pandas as pd

job_data = pd.read_csv('processed_job_data.csv', index_col=0)

In [5]:
job_data.head()

Unnamed: 0,job_id,job_title,job_description,job_sector,clean_job_description,title_and_clean_desc,label_job_sector
0,19549447,Geography Teacher,<p>Forde Education are looking to recruit a Te...,Education,Forde Education are looking to recruit a Teach...,Geography Teacher Geography Teacher Geography ...,0
1,7447537,PPA Cover teacher,Teachers Plus is seeking to employ a fully qua...,Education,Teachers Plus is seeking to employ a fully qua...,PPA Cover teacher PPA Cover teacher PPA Cover ...,0
2,26969327,Higher Level Teaching Assistant,We are currently recruiting High Level Teachin...,Education,We are currently recruiting High Level Teachin...,Higher Level Teaching Assistant Higher Level T...,0
3,7447589,Yr 2 Teacher,A suitably qualified and experienced Yr 2 Teac...,Education,A suitably qualified and experienced Yr 2 Teac...,Yr 2 Teacher Yr 2 Teacher Yr 2 Teacher Yr 2 Te...,0
4,26978624,Science Teachers,<strong>Job Description</strong><br /><br />Mo...,Education,Job Description Most Secondary Schools require...,Science Teachers Science Teachers Science Teac...,0


Let's try to detect the following from the clean job description (if present):
- location
- company hiring
- person hiring
- e-mail of the person hiring

In [6]:
spacy_accepted_entity_types = {
    'PERSON': 'Person', 
    'GPE': 'Location', 
    'ORG': 'Company', 
    'MONEY': 'Salary',
}
minimum_number_of_characters_phone_number = 8

for i, row in job_data.iterrows():
    doc = nlp(row['clean_job_description'])
    for entity in doc.ents:
        if entity.label_ not in spacy_accepted_entity_types.keys():
            continue
        else:
            job_data.at[i, spacy_accepted_entity_types[entity.label_]] = entity.orth_                

In [7]:
job_data.head(100)

Unnamed: 0,job_id,job_title,job_description,job_sector,clean_job_description,title_and_clean_desc,label_job_sector,Person,Company,Location,Salary
0,19549447,Geography Teacher,<p>Forde Education are looking to recruit a Te...,Education,Forde Education are looking to recruit a Teach...,Geography Teacher Geography Teacher Geography ...,0,Debbie Slater,GCSE,,
1,7447537,PPA Cover teacher,Teachers Plus is seeking to employ a fully qua...,Education,Teachers Plus is seeking to employ a fully qua...,PPA Cover teacher PPA Cover teacher PPA Cover ...,0,,,Wolverhampton,
2,26969327,Higher Level Teaching Assistant,We are currently recruiting High Level Teachin...,Education,We are currently recruiting High Level Teachin...,Higher Level Teaching Assistant Higher Level T...,0,Monarch Education,SEN,UK,
3,7447589,Yr 2 Teacher,A suitably qualified and experienced Yr 2 Teac...,Education,A suitably qualified and experienced Yr 2 Teac...,Yr 2 Teacher Yr 2 Teacher Yr 2 Teacher Yr 2 Te...,0,desirableApplications,ICT,Westminster,
4,26978624,Science Teachers,<strong>Job Description</strong><br /><br />Mo...,Education,Job Description Most Secondary Schools require...,Science Teachers Science Teachers Science Teac...,0,John,Reed Education,,
5,7461623,TEACHER OF MATHS,A popular mixed Maidenhead secondary school i...,Education,A popular mixed Maidenhead secondary school i...,TEACHER OF MATHS TEACHER OF MATHS TEACHER OF M...,0,,Math’s,,
6,26996572,Science Teacher,A successful Secondary School in the London Bo...,Education,A successful Secondary School in the London Bo...,Science Teacher Science Teacher Science Teache...,0,Acorn Appointments,the Home Counties,Essex,
7,7494389,Nursery Manager,Fantastic Nursery Managers vacancy in a beaut...,Education,Fantastic Nursery Managers vacancy in a beaut...,Nursery Manager Nursery Manager Nursery Manage...,0,,•Knowledge,,
8,26127364,Nursery Deputy Manager,<p><strong>Nursery Deputy Manager</strong></p...,Education,Nursery Deputy Manager £16 000 - £19 000 ...,Nursery Deputy Manager Nursery Deputy Manager ...,0,,Settling,Coventry,16 000 - £19 000
9,3565164,KS4 History (14/16 yr old),Teachers needed for various assignments Supply...,Education,Teachers needed for various assignments Supply...,KS4 History (14/16 yr old) KS4 History (14/16 ...,0,Stephen Wills,Monarch Education,,


The data does not look that great. In some cases the person detected is actually a company or something else. More work would be required in this area.

Let's also look at e-mail extraction:

In [8]:
import re

for i, row in job_data.iterrows():
    emails = re.findall(
        r'[\w\.-]+@[\w\.-]+\.[\w-]+', 
        row['clean_job_description']
    )
    if emails:
        job_data.at[i, 'E-mail'] = ' '.join(emails)  

In [9]:
job_data.head(300)

Unnamed: 0,job_id,job_title,job_description,job_sector,clean_job_description,title_and_clean_desc,label_job_sector,Person,Company,Location,Salary,E-mail
0,19549447,Geography Teacher,<p>Forde Education are looking to recruit a Te...,Education,Forde Education are looking to recruit a Teach...,Geography Teacher Geography Teacher Geography ...,0,Debbie Slater,GCSE,,,
1,7447537,PPA Cover teacher,Teachers Plus is seeking to employ a fully qua...,Education,Teachers Plus is seeking to employ a fully qua...,PPA Cover teacher PPA Cover teacher PPA Cover ...,0,,,Wolverhampton,,
2,26969327,Higher Level Teaching Assistant,We are currently recruiting High Level Teachin...,Education,We are currently recruiting High Level Teachin...,Higher Level Teaching Assistant Higher Level T...,0,Monarch Education,SEN,UK,,
3,7447589,Yr 2 Teacher,A suitably qualified and experienced Yr 2 Teac...,Education,A suitably qualified and experienced Yr 2 Teac...,Yr 2 Teacher Yr 2 Teacher Yr 2 Teacher Yr 2 Te...,0,desirableApplications,ICT,Westminster,,
4,26978624,Science Teachers,<strong>Job Description</strong><br /><br />Mo...,Education,Job Description Most Secondary Schools require...,Science Teachers Science Teachers Science Teac...,0,John,Reed Education,,,john.mcgee@reedglobal.com
5,7461623,TEACHER OF MATHS,A popular mixed Maidenhead secondary school i...,Education,A popular mixed Maidenhead secondary school i...,TEACHER OF MATHS TEACHER OF MATHS TEACHER OF M...,0,,Math’s,,,
6,26996572,Science Teacher,A successful Secondary School in the London Bo...,Education,A successful Secondary School in the London Bo...,Science Teacher Science Teacher Science Teache...,0,Acorn Appointments,the Home Counties,Essex,,
7,7494389,Nursery Manager,Fantastic Nursery Managers vacancy in a beaut...,Education,Fantastic Nursery Managers vacancy in a beaut...,Nursery Manager Nursery Manager Nursery Manage...,0,,•Knowledge,,,
8,26127364,Nursery Deputy Manager,<p><strong>Nursery Deputy Manager</strong></p...,Education,Nursery Deputy Manager £16 000 - £19 000 ...,Nursery Deputy Manager Nursery Deputy Manager ...,0,,Settling,Coventry,16 000 - £19 000,
9,3565164,KS4 History (14/16 yr old),Teachers needed for various assignments Supply...,Education,Teachers needed for various assignments Supply...,KS4 History (14/16 yr old) KS4 History (14/16 ...,0,Stephen Wills,Monarch Education,,,


The data extracted would probably be improved if we tried combining the results with other classifiers such as DbPedia Spotlight or by improving the Spacy Model.

In [10]:
job_data.to_csv('job_data_with_extra_information_extracted.csv')