# Find my NOC

### The goal of this notebook is to map job postings to their correct job code (NOC)

In [1]:
# utils
import pandas as pd
import numpy as np

# modules
import src.utils.data_cleaning as data_cleaning
import models.demand.demand as demand

### Import relevant datasets
First, let's import all of the postings that are obtained from the external MySQL database beforehand. I have already downloaded the data dump from the database to save time. The data in the MySQL database has been aggregated from different job posting websites. This database is designed for a different project by an external organization and I have special access to it for this project.

In [2]:
postings = pd.read_csv('data/raw/postings.csv')

In [3]:
postings.head()

Unnamed: 0,hash,title,content,noc
0,000007c5b6113eafb89b3b73c811dfb4,FINANCIAL ANALYST Corporate Office,,1112
1,00000be8f81af54b07c885688edb14db,Security Guard,Retail environment. Ensuring COVID19 restricti...,6541
2,00001627d6d70611a25853337fb5912b,Kitchen Exhaust Hood Cleaning Technician,Hood Cleaners of America is looking for candid...,6732
3,000058401cec51520f401a493886e5f5,Child Care Supply / Assistant,Good Shepherd Lutheran Christian Day Care is l...,4214
4,000066567c6fe57b54b79cb41ff4e03a,WINDSOR &#8211; Sales Team Member: Part Time,Princess Auto is a Canadian based multi-channe...,6421


In [4]:
postings.size

1256728

In [5]:
postings.noc.value_counts()

6421    16235
7452    15981
4412    12497
6552    11801
7511    10294
        ...  
8614        1
8441        1
1227        1
2146        1
5245        1
Name: noc, Length: 519, dtype: int64

Out of all the postings, only some are human labelled. The rest are labelled by a string-matching program that does not make use of machine learning. We will train the model on the human-labelled data. 

In [6]:
jobs_to_tags = pd.read_csv('data/raw/jobs_to_tags.csv')

In [7]:
jobs_to_tags.head()

Unnamed: 0,hash
0,0000773cb4875c8a442094365013263f
1,0001990412714e43f413d6b37b13a166
2,000d1e8051337e97660c0052379d9b0a
3,001a4de013ee3c8ba2f36c5bce851dbe
4,001af489878da488b65b079afe993c82


These hashes correspond to those job postings that are hand matched.

Now, we can filter rows from postings that also fall in jobs_to_tags.

In [8]:
human_labelled_postings = pd.merge(postings,
                        jobs_to_tags,
                        on='hash',
                        how='inner')

In [9]:
human_labelled_postings.size

35288

In [10]:
human_labelled_postings.to_csv('data/interim/human_labelled_postings.csv')

### Data Cleaning

In [11]:
cleaned_postings = data_cleaning.run(input_file='data/interim/human_labelled_postings.csv', output_file='data/cleaned/cleaned_postings.csv')

In [12]:
cleaned_postings.head()

Unnamed: 0,hash,content,noc
0,0000773cb4875c8a442094365013263f,housekeep aid clean assign hospit manner candi...,3414
1,0001990412714e43f413d6b37b13a166,summer summari cream manufactur outgrow commun...,9617
2,000d1e8051337e97660c0052379d9b0a,adjoint bureau emplac provinc type classif rel...,1511
3,001a4de013ee3c8ba2f36c5bce851dbe,develop manag market descript univers student ...,213
4,001af489878da488b65b079afe993c82,covid test site grow fit duti prioriti safeti ...,3414


In [13]:
cleaned_postings.isna().sum()

hash       0
content    0
noc        0
dtype: int64

By the end of this step, we have fully pre-processed text. I have only considered nouns and verbs as these would provide the most meaning and also help in saving time and space.

### Prediction of classes
The NOC has a tree stucture. It would be difficult to directly classify a job posting to a 4 digit NOC. We can take advantage of the the tree structure by making predictions for each level. To start with, the top level only has 10 classes (0 - 9). We can classify each job posting to one of these 10 classes and then we will be left with 10 partitions. Within each partition, we will need to make predictions for the second-, third-, and fourth-level. 

For training purposes, we will train on the top level first using the first digit of the NOC. We will then partition the data according to the label and now train on the second digit. 

This function is written in the demand.py file.

In [14]:
level_3_df = demand.traversal(cleaned_postings) 
# traversal does not change the data itself, it just adds columns corresponding to each level of the NOC and is a
# way to test if the concatenation is working or not

Training model...
Created pipeline
Fitted
Model: class_top_level_0, Train Accuracy: 73.95104895104895%, Test Accuracy: 63.58436606291706%
Training model...
Created pipeline
Fitted
Model: class_3_level_1, Train Accuracy: 88.70523415977961%, Test Accuracy: 67.21311475409836%
Training model...
Created pipeline
Fitted
Model: class_9_level_1, Train Accuracy: 59.210526315789465%, Test Accuracy: 58.77192982456141%
Training model...
Created pipeline
Fitted
Model: class_1_level_1, Train Accuracy: 84.67650397275823%, Test Accuracy: 65.3061224489796%
Training model...
Created pipeline
Fitted
Model: class_0_level_1, Train Accuracy: 88.88888888888889%, Test Accuracy: 63.31360946745562%
Training model...
Created pipeline
Fitted
Model: class_8_level_1, Train Accuracy: 58.46153846153847%, Test Accuracy: 54.54545454545454%
Training model...
Created pipeline
Fitted
Model: class_6_level_1, Train Accuracy: 79.06793048973144%, Test Accuracy: 56.73758865248227%
Training model...
Created pipeline
Fitted
Mode

Fitted
Model: class_143_level_3, Train Accuracy: 94.04761904761905%, Test Accuracy: 89.28571428571429%
Training model...
Created pipeline
Fitted
Model: class_145_level_3, Train Accuracy: 66.66666666666666%, Test Accuracy: 100.0%
Training model...
Training model...
Created pipeline
Fitted
Model: class_131_level_3, Train Accuracy: 66.66666666666666%, Test Accuracy: 85.71428571428571%
Training model...
Created pipeline
Fitted
Model: class_122_level_3, Train Accuracy: 67.79661016949152%, Test Accuracy: 72.5%
Training model...
Created pipeline
Fitted
Model: class_124_level_3, Train Accuracy: 60.99290780141844%, Test Accuracy: 57.446808510638306%
Training model...
Created pipeline
Fitted
Model: class_125_level_3, Train Accuracy: 75.0%, Test Accuracy: 100.0%
Training model...
Created pipeline
Fitted
Model: class_121_level_3, Train Accuracy: 72.72727272727273%, Test Accuracy: 80.0%
Training model...
Created pipeline
Fitted
Model: class_021_level_3, Train Accuracy: 66.66666666666666%, Test Accu

Fitted
Model: class_415_level_3, Train Accuracy: 72.17391304347827%, Test Accuracy: 79.48717948717949%
Training model...
Created pipeline
Fitted
Model: class_416_level_3, Train Accuracy: 62.03703703703704%, Test Accuracy: 52.77777777777778%
Training model...
Training model...
Created pipeline
Fitted
Model: class_421_level_3, Train Accuracy: 62.96296296296296%, Test Accuracy: 70.3125%
Training model...
Created pipeline
Fitted
Model: class_441_level_3, Train Accuracy: 82.85714285714286%, Test Accuracy: 68.57142857142857%
Training model...
Created pipeline
Fitted
Model: class_442_level_3, Train Accuracy: 80.0%, Test Accuracy: 100.0%
Training model...
Created pipeline
Fitted
Model: class_401_level_3, Train Accuracy: 90.0%, Test Accuracy: 71.42857142857143%
Training model...
Training model...
Created pipeline
Fitted
Model: class_403_level_3, Train Accuracy: 57.446808510638306%, Test Accuracy: 62.5%
Training model...
Training model...
Created pipeline
Fitted
Model: class_521_level_3, Train A

In [15]:
level_3_df # we can verify that every row has been trained on

Unnamed: 0,hash,content,noc,level_0_label,level_1_label,level_2_label,level_3_label
0,0000773cb4875c8a442094365013263f,housekeep aid clean assign hospit manner candi...,3414,3,34,341,3414
1,001af489878da488b65b079afe993c82,covid test site grow fit duti prioriti safeti ...,3414,3,34,341,3414
2,007c9fbb0bc78d8f32a3c0d17b93dcae,endoscop work locat hill look compani health h...,3414,3,34,341,3414
3,01dacff5876c21f148ff0c99b018126a,recreationist time term care home passion pers...,3414,3,34,341,3414
4,01f6d74cdea3fafd7bf77811860da078,porter depart patient summari context patient ...,3414,3,34,341,3414
...,...,...,...,...,...,...,...
8385,2d26ce8ed29ba40bf192919e69464ba8,heritag specialist world lead purpos futur pro...,5112,5,51,511,5112
8386,65f511cecdbe8b90fc4178c8af86aa28,time compani descript provid custom marc shelf...,5111,5,51,511,5111
8387,8b1e08790c2102d612baffe83263c37d,field locat base employ basi contract compens ...,5112,5,51,511,5112
8388,c7cd950d6f10b0f32ed21e710de314f4,head date locat compani univers depart time ti...,5112,5,51,511,5112
