In [1]:
from functions.db import *
from functions.labeling import *
import numpy as np
import time

# Labeling the job descriptions

The next problem we have is that we have too many jobs. While LinkedIn filters work, I've found that they're not enough for me to really filter out the jobs I am not a good match for.

This is when NLP comes in. I've used <a href="https://huggingface.co/docs/transformers/index">HuggingFace's transformers</a> and <a href="https://huggingface.co/facebook/bart-large-mnli"> Facebook's Barte Large MNLI</a> to create a function that will go through all of the job descriptions in our database and will assign a score depending on how much these job descriptions fit certain labels.

Since I am a data science student, I am interested in jobs that fit the following labels:

* Python
* Data analysis
* SQL

But considering that I am a student, I have to look for junior positions, so I've added the following label too:

* Junior level

I also wanted to create an overall score, for which I used the python, data analysis and sql scores combined, assigning them some weights according to my personal interests! 

This function below can be quite slow depending on the machine you're using. I've added a timer that estimates how much time left to completion!

In [2]:
df = sql_query_to_pandas("SELECT description FROM job_posts").to_numpy()
Bert = Bert_for_zero_shot_classification()
python_scores = []
analysis_scores = []
sql_scores = []
junior_scores = []
overall_scores = []
count = 0
time_array = []
for job in df:
    beginning = time.time()
    python_score = pipe(Bert, "python", job)
    python_scores.append(python_score)
    analysis_score = pipe(Bert, "data analysis", job)
    analysis_scores.append(analysis_score)
    sql_score = pipe(Bert, "sql", job)
    sql_scores.append(sql_score)
    junior_score = pipe(Bert, "junior level", job)
    junior_scores.append(junior_score)
    overall_score = 3*python_score + 1.5*analysis_score + sql_score
    overall_scores.append(overall_score)
    count += 1
    end = time.time()
    job_time = time.time() - beginning
    time_array.append(job_time)
    jobs_left = df.shape[0] - count
    avg_time = sum(time_array)/len(time_array)
    remaining_time = jobs_left * avg_time
    hours_remaining = int(remaining_time // 3600)
    minutes_remaining = int((remaining_time % 3600) // 60)
    message = f"{hours_remaining} hours and {minutes_remaining} minutes remaining...          "
    if count % 10 == 0:
        print(message, end = "\r", flush = True)



0 hours and 0 minutes remaining...           

Once we have the arrays that contain the scores for each one of these labels, we create a dataframe with the `id`s of the jobs and their scores. Then we create a database table for the job scores.

In [3]:
db_df = sql_query_to_pandas("SELECT id FROM job_posts")
db_df["python"] = python_scores
db_df["analytics"] = analysis_scores
db_df["sql"] = sql_scores
db_df["junior"] = junior_scores
db_df["overall"] = overall_scores
db_df

Unnamed: 0,id,python,analytics,sql,junior,overall
0,1,2.675368,2.588062,2.725649,1.241452,14.633846
1,2,2.065728,2.554153,2.041401,1.645679,12.069816
2,3,1.827861,1.539850,1.603662,1.105828,9.397020
3,4,5.109573,38.508186,5.640140,1.595000,78.731139
4,5,2.141974,3.493265,3.631444,1.651816,15.297265
...,...,...,...,...,...,...
5152,5153,1.211412,1.006622,2.053728,0.865876,7.197896
5153,5154,3.205282,6.215873,6.520350,4.592146,25.460006
5154,5155,2.838358,2.325623,2.084330,1.023668,14.087840
5155,5156,1.930573,1.830294,3.839613,0.860003,12.376772


In [4]:
pandas_to_mysql(db_df, "job_labels")

5157 entries were added to the table job_labels!
