### Parse Dataset to get Job Titles and Corresponding Skills
In this section, we parse the LinkedIn Skills dataset to get the job titles and associated skills we will use for our recommender system.

In [6]:
# import job skill data
import pandas as pd
df = pd.read_csv('data/job_skills.csv')

In [7]:
df.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [8]:
# first few job links
for i in range(10):
    print(df["job_link"][i])

https://www.linkedin.com/jobs/view/housekeeper-i-pt-at-jacksonville-state-university-3802280436
https://www.linkedin.com/jobs/view/assistant-general-manager-huntington-4131-at-ruby-tuesday-3575032747
https://www.linkedin.com/jobs/view/school-based-behavior-analyst-at-ccres-educational-and-behavioral-health-services-3739544400
https://www.linkedin.com/jobs/view/electrical-deputy-engineering-group-supervisor-at-energy-jobline-3773709557
https://www.linkedin.com/jobs/view/electrical-assembly-lead-at-sanmina-3704300377
https://www.linkedin.com/jobs/view/senior-lead-technician-programmer-at-security-101-3785441848
https://www.linkedin.com/jobs/view/program-consultant-at-methodist-family-health-3588621456
https://www.linkedin.com/jobs/view/veterinary-receptionist-at-wellhaven-pet-health-3803807922
https://www.linkedin.com/jobs/view/sr-technician-receiving-inspection-at-abbott-3799867135
https://www.linkedin.com/jobs/view/experienced-hvac-service-technician-at-lane-valente-industries-37982085

In [9]:
# drop rows that are nan
print(len(df))
df = df[df['job_skills'].notna()]
print(len(df))

1296381
1294346


In [10]:
# function to parse job title from linkedin link
def extract_title(link):
    # split along /
    title_text = link.split('/')[-1]
    # split title text along -
    title_text_split = title_text.split('-')
    try:
        # get index of at
        at_index = title_text_split.index('at')
        # return capitlized version of every word that is part of title
        return ' '.join([word.capitalize() for word in title_text_split[:at_index]])
    except:
        return ' '.join([word.capitalize() for word in title_text_split])

In [11]:
# test job link parsing
for i in range(10):
    print(extract_title(df["job_link"][i]))

Housekeeper I Pt
Assistant General Manager Huntington 4131
School Based Behavior Analyst
Electrical Deputy Engineering Group Supervisor
Electrical Assembly Lead
Senior Lead Technician Programmer
Program Consultant
Veterinary Receptionist
Sr Technician Receiving Inspection
Experienced Hvac Service Technician


In [12]:
# apply job title parsing to dataframe
df['job_title'] = df['job_link'].apply(extract_title)

In [13]:
job_title_skills_df = df[["job_title","job_skills"]]
job_title_skills_df.head()

Unnamed: 0,job_title,job_skills
0,Housekeeper I Pt,"Building Custodial Services, Cleaning, Janitor..."
1,Assistant General Manager Huntington 4131,"Customer service, Restaurant management, Food ..."
2,School Based Behavior Analyst,"Applied Behavior Analysis (ABA), Data analysis..."
3,Electrical Deputy Engineering Group Supervisor,"Electrical Engineering, Project Controls, Sche..."
4,Electrical Assembly Lead,"Electrical Assembly, Point to point wiring, St..."


In [14]:
# save title + skills dataframe
job_title_skills_df.to_csv("data/skills.csv")

### Get Skills Embeddings Dataset
After we extract the job titles and skills, we convert each vector of skills into an embedding using BERT. The code below is adapted from:
https://www.geeksforgeeks.org/how-to-generate-word-embedding-using-bert/

In [15]:
!pip install transformers



DEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063

[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
from transformers import BertTokenizer, BertModel
import torch

# load BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# function to get word embeddings
def get_BERT_embedding(text):
    # tokenize input
    encoding = tokenizer(text,
                       return_tensors='pt',
                       padding=True,
                       truncation=True)
    input_ids = encoding['input_ids'].to(device)
    mask = encoding['attention_mask'].to(device)
    # generate embedding
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=mask)
        embeddings = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
    return embeddings

# set device to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
# put model on GPU (is done in-place)
model.to(device)

Example Cosine Similarity

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

# Example skill
text = "Python coding, data analysis"

# Extract embeddings
emb1 = get_BERT_embedding(text)


text = "Snake programming, information science"
emb2 = get_BERT_embedding(text)

cosine_similarity(emb1, emb2)

array([[0.7513383]], dtype=float32)

In [None]:
df = job_title_skills_df

In [None]:
# drop rows that are nan
print(len(df))
df = df.dropna()
print(len(df))

Getting BERT embeddings takes a while!

In [None]:
# convert skills to embeddings in dataframe
from tqdm import tqdm
tqdm.pandas()
df['skill_embedding'] = df['job_skills'].progress_apply(get_BERT_embedding)

Saving skills embeddings to .pkl

In [None]:
df.to_pickle('skill.pickle')