# Introduction

A talent sourcing and management company is seeking to develop a machine learning-powered pipeline that can spot potential talented candidates, and rank them based on the fitness of their profile against some specific keywords such as “full-stack software engineer”, “engineering manager”, or “aspiring human resources”. The company also would like to be able to manually re-rank the candidate list, so that the candidate they ended up choosing would be set as the ideal candidate independently of their previous rank position. 

For this particular case, the keywords required by the company are: “aspiring human resources” or “seeking human resources”.


Goal(s):
- Predict how fit the candidate is based on their available information (variable fit)


Bonus(es):

- We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

- How can we filter out candidates which in the first place should not be in this list?

- Can we determine a cut-off point that would work for other roles without losing high potential candidates?

- Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [132]:
# basics
import pandas as pd
import numpy as np
import math
import re
from collections import Counter

# preprocessing tools
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')

# embedding models
from sklearn.feature_extraction.text import TfidfVectorizer #TFiDF
import torchtext #GloVe
from gensim.models import Word2Vec # Word2Vec
import transformers #BERT and SBERT
import torch

# metrics
from sklearn.metrics.pairwise import cosine_similarity


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rp3650/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Data exploration

In [133]:
data = pd.read_csv("../data/raw/potential-talents_aspiring-humanresources_seeking-human-resources.csv").set_index("id")

data.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


Attributes:
- `id`: unique identifier for candidate (numeric)
- `job_title`: job title for candidate (text)
- `location`: geographical location for candidate (text)
- `connections`: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
- `fit`: how fit the candidate is for the role? (numeric, probability between 0-1)

There is no missing value. 

In [134]:
data.isnull().sum()

job_title       0
location        0
connection      0
fit           104
dtype: int64

There are no duplicates in the dataset.

In [135]:
data.duplicated().sum()

51

In [136]:
data_copy = data.copy()

# Modeling

## Pre-processing

Let's take a look at all possible `job_title` values to spot potential pre-processing procedure to be done before diving into futher steps.

In [137]:
data_copy['job_title'].value_counts()

2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 7
Aspiring Human Resources Professional                                                                                    7
Student at Humber College and Aspiring Human Resources Generalist                                                        7
People Development Coordinator at Ryan                                                                                   6
Native English Teacher at EPIK (English Program in Korea)                                                                5
Aspiring Human Resources Specialist                                                                                      5
HR Senior Specialist                                                                                                     5
Student at Chapman University                                                                                            4
SVP, CHRO, Marke

It looks like there are a number of abbreviation that need to be spelt out so they could be properly tokenized and stemmed.

In [138]:
abbreviations = {
    'GPHR': 'Global Professional in Human Resources',
    'CSR': 'Corporate Social Responsibility',
    'MES': 'Manufacturing Execution Systems',
    'SPHR': 'Senior Professional in Human Resources',
    'SVP': 'Senior Vice President',
    'GIS': 'Geographic Information System',
    'RRP': 'Reduced Risk Products',
    'CHRO': 'Chief Human Resources Officer',
    'HRIS': 'Human resources information system',
    'HR': 'Human resources',
}

def replace_abbreviations(title):
    for k, v in abbreviations.items():
        regex = r'\b{}\b'.format(re.escape(k)) # create a readable regex
        title = re.sub(regex, v, title, flags=re.IGNORECASE)
    return title

We are then ready to clean up all `job_title` values for each row of the dataset. We will apply:
1. abbreviation replacement (as defined in the function above)
2. word tokenization (from the `nltk` package)
3. stemming (from the `nltk` package)
4. lemmatization (from the `nltk` package)

In [139]:
def clean_title(title):

    title = replace_abbreviations(title) # replace abbreviations

    words = word_tokenize(title.lower()) # tokenize words in each job title

    # stemming
    ps = PorterStemmer()
    stems = []
    for word in words:
        stem = ps.stem(word)
        stems.append(stem)

    # lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_stems = []
    engl_stopwords = stopwords.words('english')
    for stem in stems:
        if stem not in engl_stopwords:
            lemma = lemmatizer.lemmatize(stem)
            lemmatized_stems.append(lemma)
            
    return ' '.join(lemmatized_stems)

In [140]:
data_copy['job_title'] = data_copy['job_title'].apply(clean_title)

In [141]:
data_copy.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 c.t . bauer colleg busi graduat ( magna c...,"Houston, Texas",85,
2,nativ english teacher epik ( english program k...,Kanada,500+,
3,aspir human resourc profession,"Raleigh-Durham, North Carolina Area",44,
4,peopl develop coordin ryan,"Denton, Texas",500+,
5,advisori board member celal bayar univers,"İzmir, Türkiye",500+,


In [142]:
data_copy.to_csv('../data/processed/potential-talents_aspiring-humanresources_seeking-human-resources_preprocessed.csv') # save pre-processed dataset

While the preprocessing procedure above might have been useful in the past, it is actually unnecessary for modern models. It has indeed been reported that at the current state the available models seem to be able to handle unpreprocessed strings with satisfactory results. while the above section may be a good exemplification of a typical pre-processing pipeline, what follows will be using the original data, with some minimal cleaning involving string lowering and space removal (but without stemming and lemmatization).

In [143]:
## getting nan values for embedding when simple clearning is not done for a few job_titles
def minimal_cleaning(str):
    str = re.sub(r'\s+', ' ', str).strip()
    return str.lower()

data['job_title'] = data['job_title'].apply(minimal_cleaning)
data.to_csv('../data/processed/potential-talents_aspiring-humanresources_seeking-human-resources_preprocessed_minimal.csv') # save pre-processed dataset


## Fitting

The main goal of the project is to predict how fit each candidate for the position of 'Aspiring human resources' based on their job title. In practice, the fit score can be seen as the vector distance between the vector embedding of the job description of a candidate and the vector embedding of the job position offered. So, the fitting here consists of two steps:

1. Get the embedding of the job title of each candidate, and the embedding of the description of the position offered. I will retrieve the embeddings from a different models:
  - Continuous Bag of Words
  - TF-IDF (pre-trained)
  - Word2Vec (pre-trained)
  - FastText (pre-trained)
  - BERT/SBERT (pre-trained)
2. Calculate the distance between the two vectors. For this project, I will use cosine similarity (rather than Euclidean distance). 

It is worth noting that all models just mentioned will be adopted just to see if and how much they differ in the cosine similarity measures. But there will be no ground truth to base the choice of a specific model over the others on.


In [144]:
data.drop('fit', axis=1, inplace=True)

In [145]:
keywords = ['aspiring human resources']
data_master = data.copy()

### CBOW

In the Continuous Bag of Words (CBOW) model, the embeddings are just the sum of the occurrences in the current dataset.

In [146]:
word = re.compile(r'\w+')

def str_to_vec(str):
  return Counter(word.findall(str))

def get_cosine(v1, v2):
  intersect = set(v1.keys()) & set(v2.keys())
  num = sum([v1[i] * v2[i] for i in intersect])
  sum_v1 = sum([v1[j] ** 2 for j in v1.keys()])
  sum_v2 = sum([v2[k] ** 2 for k in v2.keys()])
  denom = math.sqrt(sum_v1) * math.sqrt(sum_v2)
  if not denom:
    return 0.0
  else:
    cosine = float(num/denom)
    return cosine

cbow_data = data_master.copy()
cbow_title_embeddings = [str_to_vec(str) for str in cbow_data['job_title']]
cbow_keywords_embeddings = [str_to_vec(str) for str in keywords]

cbow_cosine = [get_cosine(key_emb, title_emb) for key_emb in cbow_keywords_embeddings for title_emb in cbow_title_embeddings]
cbow_data['cbow_fit'] = cbow_cosine

data = data.merge(cbow_data['cbow_fit'], how='left', left_index=True, right_index=True)
data.sort_values('cbow_fit', ascending=False, inplace=True)
data.head(10)

Unnamed: 0_level_0,job_title,location,connection,cbow_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
36,aspiring human resources specialist,Greater New York City Area,1,0.866025
58,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025
33,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025
17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025
24,aspiring human resources specialist,Greater New York City Area,1,0.866025
49,aspiring human resources specialist,Greater New York City Area,1,0.866025
60,aspiring human resources specialist,Greater New York City Area,1,0.866025
97,aspiring human resources professional,"Kokomo, Indiana Area",71,0.866025
46,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025
6,aspiring human resources specialist,Greater New York City Area,1,0.866025


### TF-IDF

In [147]:
vectorizer = TfidfVectorizer()

tfidf_data = data_master.copy()
titles = tfidf_data['job_title'].tolist()

tfidf_title_embs = vectorizer.fit_transform(titles)
tfidf_keyword_embs = vectorizer.transform(keywords)

tfidf_cosine = [cosine_similarity(tfidf_keyword_embs, tfidf_title_emb) for tfidf_title_emb in tfidf_title_embs]
cosine_list = []
for i in tfidf_cosine:
  cosine_list.append(i.item())

tfidf_data['tfidf_fit'] = cosine_list

data = data.merge(tfidf_data['tfidf_fit'], how='left', left_index=True, right_index=True)
data.sort_values('tfidf_fit', ascending=False, inplace=True)
data.head(10)

Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
46,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
33,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
58,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591
97,aspiring human resources professional,"Kokomo, Indiana Area",71,0.866025,0.753591
6,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679
36,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679
60,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679


### GloVe

In [148]:
glove = torchtext.vocab.GloVe(name='6B', dim=100)

def str_to_glove(str):
  tokens = str.split()
  ind = [glove.stoi[token] for token in tokens if token in glove.stoi]
  vecs = glove.vectors[ind]
  vecs_arr = vecs.numpy()
  embs = vecs_arr.mean(axis=0)
  return embs

glove_data = data_master.copy()
glove_titles = glove_data['job_title'].apply(str_to_glove)
glove_title_embeddings = [title for title in glove_titles]
glove_keywords_embeddings = str_to_glove(keywords[0])

glove_cosines = [cosine_similarity(title_emb.reshape(1,-1), glove_keywords_embeddings.reshape(1,-1))[0,0] for title_emb in glove_title_embeddings]

glove_data['gloVe_fit'] = glove_cosines

data = data.merge(glove_data['gloVe_fit'], how='left', left_index=True, right_index=True)
data.sort_values('gloVe_fit', ascending=False, inplace=True)
data.head(10)

Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit,gloVe_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
36,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001
6,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001
24,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001
49,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001
60,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001
46,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721
21,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721
97,aspiring human resources professional,"Kokomo, Indiana Area",71,0.866025,0.753591,0.948721
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721
58,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721


### Word2Vec

In [149]:
def tokenization(str):
  tokens = str.split()
  return tokens

w2v_data = data_master.copy()
tokens = w2v_data['job_title'].apply(tokenization).tolist()

w2v_model = Word2Vec(tokens, vector_size=100, window=5, min_count=2, workers=4)

def str_to_w2v_embedding(str):
  tokens = str.split()
  vecs = [w2v_model.wv[token] for token in tokens if token in w2v_model.wv]
  embeddings = np.mean(vecs, axis=0)
  return embeddings

embeddings_w2v = w2v_data['job_title'].apply(str_to_w2v_embedding)
w2v_list = [embedding for embedding in embeddings_w2v]
keywords_emb = str_to_w2v_embedding(keywords[0])

w2v_cosines = [cosine_similarity(w2v.reshape(1,-1), keywords_emb.reshape(1,-1))[0,0] for w2v in w2v_list]
w2v_data['w2v_fit'] = w2v_cosines

data = data.merge(w2v_data['w2v_fit'], left_index=True, right_index=True, how='left')
data.sort_values('w2v_fit', ascending=False, inplace=True)
data.head(10)


Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit,gloVe_fit,w2v_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
79,liberal arts major. aspiring human resources a...,"Baton Rouge, Louisiana Area",7,0.654654,0.297679,0.892498,1.0
36,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571
24,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571
49,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571
60,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571
6,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571
46,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362
21,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362
97,aspiring human resources professional,"Kokomo, Indiana Area",71,0.866025,0.753591,0.948721,0.881362
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362


### BERT

In [150]:
bert = transformers.BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')

def str_to_bert_embedding(str):
  ids = bert_tokenizer.encode_plus(str, add_special_tokens=True, return_tensors='pt')
  out = bert(**ids)
  embeddings = torch.mean(out.last_hidden_state, dim=1)
  return embeddings

bert_data = data_master.copy()
bert_title_embeddings = [emb.detach().numpy() for emb in bert_data['job_title'].apply(str_to_bert_embedding)]
bert_keywords_embeddings = str_to_bert_embedding(keywords[0]).detach().numpy()

bert_cosine = [cosine_similarity(bert_keywords_embeddings, bert_title_embedding).item() for bert_title_embedding in bert_title_embeddings]
bert_data['bert_fit'] = bert_cosine

data = data.merge(bert_data['bert_fit'], how='left', left_index=True, right_index=True)
data.sort_values('bert_fit', ascending=False, inplace=True)
data.head(10)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit,gloVe_fit,w2v_fit,bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
36,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571,0.90548
24,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571,0.90548
49,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571,0.90548
60,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571,0.90548
6,aspiring human resources specialist,Greater New York City Area,1,0.866025,0.695679,0.953001,0.895571,0.90548
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632
33,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632
58,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632
17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632
97,aspiring human resources professional,"Kokomo, Indiana Area",71,0.866025,0.753591,0.948721,0.881362,0.902632


### Interim conclusion

Apart from minimal differences embedding methods, the models seem to behave quite similarly. In absence of an objective evaluation method, the above was merely explorative. In the following, I will just use the BERT embeddings, which have been consistently proven to be quite effective in multiple applications.

## Reranking

The client additionally asked for a way to re-rank candidates, so that a specific subset of candidates are moved to the first positions regardless of their fit score.

There are two ways to do this. 

1. Add the job title of the selected candidate(s) to the keywords
2. Calculate the average of the embeddings of the job title of the selected candidates(s) and of the pre-set keywords.

Below I show both methods while using BERT (but the same can be accomplished with any other model).

### Method 1 - update keyword string

In [151]:
def update_keywords(keywords, candidate_ids, df):
  for i in candidate_ids:
    keywords_join = ' '.join(keywords)
    keywords_l = keywords_join.lower().split()
    job_titles = df.loc[i]['job_title'].lower().split()
    for title in job_titles:
      if title not in keywords_l:
        keywords_l.append(title)
        keywords = ' '.join(keywords_l)
  return keywords

rerank_data = data_master.copy()
candidate_id = [75]
updated_keywords = update_keywords(keywords, candidate_id, rerank_data)

bert_updated_keywords_embeddings = str_to_bert_embedding(updated_keywords).detach().numpy()

bert_cosine_reranked = [cosine_similarity(bert_updated_keywords_embeddings, bert_title_embedding).item() for bert_title_embedding in bert_title_embeddings]

rerank_data['rerank_bert_fit'] = bert_cosine_reranked

data = data.merge(rerank_data['rerank_bert_fit'], how='left', left_index=True, right_index=True)
data.sort_values('rerank_bert_fit', ascending=False, inplace=True)
data.head(10)

Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit,gloVe_fit,w2v_fit,bert_fit,rerank_bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
75,"nortia staffing is seeking human resources, pa...","San Jose, California",500+,0.333333,0.102635,0.733283,0.475071,0.60748,0.985951
64,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.0,0.0,0.346589,0.153282,0.621767,0.77085
12,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.0,0.0,0.346589,0.153282,0.621767,0.77085
55,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.0,0.0,0.346589,0.153282,0.621767,0.77085
42,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.0,0.0,0.346589,0.153282,0.621767,0.77085
77,human resources| conflict management| policies...,Dallas/Fort Worth Area,409,0.333333,0.108243,0.732047,0.346412,0.67145,0.750498
100,aspiring human resources manager | graduating ...,"Cape Girardeau, Missouri",103,0.629941,0.282174,0.775113,0.704915,0.659241,0.75031
67,"human resources, staffing and recruiting profe...","Jackson, Mississippi Area",500+,0.471405,0.19758,0.783895,0.43956,0.716683,0.75019
68,human resources specialist at luxottica,Greater New York City Area,500+,0.516398,0.240072,0.799546,0.616314,0.749704,0.749126
82,aspiring human resources professional | an ene...,"Austin, Texas Area",174,0.547723,0.261724,0.836089,0.674098,0.710185,0.748669


The method works also in case we wanted to star more than one candidate.

### Method 2 - embedding averages

In [61]:
def get_candidate_keywords(candidate_id, df):
  candidate_keywords_l = []
  for i in candidate_id:
    candidate_title = df.loc[i]['job_title'].lower().split()
    for word in candidate_title:
      if word not in candidate_keywords_l:
        candidate_keywords_l.append(word)
  candidate_keywords = ' '.join(candidate_keywords_l)
  return candidate_keywords

def averaged_bert_emb(keywords, candidate_id, df):
  bert_keywords_embeddings = str_to_bert_embedding(keywords)

  candidate_keywords = get_candidate_keywords(candidate_id, df)
  bert_candidate_keywords = str_to_bert_embedding(candidate_keywords)

  avg_bert_emb = (bert_keywords_embeddings + bert_candidate_keywords)/2
  return avg_bert_emb

In [82]:
rerank2_data = data_master.copy()
candidate_id = [75]

updated_avg_keywords = averaged_bert_emb(keywords, candidate_id, rerank2_data).detach().numpy()

bert_cosine_reranked_2 = [cosine_similarity(updated_avg_keywords, bert_title_embedding).item() for bert_title_embedding in bert_title_embeddings]

rerank2_data['rerank2_bert_fit'] = bert_cosine_reranked_2

data = data.merge(rerank2_data['rerank2_bert_fit'], how='left', left_index=True, right_index=True)
data.sort_values('rerank2_bert_fit', ascending=False, inplace=True)
data.head(10)

Unnamed: 0_level_0,job_title,location,connection,cbow_fit,tfidf_fit,gloVe_fit,w2v_fit,bert_fit,rerank_bert_fit,rerank2_bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
75,"nortia staffing is seeking human resources, pa...","San Jose, California",500+,0.333333,0.102635,0.733283,0.475071,0.60748,0.985951,0.793741
99,seeking human resources position,"Las Vegas, Nevada Area",48,0.57735,0.279124,0.873271,0.64363,0.815409,0.667776,0.720714
38,hr senior specialist,San Francisco Bay Area,500+,0.0,0.0,0.48781,0.188918,0.80481,0.630726,0.712602
26,hr senior specialist,San Francisco Bay Area,500+,0.0,0.0,0.48781,0.188918,0.80481,0.630726,0.712602
8,hr senior specialist,San Francisco Bay Area,500+,0.0,0.0,0.48781,0.188918,0.80481,0.630726,0.712602
51,hr senior specialist,San Francisco Bay Area,500+,0.0,0.0,0.48781,0.188918,0.80481,0.630726,0.712602
61,hr senior specialist,San Francisco Bay Area,500+,0.0,0.0,0.48781,0.188918,0.80481,0.630726,0.712602
46,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632,0.64574,0.708697
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632,0.64574,0.708697
33,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.866025,0.753591,0.948721,0.881362,0.902632,0.64574,0.708697


The method works also in case we wanted to star more than one candidate.

### Conclusions

Both methods seem to work well in case we wanted to star candidates that wouldn't necessarily be at the top fit rank. Moreoever, both methods work also in case we wanted to star more than one candidate.

##### Method 1 - update keyword string

In [None]:
rerank_data = data_master.copy()
candidate_id = [75, 44]
updated_keywords = update_keywords(keywords, candidate_id, rerank_data)

bert_updated_keywords_embeddings = str_to_bert_embedding(updated_keywords).detach().numpy()

bert_cosine_reranked = [cosine_similarity(bert_updated_keywords_embeddings, bert_title_embedding).item() for bert_title_embedding in bert_title_embeddings]

rerank_data['rerank_bert_fit'] = bert_cosine_reranked


rerank_data.sort_values('rerank_bert_fit', ascending=False, inplace=True)
rerank_data.head(10)

Unnamed: 0_level_0,job_title,location,connection,rerank_bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
75,"nortia staffing is seeking human resources, pa...","San Jose, California",500+,0.548542
12,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.529591
42,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.529591
55,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.529591
64,"svp, chro, marketing & communications, csr off...","Houston, Texas Area",500+,0.529591
1,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.524188
44,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.524188
57,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.524188
19,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.524188
15,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.524188


##### Method 2 - embedding averages

In [None]:
rerank2_data = data_master.copy()
candidate_id = [75, 44]

updated_avg_keywords = averaged_bert_emb(keywords, candidate_id, rerank2_data).detach().numpy()

bert_cosine_reranked_2 = [cosine_similarity(updated_avg_keywords, bert_title_embedding).item() for bert_title_embedding in bert_title_embeddings]

rerank2_data['rerank2_bert_fit'] = bert_cosine_reranked_2

rerank2_data.sort_values('rerank2_bert_fit', ascending=False, inplace=True)
rerank2_data.head(20)

Unnamed: 0_level_0,job_title,location,connection,rerank2_bert_fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
75,"nortia staffing is seeking human resources, pa...","San Jose, California",500+,0.737147
1,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
57,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
44,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
31,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
19,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
14,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
15,2019 c.t. bauer college of business graduate (...,"Houston, Texas",85,0.732687
17,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.727043
3,aspiring human resources professional,"Raleigh-Durham, North Carolina Area",44,0.727043
