This project is to predict how fit the candidate is based on their available information (variable fit), utilizing Doc2Vec and BERT models.

**Data Description:**

- id : unique identifier for candidate (numeric)

- job_title : job title for candidate (text)

- location : geographical location for candidate (text)

- connections: number of connections candidate has, 500+ means over 500 (text)

- Output (desired target): fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

**Success Metric(s):**

- Rank candidates based on a fitness score.

- Re-rank candidates when a candidate is starred.

**Bonus(es):**

- We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

- How can we filter out candidates which in the first place should not be in this list?

- Can we determine a cut-off point that would work for other roles without losing high potential candidates?

- Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?




# Table of contents
- [Check datatype,  duplicates & missing values](#nan)
- [Doc2Vec](#d2v)
- [BERT](#bt)
- [ELMo](#elmo)
- [Learning to Rank](#ltr)
    - [Doc2Vec](#d2vop)
    - [BERT](#btop)


In [1]:
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 400)
import gensim
import gensim.downloader as api
import re
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine as cosine_distance
from scipy.spatial.distance import cosine
from sklearn.preprocessing import MinMaxScaler
from nltk.corpus import stopwords
from nltk import download
from sentence_transformers import SentenceTransformer
import lightgbm 
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import ndcg_score
import random
from tqdm import tqdm

download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /Users/thao/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df=pd.read_csv('potential-talents - Aspiring human resources - seeking human resources.csv')
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
print(f' This dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

 This dataset has 104 rows and 5 columns.


# Check datatype,  duplicates & missing values <a class="anchor" id="nan"></a>

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [5]:
df.isna().sum()

id              0
job_title       0
location        0
connection      0
fit           104
dtype: int64

In [6]:
df.duplicated().sum()

0

In [7]:
# we'll use only 2 columns job_title and connection
docs=df[['job_title','connection']]
docs.head()

Unnamed: 0,job_title,connection
0,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,85
1,Native English Teacher at EPIK (English Program in Korea),500+
2,Aspiring Human Resources Professional,44
3,People Development Coordinator at Ryan,500+
4,Advisory Board Member at Celal Bayar University,500+


# Doc2Vec <a class="anchor" id="d2v"></a>

In [8]:
# prepare dataset for doc2vec
data = [simple_preprocess(d) for d in docs['job_title']]

# look at the first 5 values
for i in range (5):
    print(f"Job title {i}: {docs['job_title'][i]}")
    print('After preprocessing:', data[i],'\n')
    print('=============================\n')

Job title 0: 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
After preprocessing: ['bauer', 'college', 'of', 'business', 'graduate', 'magna', 'cum', 'laude', 'and', 'aspiring', 'human', 'resources', 'professional'] 


Job title 1: Native English Teacher at EPIK (English Program in Korea)
After preprocessing: ['native', 'english', 'teacher', 'at', 'epik', 'english', 'program', 'in', 'korea'] 


Job title 2: Aspiring Human Resources Professional
After preprocessing: ['aspiring', 'human', 'resources', 'professional'] 


Job title 3: People Development Coordinator at Ryan
After preprocessing: ['people', 'development', 'coordinator', 'at', 'ryan'] 


Job title 4: Advisory Board Member at Celal Bayar University
After preprocessing: ['advisory', 'board', 'member', 'at', 'celal', 'bayar', 'university'] 




In [9]:
# Create the tagged document needed for Doc2Vec
def create_tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield TaggedDocument(list_of_words, [i])

train_data = list(create_tagged_document(data))

print(train_data[:1])

[TaggedDocument(words=['bauer', 'college', 'of', 'business', 'graduate', 'magna', 'cum', 'laude', 'and', 'aspiring', 'human', 'resources', 'professional'], tags=[0])]


In [10]:
# Initiate the Doc2Vec model
doc2vec = Doc2Vec(min_count=1, epochs=100)

# Build the Vocabulary
doc2vec.build_vocab(train_data)

# Train the Doc2Vec model
doc2vec.train(train_data, total_examples=doc2vec.corpus_count, epochs=doc2vec.epochs)

# sanity check
print("Raw text:", docs.loc[11,'job_title'])
print("\nAfter preprocessing: ", simple_preprocess(docs.loc[11,'job_title']))
print("\nVector:", doc2vec.infer_vector(simple_preprocess(docs.loc[11,'job_title'])))

Raw text: SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR

After preprocessing:  ['svp', 'chro', 'marketing', 'communications', 'csr', 'officer', 'engie', 'houston', 'the', 'woodlands', 'energy', 'gphr', 'sphr']

Vector: [-0.05448751 -0.1701523  -0.27007812 -0.11856808  0.34031236 -0.12773614
 -0.11242825  0.18488906 -0.08661037 -0.01859559 -0.03029132  0.00327014
 -0.06770435  0.06368679  0.16657469 -0.09503851 -0.05302703 -0.31221947
 -0.09028573  0.04132126  0.0451204  -0.0805016   0.32475853 -0.16146518
  0.13893728 -0.46121028 -0.02062518 -0.08637611 -0.13886283 -0.21672426
  0.1408282   0.23916028  0.19489495 -0.01035442  0.12450544  0.08115862
 -0.28935444 -0.14849536 -0.15848145 -0.18392552  0.12489627  0.01814957
 -0.03956014 -0.08834229 -0.10693482  0.06379726 -0.10545415  0.00879371
 -0.01520778  0.360761   -0.01292751  0.01838173  0.07517787  0.06018162
  0.2141232  -0.10837337  0.10799746  0.12380023 -0.1923312   

In [11]:
query = 'Aspiring human resources'  

# embed all job titles and calculate Cosine Similarity with the query
for index, sentence in docs['job_title'].items():
    doc2vec.random.seed(1)
    docs.loc[index,'Doc2Vec_Cosine_Similarity'] = cosine_similarity(
                                        doc2vec.infer_vector(simple_preprocess(sentence)).reshape(1,-1), 
                                        doc2vec.infer_vector(simple_preprocess(query)).reshape(1,-1))[0][0]

docs.sort_values(by='Doc2Vec_Cosine_Similarity', ascending=False).head(20)

Unnamed: 0,job_title,connection,Doc2Vec_Cosine_Similarity
66,"Human Resources, Staffing and Recruiting Professional",500+,0.996322
2,Aspiring Human Resources Professional,44,0.996053
45,Aspiring Human Resources Professional,44,0.996053
32,Aspiring Human Resources Professional,44,0.996053
57,Aspiring Human Resources Professional,44,0.996053
20,Aspiring Human Resources Professional,44,0.996053
16,Aspiring Human Resources Professional,44,0.996053
96,Aspiring Human Resources Professional,71,0.996053
87,Human Resources Management Major,18,0.99589
88,Director Human Resources at EY,349,0.995562


In [12]:
# turn 500+ connection into 500 
docs['connection'] = np.where(docs['connection'] == '500+ ', 500, docs['connection'])

# scale connection
scaler = MinMaxScaler()
docs[['Scaled_Connection']] = scaler.fit_transform(docs[['connection']])

# calculate ranking by weighted sum of cosine_similarity and scaled_connection
docs['Doc2Vec_Ranking']=docs['Doc2Vec_Cosine_Similarity']*0.8 + docs['Scaled_Connection']*0.2

# sort ranking
docs.sort_values(by='Doc2Vec_Ranking', ascending=False).head(20)

Unnamed: 0,job_title,connection,Doc2Vec_Cosine_Similarity,Scaled_Connection,Doc2Vec_Ranking
66,"Human Resources, Staffing and Recruiting Professional",500,0.996322,1.0,0.997057
77,Human Resources Generalist at Schwan's,500,0.995309,1.0,0.996248
100,Human Resources Generalist at Loparex,500,0.995143,1.0,0.996114
67,Human Resources Specialist at Luxottica,500,0.994954,1.0,0.995963
103,Director Of Administration at Excellence Logging,500,0.99489,1.0,0.995912
70,"Human Resources Generalist at ScottMadden, Inc.",500,0.993441,1.0,0.994753
74,"Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621",500,0.992786,1.0,0.994229
102,Always set them up for Success,500,0.991876,1.0,0.993501
7,HR Senior Specialist,500,0.990402,1.0,0.992322
25,HR Senior Specialist,500,0.990402,1.0,0.992322


# BERT <a class="anchor" id="bt"></a>

In [13]:
# use pre-trained BERT model: all-MiniLM-L6-v2
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)


In [14]:
# calculate Cosine Similarity between the query and all job titles
for index, sentence in docs['job_title'].items():
    docs.loc[index,'BERT_Cosine_Similarity'] = cosine_similarity(model.encode(sentence).reshape(1,-1), 
                                                                 model.encode(query).reshape(1,-1))[0][0]

# sort it    
docs.sort_values(by='BERT_Cosine_Similarity', ascending=False).head(20)

Unnamed: 0,job_title,connection,Doc2Vec_Cosine_Similarity,Scaled_Connection,Doc2Vec_Ranking,BERT_Cosine_Similarity
96,Aspiring Human Resources Professional,71,0.996053,0.140281,0.824898,0.949807
2,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
45,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
20,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
16,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
32,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
57,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807
23,Aspiring Human Resources Specialist,1,0.994566,0.0,0.795653,0.928035
35,Aspiring Human Resources Specialist,1,0.994566,0.0,0.795653,0.928035
48,Aspiring Human Resources Specialist,1,0.994566,0.0,0.795653,0.928035


The result looks better than doc2vec when top rankings are job titles including exactly all the keywords.

In [15]:
# calculate ranking by weighted sum of cosine_similarity and scaled_connection
docs['BERT_Ranking']=docs['BERT_Cosine_Similarity']*0.8 + docs['Scaled_Connection']*0.2

# sort ranking
docs.sort_values(by='BERT_Ranking', ascending=False).head(20)

Unnamed: 0,job_title,connection,Doc2Vec_Cosine_Similarity,Scaled_Connection,Doc2Vec_Ranking,BERT_Cosine_Similarity,BERT_Ranking
66,"Human Resources, Staffing and Recruiting Professional",500,0.996322,1.0,0.997057,0.745484,0.796387
52,Seeking Human Resources HRIS and Generalist Positions,500,0.982756,1.0,0.986205,0.74471,0.795768
39,Seeking Human Resources HRIS and Generalist Positions,500,0.982756,1.0,0.986205,0.74471,0.795768
61,Seeking Human Resources HRIS and Generalist Positions,500,0.982756,1.0,0.986205,0.74471,0.795768
9,Seeking Human Resources HRIS and Generalist Positions,500,0.982756,1.0,0.986205,0.74471,0.795768
27,Seeking Human Resources Opportunities,390,0.993028,0.779559,0.950334,0.799641,0.795625
29,Seeking Human Resources Opportunities,390,0.993028,0.779559,0.950334,0.799641,0.795625
96,Aspiring Human Resources Professional,71,0.996053,0.140281,0.824898,0.949807,0.787902
16,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807,0.77708
32,Aspiring Human Resources Professional,44,0.996053,0.086172,0.814077,0.949807,0.77708


# ELMo <a class="anchor" id="elmo"></a>

# Learning to Rank <a class="anchor" id="ltr"></a> 

In [16]:
df_ltr=docs.sort_values('Doc2Vec_Ranking', ascending=False)[['Doc2Vec_Cosine_Similarity','BERT_Cosine_Similarity','Scaled_Connection']]
# set default value for target rank = 0
df_ltr['Target_Rank']=0
df_ltr

Unnamed: 0,Doc2Vec_Cosine_Similarity,BERT_Cosine_Similarity,Scaled_Connection,Target_Rank
66,0.996322,0.745484,1.000000,0
77,0.995309,0.623912,1.000000,0
100,0.995143,0.603002,1.000000,0
67,0.994954,0.622302,1.000000,0
103,0.994890,0.275140,1.000000,0
...,...,...,...,...
56,0.705799,0.573268,0.168337,0
43,0.705799,0.573268,0.168337,0
18,0.705799,0.573268,0.168337,0
30,0.705799,0.573268,0.168337,0


In [17]:
# target rank = 1 for the top 10 positions 
df_ltr.iloc[:10]['Target_Rank']=1
df_ltr

Unnamed: 0,Doc2Vec_Cosine_Similarity,BERT_Cosine_Similarity,Scaled_Connection,Target_Rank
66,0.996322,0.745484,1.000000,1
77,0.995309,0.623912,1.000000,1
100,0.995143,0.603002,1.000000,1
67,0.994954,0.622302,1.000000,1
103,0.994890,0.275140,1.000000,1
...,...,...,...,...
56,0.705799,0.573268,0.168337,0
43,0.705799,0.573268,0.168337,0
18,0.705799,0.573268,0.168337,0
30,0.705799,0.573268,0.168337,0


In [18]:
X=df_ltr[['Doc2Vec_Cosine_Similarity','BERT_Cosine_Similarity','Scaled_Connection']]
y=df_ltr['Target_Rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1)

In [19]:
param_dic = {
    'learning_rate': list(np.arange(0.001, 0.1, 0.001)),
    'n_estimators': list(range(5, 101)),
    'num_leaves': list(range(5, 51)),
    'max_depth': list(range(1, 20)),
    'min_split_gain':list(10**i for i in range (-10,1)),
    'min_child_samples':list(range(5, 21))}

def randomized_search_cv_custom(x_train_total, y_train_total):
    
    x_train_total, y_train_total = x_train_total.reset_index(drop=True), y_train_total.reset_index(drop=True)
    
    # we would use the same default value of n_iter=10 as RandomizedSearchCV, 
    # and store the mean score of each iteration into these 2 lists
    train_scores = []
    test_scores = []

    param_list=[]
    random.seed(22)
    for iteration in tqdm(range(10)):
        # for each iteration, we would take a random param set from param_dic and save it into a list
        random_params = {k: random.choice(v) for k, v in param_dic.items()}
        param_list.append(random_params)
        
        # each random param set would have 5 train and test scores  
        trainscores_folds = []
        testscores_folds = []

        skf = StratifiedKFold(n_splits=5, random_state=99, shuffle=True)
        for fold_number, (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
            
            # select datapoints based on test_index and train_index
                x_train_fold = x_train_total.iloc[train_index]
                y_train_fold = y_train_total.iloc[train_index]
                x_test_fold = x_train_total.iloc[test_index]
                y_test_fold = y_train_total.iloc[test_index]

#             # sanity check     
#                 print('Iter', iteration)
#                 print('Fold number', fold_number)
#                 print('Train index',list(x_train_fold.index))
#                 print('Test index', list(x_test_fold.index))
#                 print('--------------------------------------')
                
            # call and fit the classifier on the x_train_fold    
                gbm = lightgbm.LGBMRanker(objective='lambdarank', n_jobs=-1, random_state=12, **random_params)
                query_train = [x_train_fold.shape[0]]
                gbm.fit(x_train_fold, y_train_fold, group=query_train, eval_metric='ndcg')

            # predict x_test_fold and append the ndcg score in the testscores_folds
                test_pred = gbm.predict(x_test_fold)
                y_test_df = pd.DataFrame({"relevance_score": y_test_fold, "predicted_ranking": test_pred})

                true_relevance = y_test_fold.sort_values(ascending=False)
                relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)
                testscores_folds.append(ndcg_score([true_relevance.to_numpy()], 
                                                   [relevance_score["relevance_score"].to_numpy()]))

            # predict x_train_fold and append the ndcg score in the trainscores_folds
                train_pred = gbm.predict(x_train_fold)
                y_train_df = pd.DataFrame({"relevance_score": y_train_fold, "predicted_ranking": train_pred})

                train_true_relevance = y_train_fold.sort_values(ascending=False)
                train_relevance_score = y_train_df.sort_values("predicted_ranking", ascending=False)
                trainscores_folds.append(ndcg_score([train_true_relevance.to_numpy()], 
                                                    [train_relevance_score["relevance_score"].to_numpy()]))

    # append the mean score for each random param set
    train_scores.append(np.mean(np.array(trainscores_folds)))
    test_scores.append(np.mean(np.array(testscores_folds)))
    
    # get the highest validation score and best param set
    highest_test_score=max(test_scores)
    highest_test_score_index=np.argmax(test_scores)
    best_param=param_list[highest_test_score_index]
    return highest_test_score, best_param

# sanity check 
#randomized_search_cv_custom(X_train, y_train)

highest_test_score, best_param = randomized_search_cv_custom(X_train, y_train)
print('Highest validation score', highest_test_score)
print('\nBest params:', best_param)


100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  4.11it/s]

Highest validation score 0.9385455493943781

Best params: {'learning_rate': 0.018000000000000002, 'n_estimators': 36, 'num_leaves': 6, 'max_depth': 15, 'min_split_gain': 1e-08, 'min_child_samples': 8}





In [31]:
# call and fit the best model
gbm_optimal = lightgbm.LGBMRanker(n_jobs=-1, random_state=22, **best_param)
query_train = [X_train.shape[0]]
gbm_optimal.fit(X_train, y_train, group=query_train, eval_metric='ndcg')

# predict and put into the dataframe
X_train["Predicted_relevance_score"] = gbm_optimal.predict(X_train)
X_train['Target_Rank']=y_train

X_test["Predicted_relevance_score"] = gbm_optimal.predict(X_test)
X_test['Target_Rank']=y_test

# sort 
X_train_sorted=X_train.sort_values("Predicted_relevance_score", ascending=False)
X_test_sorted=X_test.sort_values("Predicted_relevance_score", ascending=False)
X_train_sorted.head()

Unnamed: 0,Doc2Vec_Cosine_Similarity,BERT_Cosine_Similarity,Scaled_Connection,Predicted_relevance_score,Target_Rank
25,0.990402,0.503456,1.0,0.926795,1
77,0.995309,0.623912,1.0,0.926795,1
66,0.996322,0.745484,1.0,0.926795,1
102,0.991876,0.148272,1.0,0.926795,1
67,0.994954,0.622302,1.0,0.926795,1


In [32]:
# reshape to calculate ndcg score on train and test sets
y_train_true=[y_train.sort_values(ascending=False).to_numpy()]
y_train_score=[X_train_sorted['Target_Rank'].to_numpy()]

y_test_true=[y_test.sort_values(ascending=False).to_numpy()]
y_test_score=[X_test_sorted['Target_Rank'].to_numpy()]

# calculate ndcg score
ndcg_train=ndcg_score(y_train_true, y_train_score, k=10)
ndcg_test=ndcg_score(y_test_true, y_test_score, k=10)
print('nDCG@10 score on TRAIN set:', round(ndcg_train,2))
print('nDCG@10 score on TEST set:', round(ndcg_test,2))
print('nDCG@10 score DIFFERENCE:', round(ndcg_train-ndcg_test,2))

nDCG@10 score on TRAIN set: 1.0
nDCG@10 score on TEST set: 1.0
nDCG@10 score DIFFERENCE: 0.0
