**Introduction**

This project is a solution for talent sourcing company that wants to automate the talent sourcing process.
We will do this by using a machine learning model to spot talented individuals and rank them based on their fitness.
The model will be trained on a dataset of historical data, including the skills and experience of successful candidates, the keywords used to search for candidates, and the results of manual reviews.
The model will be able to rank candidates based on their likelihood of being a good fit for a specific role.
In addition to automating the talent sourcing process, our solution will also allow us to re-rank the list of fitting candidates based on manual review feedback.
This will allow the company to make better decisions about which candidates to pursue and will to find the best possible candidates for the clients.


* The following is the default intilization of Kaggle Notebook 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/textsearch/potential-talents - Aspiring human resources - seeking human resources.csv


* Importing the required libraries and setting the i/o max_chars to control the resources utilization

In [2]:
import IPython
IPython.core.display._iopub_max_chars = 1000000
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import ndcg_score
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import average_precision_score
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer


**Training the model**

We have trained the model by using diffrent approaches to check which of them is providing the best fit result.
as conclousion we found that TD-IDF is giving best similarity score among the other used techniques (Word2Vec, BERT, gloVe).

* The following code is for training TF-IDF model to find the similarity score between the keywords and the job_title column for each applicant.

In [3]:

# Loading the data set to the workbook
df = pd.read_csv("/kaggle/input/textsearch/potential-talents - Aspiring human resources - seeking human resources.csv")

# create a TF-IDF vectorizer to extract features from text documents
vectorizer = TfidfVectorizer()

# fit the vectorizer on the job_title column
vectorizer.fit(df['job_title'])

# transform the job_title column into a TF-IDF matrix
tfidf_matrix = vectorizer.transform(df['job_title'])

# compute the cosine similarity between each row and the target phrase
target_phrase = 'aspiring human resources'
target_tfidf = vectorizer.transform([target_phrase])
similarity_scores = tfidf_matrix.dot(target_tfidf.T).toarray().flatten()

# add the scores as a new column in the DataFrame
df['tfidf'] = similarity_scores

# filter the DataFrame to show only rows with similarity > 0.4
similar_jobs = df[df['tfidf'] >= 0.4]

# print the filtered DataFrame
print(similar_jobs, '\n')

# print the count of the filtered DataFrame
print(similar_jobs.count())

    id                                          job_title  \
2    3              Aspiring Human Resources Professional   
5    6                Aspiring Human Resources Specialist   
6    7  Student at Humber College and Aspiring Human R...   
8    9  Student at Humber College and Aspiring Human R...   
16  17              Aspiring Human Resources Professional   
20  21              Aspiring Human Resources Professional   
23  24                Aspiring Human Resources Specialist   
24  25  Student at Humber College and Aspiring Human R...   
32  33              Aspiring Human Resources Professional   
35  36                Aspiring Human Resources Specialist   
36  37  Student at Humber College and Aspiring Human R...   
38  39  Student at Humber College and Aspiring Human R...   
45  46              Aspiring Human Resources Professional   
48  49                Aspiring Human Resources Specialist   
49  50  Student at Humber College and Aspiring Human R...   
51  52  Student at Humbe

**Learning To Rank**

in the follwoing code we prepared the resulted daataframe from TF-IDF to be used in the learn to rank model for ranking the job tiltles, we have used LightGBM which is a framework developed by Microsoft that that uses tree based learning algorithms to rank the results.

In [4]:

# Extract the job titles from the relevant column
job_titles = df['job_title'].tolist()

# Define the keywords as a complete sentence
keywords = "Aspiring Human Resources"

# Compute the TFIDF scores
vectorizer = TfidfVectorizer()
tfidf_scores = vectorizer.fit_transform(job_titles)
keyword_scores = vectorizer.transform([keywords])

# computes the similarity scores between each job description represented as a TF-IDF vector and a set of keywords represented as a binary vector.
similarity_scores = tfidf_scores.dot(keyword_scores.T).toarray()

# Choose the rows with similarity scores > 0.5
selected_rows = np.where(similarity_scores >= 0.5)[0]
dropped_rows = np.where(similarity_scores < 0.5)[0]
print(selected_rows,'\n')
print(dropped_rows,'\n')

# defines the target variable y as a list of integers from 1 to the number of selected rows. 
y = [i+1 for i in range(len(selected_rows))]

# Prepares the input data X for the model by selecting the rows with similarity scores greater than or equal to 0.5 from the TF-IDF matrix.
X = tfidf_scores[selected_rows,:]

# Converts the resulting sparse matrix to a dense array.
X = X.toarray()

# Computing the length of each job title (in characters and words) and concatenating these features to the TF-IDF matrix using the np.hstack function.
X = np.hstack((X, np.array([len(job_title) for job_title in job_titles])[selected_rows].reshape(-1,1)))
X = np.hstack((X, np.array([len(job_title.split()) for job_title in job_titles])[selected_rows].reshape(-1,1)))

# Define the query information, this is used to eliminate diminsion errors in the LTR models.
# array of integers from 0 to the number of selected rows, which is used to identify each query.
query_ids = np.arange(len(y))

# Array containing the number of selected rows.
query_lengths = np.array([len(selected_rows)])

# largest label value in y, which is used to determine the number of classes in the classification problem
max_label = max(y)

print(max_label, '\n')
print(query_lengths, '\n')

#array containing the unique label values in y.
unique_labels = np.unique(y)
print(unique_labels, '\n')

# Train the LTR model
lgb_train = lgb.Dataset(X, y, group=query_lengths)
params = {
    'boosting_type': 'gbdt',
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'ndcg_eval_at': [5, 10],
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1
}
model = lgb.train(params, lgb_train, num_boost_round=100)

# Use the model for ranking
rankings = model.predict(X, group=query_lengths)
ranked_job_titles_Aspiring = [job_titles[selected_rows[i]] for i in np.argsort(rankings)]

# Print the ranked job titles
print("The top ranked job titles for Aspiring are:")
for job_title in ranked_job_titles_Aspiring:
    print(job_title)

[ 2  5 16 20 23 32 35 45 48 57 59 72 96] 

[  0   1   3   4   6   7   8   9  10  11  12  13  14  15  17  18  19  21
  22  24  25  26  27  28  29  30  31  33  34  36  37  38  39  40  41  42
  43  44  46  47  49  50  51  52  53  54  55  56  58  60  61  62  63  64
  65  66  67  68  69  70  71  73  74  75  76  77  78  79  80  81  82  83
  84  85  86  87  88  89  90  91  92  93  94  95  97  98  99 100 101 102
 103] 

13 

[13] 

[ 1  2  3  4  5  6  7  8  9 10 11 12 13] 

The top ranked job titles for Aspiring are:
Aspiring Human Resources Professional
Aspiring Human Resources Specialist
Aspiring Human Resources Professional
Aspiring Human Resources Professional
Aspiring Human Resources Specialist
Aspiring Human Resources Professional
Aspiring Human Resources Specialist
Aspiring Human Resources Professional
Aspiring Human Resources Specialist
Aspiring Human Resources Professional
Aspiring Human Resources Specialist
Aspiring Human Resources Manager, seeking internship in Human Resources.
Aspi

**Sorting**

in the follwoing code we sort the job titles based on the ranking results and then print in descending order.

In [5]:
# Get the indices of the top N job titles with the highest ranking scores
N = 10
top_indices = np.argsort(rankings)[-N:]

# Print the top N job titles with their corresponding similarity scores
for i in top_indices:
    index = selected_rows[np.argsort(rankings)[i]]
    print(df.iloc[index])

id                                               21
job_title     Aspiring Human Resources Professional
location        Raleigh-Durham, North Carolina Area
connection                                       44
fit                                             NaN
tfidf                                      0.753591
Name: 20, dtype: object
id                                             24
job_title     Aspiring Human Resources Specialist
location               Greater New York City Area
connection                                      1
fit                                           NaN
tfidf                                    0.695679
Name: 23, dtype: object
id                                               33
job_title     Aspiring Human Resources Professional
location        Raleigh-Durham, North Carolina Area
connection                                       44
fit                                             NaN
tfidf                                      0.753591
Name: 32, dtype: object
id  

**Starred Applications**

the chosen application wil be starred and re ranked to the top of the query result and the remaining applications will be re-ranked.

In [6]:
# Starring application function
def star_candidate(candidate_id, rankings):
    rankings[candidate_id] = float('inf')
    return rankings

# Star the candidate with ID 10 and 5 as trial
rankings = star_candidate(10, rankings)
rankings = star_candidate(5, rankings)

# Print the re-ranked candidates
print(np.argsort(rankings)[::-1])


# re-ranking function 
def re_rank_candidates(rankings):
    new_rankings = rankings.copy()
    for candidate_id in np.argsort(rankings)[::-1]:
        if candidate_id in starred_candidates:
            new_rankings[candidate_id] = float('inf')
    return new_rankings

# Get the list of starred candidates
starred_candidates = [10, 5]

# Re-rank the candidates
new_rankings = re_rank_candidates(rankings)

# Print the re-ranked candidates
print(np.argsort(new_rankings)[::-1])

[10  5 12 11  9  8  7  6  4  3  2  1  0]
[10  5 12 11  9  8  7  6  4  3  2  1  0]


**Print the final result**

the follwoing code will print the rankings in table with the starred candidates at the top of the result.

In [7]:
# Get the rankings
rankings = np.argsort(new_rankings)[::-1]

# Create a Pandas DataFrame
df = pd.DataFrame({
    'Candidate ID': rankings,
    'Ranking': np.arange(1, len(rankings) + 1)
})

# Print the DataFrame
print(df.to_string())

    Candidate ID  Ranking
0             10        1
1              5        2
2             12        3
3             11        4
4              9        5
5              8        6
6              7        7
7              6        8
8              4        9
9              3       10
10             2       11
11             1       12
12             0       13
