# Potential Talents - An Apziva Project (#3)

By Samuel Alter

Apziva: 6bImatZVlK6DnbEo

# Proceed to the [previous notebook](potential_talents_pt2_ranknet.ipynb) to view my work on RankNet.

## Project Overview

We are working with a talent sourcing and management company to help them surface candidates that are a best fit for their human resources job post. We are using a dataset of job candidates' job titles, their location, and their number of LinkedIn connections.

### Goals

Produce a probability, between 0 and 1, of how closely the candidate fits the job description of **"Aspiring human resources"** or **"Seeking human resources."** After an initial recommendation pulls out a candidate(s) to be starred for future consideration, the recommendation will be re-run and new "stars" will be awarded.

To help predict how the candidates fit, we are tracking the performance of two success metrics:
* Rank candidates based on a fitness score
* Re-rank candidates when a candidate is starred

We also need to do the following:
* Explain how the algorithm works and how the ranking improves after each starring iteration
* How to filter out candidates which should not be considered at all
* Determine a cut-off point (if possible) that would work for other roles without losing high-potential candidates
* Ideas to explore on automating this procedure to reduce or eliminate human bias

### The Dataset

| Column | Data Type | Comments |
|---|---|---|
| `id` | Numeric | Unique identifier for the candidate |
| `job_title` | Text | Job title for the candidate |
| `location` | Text | Geographic location of the candidate |
| `connections` | Text | Number of LinkedIn connections for the candidate |

Connections over 500 are encoded as "500+". Some do not have specific locations listed and just had their country, so I substituted capitol cities or geographic centers to represent those countries.

# Imports and Helper Functions

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb

# LambdaRank

In an effort to explore other ranking algorithms, we will now turn to LambdaRank. It is an evolution of the RankNet algorithm that we worked on above. While RankNet looks to optimize pairwise accuracy, LambdaRank optimizes for ranking metrics like NDCG, or Normalized Discounted Cumulative Gain. This checks not only if the first item should be ranked higher than the second, but also how much swapping their order would improve the final ranking. The gain can be thought of this way: if a relevant item is placed close to the top, it will have a greater gain than if a relevant item was placed towards the bottom. RankNet also takes advantage of a loss function and cares about individual rankings, while LambdaRank uses **lambdas** that help adjust the model's focus to help improve the overall ranking quality.

You can read more about LambdaRank [here](https://tamaracucumides.medium.com/learning-to-rank-with-lightgbm-code-example-in-python-843bd7b44574). There's a short snippet of information about LambdaRank [from Microsoft](https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/). Researchers there designed the algorithm.

[This repository](https://github.com/Ransaka/LTR-with-LIghtGBM) gives a good example of how to implement the algorithm.

In [None]:
## we'll include a validation set for this run
# split data into training+validation and testing sets
pairs_train_val, pairs_test, labels_train_val, labels_test, id_pairs_train_val, id_pairs_test = train_test_split(
    pairs, labels, id_pairs, test_size=0.2, random_state=seed
)

# split val set into training and validation sets
pairs_train, pairs_val, labels_train, labels_val, id_pairs_train, id_pairs_val = train_test_split(
    pairs_train_val, labels_train_val, id_pairs_train_val, test_size=0.25, random_state=seed
)  # 0.25 x (1.0 - test_size) = 0.2, so validation set is 20% of the original data

# convert to DataFrame for saving to parquet
train_data = pd.DataFrame({
    'input_1': [pair[0].detach().cpu().numpy() for pair in pairs_train],
    'input_2': [pair[1].detach().cpu().numpy() for pair in pairs_train],
    'label': labels_train,
    'id_1': [id_pair[0] for id_pair in id_pairs_train],
    'id_2': [id_pair[1] for id_pair in id_pairs_train],
})
print(f'Finished defining train_data. Shape: {train_data.shape}')

test_data = pd.DataFrame({
    'input_1': [pair[0].detach().cpu().numpy() for pair in pairs_test],
    'input_2': [pair[1].detach().cpu().numpy() for pair in pairs_test],
    'label': labels_test,
    'id_1': [id_pair[0] for id_pair in id_pairs_test],
    'id_2': [id_pair[1] for id_pair in id_pairs_test],
})
print(f'Finished defining test_data. Shape: {test_data.shape}')

# save as parquet files
train_data.to_parquet('../joblib/3_pairs_train.parquet', index=False)
test_data.to_parquet('../joblib/3_pairs_test.parquet', index=False)
print('Finished saving to parquet')

In [None]:
# instantiate Learning To Rank'er with LightGBM
ranker = lgb.LGBMRanker()