# Learning to rank

It is a subfield of machine learning often used to rank documents based on their relevance score. However, it is not that often used in Kaggle, so I decided to give it a shot for some proxy task in the [American Express](https://www.kaggle.com/competitions/amex-default-prediction) competition.

The data has temporal structure - each customer has N months of history observed (N = 1,...,13).

We are going to test the hypothesis, that there are time related features in the dataset, which have strong correlation with time. Before doing this I was interested if there was `time_since_customer` type of variables in the dataset. Spotting this by looking at the data is cumbersome. However, learning to rank models are perfect for this task!




# Model setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

We are going to use the dataset I have cleaned up of the original data. More info in a kaggle [thread](https://www.kaggle.com/competitions/amex-default-prediction/discussion/328514).

In [None]:
train = pd.read_parquet('../input/amex-data-integer-dtypes-parquet-format/train.parquet')

# train/val by customer_ID first symbol
fi = {'0':0,'1':0,'2':0,'3':0,'4':1,'5':1,'6':1,'7':1,'8':2,'9':2,'a':2,'b':2,'c':3,'d':3,'e':3,'f':3}
train['fold'] = train['customer_ID'].apply(lambda t: fi[t[0]])

For simplicity, we will take the customers with the N=13 timeframe available.

In [None]:
train = train.loc[train.groupby('customer_ID')['customer_ID'].transform('count')==13].reset_index(drop=True)

Now we will construct rank based on the date field. The lower the rank - the closer we are to the last observation date for each customer.

Important: We are not going to use `S_2` as a feature! The idea is that a model should be able to approximate `S_2` based on other features (if there are time-related features of course).

In [None]:
train['rank'] = train.groupby('customer_ID')['S_2'].rank(ascending=False).astype(int)

# Training the model

Let's build a very simple `rank:pairwise` xgboost model. We are going to use GPU as it speeds up training at least 30x! More on that in [NVIDIA's blog](https://developer.nvidia.com/blog/learning-to-rank-with-xgboost-and-gpu/#:~:text=XGBoost%20is%20a%20widely%20used,descent%20using%20an%20objective%20function).

In [None]:
cols = [x for x in train.columns if x not in ('customer_ID','fold','S_2','rank')]

tr = train.loc[train.fold == 0].reset_index(drop=True)
dtrain = xgb.DMatrix(tr[cols], label=tr['rank'], group = tr.groupby('customer_ID').size().to_frame('size')['size'].to_numpy())

va = train.loc[train.fold == 1].reset_index(drop=True)
dvalid = xgb.DMatrix(va[cols], label=va['rank'], group = va.groupby('customer_ID').size().to_frame('size')['size'].to_numpy())

del train

model = xgb.train({'tree_method': 'gpu_hist', 
                   'objective':'rank:pairwise', 
                   'subsample':1, 
                   'colsample_bytree':1}, 
            dtrain=dtrain,
            evals=[(dtrain,'train'),
                   (dvalid,'valid')],
            num_boost_round=100,    
            verbose_eval=100,                  
            maximize=True)  

plot_importance(model,max_num_features=10, grid=False)

We can see that three `D` (as for **D**elinquency) features stand out. We will investigate this further. But first let's see how accurate our very simple model is in predicting the time rank.

In [None]:
va['pred'] = model.predict(dvalid)
va['rank_pred'] = va.groupby('customer_ID')['pred'].rank().astype(int)

cm = confusion_matrix(va['rank'],va['rank_pred'],normalize='true')
cmp = ConfusionMatrixDisplay(cm)
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax)

It seems the model is able to predict the rank correctly 40-50% of all instances. Quite good. But this rejects the hypothesis that we have a `time_since_customer` feature in the dataset - the accuracy should be 100% if the hypothesis was true.

Let's dig in a bit deeper!

# Understanding `D_59`

Feature importance indicates that there is something special about this feature. Let's display a few customers:


In [None]:
pd.set_option('display.max_rows', 100)
va[['customer_ID','rank','rank_pred','D_59']].head(13*3)

Quite interesting! Some customers have `D_59` time correlation (and almost perfect time rank prediction). But there are customers who do not have this behavior as `D_59` acts more like a constant. 

We do know, that `D_59` feature is a delinquency feature. We can assume that `D_59` represents how many credit installments/months a customer has been late to repay on his/her credit card. However, we would see clients with `D_59 = 0` (no delinquencies) - which are not present in the dataset. Maybe the dataset was somehow stratified to only include previously delinquent costumers? There is a lot of room to speculate here.

There are 2 clear segments (`D_59` correlates / does not correlate with time) - this opens the opportunity to analyze these segments individually.

# Client segmentation

In [None]:
#create arbitrary segments based on min/max of D_59
gg = va.loc[va.D_59!=-1].groupby('customer_ID')['D_59'].agg(('max','min')).reset_index()
s1 = gg.loc[gg['max']-gg['min']>9].reset_index(drop=True)
s2 = gg.loc[gg['max']-gg['min']<=9].reset_index(drop=True)

s1.shape, s2.shape

Let's see how the model confusion matrix looks like for these 2 segments separatelly.

### segment with temporal `D_59`:

In [None]:
seg = va.loc[va.customer_ID.isin(s1.customer_ID)]
cm = confusion_matrix(seg['rank'],seg['rank_pred'],normalize='true')
cmp = ConfusionMatrixDisplay(cm)
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax)

### segment without temporal `D_59`:

In [None]:
seg = va.loc[va.customer_ID.isin(s2.customer_ID)]
cm = confusion_matrix(seg['rank'],seg['rank_pred'],normalize='true')
cmp = ConfusionMatrixDisplay(cm)
fig, ax = plt.subplots(figsize=(10,10))
cmp.plot(ax=ax)

As expected, our model accuracy is 2-3x better for the segment showing temporal `D_59` behavior.

# End notes

There is a potential to segment clients based on their delinquency status. However, it is not clear if doing so gives any advantage in overall default risk modelling.