<div style="padding:20px;color:white;margin:0;font-size:175%;text-align:center;display:fill;border-radius:5px;background-color:#016CC9;overflow:hidden;font-weight:500">D: The Default Rate captured at 4%</div>

As explained by the organizers [here](https://www.kaggle.com/competitions/amex-default-prediction/overview/evaluation), the competition's metric M is the average of two sub-metrics: G an D. We will focus here on the D part.


The organizers have provided us with the [code](https://www.kaggle.com/code/inversion/amex-competition-metric-python)  to calculate the metrics. For those who have been using the provided functions and would like to know more about how they work, this Notebook will gently go through the calculation details for the D part of the metric. 


[AmbrosM](https://www.kaggle.com/ambrosm) has published a beautiful graphical 
[explanation](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464) for the two components of the metric.

# <b><span style='color:#4B4B4B'>1 |</span><span style='color:#016CC9'> Submission file</span></b>

We are going to use a submission example and work on it. The following submission file is coming from the train customers. A model has been trained and applied to some of the train customers. We have added the column "target" with the true value coming from the file train_label.csv.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

sub = pd.read_csv('../input/xgbwfecublend/submission.csv')
print(sub.head(5).to_markdown())

As you can see, this submission file contains 3 columns: 
* customer_ID (these are some customers from the training data)
* target (this comes from the train_label file provided by amex) 
* prediction (this is the results of a trained model prediction).


We actually have everything we need to calculate the amex score, so let's do it.

# <b><span style='color:#4B4B4B'>2 |</span><span style='color:#016CC9'> Score</span></b>

We are going to use the functions provided by the organizers. We simply extracted the two sub-metrics to make them callable independently.

In [None]:
 def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    df = (pd.concat([y_true, y_pred], axis='columns')
             .sort_values('prediction', ascending=False))
    df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
    four_pct_cutoff = int(0.04 * df['weight'].sum())
    df['weight_cumsum'] = df['weight'].cumsum()
    df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
    return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    df = (pd.concat([y_true, y_pred], axis='columns')
            .sort_values('prediction', ascending=False))
    df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
    df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
    total_pos = (df['target'] * df['weight']).sum()
    df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
    df['lorentz'] = df['cum_pos_found'] / total_pos
    df['gini'] = (df['lorentz'] - df['random']) * df['weight']
    return df['gini'].sum()

def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    y_true_pred = y_true.rename(columns={'target': 'prediction'})
    return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

In [None]:
score = amex_metric(sub[['target']],sub[['prediction']])
print (f'The amex score is: {score:.3f}')

We have a score of **0.796**. Let's now look at both sub-scores separately.

In [None]:
G = normalized_weighted_gini(sub[['target']],sub[['prediction']])
D= top_four_percent_captured(sub[['target']],sub[['prediction']])

print(f'The score split is as follow. G: {G:.3f} and D: {D:.3f}')

This is a big surprise. The score is not balanced, G is much higher than D. You can check it while CVing your models, on your validation data. In my case, **D is always worse than G**. D is in the 60s and G in the 90s. The barrier at 0.80 seems to be coming from D.

For this reason it is interesting to understand exactly how D is calculated, maybe this can give us some insight on improving the models or features selection. Maybe someone will crack D and break the 0.800 barrier!

 # <b><span style='color:#4B4B4B'>3 |</span><span style='color:#016CC9'> Adjusting the subsampling</span></b>

To understand the metric calculation, we really need to understand the subsampling process. This was the most confusing part in my case. As stated by the organizers:

**Note that the negative class has been subsampled for this dataset at 5%**

In the original data, the ratio of defaulting customers is vey low. Amex would need to provide us with files 20 times larger, in order to get the same number of defaulters (to provide us with enough signal to train our models), But the files are already gigantic.
In the train and test data, the ratio of non defaulting customers has been artificially decreased. To achieve this, amex has subsampled the negative class. In other words, **they have removed 19 out of 20** non-defaulting customers to increase the density of defaulting customers.

 **and thus receives a 20x weighting in the scoring metric**.
 But for the scoring, amex wants to reproduce the orginal conditions. To this effect, we need to apply the opposite of the subsampling. As we have lost the removed customers, we need to find an artificial adjustment. To do this, any non defaulting customer (target=0) will be artificially duplicated 20 times, or attributed a weight of 20, while a defaulting customer (target=1) keeps a weight of 1.
In other words, during the subsampling, 19 out of 20 non-defaulting customers are removed. During the scoring, to artificially recreate these customers, each non-defaulting customer is multiplied 20 times.
 
 So this is the thing to keep in mind, **during the scoring, a non defaulting customer (target=0) is "multiplied" 20 times** (through a weight).

In [None]:
sub['weight']=20-sub['target']*19
print(sub.head().to_markdown())

We have added a weight column. Think of the weight as a multiplication factor. Each non-defaulting customer (target=0) 'exists' 20 times. Each defaulting customers (target=1) 'exists' only once.

In [None]:
print(f'the raw length of the submission file is {len(sub)}')
length_after_adjustment = sub['weight'].sum()
print (f'the length of the adjusted submission file is {length_after_adjustment}')

The raw submission files contain 91,782 customers, but the adjusted file, reproducing the original conditions, contains 1,386,879 customers (most of them are artificially created by multiplication)

In [None]:
rr=sub['target'].sum()/len(sub)
print(f'The raw ratio of defaulting customers is: {rr*100:.1f}%')
ar=(sub['target'].sum()/length_after_adjustment)
print(f'The adjusted ratio of defaulting customers is: {ar*100:.1f}%')

In [None]:

fig, ax = plt.subplots(ncols=2,figsize=(15,10))
sizes=[sub['target'].sum(),len(sub)-sub['target'].sum()]
labels=['Default','Non Default']
ax[0].pie(sizes, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax[0].axis('equal') 
ax[0].set_title('subsampled')
sizes=[sub['target'].sum(),length_after_adjustment-sub['target'].sum()]
labels=['Default','Non Default']
ax[1].pie(sizes, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax[1].axis('equal') 
ax[1].set_title('adjusted for subsampling')

plt.show()

In the train and test files we are working on, the defaulting customers represent around 25% of the cutomers. But in the original files at amex, it is much lower (1.70% in this example). By adjusting the submission file, we get back to this original ratio. As a rough idea, in the train/test datasets, the ratio is 1 Default to 3 Non-Defaults. After adjustment, the ratio becomes 1 Default to 60 Non-Defaults.

 # <b><span style='color:#4B4B4B'>4 |</span><span style='color:#016CC9'> 4% cut-off</span></b>

We will first sort the customers by prediction (probability of default) from highest to lowest.

In [None]:
sub_sorted=sub.sort_values('prediction',ascending=False)
print(sub_sorted.head(5).to_markdown())

We add a column weight_cumsum. This is a counter of the adjusted number of customers up to this row. 

In [None]:
sub_sorted['weight_cumsum']=sub_sorted['weight'].cumsum()
print(sub_sorted.head(5).to_markdown())
print(sub_sorted.tail(5).to_markdown())


We will now select the first 4% of the sorted and adjusted customers. 

In [None]:
sub_cutoff=sub_sorted.loc[sub_sorted['weight_cumsum'] <= (0.04*length_after_adjustment)]

The idea here is to select the 4% customers with the highest probability of defaulting according to the model.

In [None]:
print(f'cut_off at : {len(sub_cutoff)/len(sub)*100:.1f}%')

We can see that we have actually selected much more than 4% of the submission file, we have kept 19.3% of the submission file. This is explained by the adjustment process. To make it clearer, we will draw the cutoff post and pre adjustment.

In [None]:
import numpy as np
import seaborn as sns
fig, ax = plt.subplots(figsize=(15,10))
l=len(sub_sorted)
ax.scatter(sub_sorted['weight_cumsum'],sub_sorted['prediction'],s=1)
ax.set_title('sorted prediction. cut-off after adjustment',fontsize = 12)
plt.xlabel("prediction rank",fontsize = 12)
plt.ylabel("default prediction",fontsize = 12)
plt.ticklabel_format(style='plain')
plt.xlim(left=0)
plt.axvline(x=0.04*length_after_adjustment,color='r')
plt.text(100000,0.8,'The red line is the 4% cutoff (after adjustment). The customers on the left of this line are selected.', fontsize = 12)
 

plt.show()

However, if we represent the raw submission file, the 4% cutoff looks like this.

In [None]:
import numpy as np
import seaborn as sns
fig, ax = plt.subplots(figsize=(15,10))
l=len(sub_sorted)
ax.scatter(np.arange(l),sub_sorted['prediction'],s=1)
ax.set_title('sorted prediction, how the cuttoff looks like on the raw data',fontsize = 12)
plt.xlabel("prediction rank",fontsize = 12)
plt.xlim(left=0)
plt.ylabel("default prediction",fontsize = 12)
plt.axvline(x=len(sub_cutoff),color='r')
plt.text(20000,0.8,'The red line is the 4% cutoff. The customers on the left of this line are selected.', fontsize = 12)
 

plt.show()

Wait a minute, what happened there? the red line has shifted! instead of selecting 4% of the submission file, we are selecting 19.3%! This is all due to the adjustment process. It is not a proportional dilatation of the curve. The customers on the right have far more chance to be non-defaulters and getting multiplied. So the dilatation mostly happens on the right side of the red line and the fraction selected goes from 19.3% to 4% post-adjustment.

 # <b><span style='color:#4B4B4B'>5 |</span><span style='color:#016CC9'> D calculation</span></b>

Now that we have selected these 4% customers, most likely to default according to the model, we want to know if we captured most of the true defaulters. This is called the True Positive Rate. It is the ratio of True Positive captured within the 4% threshold to the total number of True Positives.
It's not really clear where this 4% comes from, it may be a fixed number that amex is using (maybe regulatory). So amex wants us to produce a model that captures as many True Positives as possible within this 4% limit. This is exactly what D measures.

First we calculate the number of True Positive (target=1) within this 4% cutoff. Then we simply divide it by the total number of Positive in the submission file.

In [None]:
TP=(sub_cutoff['target']==1).sum()# Number of True Positives within the 4% cuttoff
P=(sub['target']==1).sum() #Total number of True Positives

TPR=TP/P # Ratio to get the True Positive Rate
print(f'True Positive Rate with a cutoff at 4%: {TPR:.3f}')

We get **0.667**. We have recalculated D in the same way as the function provided by amex !

Now the description of D in the [Evaluation part](https://www.kaggle.com/competitions/amex-default-prediction/overview/evaluation)  of the competition is clear:

**The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions**

This is done after adjustment for subsampling.
The wider the cutoff (5%, 6%...), the more True Positives are being captured, but unwanted False Positives will also be captured.

 # <b><span style='color:#4B4B4B'>6 |</span><span style='color:#016CC9'> Conclusion</span></b>

Note that D=0.667 is not great, this means that the model missed a third of the defaulters when a 4% cuttoff was applied. It seems that this is where there is room for progress. 
It would really be interesting to see each part of the score, G and D, on the LeaderBoard. Which sub-metric explains the gap from 0.795 and 0.800? Are we improving our models on the G or D part? You can see an example of the learning curves for both sub-metrics [here](https://www.kaggle.com/competitions/amex-default-prediction/discussion/334157)