This notebook is intended for those who want a gentle introduction to the Evaluation Metric for this competition. 

I created this notebook to decomose the Evaluation Metric, M, and its components, D and G.

In [None]:
import pandas as pd

Let's create a simple benchmark to calculate the Evaluation Metric. We are going to use *P_2* column in train_data.csv to calculate *y_pred* and train_labels.csv to create *y_true*.

In [None]:
train_data = pd.read_csv('../input/amex-default-prediction/train_data.csv', index_col='customer_ID', usecols=['customer_ID', 'P_2'])
train_labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv', index_col='customer_ID')

In [None]:
train_data.head()

In [None]:
train_labels.head()

Since there are a couple of rows for each *customer_id*, we are going to group them and then calculate the mean of *P_2* column as our prediction.

In [None]:
ave_p2 = (train_data.groupby('customer_ID').mean().rename(columns={'P_2': 'prediction'}))

# Scale the mean P_2 by the max value and take the compliment
ave_p2['prediction'] = 1.0 - (ave_p2['prediction'] / ave_p2['prediction'].max())

ave_p2.head()

In [None]:
y_true = train_labels.copy()
y_pred = ave_p2.copy()

The evaluation metric, **M**, for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient, **G**, and default rate captured at 4%, **D**.

*M = 0.5 x (G + D)*

In the following, we will calculate each component in detail.

##### Calculating Parameter D (default rate captured at 4%):
The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.

Let's break down the steps to caluclate D:

In [None]:
# Create a df dataframe by concatinating y_true, y_pred and sorting the rows based on y_pred in a descending order.
df = (pd.concat([y_true, y_pred], axis='columns').sort_values('prediction', ascending=False))
df

In [None]:
# Create a 'weight' column in df with values of 1 for y_true = 0, and 20 for y_true = 0.
df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
df

In [None]:
# Calculate four_pct_cutoff variable as 4 percent of the sum of all values in 'weight' column.
four_pct_cutoff = int(0.04 * df['weight'].sum())
four_pct_cutoff

In [None]:
# Create 'weight_cumsum' column in df to calculate cumulative sum of weights in 'weight' column.
df['weight_cumsum'] = df['weight'].cumsum()
df

In [None]:
# Create df_cutoff dataframe based on filtering the portion of descending sorted y_pred which has lower weight_cumsum of four_pct_cutoff.
df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
df_cutoff

In [None]:
# Calculate the ratio of the number of y_true = 1 in df_cutoff to number of y_true = 1 in the input dataframe.
d = (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
d

##### Calculating Weighted Gini:

Weighted Gini is required to calculate normalized weighted Gini (G parameter).

Let's break down the steps to caluclate Weighted Gini:

In [None]:
# Create df dataframe by concatinating y_true, y_pred and sorting the rows based on y_pred in a descending order.
df = (pd.concat([y_true, y_pred], axis='columns').sort_values('prediction', ascending=False))
df

In [None]:
# Create 'weight' column in df with values of 1 for y_true = 0, and 20 for y_true = 0.
df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
df

In [None]:
# Create 'random' column in df that has cumulative sum of 'weight' column divided by sum of 'weight' column.
df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
df

In [None]:
# Define total_pos variable by the sum of all y_true values to their corresponding 'weight' values.
total_pos = (df['target'] * df['weight']).sum()
total_pos

In [None]:
# Create 'cum_pos_found' column in df by calculating cumulative sum of the multiplication of y_true and their corresponding 'weight' values.
df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
df

In [None]:
# Create 'lorentz' column by dividing 'cum_pos_found' values by total_pos value.
df['lorentz'] = df['cum_pos_found'] / total_pos
df

In [None]:
# Create 'gini' column by ('lorentz' - 'random') * 'weight' values
df['gini'] = (df['lorentz'] - df['random']) * df['weight']
df

In [None]:
# Calculate sum of 'gini' column values.
g_not_normalized = df['gini'].sum()
g_not_normalized

Let's put everything togheter in two functions to calculate **d** and **g_not_normalized**:

In [None]:
def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
    
def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

In [None]:
print('D = {}'.format(top_four_percent_captured(y_true, y_pred)))
print('g_not_normalized = {}'.format(weighted_gini(y_true, y_pred)))

##### Calculating G (Normalized Weighted Gini):

Let's break down the steps to caluclate Normalized Weighted Gini:
- Create y_true_pred dataframe by modifying 'target' column name to 'prediction' column name in y_true dataframe.
- Return weighted_gini(y_true, y_pred) divided by weighted_gini(y_true, y_true_pred)

In [None]:
y_true_pred = y_true.rename(columns={'target': 'prediction'})
y_true_pred

In [None]:
weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

Let's put everything togheter in one functions to calculate **g_normalized**:

In [None]:
def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    y_true_pred = y_true.rename(columns={'target': 'prediction'})
    return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

Now that we have calculated **D** and **G**, we can measure the evaluation metric, **M**.

In [None]:
g = normalized_weighted_gini(y_true, y_pred)
d = top_four_percent_captured(y_true, y_pred)

0.5 * (g + d)

Now let's put everything in one place:

In [None]:
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

In [None]:
amex_metric(y_true, y_pred)