# <b>1 <span style='color:greenyellow'>|</span> Introduction</b>


Credit default risk modeling is a very crucial and important in the domain of BFSI. In this AMEX provided dataset we have very large data which poses computing and processing hurdles .
 
 In this notebook we shall try to explore the various features and conduct preliminary EDA to make way for model building! Apart from this we shall be exploring Feature Partitioning in such high dimensional dataset and building ensemble models for each partition!
 
 Let us dive in !
 


<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | Load Libraries</b></p>
</div>

In [None]:
import pandas as pd
import os

import matplotlib.pyplot as plt
import seaborn as sns
import gc

pd.options.display.max_rows = 200

# <b>2 <span style='color:greenyellow'>|</span> Load Large Datasets</b>

We can use one of the following strategies to load large dataset:
1. Reduce the default dtype space when reading by downcasting columns
2. Use Feather dataset
3. Use Dask Distributed Computing 
4. Read in chunks(but analysis could not be done at once for all)

Here am using the feather datset [here](https://www.kaggle.com/datasets/seefun/amex-default-prediction-feather). This dataset is of only 4GB and excludes target label.




<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.1 | Loading Feather Dataset</b></p>
</div>

In [None]:
train_feather = pd.read_feather("../input/amex-default-prediction-feather/train.feather")
#test_feather = pd.read_feather("../input/amex-default-prediction-feather/test.feather")

In [None]:
train_feather.head()

# <b>3 <span style='color:greenyellow'>|</span> Feature Analysis and EDA</b>



<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | Data Dimensions</b></p>
</div>

**Size of Train** : (5531451, 190) ~5.5 million rows

**Size of Test** : (5531451, 190) ~5.5 million rows

**Number of Unique Customers** : 458913 

Observations from below:
* Each customer's credit card statement is present for 13 months in majority of cases
* All the credit card 

In [None]:
train_feather.dtypes

<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Missing Value Features</b></p>
</div>

We can observe many features having more than 50% of missing values. So removing those features which is having more than 50% of NAN and common in both train and test set.

In [None]:
gc.collect()
TRAIN_SHAPE = train_feather.shape
missing_df = train_feather.isnull().sum().sort_values(ascending=False)
missing_percent = missing_df.apply(lambda x:round(x/TRAIN_SHAPE[0],2))
#clean up
del missing_df
gc.collect()
missing_percent

In [None]:
gc.collect()
test_feather = pd.read_feather("../input/amex-default-prediction-feather/test.feather")
TEST_SHAPE = test_feather.shape


In [None]:
missing_df_test = test_feather.isnull().sum().sort_values(ascending=False)
missing_percent_test = missing_df_test.apply(lambda x:round(x/TEST_SHAPE[0],2))
#clean up
del missing_df_test 
del test_feather
gc.collect()

missing_percent_test

In [None]:
train_missing = set(missing_percent.index[missing_percent>0.5])
test_missing = set(missing_percent_test.index[missing_percent_test>0.5])
#finding features which is 50% missing in both train & test set
common_miss_feature = train_missing.intersection(test_missing) 
print("The 50% missing feautres in train and test set : \n",common_miss_feature)

#clean up


<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Basic EDA</b></p>
</div>

## Categorical Variables

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

## Binary Variables

* B_31 is always 0 or 1.
* D_87 is  missing.

## Count of 5 Feature Type Variables
Here we count the number of features in each of the 5 types of attributes Delinquency , Spend, Payment, Balance, Risk

In [None]:
train_features = train_feather.columns

print("Count Delinquency Features :",sum(train_features.str.startswith('D')))
print("Count Spend Features :",sum(train_features.str.startswith('S')))
print("Count Payment  Features :",sum(train_features.str.startswith('P')))
print("Count Balance Features :",sum(train_features.str.startswith('B')))
print("Count Risk Features :",sum(train_features.str.startswith('R')))

## Customer Statements (S_2 feature)

Let us count the number of credit card statements present for each unique customer ID . The `S_2` feature is the date of receiving statements.

**Observation**
* We can see from the plot that majority of customers have received 13 credit card statements .
* And if we inspect the `last` row in each group by customer_ID then last statement for customer were all in month of March 2018.

In [None]:
gc.collect()
cutomer_statements = (train_feather['customer_ID'].value_counts()).value_counts()
cutomer_statements

In [None]:
fig,ax = plt.subplots(1,1,figsize=(15,5))
sns.barplot(x=cutomer_statements.index,y=cutomer_statements,palette='Reds').set_xlabel('No. of credit card statements for each customer')
plt.show()

In [None]:
#clean up 
del cutomer_statements
gc.collect()

# <b>4 <span style='color:greenyellow'>|</span> Feature Engineering and Transformation</b>

<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.1 | Custom Feature Engg Function</b></p>
</div>
The below is a custom function to transform and engineer features. It does the following:
1. Drop  commong missing columns
2. Group data by `customer_ID` 
3. Agrregate the groups : 
    * For Numerical: Calculate the mean,std,max,min, last(this will be the last statement received by the customers).
    * For Categorical: Calculate the nsmallest, nlargest ,count, nunique last(this will be the last statement received by the customers).
4. Number of missing features in each row.

In [None]:
gc.collect()

#Custom function for feature engg.

def feature_engg(data, missing_f, num_f, cat_f):
    #drop
    data = data.drop(columns=missing_f,axis=1)
    
    #group num
    data_num_agg = data.groupby('customer_ID')[num_f].agg(['mean','std','min','max','last'])
    #join each column which is a tuple for each feature
    data_num_agg.columns = ['_'.join(c) for c in data_num_agg.columns]
    
    #group cat
    data_cat_agg = data.groupby('customer_ID')[cat_f].agg(['count','nunique','nlargest','nsmallest','last'])
    data_cat_agg.columns = ['_'.join(c) for c in data_cat_agg.columns]
    
    del data
    gc.collect()
    
    #concat
    df_out = pd.concat([data_num_agg,data_cat_agg],axis=1)
    del data_cat_agg, data_num_agg
    gc.collect()
    
    #count of missing features per row
    df_out['na_count'] = df_out.isna().sum(axis=1)
    
    print('Shape of data after feature engineering',df_out.shape)
    
    return df_out
    
    

In [None]:

#Define features
missing_f = list( {'D_53', 'R_26', 'D_50', 'D_136', 'D_88', 'D_134', 'D_137', 'D_111', 'D_132', 'D_73', 'B_17', 'D_87', 'D_142', 'B_42', 'D_138', 'B_29', 'D_110', 'D_106', 'D_135', 'D_108', 'R_9', 'D_76', 'D_49', 'D_105', 'B_39', 'D_42', 'D_82'})
cat_f = ['B_30','B_31','B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
exclude_f = ['customer_ID']+missing_f+cat_f 
num_f= [i for i in train_feather.columns if i not in exclude_f]
print('Number of missing features:', missing_f.__len__())
print('Number of cat features:', cat_f.__len__())
print('Number of num features:', num_f.__len__())

#call function
train_trans = feature_engg(train_feather, missing_f, num_f, cat_f)

In [None]:
train_trans

In [None]:
train_trans.to_pickle('./train_train.pkl')

<div style="color:white;display:fill;border-radius:12px;
            background-color:#323232;font-size:150%;
            font-family:Georgia;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>4.2 | Merge Target Labels</b></p>
</div>

In [None]:
train_labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv')
train_labels.set_index('customer_ID')

train_xy = pd.merge(left=train_trans,right=train_labels,on='customer_ID',how='inner')

In [None]:
mean_p2 = train_grouped.P_2.mean()

In [None]:
train_labels = pd.read_csv('../input/amex-default-prediction/train_labels.csv')
train_labels.set_index('customer_ID')

In [None]:
merged_df_p2 = pd.merge(left=mean_p2,right=train_labels,on='customer_ID',how='inner')

In [None]:
sns.histplot(x = merged_df_p2.P_2,hue=merged_df_p2.target)

In [None]:
del merged_df_p2
gc.collect()


In [None]:
mean_p3 = train_grouped.P_3.mean()
merged_df_p3 = pd.merge(left=mean_p3,right=train_labels,on='customer_ID',how='inner')
sns.histplot(x = merged_df_p3.P_3,hue=merged_df_p3.target)

In [None]:
train_feather[:100].isna().sum(axis=1)