## Competition Objective:

The objective of the competition is to predict credit default i.e., to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile.  Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information.

## Notebook Objective

The objective of the notebook is to explore the given datasets and make some inferences along the way.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import plotly.graph_objects as go
import plotly.express as px

%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

## Files Information

Let us first look at the given files. 

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        file_size = round(os.path.getsize(os.path.join(dirname, filename)) / (1e9), 2)
        print(f"Filename : {filename} \t File Size : {file_size} GB")

We are given four files and the names are self explanatory. But look at those file sizes, they are just huge!!

## Reading the dataset files

As the files are huge, reading them using pandas `read_csv` as is will blew up the Kaggle notebook memory. So let us convert the `float64` columns to `float16` and then read the training dataset.

In [None]:
dtype_dict = {'customer_ID': "object",
 'S_2': "object",
 'P_2': 'float16',
 'D_39': 'float16',
 'B_1': 'float16',
 'B_2': 'float16',
 'R_1': 'float16',
 'S_3': 'float16',
 'D_41': 'float16',
 'B_3': 'float16',
 'D_42': 'float16',
 'D_43': 'float16',
 'D_44': 'float16',
 'B_4': 'float16',
 'D_45': 'float16',
 'B_5': 'float16',
 'R_2': 'float16',
 'D_46': 'float16',
 'D_47': 'float16',
 'D_48': 'float16',
 'D_49': 'float16',
 'B_6': 'float16',
 'B_7': 'float16',
 'B_8': 'float16',
 'D_50': 'float16',
 'D_51': 'float16',
 'B_9': 'float16',
 'R_3': 'float16',
 'D_52': 'float16',
 'P_3': 'float16',
 'B_10': 'float16',
 'D_53': 'float16',
 'S_5': 'float16',
 'B_11': 'float16',
 'S_6': 'float16',
 'D_54': 'float16',
 'R_4': 'float16',
 'S_7': 'float16',
 'B_12': 'float16',
 'S_8': 'float16',
 'D_55': 'float16',
 'D_56': 'float16',
 'B_13': 'float16',
 'R_5': 'float16',
 'D_58': 'float16',
 'S_9': 'float16',
 'B_14': 'float16',
 'D_59': 'float16',
 'D_60': 'float16',
 'D_61': 'float16',
 'B_15': 'float16',
 'S_11': 'float16',
 'D_62': 'float16',
 'D_63': 'object',
 'D_64': 'object',
 'D_65': 'float16',
 'B_16': 'float16',
 'B_17': 'float16',
 'B_18': 'float16',
 'B_19': 'float16',
 'D_66': 'float16',
 'B_20': 'float16',
 'D_68': 'float16',
 'S_12': 'float16',
 'R_6': 'float16',
 'S_13': 'float16',
 'B_21': 'float16',
 'D_69': 'float16',
 'B_22': 'float16',
 'D_70': 'float16',
 'D_71': 'float16',
 'D_72': 'float16',
 'S_15': 'float16',
 'B_23': 'float16',
 'D_73': 'float16',
 'P_4': 'float16',
 'D_74': 'float16',
 'D_75': 'float16',
 'D_76': 'float16',
 'B_24': 'float16',
 'R_7': 'float16',
 'D_77': 'float16',
 'B_25': 'float16',
 'B_26': 'float16',
 'D_78': 'float16',
 'D_79': 'float16',
 'R_8': 'float16',
 'R_9': 'float16',
 'S_16': 'float16',
 'D_80': 'float16',
 'R_10': 'float16',
 'R_11': 'float16',
 'B_27': 'float16',
 'D_81': 'float16',
 'D_82': 'float16',
 'S_17': 'float16',
 'R_12': 'float16',
 'B_28': 'float16',
 'R_13': 'float16',
 'D_83': 'float16',
 'R_14': 'float16',
 'R_15': 'float16',
 'D_84': 'float16',
 'R_16': 'float16',
 'B_29': 'float16',
 'B_30': 'float16',
 'S_18': 'float16',
 'D_86': 'float16',
 'D_87': 'float16',
 'R_17': 'float16',
 'R_18': 'float16',
 'D_88': 'float16',
 'B_31': 'int64',
 'S_19': 'float16',
 'R_19': 'float16',
 'B_32': 'float16',
 'S_20': 'float16',
 'R_20': 'float16',
 'R_21': 'float16',
 'B_33': 'float16',
 'D_89': 'float16',
 'R_22': 'float16',
 'R_23': 'float16',
 'D_91': 'float16',
 'D_92': 'float16',
 'D_93': 'float16',
 'D_94': 'float16',
 'R_24': 'float16',
 'R_25': 'float16',
 'D_96': 'float16',
 'S_22': 'float16',
 'S_23': 'float16',
 'S_24': 'float16',
 'S_25': 'float16',
 'S_26': 'float16',
 'D_102': 'float16',
 'D_103': 'float16',
 'D_104': 'float16',
 'D_105': 'float16',
 'D_106': 'float16',
 'D_107': 'float16',
 'B_36': 'float16',
 'B_37': 'float16',
 'R_26': 'float16',
 'R_27': 'float16',
 'B_38': 'float16',
 'D_108': 'float16',
 'D_109': 'float16',
 'D_110': 'float16',
 'D_111': 'float16',
 'B_39': 'float16',
 'D_112': 'float16',
 'B_40': 'float16',
 'S_27': 'float16',
 'D_113': 'float16',
 'D_114': 'float16',
 'D_115': 'float16',
 'D_116': 'float16',
 'D_117': 'float16',
 'D_118': 'float16',
 'D_119': 'float16',
 'D_120': 'float16',
 'D_121': 'float16',
 'D_122': 'float16',
 'D_123': 'float16',
 'D_124': 'float16',
 'D_125': 'float16',
 'D_126': 'float16',
 'D_127': 'float16',
 'D_128': 'float16',
 'D_129': 'float16',
 'B_41': 'float16',
 'B_42': 'float16',
 'D_130': 'float16',
 'D_131': 'float16',
 'D_132': 'float16',
 'D_133': 'float16',
 'R_28': 'float16',
 'D_134': 'float16',
 'D_135': 'float16',
 'D_136': 'float16',
 'D_137': 'float16',
 'D_138': 'float16',
 'D_139': 'float16',
 'D_140': 'float16',
 'D_141': 'float16',
 'D_142': 'float16',
 'D_143': 'float16',
 'D_144': 'float16',
 'D_145': 'float16'}

df = pd.read_csv("/kaggle/input/amex-default-prediction/train_data.csv", dtype=dtype_dict)
df.shape

The training dataset has got 5,531,451 rows and 190 columns.

Features are anonymized and normalized and grouped into following general categories.
* D_* = Delinquency variables
* S_* = Spend variables
* P_* = Payment variables
* B_* = Balance variables
* R_* = Risk variables

In [None]:
df.head()

**Takeaways:**

* It is also given that the following variables are categorical.`['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']`

* We are given the monthly customer profiles which means we have got multiple rows per `customer_ID`, one for each month.

* Number of variables in each of the general categories
  - Delinquency variables = 96
  - Spend variables = 22
  - Payment variables = 3
  - Balance variables = 40
  - Risk variables = 28
  
There are several other ways to load the data faster using other packages and file formats. Please check the below excellent tutorial notebook by Rohan Rao.
https://www.kaggle.com/code/rohanrao/tutorial-on-reading-large-datasets/notebook

## Customer Analysis

In this section, let us look into the customer level information.

Let us start with looking into the number of unique customers in the dataset.

In [None]:
print(f"There are {df['customer_ID'].nunique()} unique customers in the training dataset")

Now let us check the number of months for which the customer profiles are available.

In [None]:
cnt_srs = (df['customer_ID'].value_counts()).value_counts()
plt.figure(figsize=(12,6))
sns.barplot(x=cnt_srs.index, y=cnt_srs.values, alpha=0.8, color=color[0])
plt.xlabel('Number of Months', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
plt.title("Number of months for which the customer profile is present", fontsize=15)
plt.show()

**Takeaways:**
* Most of the customers in the given training dataset have 13 months of customer profile.

## Time Period Analysis

In this section, let us check the time period of the given training dataset. The variable `S_2` has time information.

In [None]:
df['S_2'] = pd.to_datetime(df['S_2'])
print(f"Minimum date value in the training dataset : {df['S_2'].min()}")
print(f"Maximum date value in the training dataset : {df['S_2'].max()}")

So, we are given the customer monthly data from March 1, 2017 to March 31, 2018 for 13 months in the training dataset. 

Now let us see how many records are there in each of these months.

In [None]:
cnt_srs = (df['S_2'].dt.year*100 + df['S_2'].dt.month).value_counts()
plt.figure(figsize=(12,6))
sns.barplot(x=cnt_srs.index, y=cnt_srs.values, alpha=0.8, color=color[0])
#plt.xticks(rotation='vertical')
plt.xlabel('YearMonth', fontsize=12)
plt.ylabel('Number of Rows', fontsize=12)
plt.title("Monthly distribution of training data", fontsize=15)
plt.show()

**Takeaways**
* The number of customer profiles for each month is slightly increasing month on month

Let us check the day of the month distribution as well.

In [None]:
cnt_srs = (df['S_2'].dt.day).value_counts()
plt.figure(figsize=(12,6))
sns.barplot(x=cnt_srs.index, y=cnt_srs.values, alpha=0.8, color=color[0])
plt.xlabel('Day of the month', fontsize=12)
plt.ylabel('Number of Rows', fontsize=12)
plt.title("Daywise distribution of training data", fontsize=15)
plt.show()

**Takeaways:**
* The monthly customer profile is taken from all days of the month.
* Beginning of the month has comparatively lower customer profiles than the rest of the month.

Now let us take a sample customer and see if the profiling is done at the same day of every month.

In [None]:
sample_customer_id = df["customer_ID"].values[0]
temp_df = df[df["customer_ID"]==sample_customer_id]
temp_df["YearMonth"] = pd.to_datetime(pd.DataFrame({"year":temp_df['S_2'].dt.year, "month":temp_df['S_2'].dt.month, "day":[1]*temp_df.shape[0]}))
temp_df["DayOfMonth"] = temp_df['S_2'].dt.day

plt.figure(figsize=(12,6))
sns.scatterplot(data=temp_df, x="YearMonth", y="DayOfMonth", alpha=0.8, color=color[0], s=80)
plt.xlabel('Year&Month', fontsize=12)
plt.ylabel('Day of Month', fontsize=12)
plt.title(f"Customer profiling days for customer_ID={sample_customer_id}", fontsize=15)
plt.show()

**Takeaways:**
* So customer profiling is done once every month 
* Customer profiling is not done on a fixed day every month

**This is a work in progress. More to come. Stay tuned**