# **Notebook Objective**

Exploring train and test data timeperiods



# **Target Definition**

The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event. 

1. In credit terminology what this definition means is **people who are 120+ days delinquent in 18 months**

2. **ECM Model** For Amex this is an Existing customer management model, This model will primarily be used for Credit line management, assesing portfolio risk.



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Delinquencies DQ Transitions: Credit Terminology an example

A customer who got his credit statement in April 31st and does not pay the minimum amount , is

* 30+ DQ on May 1

* 60+ DQ on June 1

* 90+ DQ on July 1

* 120+ DQ on August 1

* 150+ DQ on September 1

* 180+ DQ on October 1 - At this point customer is considered as charged off.

We can think of above transition phases as a markov chain, where in recovery or cure rate from one stage to another decreases as we move towards final(charge off state).

Once a customer enters DQ phases, it is easy to predict transition rates for next phases, so existing customer models focus on the time period before a customer enters DQ cycles

# **Understanding vintage(time periods) given in the dataset**


For this initial part, I am focusing on statement date column, S_2

In [None]:
dtype_dict = {'customer_ID': "object",
 'S_2': "object"}

train = pd.read_csv("/kaggle/input/amex-default-prediction/train_data.csv", dtype=dtype_dict, usecols=['customer_ID','S_2'])
print(f' Number of customers {train.customer_ID.nunique()} , Number of rows {train.shape[0]}')
train.head()

In [None]:
# get statement month
train['stmt_mon'] = train['S_2'].to_numpy().astype('datetime64[M]')
train.head()

In [None]:
gp = train.groupby('stmt_mon').agg({'customer_ID':'nunique'})
print(gp)
gp.plot.bar(title='Number of Unique customers in each month')

# March, 2018 as the vintage for Train data

From the above we can see, that train data has statements upto March, 2018

####  Get data at the customer level
**Starting and ending statement month for each customer**

In [None]:
cust = train.groupby(['customer_ID']).agg({'customer_ID':'count',
                             'stmt_mon':['min','max']
                              
                             })
cust.columns = ['num_obs','st_mon','end_mon']
cust.head()

In [None]:
vc = cust.num_obs.value_counts(normalize=True).round(4)
print(vc)
vc.plot.bar('Number of customers by number of statments')

**For 85% of the customers we have data for 13 months**

In [None]:
cust = cust.reset_index()
cust.groupby('st_mon').agg({'customer_ID':'count','num_obs':[min,max]})

# **Starting month of Train data is March 2017**

* My assumption is :  Other starting months might represent customers who were on book in the month of March 17, but were spend inactive during that time period, But this needs to be investigated,

In [None]:
cust.groupby('end_mon').agg({'customer_ID':'count','num_obs':[min,max]})

# **Train data represents Default behavior as seen on Statment of March 2018**

# **Relevance of 13 months**

* **We are given performance data for a maximum of 13 months, after which a customer has done default in next 18 months+**


# Relation with target by number of statements


In [None]:
dep = pd.read_csv('/kaggle/input/amex-default-prediction/train_labels.csv')
cust = cust.merge(dep,on='customer_ID')
cust.shape

In [None]:
gp  = cust.groupby('num_obs').target.mean()
gp.plot.bar('Target proportion by number of statements')

**As the above plot shows people who were inactive for some of the months have higher default rates**

# **Time periods for test data**

In [None]:
del(train)
del(dep)

In [None]:

test = pd.read_csv("/kaggle/input/amex-default-prediction/test_data.csv", dtype=dtype_dict, usecols=['customer_ID','S_2'])
print(f' Number of customers {test.customer_ID.nunique()} , Number of rows {test.shape[0]}')

In [None]:
test['stmt_mon'] = test['S_2'].to_numpy().astype('datetime64[M]')
gp = test.groupby('stmt_mon').agg({'customer_ID':'nunique'})
print(gp)
gp.plot.bar(title='Number of unique customers in each month for test dataset')

In [None]:
cust_test = test.groupby(['customer_ID']).agg({'customer_ID':'count',
                             'stmt_mon':['min','max']
                              
                             })
cust_test.columns = ['num_obs','st_mon','end_mon']
cust_test = cust_test.reset_index()
cust_test.groupby('st_mon').agg({'customer_ID':'count','num_obs':[min,max]})

### Checking overlap of customers

In [None]:
set(cust_test.index).intersection(cust.customer_ID)

In [None]:
cust_test.groupby('end_mon').agg({'customer_ID':'count','num_obs':[min,max]})

# **Test data is from 2 Time periods** 

1. April 2018 to April 2019
2. October 2018 to October 2019

**As DQ behavior is influenced by seasonality, Out of time validation for this competition might be the most challenging part,(considering train data is from a  single vintage of March 2017)**


# Leaderboard scoring is done on April 2019 vintage and Final evaluation will be done on October 2019 vintage

