### Description:
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

[Home Credit](http://www.homecredit.net/about-us.aspx) strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

# Import Libraries and The Data
First, we import necessary libraries, such as:

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Then, import the data

In [None]:
#load train and test dataset
train = pd.read_csv('/kaggle/input/home-credit-default-risk/application_train.csv')
test = pd.read_csv('/kaggle/input/home-credit-default-risk/application_test.csv')

previous_application = pd.read_csv('/kaggle/input/home-credit-default-risk/previous_application.csv')
installment_payment = pd.read_csv('/kaggle/input/home-credit-default-risk/installments_payments.csv')

# Exploratory Data Analysis
## Exploration: Train and Test Dataset
### - Concise Summary
Display ```info()``` and ```head()``` to familiarize ourself with the train and test dataset.

In [None]:
#concise summary of train dataset
train.info()
train.head()

In [None]:
#concise summary of test dataset
test.info()
test.head()

The target variable defines if the client had payment difficulties, marked as 1, meaning the client with late payment more than X days, while other all other cases marked as 0.

### - Check For Anomalies
One way to do this is by analyze at the output of ```describe()``` method. We will check for anomalies such as typo, extreme outliers, dtype error between numerical and categorical, etc.

In [None]:
#describe dataset
train.describe(include='all')

Quick observation on the ```describe()``` output :

- The maximum value of DAYS_EMPLOYED feature is a positive value. That seems an value error, since DAYS_EMPLOYED feature description is 'How many days before the application the person started current employment', and supposed to be a negative value. Let's plot ditribution of DAYS_EMPLOYED feature to visualize it.

In [None]:
#plot distribution
sns.distplot(train['DAYS_EMPLOYED']);

From the plot above, we can see there are quite a lot outliers with values = 365243. Since we don't have any information whether it was a typo or on purpose, so we will handling it by set this anomalies to a missing value.

In [None]:
#set anomalies to a missing value
train['DAYS_EMPLOYED'].replace({365243 : np.nan}, inplace=True)
test['DAYS_EMPLOYED'].replace({365243 : np.nan}, inplace=True)

In [None]:
#plot distribution after removing anomalies
sns.distplot(train['DAYS_EMPLOYED']);

Now the distribution looks like what we would expect.

### - Check For Duplicates

In [None]:
#check for duplicated data
print('Duplicated value(s) on the train dataset : ', train.duplicated().sum())
print('Duplicated value(s) on the test dataset  : ', test.duplicated().sum())

There is no duplicated data.

### - Check The Distribution of The Target Array

In [None]:
#plot distribution of the target array
sns.countplot(train['TARGET']);

In [None]:
print('Percentage of the target array distribution:')
print('--------------------------------------------')
print(train['TARGET'].value_counts() / len(train['TARGET']) * 100)

From the distribution plot, we can see the dataset is highly unbalanced, the positive target account for 8.07% of all target. To deal with imbalanced data, we can either using a resampling technique such as over- or under-sampling or set ```class_weight``` 'to balanced' when tuning the machine learning model. In this case, we will use ```class_weight```, because resampling technique tends to overfitting when data is highly unbalanced.

### - Check For Missing Values
Check for percentage of missing values in each feature.

In [None]:
print('percentage of missing values for each feature:')
print('----------------------------------------------')
train.isnull().sum().sort_values(ascending=False) / len(train) * 100

We will use the strategy where we will fill missing values in categorical features with its ```mode()``` and fill missing values in numerical features with its ```mean()```.

In [None]:
for feature in test.columns:
    if (train[feature].dtype == 'object'):
        #fill missing values in categorical features with its mode() 
        train[feature].fillna(train[feature].mode()[0], inplace=True)
        test[feature].fillna(test[feature].mode()[0], inplace=True)
    else:
        #fill missing values in numerical features with its mean()
        train[feature].fillna(train[feature].mean(), inplace=True)
        test[feature].fillna(test[feature].mean(), inplace=True)

In [None]:
#check for any missing data
print('missing data in the train dataset : ', train.isnull().any().sum())
print('missing data in the test dataset : ', test.isnull().any().sum())

## Exploration: Previous Application Dataset
### - Concise Summary
Display ```info()``` and ```head()``` to familiarize ourself with previous application dataset.

In [None]:
#concise summary of train dataset
previous_application.info()
previous_application.head()

### - Check For Anomalies

In [None]:
#describe dataset
previous_application.describe(include='all')

Quick observation on the ```describe()``` output :

- AMT_DOWN_PAYMENT has negative value. Let's take a deeper look at the dataset.

In [None]:
previous_application[previous_application['AMT_DOWN_PAYMENT'] < 0]['AMT_DOWN_PAYMENT'].count()

There are only 2 negative values, we will set them to 0 since it will not significantly affect the models.

In [None]:
#set negative value to 0
previous_application.loc[previous_application['AMT_DOWN_PAYMENT'] < 0 , 'AMT_DOWN_PAYMENT'] = 0

- DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE and DAYS_TERMINATION features have a positive value. That seems an value error, since those features description are 'Relative to application date of current application when ... ', and supposed to be a negative value. Let's plot ditribution of those features to visualize it.

In [None]:
sns.distplot(previous_application['DAYS_FIRST_DRAWING'], kde=False);

In [None]:
sns.distplot(previous_application['DAYS_FIRST_DUE'], kde=False);

In [None]:
sns.distplot(previous_application['DAYS_LAST_DUE_1ST_VERSION'], kde=False);

In [None]:
sns.distplot(previous_application['DAYS_LAST_DUE'], kde=False);

In [None]:
sns.distplot(previous_application['DAYS_TERMINATION'], kde=False);

Same as before, Since we don't have any information whether it was a typo or on purpose, so we will handling it by set this anomalies to a missing value.

In [None]:
#set anomalies to a missing value
previous_application['DAYS_FIRST_DRAWING'].replace({365243 : np.nan}, inplace=True)
previous_application['DAYS_FIRST_DUE'].replace({365243 : np.nan}, inplace=True)
previous_application['DAYS_LAST_DUE_1ST_VERSION'].replace({365243 : np.nan}, inplace=True)
previous_application['DAYS_LAST_DUE'].replace({365243 : np.nan}, inplace=True)
previous_application['DAYS_TERMINATION'].replace({365243 : np.nan}, inplace=True)

plot distribution after removing anomalies.

In [None]:
sns.distplot(previous_application['DAYS_FIRST_DRAWING']);

In [None]:
sns.distplot(previous_application['DAYS_FIRST_DUE']);

In [None]:
sns.distplot(previous_application['DAYS_LAST_DUE_1ST_VERSION']);

In [None]:
sns.distplot(previous_application['DAYS_LAST_DUE']);

In [None]:
sns.distplot(previous_application['DAYS_TERMINATION']);

### - Check For Missing Values

In [None]:
print('percentage of missing values for each feature:')
print('----------------------------------------------')
previous_application.isnull().sum().sort_values(ascending=False) / len(previous_application) * 100

In [None]:
for feature in previous_application.columns:
    if (previous_application[feature].dtype == 'object'):
        #fill missing values in categorical features with its mode() 
        previous_application[feature].fillna(previous_application[feature].mode()[0], inplace=True)
    else:
        #fill missing values in numerical features with its mean() 
        previous_application[feature].fillna(previous_application[feature].mean(), inplace=True)

In [None]:
#check for any missing data
print('missing data in previous application dataset : ', previous_application.isnull().any().sum())

## Exploration: Installment Payment
### - Concise Summary
Display ```info()``` and ```head()``` to familiarize ourself with installment payment dataset.

In [None]:
#concise summary
installment_payment.info()
installment_payment.head()

### - Check For Anomalies

In [None]:
#describe dataset
installment_payment.describe(include='all')

No anomalies were found.

### - Check For Missing Values

In [None]:
print('percentage of missing values for each feature:')
print('----------------------------------------------')
installment_payment.isnull().sum().sort_values(ascending=False) / len(installment_payment) * 100

In [None]:
for feature in installment_payment.columns:
    if (installment_payment[feature].dtype == 'object'):
        #fill missing values in categorical features with its mode() 
        installment_payment[feature].fillna(installment_payment[feature].mode()[0], inplace=True)
    else:
        #fill missing values in numerical features with its mean() 
        installment_payment[feature].fillna(installment_payment[feature].mean(), inplace=True)

In [None]:
#check for any missing data
print('missing data in installment payment dataset : ', installment_payment.isnull().any().sum())

# Feature Engineering
## Previous Application Dataset
### - Feature Creation

In [None]:
#for each ID, count the number of previous application
prev_app_count = previous_application[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR', as_index=False).count()
prev_app_count.rename(columns={'SK_ID_PREV':'PREV_APP_COUNT'}, inplace=True)

#merge to train and test dataset
train = pd.merge(train, prev_app_count, on='SK_ID_CURR')
test = pd.merge(test, prev_app_count, on='SK_ID_CURR')

prev_app_count.head()

In [None]:
#recent application for each ID
recent_app = previous_application[['SK_ID_CURR', 'DAYS_DECISION']].groupby('SK_ID_CURR', as_index=False).max()
recent_app.rename(columns={'DAYS_DECISION':'RECENT_APP'}, inplace=True)

#merge to train and test dataset
train = pd.merge(train, recent_app, on='SK_ID_CURR')
test = pd.merge(test, recent_app, on='SK_ID_CURR')

recent_app.head()

In [None]:
#for each ID, average values for each features in previous applications
prev_app_mean = previous_application.groupby('SK_ID_CURR', as_index=False).mean()
prev_app_mean.drop(['SK_ID_PREV'], axis=1, inplace=True)

#prefix addition
prev_app_mean.columns = ['PREV_' + col_name + '_MEAN' if col_name != 'SK_ID_CURR' else col_name for col_name in prev_app_mean.columns]

#merge to train and test dataset
train = pd.merge(train, prev_app_mean, on='SK_ID_CURR')
test = pd.merge(test, prev_app_mean, on='SK_ID_CURR')

prev_app_mean.head()

## Installment Payment Dataset
### - Feature Creation

In [None]:
#for each ID, count the number of installment payment
inst_pay_count = installment_payment[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR', as_index=False).count()
inst_pay_count.rename(columns={'SK_ID_PREV':'INST_PAY_COUNT'}, inplace=True)

#merge to train and test dataset
train = pd.merge(train, inst_pay_count, on='SK_ID_CURR')
test = pd.merge(test, inst_pay_count, on='SK_ID_CURR')

inst_pay_count.head()

In [None]:
#for each ID, average values for each features in installment payment
inst_pay_mean = installment_payment.groupby('SK_ID_CURR', as_index=False).mean()
inst_pay_mean.drop(['SK_ID_PREV'], axis=1, inplace=True)

#prefix addition
inst_pay_mean.columns = ['INST_' + col_name + '_MEAN' if col_name != 'SK_ID_CURR' else col_name for col_name in inst_pay_mean.columns]

#merge to train and test dataset
train = pd.merge(train, inst_pay_mean, on='SK_ID_CURR')
test = pd.merge(test, inst_pay_mean, on='SK_ID_CURR')

inst_pay_mean.head()

## Train and Test Dataset
### - Feature Creation
- DEBT_BURDEN_RATIO : The ratio of the debts you have to your average monthly income. Let's say the lenders set DBR threshold to 35%
- ANNUITY_TO_DBR : Percentage of annuity to Debt Burden Ratio
- ANNUITY_TO_CREDIT : Percentage of annuity to approved credit

In [None]:
main_dataset = [train, test] 

for dataset in main_dataset:
    dataset['DEBT_BURDEN_RATIO'] = dataset['AMT_INCOME_TOTAL'] * (35/100)
    dataset['ANNUITY_TO_DBR'] = (dataset['AMT_ANNUITY'] / dataset['DEBT_BURDEN_RATIO']) * 100
    dataset['ANNUITY_TO_CREDIT'] = (dataset['AMT_ANNUITY'] / dataset['AMT_CREDIT']) * 100

train.head()

### - Correlations
Finding correlations of all features with the target.

In [None]:
#correlations
train.corr()['TARGET'].sort_values(ascending=False)[1:]

In [None]:
#drop features based on correlations
features_to_be_dropped = ['SK_ID_CURR',
                          #'TARGET',
                          'FLAG_MOBIL']

#store target array
target_array = train['TARGET']
train.drop('TARGET', axis=1, inplace=True)
#target_array_train = train['TARGET']
#target_array_test = test['TARGET']

#store test's LN_ID
SK_ID_CURR = test['SK_ID_CURR']

#drop features
train.drop(features_to_be_dropped, axis=1, inplace=True)
test.drop(features_to_be_dropped, axis=1, inplace=True)

#print shape
print('train shape: ', train.shape)
print('test shape: ', test.shape)

### - One-Hot Encoding

In [None]:
train['SOURCE'] = 'train'
test['SOURCE'] = 'test'

#combine train and test dataset
combined_data = pd.concat([train, test], ignore_index=True)

print(train.shape, test.shape, combined_data.shape)

In [None]:
#create dummies
combined_data = pd.get_dummies(combined_data, drop_first=True)
combined_data.shape

In [None]:
combined_data.head()

## Creating features matrix (X) and target array (y)

In [None]:
X = combined_data[combined_data['SOURCE_train'] == 1].copy()
y = target_array

X_test = combined_data[combined_data['SOURCE_train'] == 0].copy()

X.drop(['SOURCE_train'], axis=1, inplace=True)
X_test.drop(['SOURCE_train'], axis=1, inplace=True)

### - Check For Overfitting
Check for overfitting caused by redundant zeroes:

In [None]:
def overfit_zeros(df, limit=99.95):
    """df (dataframe)  : data
       limit (float)   : limit to be called overfitted
       Returns a list of features that have redundant zeroes and caused overfitting.
    """
    overfit = []
    
    for i in df.columns:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df) * 100 > limit:
            overfit.append(i)
            
    overfit = list(overfit)
    
    return overfit

In [None]:
#list of overfitted features
overfitted_features = overfit_zeros(X)

#print overfitted features
print('Overfitted features :')
print('---------------------')
for feature in overfitted_features:
    print(feature)

In [None]:
#drop overfitted features
X.drop(overfitted_features, axis=1, inplace=True)
X_test.drop(overfitted_features, axis=1, inplace=True)

### - Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

# Creating A Model
## Model Selection: CatBoostClassifier
CatBoost name comes from two words, Category and Boosting. Boost comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well.

We begin by splitting data into two subsets: for training data and for validating data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=7)

In [None]:
from catboost import CatBoostClassifier

#tuning the model
cat_boost = CatBoostClassifier(iterations = 1000,
                               scale_pos_weight = 11, #from the ratio between majority class to minority class
                               learning_rate = 0.01, 
                               depth = 8,
                               eval_metric = 'AUC',
                               random_seed = 7)

#fitting
cat_boost.fit(X_train, y_train, 
              eval_set=(X_eval, y_eval))

## Metrics
The results will be on evaluated on area under the ROC (Receiver Operating Characteristic) curve between the predicted probability and the observed target, because we deal with imbalanced dataset. We should not use accuracy score beacuse it will be bias to majority class.

The Area Under the ROC Curve, known as ROC AUC, measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC ranges in value from 0 to 1.

In [None]:
from sklearn.metrics import roc_auc_score

#probability
cat_boost_positive_prob = cat_boost.predict_proba(X_eval)[:, 1]

print('CatBoostClassifier ROC AUC score : ', roc_auc_score(y_eval, cat_boost_positive_prob))

## Feature Importances
```feature_importances_``` attribute gives us a list where the higher score the more important that feature.

In [None]:
#feature importances
feature_importances = pd.DataFrame({'feature'   : X_train.columns,
                                    'importance': cat_boost.feature_importances_})

#plot top 10 feature importances
sns.barplot(x='importance', y='feature', data=feature_importances.sort_values('importance', ascending=False)[:10]);

## Make A Prediction
In this section, we will make a predicted probability for positive class, i.e. client had payment difficulties, and sorted/ranked them from the highest probability to the lowest. From there, the lenders can set a threshold to determine how much of the potential risk that can be accepted and choose who are eligible for a loan.

In [None]:
#create a predicted probability dataframe
prediction = pd.DataFrame({'SK_ID_CURR': SK_ID_CURR,
                           'TARGET_POSITIVE_PROB': cat_boost.predict_proba(X_test)[:, 1]})

#sort/ranked from the highest probability to the lowest
prediction = prediction.sort_values('TARGET_POSITIVE_PROB', ascending=False)

prediction