# Home Credit Kaggle Competition

- Flaminia Spasiano (1889394)
- Onur Çopur (1891194)
- Anil Keshwani (1919705)

### Project Description

Our predictive analysis consisted of data cleaning, exploratory data analysis and modelling, the latter including feature selection and model evaluation. 

The initial **data cleaning** required:  

- basic recoding of specific columns (e.g. age of client)
- creation of dummy variables to replace categorical variables such as client gender or housing situation
- imputation of missing values (we imputed missing values according to column medians)

We performed this using a combination of readily available functions (Numpy and Pandas) and convenience functions we wrote ourselves. 

During **exploratory data analysis**, we created a number of cross-tabulations of the target variable (default or late loan repayment) with features we might expect to be predictive such as income type, loan type and age. We produced a range of interactive visualisations using Plotly (via custom wrapper functions for convenience) to help build some intuition about the dataset. 

To conduct **modelling**, we merged data from the ancillary tables (e.g. _bureau_, _bureau\_balance_, _previous\_appliation_ etc.) with the main _application_ table after aggregation on the `SK_IDD_CURR` key. We then performed feature selection on this extended dataset by eliminating highly collinear variables (i.e. those with high absolute correlations). 

We attempted several modelling approaches including:  

- penalised logistic regression
- support vector machines
- light gradient boosting

We used k-fold ($k=5$) cross-validation to optimise hyperparameters in our models (e.g. weight of penalisation term in regularised logistic regression) and to provide estimates of test error.

Due to highly imbalanced classes in the target variable and in line with the assessment criterion, we of course used ROC curves and the _area under the curve_ (AUC) metric to assess our models.

As appears in our notebook, our best performance was achieved with light gradient boosting performed on the extended version of the dataset where data from ancillary tables was aggregated and joined onto the core _application_ table. 

### Note on the Outcome Variable: `TARGET`

- TARGET == 1: _late payment more than X days on at least one of the first Y installments of the loan in our sample_
- TARGET == 0: client repaid loan on time

NB We will often refer to loans being paid or not paid for brevity, in the understanding this refers to the repayment status at the due date.

### Resources We Used

A list of key notebooks from which we adapted code. 

- A Gentle Introduction: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
- Feature Selection: https://www.kaggle.com/willkoehrsen/introduction-to-feature-selection
- Home Credit Default Risk Extensive EDA: https://www.kaggle.com/gpreda/home-credit-default-risk-extensive-eda
- Home Credit : Complete EDA + Feature Importance: https://www.kaggle.com/codename007/home-credit-complete-eda-feature-importance/notebook
- Light-GBM : https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc

In [None]:
import csv
import os

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

# Plotting

from matplotlib import pyplot
import seaborn as sns 
import plotly
import plotly.offline as py
from plotly.offline import iplot
import plotly.graph_objs as go
import cufflinks as cf

py.init_notebook_mode(connected=True)
cf.go_offline()

# Data Preprocessing, Models, Feature and Model Selection

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split, KFold

import lightgbm as lgb

In [None]:
# Load Necessary Datasets

application_train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
application_test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')
previous_applications = pd.read_csv("../input/home-credit-default-risk/previous_application.csv")
bureau = pd.read_csv("../input/home-credit-default-risk/bureau.csv")

# Data Cleaning

# Convert DAYS_BIRTH to age in years of each client (it's expressed in negative days)

application_train["DAYS_BIRTH"] = application_train["DAYS_BIRTH"]/(-365)

The structure of the data is explained in the following image

![](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

# Exploratory Data Analysis

We have included some components of our initial exploration of the datasets to indicate the process we went through to familiarise ourselves with the problem domain, visualise the data and locate missing values by field. 

We use a number of convenience and wrapper functions for repetitive tasks.

### Plotting Convenience Functions

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    missing = pd.concat([total, percent], axis=1)
    missing.rename(columns= {0:'Total', 1:'Percent'}, inplace = True)
    return missing

def plot_iploty_stats(application_train, feature):
    temp = application_train[feature].value_counts()
    df1 = pd.DataFrame({feature: temp.index,'Number of contracts': temp.values})
    
    # Calculate the percentage of target=1 per category value
    
    cat_perc = application_train[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
    cat_perc.sort_values(by='TARGET', ascending=False, inplace=True)
    
    trace = go.Bar(
        x = temp.index,
        y = temp / temp.sum()*100)
    data = [trace]
    layout = go.Layout(
        title = 'Percentage of contracts according to '+feature,
        xaxis=dict(
            title='Values',
            tickfont=dict(
                size=14,
                color='rgb(107, 107, 107)'
            )
        ),
        yaxis=dict(
            title='Percentage of contracts',
            titlefont=dict(
                size=16,
                color='rgb(107, 107, 107)'
            ),
            tickfont=dict(
                size=14,
                color='rgb(107, 107, 107)'
            )
            
        )
    )
    fig = go.Figure(data=data, layout=layout)
    fig.update_yaxes(range=[0, 100])
    
    py.iplot(fig, filename='statistics')
    
    trace = go.Bar(
        x = cat_perc[feature],
        y = cat_perc.TARGET
    )
    data = [trace]
    layout = go.Layout(
        title = 'Percent of contracts with TARGET==1 according to '+feature,
        xaxis=dict(
            title='Values',
            tickfont=dict(
                size=14,
                color='rgb(107, 107, 107)'
            )
        ),
        yaxis=dict(
            title='Percent of target with value 1',
            titlefont=dict(
                size=16,
                color='rgb(107, 107, 107)'
            ),
            tickfont=dict(
                size=14,
                color='rgb(107, 107, 107)'
            )
        )
    )
    fig = go.Figure(data=data, layout=layout)
    
    py.iplot(fig, filename='schoolStateNames')
    
def plot_repayed_perc(application_train, feature, round_feat=-1):
    #percentage of the loans repayed or not according to the feature chosen in input 
    if round_feat > -1:
        application_train[feature] = np.round(application_train[feature], round_feat)
    temp = application_train[feature].value_counts()
    
    temp_y0 = []
    temp_y1 = []
    for val in temp.index:
        temp_y1.append(np.sum(application_train["TARGET"][application_train[feature]==val] == 1))
        temp_y0.append(np.sum(application_train["TARGET"][application_train[feature]==val] == 0))    
    trace1 = go.Bar(
        x = temp.index,
        y = (temp_y1 / temp.sum()) * 100,
        name='YES'
    )
    trace2 = go.Bar(
        x = temp.index,
        y = (temp_y0 / temp.sum()) * 100, 
        name='NO'
    )

    data = [trace1, trace2]
    fig = go.Figure(data=data)
    fig.update_layout(showlegend=True, title = 'Loan Defaults Decomposed by ' + feature + ' (Percentage)')

    iplot(fig)
    
def heatmap_coor_matrix(application_train, corr_pearson):
    data = [go.Heatmap(
        z= corr_pearson,
        x=application_train.columns.values,
        y=application_train.columns.values,
        colorscale='Viridis',
        reversescale = False,
        opacity = 1.0 )
       ]
    fig = go.Figure(data=data)
    fig.update_layout(title = 'Pearson Correlation between features', 
                      xaxis = dict(ticks='', nticks=36),
                      yaxis = dict(ticks='' ),
                      width = 900, height = 900,
                      margin=dict(l=240)) 
    py.iplot(fig, filename='coorelation_heatmap')

Check that `DAYS_BIRTH` has been recoded into an age in (positive) years

In [None]:
application_train["DAYS_BIRTH"].head()

Which fields constitute `NAME_TYPE_SUITE` (flags _Who accompanied client when applying for the previous application_, see data dictionary.

In [None]:
application_train['NAME_TYPE_SUITE'].unique()

Count number of missing values in `NAME_TYPE_SUITE`

In [None]:
application_train['NAME_TYPE_SUITE'].isna().sum()

Tabulate `CODE_GENDER`, which flags the gender of clients taking out loans

In [None]:
application_train['CODE_GENDER'].value_counts()

### Question: Are the data balanced or imbalanced?

In [None]:
temp = application_train["TARGET"].value_counts()
temp = (temp / temp.sum())*100
temp.iplot(kind='bar', labels='labels', values='values', colors ='green', title='Loan Repayed or Not')

In [None]:
temp = application_train["TARGET"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values})

df.iplot(kind='pie', labels='labels', values='values', title='Loan Repayed or not')

### Answer: The data are highly unbalanced

- 91.92% of clients repaid the loan on time 
- 8.07% of clients didn't repaid the loan on time

This means that we cannot use accuracy as a error metric but we use ROC curves and the _area under the curve_ (AUC) metric to evaluate the power of our predictions. 

### Outcome Decompositions by Features

### 1. Breakdowns of the occupation type of each client

We can see that low-skill laborers have the highest likelihood of failing to repay the loan on time when grouping by occupational status. This group comprises less than 1% of loans at Home Credit, however with laborers, sales staff and "core staff" all making up larger factions at 26%, 15% and 13% respectively. 

In [None]:
plot_iploty_stats(application_train, 'OCCUPATION_TYPE')

### 2. Breakdown of clients by gender

The number of female clients is almost double the number of male clients. Looking at the percent of defaulted credits, males have a higher chance of not returning their loans.

In [None]:
plot_iploty_stats(application_train, 'CODE_GENDER')

In [None]:
# Tabulated Default Rates Decomposed by Gender

np.round(pd.crosstab(application_train.CODE_GENDER, application_train.TARGET, margins=True, normalize=0), 3)

### 3. Some statistics about the Family status of each client

We can see that married individuals constitute a large proportion of the loans issued by Home Credit. There appears to be little impact of family status on default risk.

In [None]:
plot_iploty_stats(application_train, 'NAME_FAMILY_STATUS')

### 4. Some statistics about the organization type of each client

We can see that clients' organisation types make a sizeable difference to default risk - at least on visual inspection - with clients whose organisation is in "transport type 3" being at higher risk of default than "industry type 12" for example.

In [None]:
plot_iploty_stats(application_train, "ORGANIZATION_TYPE")

### 5. Some statistics about the Clients income type (businessman, working, maternity leave....)

In [None]:
plot_repayed_perc(application_train, "NAME_INCOME_TYPE")

### 6. Some statistics about the Clients Contract type

In [None]:
plot_iploty_stats(application_train, "NAME_CONTRACT_TYPE")

### Some stat about the age of each client

In [None]:
plot_repayed_perc(application_train, "DAYS_BIRTH", round_feat=0)

# Previous Applications

Some elementary exploration of clients' previous loan applications. We present tabular and graphical breakdowns of the purpose of cash loans in previous loan applications.

In [None]:
previous_applications['NAME_CASH_LOAN_PURPOSE'].value_counts()

In [None]:
temp = previous_applications['NAME_CASH_LOAN_PURPOSE'].value_counts()

temp.iplot(kind='bar', color="blue", 
           xTitle = 'Organization Name', yTitle = "Count", 
           title = 'Types of NAME_CASH_LOAN_PURPOSE in previous applications ')

# Missing Values

We provide breakdowns of the missingness by field for the core tables we use for modelling. Later, we will impute this using medians.

In [None]:
missing_data(application_train)

In [None]:
missing_data(bureau)

In [None]:
missing_data(previous_applications)

### Create dummy variables for categorical variables, i.e. factors (data cleaning)

In [None]:
# Spin out categorical variables into dummied 0/1 indicator variables

train_one_hot = pd.get_dummies(application_train)

In [None]:
# Check what our data looks like now

train_one_hot.head()

In [None]:
# Check the dimensions after dummying out factors

train_one_hot.shape

### Feature Selection

We can drop **Collinear Variables**

Collinear variables are those which are highly correlated with one another. These can:

- decrease the model's availablility to learn
- decrease model interpretability; and 
- decrease generalization performance on the test set

These are three things we want to increase, so removing collinear variables is a useful step. We can pick an arbitrary threshold for removing collinear variables, and then remove one out of any pair of variables that is above that threshold.

### Matrix of Pearson Correlations

In [None]:
corr_pearson = train_one_hot.corr().values

In [None]:
heatmap_coor_matrix(application_train, corr_pearson)    

In [None]:
# Absolute value correlation matrix

corr_matrix = train_one_hot.corr().abs()

In [None]:
corr_matrix

In [None]:
# where: Replace values where the condition is False. 
# triu: Upper triangle of an array. Return a copy of a matrix with the elements below the k-th diagonal zeroed.

corr_upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(bool))

In [None]:
corr_upper

In [None]:
# Select columns with correlations above threshold

threshold = 0.9
drop = []
for column in corr_upper.columns:
    if any(corr_upper[column]>threshold):
        drop.append(column)


print("columns to drop " + str(len(drop)))

In [None]:
# Some of the features we re going to drop

drop[:10]

In [None]:
train = train_one_hot.drop(columns = drop)

In [None]:
train

We can **drop columns with too many missing values**: We can choose a threshold and drop every column that has a percentage of missing values over the chosen threshold. 

Many learners (for example from the scikit-learn library) do not handle missing values in the feature matrix. 

We use this approach and also **median imputation**.

In [None]:
threshold = 0.55
train_missing = (train.isnull().sum() / len(train)).sort_values(ascending = False)

In [None]:
train_missing = train_missing.index[train_missing > threshold]

print("we are going to drop " + str(len(train_missing))+" columns")

In [None]:
train.drop(columns = train_missing, inplace = True)

In [None]:
train.head()

In [None]:
#Save and Drop ['SK_ID_CURR'] column because the id is just number and it shouldn't have a predictive power

ID = train["SK_ID_CURR"] # the ids

train_clean = train.drop(columns = ['SK_ID_CURR'] )


# Save the "TARGET" column and drop it

train_target = train['TARGET'] # the target list
train_clean.drop(columns = ['TARGET'], inplace = True)

In [None]:
train_clean.head()

In [None]:
print('we started from ', application_train.shape)
print('With just one-hot encoding we had ', train_one_hot.shape)
print('after feature selection we now have ', train_clean.shape)

# Test preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
#One-hot encoding on the test set 
test_one_hot = pd.get_dummies(application_test)

#and remove irrelevant features from the test set

relevant_features = list(train_clean.columns)

to_drop = [col for col in test_one_hot.columns if col not in relevant_features]
test = test_one_hot.drop(columns = to_drop)

print('we started from ', application_test.shape)
print('With just one-hot encoding we had ', test_one_hot.shape)
print('after feature selection we now have ', test.shape)


In [None]:
#there are 3 columns in train that are not present in test, remove them from train

for col in train_clean.columns:
    if col not in test.columns.tolist():
        train_clean.drop(columns = [col], inplace = True)

print('now we have ', train_clean.shape)

In [None]:
sc = StandardScaler()
train = sc.fit_transform(train_clean)

test  = sc.fit_transform(test)

# Modelling with Light GBM

### Model building and training

We need to convert our training data into LightGBM dataset format(this is mandatory for LightGBM training).

After creating a converting dataset, I created a python dictionary with parameters and their values. Accuracy of your model totally depends on the values you provide to parameters.

In this section we will train a model with just the data from application_train, later we will try to extract some features from the other dataset and re_train a model on a dataset with extended data.

In [None]:
# the set of parameters for Light GBM
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10

In [None]:
# Split the train dataset in test and train

x_train, x_test, y_train, y_test = train_test_split(train ,train_target , test_size=0.4, random_state=18)

In [None]:
# Create the LightGBM data containers

train_data = lgb.Dataset(x_train, label=y_train)

test_data = lgb.Dataset(x_test, label=y_test)

In [None]:
# Train the model

model = lgb.train(params,
                  train_data,
                  valid_sets=test_data,
                  num_boost_round=5000,
                  early_stopping_rounds=100, 
                  verbose_eval=False)

In [None]:
# Predict on train 

pred_train = model.predict(train)

In [None]:
# Accuracy 

y_target = np.array(train_target)

y_predictions = np.array(pred_train)
auc = roc_auc_score(y_target, y_predictions)
auc

In [None]:
# Calculate ROC curves

lgbm_fpr, lgbm_tpr, _ = roc_curve(y_target, pred_train)

# Plot the roc curve for the model

pyplot.plot(lgbm_fpr, lgbm_tpr, marker='.', label='LightGBM')

# Axis labels

pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')

# Show the legend

pyplot.legend()

# Show the plot

pyplot.show()

In [None]:
# Precict on the test set to make a submission

preds = model.predict(test)

#save the prediction into a csv file
submissions = pd.DataFrame()
submissions['SK_ID_CURR'] = application_test['SK_ID_CURR']
submissions['TARGET'] = preds
submissions.to_csv("predictions.csv", index=False)

## Feature Extraction

In this section we will focus on extracting information from other data sources besides the main _application_ table and create new features with this information to increase the prediction accuracy.

### Functions for Feature Extraction

Here our aim is to aggregate the information in tables bureau, previous application, posh cash balance, installments payments and credit card balance to merge with application table. The numeric aggregation function fetches the columns with numeric values in the following tables and group by "SK_ID_CURR" column to aggregate the information. For each numeric column, the aggregated information represented with count, mean, maximum, minimum and sum for each unique "SK_ID_CURR" value.  On the other hand, the categorical aggregation function fetches the categorical columns in the following tables, applies one hot encoding and then group by "SK_ID_CURR" column to aggregate the information. For each categorical column, the aggregated information represented with mean and sum for each unique "SK_ID_CURR" value. After that, to extract the information from bureau balance table, we group by "SK_ID_BUREAU" column and then merge with Bureau balance on “SK_ID_BUREAU”. Finally we merge all the aggregated tables with application train and application test tables to have a single table containing all the information from other sources. 

In [None]:
def numeric_aggregation(table, key, name ):
    for col in table:
        if col != key and "SK_ID" in col:
            table = table.drop(columns=col)
    numeric_table = table.select_dtypes("number")
    numeric_table[key] = table[key]
    agg = numeric_table.groupby(key).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()

    columns = [key]

    for var in agg.columns.levels[0]:
        if var != key:
            for stat in agg.columns.levels[1][:-1]:
                columns.append('%s_%s_%s' % (name, var, stat))
    agg.columns = columns
    return agg

def correlation_func(table):
    corrs = []

    for col in table.columns:
        if col != "TARGET":
            corr = table['TARGET'].corr(table[col])
            corrs.append((col, corr))

    corrs = sorted(corrs, key=lambda x: abs(x[1]), reverse=True)
    return corrs

def categorical_aggregation(table, key, name ):
    try:
        categoricals = pd.get_dummies(table.select_dtypes("object"))
        categoricals[key] = table[key]
    except ValueError:
        return None

    agg = categoricals.groupby(key).agg(["sum", "mean"])

    columns = []

    for var in agg.columns.levels[0]:
        for stat in ["count", "count_norm"]:
            columns.append('%s_%s_%s' % (name, var, stat))
    agg.columns = columns
    return agg

### Import tables containing ancillary data

In [None]:
bureau = pd.read_csv("../input/home-credit-default-risk/bureau.csv")
bureau_balance = pd.read_csv("../input/home-credit-default-risk/bureau_balance.csv")
train_data = pd.read_csv("../input/home-credit-default-risk/application_train.csv")
test_data = pd.read_csv("../input/home-credit-default-risk/application_test.csv")
previous_application = pd.read_csv("../input/home-credit-default-risk/previous_application.csv")
POS_CASH_balance = pd.read_csv("../input/home-credit-default-risk/POS_CASH_balance.csv")
installments_payments = pd.read_csv("../input/home-credit-default-risk/installments_payments.csv")
credit_card_balance = pd.read_csv("../input/home-credit-default-risk/credit_card_balance.csv")

### Aggregation: Bureau


In [None]:
bureau_num_agg = numeric_aggregation(bureau, key="SK_ID_CURR", name="bureau")
bureau_categorical_agg = categorical_aggregation(bureau, key="SK_ID_CURR", name="bureau")

### Aggregation: previous application

In [None]:
previous_application_num_agg = numeric_aggregation(previous_application, key="SK_ID_CURR", name="previous_application")
previous_application_categorical_agg = categorical_aggregation(previous_application, key="SK_ID_CURR", name="previous_application")
del previous_application

### Aggregation: Pos_Cash Balance

In [None]:
POS_CASH_balance_num_agg = numeric_aggregation(POS_CASH_balance, key="SK_ID_CURR", name="POS_CASH_balance")
POS_CASH_balance_categorical_agg = categorical_aggregation(POS_CASH_balance, key="SK_ID_CURR", name="POS_CASH_balance")
del POS_CASH_balance

### Aggregation: installment Payments

In [None]:
installments_payments_num_agg = numeric_aggregation(installments_payments, key="SK_ID_CURR", name="installments_payments")
del installments_payments

### Aggregation: Credit card ballance

In [None]:
credit_card_balance_num_agg = numeric_aggregation(credit_card_balance, key="SK_ID_CURR", name="credit_card_balance")
credit_card_balance_categorical_agg = categorical_aggregation(credit_card_balance, key="SK_ID_CURR", name="credit_card_balance")
del credit_card_balance

### Aggregation: Bureau Ballence

In [None]:
bureau_balance_num_agg = numeric_aggregation(bureau_balance, key="SK_ID_BUREAU", name="bureau_balance")
bureau_balance_categorical_agg = categorical_aggregation(bureau_balance, key="SK_ID_BUREAU", name="bureau_balance")
del bureau_balance

### Aggregate: bureau ballance to bureau

In [None]:
bureau_by_loan = bureau_balance_num_agg.merge(bureau_balance_categorical_agg, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')
bureau_balance_by_client = numeric_aggregation(bureau_by_loan.drop(columns=['SK_ID_BUREAU']), key='SK_ID_CURR', name='client')
del bureau
del bureau_by_loan

### Merge with Train Data

In [None]:
train_data = train_data.merge(bureau_num_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(bureau_categorical_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(previous_application_num_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(previous_application_categorical_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(POS_CASH_balance_num_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(POS_CASH_balance_categorical_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(installments_payments_num_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(credit_card_balance_num_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(credit_card_balance_categorical_agg, on = 'SK_ID_CURR', how = 'left')
train_data = train_data.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

### Merge with Test Data

In [None]:
test_data = test_data.merge(bureau_num_agg, on = 'SK_ID_CURR', how = 'left')
del bureau_balance_num_agg
test_data = test_data.merge(bureau_categorical_agg, on = 'SK_ID_CURR', how = 'left')
del bureau_categorical_agg
test_data = test_data.merge(previous_application_num_agg, on = 'SK_ID_CURR', how = 'left')
del previous_application_num_agg
test_data = test_data.merge(previous_application_categorical_agg, on = 'SK_ID_CURR', how = 'left')
del previous_application_categorical_agg
test_data = test_data.merge(POS_CASH_balance_num_agg, on = 'SK_ID_CURR', how = 'left')
del POS_CASH_balance_num_agg
test_data = test_data.merge(POS_CASH_balance_categorical_agg, on = 'SK_ID_CURR', how = 'left')
del POS_CASH_balance_categorical_agg
test_data = test_data.merge(installments_payments_num_agg, on = 'SK_ID_CURR', how = 'left')
del installments_payments_num_agg
test_data = test_data.merge(credit_card_balance_num_agg, on = 'SK_ID_CURR', how = 'left')
del credit_card_balance_num_agg
test_data = test_data.merge(credit_card_balance_categorical_agg, on = 'SK_ID_CURR', how = 'left')
del credit_card_balance_categorical_agg
test_data = test_data.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
del bureau_balance_by_client

### Align train and test datasets

In this step, we align the formats of the train and test datasets to create predictions for the test cases on the basis of the models we have built using the training data.

In [None]:
train_labels = train_data["TARGET"]
train_data, test_data = train_data.align(test_data, join="inner", axis=1)
train_data["TARGET"] = train_labels
print(train_data.shape)
print(test_data.shape)

# handling the missing values
mis_val_count = train_data.isnull().sum()
percentage = mis_val_count/len(train_data)

# drop the columns with missing values higher than a threshold
columns = percentage[percentage < 0.4].index
train_data = train_data[columns]
test_data = test_data[columns[:-1]] #drop the target column

### Remove columns by cross correlation

In [None]:
corrs = train_data.corr()

# Set the threshold
threshold = 0.8

# Empty dictionary to hold correlated variables
above_threshold_vars = {}


# For each column, record the variables that are above the threshold
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []


# Iterate through columns and correlated columns
for key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Remove one of the columns
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)

cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

train_corrs_removed = train_data.drop(columns = cols_to_remove)
test_corrs_removed = test_data.drop(columns = cols_to_remove)

### Save the extended Data 

In [None]:
train_corrs_removed.to_csv('extended_train.csv', index = False)
test_corrs_removed.to_csv('extended_test.csv', index = False)

In [None]:
del train_corrs_removed, test_corrs_removed

# LightGBM on Extended Data

We now perform the same modelling on the extended dataset we have created.

NB Writing to an external file and re-reading in of the data is obviously not necessary when working with a standalone file, but we keep it for consistency with our [Kaggle Notebook](https://www.kaggle.com/onurcopur/defaultrisk-dreamteam), and as an indication of the workflow we used when conducting analysis to allow us to work with the dataset after merging and feature selection without having to re-run these steps each time.

## Import the Extended Data

In [None]:
extended_train = pd.read_csv("extended_train.csv")
extended_test = pd.read_csv("extended_test.csv")

In [None]:
print(extended_test.shape)
print(extended_train.shape)

### One hot encode categorical features

In [None]:
x_train = pd.get_dummies(extended_train.iloc[:,:-1])
y_train = extended_train.iloc[:,-1]
x_test = pd.get_dummies(extended_test)
del extended_train
del extended_test

In [None]:
drop_list = []
for column in x_train.columns:
    if column in x_test.columns:
        continue
    else:
        drop_list.append(column)

In [None]:
x_train = x_train.drop(columns = drop_list)
print(x_train.shape)
print(x_test.shape)

### Put the column median instead of missing values

In [None]:
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imp_median.fit(x_train)
x_train = imp_median.transform(x_train)
imp_median.fit(x_test)
x_test = imp_median.transform(x_test)

### Scale the data

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

### 5- fold Cross validation of LGBM

In [None]:
kfold = KFold(n_splits = 5, shuffle = True, random_state = 50)

inputs=x_train[:,1:]
outputs = y_train
x_test = x_test[:,1:]

# Empty array for test predictions
y_pred = np.zeros(x_test.shape[0])
    
for train, test in kfold.split(inputs, outputs):
    
    model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                           class_weight = 'balanced', learning_rate = 0.05, 
                           reg_alpha = 0.1, reg_lambda = 0.1, 
                           subsample = 0.8, n_jobs = -1, random_state = 50)
    # Train the model
    model.fit(inputs[train, :], outputs[train], eval_metric = 'auc',
              eval_set = [(inputs[test, :], outputs[test]), (inputs[train, :], outputs[train])],
                  eval_names = ['valid', 'train'],
              early_stopping_rounds = 100, verbose = 200)
    # Record the best iteration
    best_iteration = model.best_iteration_
    print(1)
    # Make predictions
    y_pred += model.predict_proba(x_test, num_iteration = best_iteration)[:, 1] / kfold.n_splits

## Create the submission csv

In [None]:
submission = pd.read_csv("../input/home-credit-default-risk/sample_submission.csv")

In [None]:
submission["TARGET"] = y_pred

In [None]:
submission.to_csv('submission.csv', index = False)