# Home Credit Default Risk
## Walk Through Complete Workflow of One Submission

#### The idea of this kernel is show basics steps of competition process, and putting it all together from loading data all the way to producing submition
#### The main model I will use here is keras Neural Networks

I hope this kernel will be useful especially for people new to data challenges.

The data is taken from kaggle's challenge on Home Credit Default Risk.
It consists of eight dataframes and I will load and present them one by one before creating model.

In [None]:
import numpy as np
import pandas as pd
import os
from keras.preprocessing.text import Tokenizer
import keras
from keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from keras.models import Model
from keras import layers
from keras import Input 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import random
from keras import regularizers
from pandas.plotting import andrews_curves
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM, GRU, Dropout
from keras import regularizers
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from pandas.plotting import parallel_coordinates
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix as CM
import warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler() # Here I initialize scaler in order to call it later
# It is used to scale column values to Normal(0,1) distribution
le = preprocessing.LabelEncoder() # Label encoder will convert string values to numbers

## Helper functions

In [None]:
def plot_pie(table, column):
    '''
    Function that plots single pie chart
    '''
    labels = []
    for (key, value) in table[column].value_counts().items():
        labels.append(key)
    #plt.figure(figsize = (6,5))
    plt.pie(table[column].value_counts())
    plt.legend(labels, loc = 'best', bbox_to_anchor=(0.1,0.9))
    plt.title(str(column))

In [None]:
def plotting_pies(table, columns):
    '''
    Plots multiple pie chart using the above function
    '''
    n = len(columns)
    plt.figure(figsize = (7*2,7*int(n/2+1)))
    for i in range(len(columns)):
        plt.subplot(int(n/2+1), 2, i+1)
        plot_pie(table, columns[i])
    plt.show()
    plt.close('All')

In [None]:
def plot_nans(table):
    '''
    Presents nan values in dataframe
    '''
    n = len(table.columns)
    zeros = []
    for i in table.columns:
        zeros.append(table[i].isnull().sum())
    zeros = np.array(zeros)
    indi = np.argsort(zeros)[::-1]
    plt.figure(figsize = (n/3,5))
    plt.title("Nans Over Columns")
    plt.bar(range(n), zeros[indi],
           color='teal', align="center")
    plt.xticks(np.arange(n), table.columns[indi], rotation='vertical')
    plt.xlim([-1, n])
    plt.show()
    plt.close("all")

In [None]:
def Aggregation(table1, table2, ID, summing, averaging, counting, maximum, minimum, suff):
    ''' 
    table1 and table2 are two dataframes we want to merge, ID is column according to which we perform aggregation,
    and summing, averaging, counting, maximum and minimum are column on wich we plan to perform specified operation;
    suff is string to be added to each newly created column's name
    '''
    dictionary = {}
    for col in summing:
        dictionary[col] = 'sum'
    for col in averaging:
        dictionary[col] = 'mean'
    for col in counting:
        dictionary[col] = 'count'
    for col in maximum:
        dictionary[col] = 'max'
    for col in minimum:
        dictionary[col] = 'min'
    indexi = table1[ID].values
    Aggr = np.zeros((len(indexi),len(dictionary)))
    for i in range(len(indexi)):
        Aggr[i] = table2[table2[ID] == indexi[i]].agg(dictionary).values
    Aggr = pd.DataFrame(Aggr, columns = [i+suff for i in dictionary.keys()])
    
    return pd.concat([table1, Aggr], axis=1, join='inner') 

For each table I will show it right after loading, present it with couple of plots, preprocess it, and show it after processing



## Application train and test

These two are 'basic' dataframes in this problem. We could create simple model consisting only of these two tables, but of course, with pretty poor performance.
'application_train' is table consisting ID of person applying for loan; TARGET column, with value 1 if loan is approwed and 0 if it is rejected; and number of other attributes.
'application_test' has all the columns as 'application_train' except TARGET, that needs to be predicted.

After processing these two dataframes, I will aggregate them with other dataframes based on SK_ID_CURR

In [None]:
application_train = pd.read_csv('../input/application_train.csv') 

In [None]:
application_test = pd.read_csv('../input/application_test.csv') 

In [None]:
application_train.head() 

#### Plots

There are a lot of nan values in application train that need to be filled

In [None]:
plot_nans(application_train)

In [None]:
strColumns = ['NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR','FLAG_OWN_REALTY','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','OCCUPATION_TYPE','WEEKDAY_APPR_PROCESS_START','ORGANIZATION_TYPE','FONDKAPREMONT_MODE','HOUSETYPE_MODE','WALLSMATERIAL_MODE','EMERGENCYSTATE_MODE','NAME_TYPE_SUITE']
numColumns = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE','DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'HOUR_APPR_PROCESS_START', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'DAYS_LAST_PHONE_CHANGE' ]
Categorical = ['FLAG_MOBIL','FLAG_EMP_PHONE','FLAG_WORK_PHONE','FLAG_CONT_MOBILE','FLAG_PHONE','FLAG_EMAIL','REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION','LIVE_CITY_NOT_WORK_CITY','FLAG_DOCUMENT_2','FLAG_DOCUMENT_3','FLAG_DOCUMENT_4','FLAG_DOCUMENT_5','FLAG_DOCUMENT_6','FLAG_DOCUMENT_7','FLAG_DOCUMENT_8','FLAG_DOCUMENT_9','FLAG_DOCUMENT_10','FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13','FLAG_DOCUMENT_14','FLAG_DOCUMENT_15','FLAG_DOCUMENT_16','FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19','FLAG_DOCUMENT_20','FLAG_DOCUMENT_21']

I split data in tree classes: 

1.) strColumn - columns consisting of string values; these ususaly have more than two possible categories, and I will present them using pie charts. After that, they need to be processed with label encoder (invoked above) because we need numerical values for further work

2.) numColumns - columns with numerical values, usually float, that need to be scaled to Normal(0, 1) distribution; the reason for this is that most ML algorithms perform better when values are small, and if all attributes use the same scale

3.) Categorical - ussualy consisting of 'flag' columns; values are 0 and 1, and there is no real need to process these except for fealing with nan values

In [None]:
plotting_pies(application_train, strColumns)

#### Preprocessing

In order to scale values, first I need to fill nan values, because sklearn can't work with nans.
There is number of ways to do that; here I will fill nans with mean value of corresponding column.

When filling nans in 'application_test' it is important to use mean values from 'application_train', in order to avoid overfitting.

After that is finished, I call scaler function.

In [None]:
for col in numColumns:
    application_train[col] = application_train[col].fillna(application_train[col].mean()) 
    application_test[col] = application_test[col].fillna(application_train[col].mean()) 
    scaler.fit(application_train[col].values.reshape(-1, 1))
    application_train[col] = scaler.transform(application_train[col].values.reshape(-1, 1))
    application_test[col] = scaler.transform(application_test[col].values.reshape(-1, 1))    

In sting columns, I replace nan values vith string 'nan', and then call label encoder to convert str values into numbers.

Like above, we fit function on 'application_train' and transform values in 'application_test' according to that function

In [None]:
application_train[np.array(strColumns)] = application_train[np.array(strColumns)].fillna('nan') 
application_test[np.array(strColumns)] = application_test[np.array(strColumns)].fillna('nan')

for col in strColumns:
    le.fit(application_train[col])
    application_train[col] = le.transform(application_train[col])
    application_test[col] = le.transform(application_test[col])

Finally, this is how 'application_train' looks like after processing

In [None]:
application_train.head()

 Unfortunately, I currently work with on pretty old laptop and it isn't able to carry out that much data. Therefore, I will have to take only the fraction of data (first 1000 rows from training set) and work with that.
 
 If you are facing similar problem you can do the same (but then you won't be able to build realistic model).
 
 Or if you have a friend studying abroad, you can send him your notebook and ask him to execute it on Soviet SuperComputer that his University have, and send you back processed data in csv format.

In [None]:
application_train = application_train[:1000]
application_test = application_test[:100]

## Bureau

Now I load next table.
It contains column SK_ID_CURR on wich I will aggreagate it with application_train/test;
but it also have a column SK_ID_BUREAU, that I will use to merge it with next table - bureau balance, before aggregatig it with application_train/test

In [None]:
bureau = pd.read_csv('../input/bureau.csv')
bureau.head()

In [None]:
# pd.get_dummies(bureau)
# moglo je i ovako, ali nema veze

#### Plots

In [None]:
plot_nans(bureau)

In [None]:
burNumeric = ['DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE','DAYS_ENDDATE_FACT','AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG','AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT','AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE', 'AMT_ANNUITY'] 
burCateg = ['CREDIT_ACTIVE', 'CREDIT_CURRENCY','CREDIT_TYPE']

In [None]:
plotting_pies(bureau, burCateg)

By looking at pie charts, I realized that it could be useful to mark users that had bad debt or sold credit, so I made columns for that

In [None]:
bureau['marker'] = np.ones(len(bureau)) # I added this column so that I could later count number of occurances for each user

Bad_Debt = np.zeros(len(bureau)) 
Bad_Debt[np.where(bureau.CREDIT_ACTIVE == 'Bad debt')[0]] = 1
Sold = np.zeros(len(bureau))
Sold[np.where(bureau.CREDIT_ACTIVE == 'Sold')[0]] = 1
bureau['Bad_debt'] = Bad_Debt
bureau['Sold'] = Sold
del Bad_Debt
del Sold

In [None]:
# Again fill nans in numerical columns with mean value and in categorical with string 'nan'
for col in burNumeric: 
    bureau[col] = scaler.fit_transform(bureau[col].fillna(bureau[col].mean()).values.reshape(-1, 1))
for col in burCateg:
    bureau[col] = le.fit_transform(bureau[col].fillna('nan'))

In [None]:
# dataframe after processing
bureau.head()

In [None]:
# As I mentione abowe, becouse of computer capacity I will take only fraction of data
bureau = bureau[:1000]

## Bureau balance

This one I need to merge with bureau first

In [None]:
bureau_balance = pd.read_csv('../input/bureau_balance.csv')
bureau_balance.head()

#### Plot

No NaN values here

In [None]:
plotting_pies(bureau_balance, ['STATUS'])

In [None]:
# Here I divided column MONTHS_BALANCE to scale it on 0-1 interval
bureau_balance.MONTHS_BALANCE /= -1*bureau_balance.MONTHS_BALANCE.min() 

In [None]:
# get_dummies makes 'flag' columns for each STATUS value, which is good practice when working with categorical data
bureau_balance = pd.get_dummies(bureau_balance)
bureau_balance.head()

In [None]:
bureau_balance = bureau_balance[:1000]

#### Aggregating Bureau and Bureau Balance

In order to aggregate bureau and bureau_balance I decided for each ID in bureau balance to take minimum value of MONTHS_BALANCE, becouse it indicates when user first applied.

At the begining of kernel, I introduced function Aggregation. As a input I need to provide lists of columns, telling where to look for minimum, where to average or something else (sum, count, max). Now, it's time to use it.

In [None]:
minimum = ['MONTHS_BALANCE']
averaging = [bureau_balance.columns[i] for i in range(2, len(bureau_balance.columns))]

In [None]:
bureau = Aggregation(bureau, bureau_balance, 'SK_ID_BUREAU', summing= [], averaging = averaging, counting =[], maximum=[], minimum =minimum, suff = '_b_b')

In [None]:
del bureau_balance # I don't need this table any more, so I'll delete it to save some working memory

In [None]:
bureau.head()

#### Aggregating Application train/test and Bureau 

I decided first to aggregate only those values in 'bureau' where status is ACTIVE, and then when it is not.
This way I will gate twice as more columns, but I hope it will also bring more information.
Again, I invoke function Aggregation, with specified columns

In [None]:
minimum = ['AMT_CREDIT_MAX_OVERDUE']
maximum = ['DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE']
averaging = ['AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_ANNUITY']
summing = ['marker']

In [None]:
TRAIN = Aggregation(application_train, bureau[bureau.CREDIT_ACTIVE == 0], 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =minimum, suff = '_b_b_b')

In [None]:
del application_train # delete to release space

In [None]:
TEST = Aggregation(application_test, bureau[bureau.CREDIT_ACTIVE == 0], 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =minimum, suff = '_b_Act')

In [None]:
del application_test # delete to release space

And now for other satuses as well, but with a little different columns.

In list 'maximum' I add 'DAYS_ENDDATE_FACT', column with values only for closed credits; and of course, list 'summing' is different

In [None]:
minimum = ['AMT_CREDIT_MAX_OVERDUE']
maximum = ['DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE','DAYS_ENDDATE_FACT']
averaging = ['AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'AMT_ANNUITY']
summing = ['marker', 'Bad_debt', 'Sold']

In [None]:
TRAIN = Aggregation(TRAIN, bureau[bureau.CREDIT_ACTIVE == 0], 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =minimum, suff = '_b_NAct')

In [None]:
TEST = Aggregation(TEST, bureau[bureau.CREDIT_ACTIVE != 0], 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =minimum, suff = '_b_NAct')

In [None]:
TRAIN.head() 

In [None]:
del bureau

Now, it is pretty similar with every other table. I read and process them one by one, and aggregate it with TRAIN and TEST, so I wont go much into details.


## POS CASH

In [None]:
POS_CASH_balance = pd.read_csv('../input/POS_CASH_balance.csv')

### Plots

In [None]:
plot_nans(POS_CASH_balance)

In [None]:
POS_CASH_balance.head()

In [None]:
plotting_pies(POS_CASH_balance, ['NAME_CONTRACT_STATUS'])

### Preprocessing

In [None]:
for col in ['CNT_INSTALMENT_FUTURE', 'CNT_INSTALMENT', 'MONTHS_BALANCE']:
    POS_CASH_balance[col] = scaler.fit_transform(POS_CASH_balance[col].fillna(POS_CASH_balance[col].mean()).values.reshape(-1, 1))
POS_CASH_balance.SK_DPD /= POS_CASH_balance.SK_DPD.max()
POS_CASH_balance.SK_DPD_DEF /= POS_CASH_balance.SK_DPD_DEF.max()
POS_CASH_balance =pd.get_dummies(POS_CASH_balance)

In [None]:
POS_CASH_balance.head()

#### Aggregating TRAIN and POS CASH 

In [None]:
maximum = ['MONTHS_BALANCE']
averaging = ['CNT_INSTALMENT', 'CNT_INSTALMENT_FUTURE', 'SK_DPD', 'SK_DPD_DEF']
summing = POS_CASH_balance.columns[7:].tolist()

In [None]:
TRAIN = Aggregation(TRAIN, POS_CASH_balance, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =[], suff = '_PCb')

In [None]:
TEST = Aggregation(TEST, POS_CASH_balance, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=maximum, minimum =[], suff = '_PCb')

In [None]:
del POS_CASH_balance

### Credit card balance

In [None]:
credit_card_balance = pd.read_csv('../input/credit_card_balance.csv')
credit_card_balance.head()

#### Plots

In [None]:
plot_nans(credit_card_balance)

In [None]:
plotting_pies(credit_card_balance, ['NAME_CONTRACT_STATUS'])

#### Preprocessing

In [None]:
credit_card_balance = pd.get_dummies(credit_card_balance)
for col in credit_card_balance.columns[2:15]:
    credit_card_balance[col] = scaler.fit_transform(credit_card_balance[col].fillna(credit_card_balance[col].mean()).values.reshape(-1, 1))
for col in credit_card_balance.columns[15:22]:
    credit_card_balance[col] /= credit_card_balance[col].max()

In [None]:
credit_card_balance.head()

#### Aggregation

In [None]:
averaging = credit_card_balance.columns[3:22].tolist()
maximum = ['MONTHS_BALANCE']
summing = credit_card_balance.columns[22:].tolist()
counting = ['NAME_CONTRACT_STATUS_Active']

In [None]:
TRAIN = Aggregation(TRAIN, credit_card_balance, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =counting, maximum=maximum, minimum =[], suff = '_ccb')

In [None]:
TEST = Aggregation(TEST, credit_card_balance, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =counting, maximum=maximum, minimum =[], suff = '_ccb')

In [None]:
del credit_card_balance

### Previous application


In [None]:
previous_application = pd.read_csv('../input/previous_application.csv')

In [None]:
previous_application.head()

In [None]:
plot_nans(previous_application)

In [None]:
strColumns = ['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'PRODUCT_COMBINATION', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION', 'NAME_GOODS_CATEGORY']
numColumns = ['AMT_ANNUITY','AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE','RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE','DAYS_TERMINATION', 'SELLERPLACE_AREA', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION']

In [None]:
plotting_pies(previous_application, strColumns)

#### Preporcessing

In [None]:
previous_application=pd.get_dummies(previous_application)

In [None]:
for col in numColumns:
    previous_application[col] = scaler.fit_transform(previous_application[col].fillna(previous_application[col].mean()).values.reshape(-1, 1))
for col in  ['HOUR_APPR_PROCESS_START', 'CNT_PAYMENT']:
    previous_application[col] /= previous_application[col].max()

In [None]:
previous_application.head()

#### Aggregation

In [None]:
averaging = [col for col in previous_application.columns[2:] if col in numColumns or col in ['HOUR_APPR_PROCESS_START', 'CNT_PAYMENT']]
summing = [col for col in previous_application.columns[2:] if col not in averaging]

In [None]:
TRAIN = Aggregation(TRAIN, previous_application, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=[], minimum =[], suff = '_pa')

In [None]:
TEST = Aggregation(TEST, previous_application, 'SK_ID_CURR', summing= summing, averaging = averaging, counting =[], maximum=[], minimum =[], suff = '_pa')

In [None]:
del previous_application

## Installments payments

In [None]:
installments_payments = pd.read_csv('../input/installments_payments.csv')

In [None]:
installments_payments.head()

#### Plot

In [None]:
plot_nans(installments_payments)

In [None]:
for col in ['NUM_INSTALMENT_VERSION', 'NUM_INSTALMENT_NUMBER']:
    installments_payments[col] /= installments_payments[col].max()
for col in installments_payments.columns[4:]:
    installments_payments[col] = scaler.fit_transform(installments_payments[col].fillna(installments_payments[col].mean()).values.reshape(-1, 1))  

In [None]:
installments_payments.head()

#### Aggregation

In [None]:
averaging = [col for col in installments_payments.columns[2:]]

In [None]:
TRAIN = Aggregation(TRAIN, installments_payments, 'SK_ID_CURR', summing= [], averaging = averaging, counting =[], maximum=[], minimum =[], suff = '_ip')

In [None]:
TEST = Aggregation(TEST, installments_payments, 'SK_ID_CURR', summing= [], averaging = averaging, counting =[], maximum=[], minimum =[], suff = '_ip')

In [None]:
del installments_payments

##### Now, we have completed data preprocessing and aggregation, and here is how our data looks like now

In [None]:
TRAIN.head()

## Model

#### When I  have prepared data, it is time to build model and made prediction




Becouse of croping data earlier, this dataframe I am left with is not really representative, and probably contains a lot of missing values.


In [None]:
TRAIN = TRAIN.fillna(0) # in case that some nan values still remained

In [None]:
TEST = TEST.fillna(0)

In [None]:
TRAIN.head()

In [None]:
TRAIN.shape

In [None]:
TEST.shape

In [None]:
TEST.head()

As I sad earlier, I will made a neural network model. But now, there is another problem. Our data is not balanced. The ratio between rejected and accepted loans is about 92% : 8%. That means that I can make a model that does apsolutely nothing and achieves accuracy 92. 

So, it isn't good idea to make model based only on accuracy score.

But, what ever metric, model won't fit well if data is as imbalanced as it is here. There are two simple solusions here. Oversampling and undersampling. In undersampling, you simply drop points from majority class until you get fairly even distribution of classes. Oversampling works the other way. You artificialy create more instances of minority class, usualy by taking linear combinations of existing points, until you got equal number in both classes. 

I would reccomand you to use oversampling, in order not to loose valuable information, but now I will stick to undersampling, couse it is less time and memory consuming. 

In [None]:
marker = [] # list that will keep track of rows I wont to keep
for i in range(len(TRAIN)):
    if TRAIN.TARGET[i] == 0: # if application is rejected
        if np.random.rand()<=0.1: # take it with probability 10 % (so the ratio between positive and negative targets is close to 1)
            marker.append(i)
    else:
        marker.append(i) # keep all approved loans

The size of original dataset was around 307k, and after undersampling, I am left with nearly 50k. 

Let's split dataframe to data and labels, and also to trainign and validation sets.

In [None]:
X = TRAIN.iloc[np.array(marker),3:].values
Y = TRAIN.iloc[np.array(marker)].TARGET.values
Y = keras.utils.to_categorical(Y, 2) # for neural networks, it is better to encode labels (from 0, 1 to [1,0], [0,1])

In [None]:
limit = int(0.8*len(X)) # for training take 80% of data, and 20 for validation
trainX, valX, trainY, valY = X[:limit], X[limit:], Y[:limit], Y[limit:]

### Building model

Model is very simple, with stack of dense layers and dropout layers in between, to avoid overfitting. I tried some other architectures as well, but the performance didn't chage significantly, so I picked this, as simple and fast model.

Considering the nature of dataset, I wouldn't benefit from reccurent or convolutional layers, so there is no need to complicate.

In [None]:
model = Sequential()
model.add(Dense(1024, input_shape=(trainX.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu',kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.4))
model.add(Dense(512, activation='relu',kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.3))
model.add(Dense(256, activation='relu',kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(2, activation='softmax'))
model.summary()

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(trainX, trainY,
                    epochs=3, # it should be way more epochs, so if your computer is able to carry this out, increase it
                    batch_size=8,validation_data=(valX, valY))

### Model Evaluation

To see how model performed, it is best to plot confusion matrix

In [None]:
sns.heatmap(CM(np.argmax(model.predict(valX), axis=1), np.argmax(valY, axis = 1)), annot = True)
plt.show()

If you are satisfied with performance, it is time to train the same model on whole training set and make a submussion. If not, then fine tune model until you get better results. 

#### Training model on whole dataset

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X, Y,
                    epochs=3, 
                    batch_size=8)

#### Making sumbission

There is explicit ecplanation how solution should look like, so let's fit into that form

In [None]:
Solution = pd.DataFrame(TEST.SK_ID_CURR)
Solution['TARGET'] = np.argmax(model.predict(TEST.iloc[:,2:].values).tolist(), axis = 1)
# the next line saves solution in CSV format, and it is ready for submitting
# Solution.set_index('SK_ID_CURR').to_csv('solution.csv')

##### That's all folks.

###### I hope this helped you. 
###### The project is very simple, but goes through all necessary steps. Results probably aren't satisfying, but it can improve with more detailed feature engineering and hyperparameter tuning. Anyway this should be good skeleton for making good prediction.

