This competition is about predicting whether the loan should be approved or not. Need to see which of the variables make sense and which we can ignore in our prediction. Can use classification directly from tensorflow. Or use only Keras using binary crossentrop- depending on how complex the code is for direct tensorflow.

Let us load the input files and figure out what data do we need for our prediction.

In [None]:
import os
print(os.listdir('../input'))

In [None]:
import numpy as np
import pandas as pd

application_test=pd.read_csv('../input/application_test.csv')
application_train=pd.read_csv('../input/application_train.csv')
bureau=pd.read_csv('../input/bureau.csv')
bureau_balance=pd.read_csv('../input/bureau_balance.csv')
credit_card_balance=pd.read_csv('../input/credit_card_balance.csv')
installments=pd.read_csv('../input/installments_payments.csv')
pos=pd.read_csv('../input/POS_CASH_balance.csv')
previous_application=pd.read_csv('../input/previous_application.csv')

Main challenge with this prediction is the amount of data in different files. Need to prepare a feature file as input to our prediction model- inputs will have to come from different sources. Can try different regression algorithms in tensorflow (both linear and non-linear, then can compare the prediction).

In [None]:
application_test.head()
application_test.shape

In [None]:
application_train

In [None]:
bureau.head()

In [None]:
bureau_balance.head()

In [None]:
credit_card_balance.head()

In [None]:
installments.head()

In [None]:
pos.head()

In [None]:
previous_application.head()

Based on the above data files, let us figure out what data do we need to ignore and what feature do we need for each of the files. Some kind of manual feature engineering we are doing, better than putting all non-sense features and then asking the model to predict. Also, since the amount vary a lot, somehow we need to normalize or put the data in % terms so that it is not unfair for people applying for small loans.

* Bureau data doesn't appear too useful- too much of a stretch to predict based on monthly bureau balances. **ignore**
* Credit card balances- two derived features, **Amt Balance/ Credit limit, Current drawing/ Credit limit**
* POS sales data, installments- do not see much of relevance of these parameters. **ignore**
* Previous application- **Amt credit/ Amt application**

Main application file- lot of useless columns, date, week of bureau and things which should not be counted for the credit of a person, let us retain the features like **Car (Y/N), House (Y/N), Children, Gender. **
Use **amt credit/ amount income, amt annuity/ amt income.**

In [None]:
X_train=application_train.iloc[:,0:10]
Y_train=X_train['TARGET']

X_test=application_test.iloc[:,0:9]

In [None]:
Y_train.sum()/len(Y_train)

In the training data, **only 8% of the applicants got their loans approved**, similar ratio we have to maintain in the test data also.

In [None]:
X_train= X_train.drop(['TARGET'],axis=1)

In [None]:
X_train['Total Credit/ Total Income']=X_train['AMT_CREDIT']/X_train['AMT_INCOME_TOTAL']
X_train['Annuity/Income']=X_train['AMT_ANNUITY']/X_train['AMT_INCOME_TOTAL']

X_test['Total Credit/ Total Income']=X_test['AMT_CREDIT']/X_test['AMT_INCOME_TOTAL']
X_test['Annuity/Income']=X_test['AMT_ANNUITY']/X_test['AMT_INCOME_TOTAL']

In [None]:
X_train= X_train.drop(['AMT_CREDIT','AMT_INCOME_TOTAL','AMT_ANNUITY'],axis=1)
X_test= X_test.drop(['AMT_CREDIT','AMT_INCOME_TOTAL','AMT_ANNUITY'],axis=1)

Let's use the previous application data- calculate the ratio of credit/ application, group by ID and then merge with the original training data.
Will also need to merge with testing data.

In [None]:
previous_application_temp= previous_application.groupby(['SK_ID_CURR'],as_index=False)['AMT_APPLICATION','AMT_CREDIT'].sum()
previous_application_temp['Previous Credit/ Previous Application']= previous_application_temp['AMT_CREDIT']/previous_application_temp['AMT_APPLICATION']
previous_application_temp= previous_application_temp.drop(['AMT_APPLICATION','AMT_CREDIT'],axis=1)

In [None]:
X_train=pd.merge(X_train,previous_application_temp,on=['SK_ID_CURR'],how='left')
X_test=pd.merge(X_test,previous_application_temp,on=['SK_ID_CURR'],how='left')

Like previous application data, need to calculate credt utilization from credit card file and then merge with the main training and test files.

In [None]:
credit_card_balance_temp= credit_card_balance.groupby(['SK_ID_CURR'],as_index=False)['AMT_BALANCE','AMT_CREDIT_LIMIT_ACTUAL','AMT_DRAWINGS_CURRENT'].sum()
credit_card_balance_temp['Balance/ Credit Limit']= credit_card_balance_temp['AMT_BALANCE']/credit_card_balance_temp['AMT_CREDIT_LIMIT_ACTUAL']
credit_card_balance_temp['Drawing/ Credit Limit']= credit_card_balance_temp['AMT_DRAWINGS_CURRENT']/credit_card_balance_temp['AMT_CREDIT_LIMIT_ACTUAL']
credit_card_balance_temp= credit_card_balance_temp.drop(['AMT_BALANCE','AMT_CREDIT_LIMIT_ACTUAL','AMT_DRAWINGS_CURRENT'],axis=1)

In [None]:
X_train=pd.merge(X_train,credit_card_balance_temp,on=['SK_ID_CURR'],how='left')
X_test=pd.merge(X_test,credit_card_balance_temp,on=['SK_ID_CURR'],how='left')

In [None]:
X_train= X_train.drop(['SK_ID_CURR'],axis=1)
X_test= X_test.drop(['SK_ID_CURR'],axis=1)

Both X_train and X_test are now ready with the features that we planned earlier- did a bit of manual feature engineering. Lot of the values are NaN, one approach is to remove them, this would reduce the training data substantially. But not having a particualr value should have some corelation to the final loan approved or not.

Next thing is to choose which classification model to use. Also, our **feature set has some qualitiative data, that might need encoding**. Also, for** NaN values use imputing **to put 0 in those places. (makes sense to assume 0 credit utilization if information not available).

* Try **classic or shallow ML algorthims like Logistics regression, RandomForest, SVM** to see what % of loan approvers get loan approved. Tensorflow can also be used for these algorithms, but it is low level, need lot of hyperparamter tuning, so let's go ahead with scikit-learn library.
* Use Keras for **neural net **to predict the binary classification.

Baseline is around **8% of people should have their loan approved** from overall applicants.

In [None]:
X_train= pd.get_dummies(X_train,columns=['NAME_CONTRACT_TYPE','CODE_GENDER','FLAG_OWN_CAR','FLAG_OWN_REALTY'])
X_test= pd.get_dummies(X_test,columns=['NAME_CONTRACT_TYPE','CODE_GENDER','FLAG_OWN_CAR','FLAG_OWN_REALTY'])

X_train= X_train.drop(columns=['CODE_GENDER_XNA'])

In [None]:
X_train= X_train.replace(np.inf,np.nan)
X_test = X_test.replace(np.inf,np.nan)

X_train= X_train.fillna(0)
X_test= X_test.fillna(0)

**Logistic regression**

In [None]:
from sklearn.linear_model import LogisticRegression

classifier_LR=LogisticRegression()
classifier_LR.fit(X_train,Y_train)

Y_pred_LR=classifier_LR.predict(X_test)

In [None]:
Y_pred_LR

**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier_RF=RandomForestClassifier()
classifier_RF.fit(X_train,Y_train)

Y_pred_RF=classifier_RF.predict(X_test)

**XGBoost Classifier**

In [None]:
import xgboost as xgb

classifier_xgb=xgb.XGBClassifier()
classifier_xgb.fit(X_train,Y_train)

Y_pred_xgb=classifier_xgb.predict(X_test)

**Deep Learning using Keras**

In [None]:
from keras import models
from keras import layers
from keras.layers import Dense, Dropout

model=models.Sequential()
model.add(layers.Dense(14, activation='relu',input_dim=14))
model.add(Dropout(0.6))

model.add(layers.Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

model.fit(X_train,Y_train,batch_size=256,epochs=1)

Y_pred_DL=model.predict(X_test)

Let's see below that as per all the above classification model- what % of loan applicants get their loans approved

In [None]:
Y_pred_DL

In [None]:
count=0
for x in Y_pred_DL:
    if x >0.1338:
        count+=1
print(count)
print(count/len(X_test))

Let's approve loan of applicants with probability greater than 13.38%. This is coming to around 8% which is somewhat similar to the distribution of train data.

In [None]:
submission_DL=pd.DataFrame(application_test['SK_ID_CURR'],columns=['SK_ID_CURR'])
submission_LR=pd.DataFrame(application_test['SK_ID_CURR'],columns=['SK_ID_CURR'])
submission_RF=pd.DataFrame(application_test['SK_ID_CURR'],columns=['SK_ID_CURR'])

In [None]:
submission_DL['TARGET']=pd.DataFrame({'TARGET':Y_pred_DL[:,0]})
submission_LR['TARGET']=pd.DataFrame({'TARGET':Y_pred_LR[:,0]})
submission_RF['TARGET']=pd.DataFrame({'TARGET':Y_pred_RF[:,0]})

In [None]:
submission_DL.to_csv('Submission File_DL.csv',index=False)
submission_LR.to_csv('Submission File_LR.csv',index=False)
submission_RF.to_csv('Submission File_RF.csv',index=False)

Wite the output files for each classifier for submission and check the accuracy rate