## The Home Credit Default Risk

![](http://www.homecredit.net/~/media/Images/H/Home-Credit-Group/image-gallery/full/image-gallery-01-11-2016-b.png)

- <a href='#1'>1. Introduction</a>  
- <a href='#2'>2. Retrieving the Data</a>
- <a href='#3'>3. Glimpse of Data</a>
- <a href='#4'>4. Check for missing data</a>
- <a href='#5'>5. Data Exploration</a>
    - <a href='#5-1'>5.1 Distribution of AMT_CREDIT</a>
    - <a href='#5-2'>5.2 Distribution of AMT_INCOME_TOTAL</a>
    - <a href='#5-3'>5.3 Distribution of AMT_GOODS_PRICE</a>
    - <a href='#5-4'>5.4 Distribution of Name of type of the Suite</a>
    - <a href='#5-5'>5.5 Data is balanced or imbalanced</a>
    - <a href='#5-6'>5.6 Types of loan</a>
- <a href='#6'>6. Pearson Correlation of features</a>
- <a href='#7'>7. Feature Importance using Random forest</a>

# <a id='1'>1. Introduction</a>

The Home Credit Default Risk competition is a supervised classification machine learning task. The objective is to use historical financial and socioeconomic data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification task:

    - Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features
    - Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

### Dataset

The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population.

There are 7 different data files:

    - application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the SK_ID_CURR. The training application data comes with the TARGET with indicating 0: the loan was repaid and 1: the loan was not repaid.
    bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau and is identified by the SK_ID_BUREAU, Each loan in the application data can have multiple previous credits.
    - bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
    previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
    - POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
    credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
    - installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.


 # <a id='2'>2. Retrieving the Data</a>

In [13]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn

In [3]:
train = pd.read_csv("/home/wafa/Bureau/application_train.csv")
test = pd.read_csv("/home/wafa/Bureau/application_test.csv")

#train = train.sample(n=30751,replace=False, random_state=0)
#test = test.sample(n=4874,replace=False, random_state=0)

# <a id='3'>3. Glimpse of Data</a>

**application_train data**

In [14]:
train.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,100002,0,0,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,...,0,0,0,0,0,0,1,0,1,0
1,100003,0,0,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,...,0,1,0,0,0,0,0,0,1,0
2,100004,1,1,1,0,67500.0,135000.0,6750.0,135000.0,0.010032,...,0,0,0,0,0,0,0,0,0,0
3,100006,0,0,1,0,135000.0,312682.5,29686.5,297000.0,0.008019,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,0,1,0,121500.0,513000.0,21865.5,513000.0,0.028663,...,0,0,0,0,0,0,0,0,0,0


In [15]:
train.columns.values

array(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
       'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG',
       'LANDAREA_AVG', 'LIVINGAP

# <a id='4'> 4. Check for missing data</a>

**checking missing data in application_train **

In [16]:
# checking missing data
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_train_data.head(20)

Unnamed: 0,Total,Percent
EMERGENCYSTATE_MODE_Yes,0,0.0
NONLIVINGAPARTMENTS_MEDI,0,0.0
FLAG_DOCUMENT_6,0,0.0
FLAG_DOCUMENT_5,0,0.0
FLAG_DOCUMENT_4,0,0.0
FLAG_DOCUMENT_3,0,0.0
FLAG_DOCUMENT_2,0,0.0
DAYS_LAST_PHONE_CHANGE,0,0.0
DEF_60_CNT_SOCIAL_CIRCLE,0,0.0
OBS_60_CNT_SOCIAL_CIRCLE,0,0.0


# <a id='5'>5. Data Exploration</a>

## <a id='5-5'>5.5 Data is balanced or imbalanced</a>

In [17]:
temp = train["TARGET"].value_counts()
df = pd.DataFrame({'labels': temp.index,
                   'values': temp.values
                  })
df.iplot(kind='pie',labels='labels',values='values', title='Loan Repayed or not')

KeyError: 'TARGET'

In [4]:
from sklearn import preprocessing

# Create a label encoder object
le = preprocessing.LabelEncoder()
le_count = 0

# Iterate through the columns
for col in train:
    if train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(train[col].unique())) <= 2:
            # Train on the training data
            le.fit(train[col])
            # Transform both training
            train[col] = le.transform(train[col])
            test[col] = le.transform(test[col])
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)
print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

3 columns were label encoded.
Training Features shape:  (307511, 122)
Testing Features shape:  (48744, 121)


In [5]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

Training Features shape:  (307511, 243)
Testing Features shape:  (48744, 239)


In [None]:
# Age information into a separate dataframe
age_data = train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

In [6]:
train_labels = train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
train, test = train.align(test, join = 'inner', axis = 1)

# Add the target back in
#train['TARGET'] = train_labels

print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

Training Features shape:  (307511, 239)
Testing Features shape:  (48744, 239)


In [7]:
train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)

# <a id='7'>7. Feature Importance using Linear Regression</a>

In [8]:
# split into train/test sets
from sklearn.model_selection import train_test_split
X_train, X_val,y_train, y_val = train_test_split(train, train_labels, test_size=0.22, random_state=0)

from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

# Train on the training data
log_reg.fit(X_train, y_train)


# Make predictions
# Make sure to select the second column only
log_reg_pred = log_reg.predict_proba(X_val)[:, 1]

score = log_reg.score(X_val, y_val)
score

# calculate AUC
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_val, log_reg_pred )
print(f'AUC: {auc}')



AUC: 0.6224959428174544


In [10]:
# Submission dataframe
submit = test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.to_csv('log_reg_baseline.csv', index = False)
submit.head()


ValueError: Length of values does not match length of index