# SIMPLE DATA ANALYSIS

### DATA DESCRIPTION

- **application_{train|test}.csv**

    This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).

    Static data for all applications. One row represents one loan in our data sample.


- **bureau.csv**

    All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).

    For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.


- **bureau_balance.csv**

    Monthly balances of previous credits in Credit Bureau.

    This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.


- **POS_CASH_balance.csv**

    Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.

    This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.


- **credit_card_balance.csv**

    Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.

    This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.


- **previous_application.csv**

    All previous applications for Home Credit loans of clients who have loans in our sample.

    There is one row for each previous application related to loans in our data sample.


- **installments_payments.csv**

    Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.

    There is a) one row for every payment that was made plus b) one row each for missed payment.

    One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
    
<img src='home_credit.png'>

In [4]:
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,KFold,GridSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
import lightgbm as lgb

# for showing all columns
pd.set_option('display.max_columns', None)

%matplotlib inline 

## READING THE FILES

In [5]:
test=pd.read_csv("../input/application_test.csv")
train=pd.read_csv("../input/application_train.csv")
credit_card_balance=pd.read_csv("../input/credit_card_balance.csv")
installments_payments=pd.read_csv("../input/installments_payments.csv")
bureau=pd.read_csv("../input/bureau.csv")
bureau_balance=pd.read_csv("../input/bureau_balance.csv")
pos_cash=pd.read_csv("../input/POS_CASH_balance.csv")
prev_applications=pd.read_csv("../input/previous_application.csv")

# description=pd.read_csv("data/HomeCredit_columns_description.csv",encoding='ISO-8859-1')


In [None]:
print("Application train:",train.shape)
print("Application test:",test.shape)
print("Credit card balance:",credit_card_balance.shape)
print("Installements Payements:",installments_payments.shape)
print("Bureau:",bureau.shape)
print("Bureau balance:",bureau_balance.shape)
print("POS_cash Balance:",pos_cash.shape)
print("Previous Applications:",prev_applications.shape)

## Distribution of Target variable

From the description 0 means that loan is repayed and 1 means loan is not repayed.

As you can see that the percentage of y=1 is very less and the data set is skewed. So the best metric to test the performance is ROC

In [None]:
sns.countplot(train['TARGET'])
plt.title("Distribution of target variable ")
plt.show()

In [None]:
train.head(10)

In [None]:
test.head(10)

## MISSING DATA 

We will see the missing data in each of the files.

In [None]:
def missing_data(data):
    na=pd.DataFrame()
    na['number']=data.isnull().sum().sort_values(ascending=False)
    na['Percent']=data.isnull().sum()/data.shape[0]*100
    na.drop(index=na.loc[na['number']==0].index,inplace=True)
    return na

In [None]:
print(missing_data(train).shape[0])
missing_data(train).head(10)

In [None]:
print(missing_data(test).shape[0])
missing_data(test).head(10)

## EXPLORATION

i am making a function and this will be useful to plot 

The reason i am also plotting test data to see **distribution remains approximately same**.

In [None]:
def plot(train,test,feature,rot=False):
    plt.subplots(nrows=2,ncols=2,figsize=(12,12))
    
    plt.subplot(221)
    sns.countplot(train[feature])
    if(rot):
        plt.xticks(rotation='90')
    plt.title("For training data set")

    plt.subplot(222)
    sns.countplot(train.loc[train['TARGET']==1,feature])
    if(rot):
        plt.xticks(rotation='90')
    plt.title("Count plot when TARGET=1")

    plt.subplot(223)
    sns.barplot(x=train[feature],y=train['TARGET'])
    if(rot):
        plt.xticks(rotation='90')
    plt.title("Bar Plot between Target and "+feature)

    plt.subplot(224)
    sns.countplot(test[feature])
    if(rot):
        plt.xticks(rotation='90')
    plt.title("For test data set")
    
    plt.show()
    
    # these are numbers showing the stats
#     print(train[feature].value_counts())
#     print(train.loc[train['TARGET']==1,feature].value_counts())
    

### Loan types

The percentage of cash loan in total loans is near to 90% and for revolving loans is 10%. Out of the total unpaid loans the percent of cash loan is high as you see from the second graph.

But, given it is cash loan the probability that it is not paid is 8%(23221/278232) and given it is a revolving loan the probability that is not paid is 5% (1604/29279). This what our **bar plot is showing** and this is important for our prediction.

The other thing is that distribution in test data is same as training data that is more cash loans and less revolving loans

In [43]:
plot(train,test,'NAME_CONTRACT_TYPE')

### GENDER

Gender represents the gender of the client who have applied for the loan. 

There are more female clients that male clients. The other to notice is that number of females clients who has not paid the loan is more than male but more important is to notice the bar graph.

Given that client is male then probability that he doesn't pay is 10% which is higher in case of female(8%)

and distribution in test data is same as train data

In [44]:
plot(train,test,'CODE_GENDER')

### CAR AND HOUSE

Flag if the client owns a car and another one denotes flag for owning house.

number of clients having car is less and same case goes to house.

In [45]:
plot(train,test,'FLAG_OWN_CAR')
plot(train,test,'FLAG_OWN_REALTY')

### CNT_CHILDREN

denoting the number of children. A lot of thnings we can observe.

The number of clients having zero children is really high and other is the bar plot.

From the bar graph we can say that easily that if client has 9 or 11 it's almost sure that he/she will never pay the loan.

In [46]:
plot(train,test,'CNT_CHILDREN')

### NAME_TYPE_SUITE

who is accopamied when client was taking the loan. Most of the clients have not accompanied anyone nothing fancy in the bar graph.

In [47]:
plot(train,test,'NAME_TYPE_SUITE',rot=True)

### INCOME TYPE

Most of applicants for loans are income from Working, followed by Commercial associate, Pensioner and State servant.

The applicants with the type of income Maternity leave have almost 40% ratio of not returning loans, followed by Unemployed (37%). The rest of types of incomes are under the average of 10% for not returning loans.

In [49]:
plot(train,test,'NAME_INCOME_TYPE',rot=True)

### EDUCATION TYPE

Most of the clients are secondary/secondary_special and one thing to note from the bar plot is that lower secondary have 10% chances of not paying the loan follwed by incomplete higher and secondary/special having 8% chances of not paying the loan

In [51]:
plot(train,test,'NAME_EDUCATION_TYPE',rot=True)

### FAMILY STATUS

Most of the clients are married but the clients who are single or gone through civil marraige have 10% of not repaying the loans

In [52]:
plot(train,test,'NAME_FAMILY_STATUS',rot=True)

### HOUSING_TYPE

Most of the clients have house or apartment and the clients who stay in rented apartments have 12% chance of not paying the loan followed by clients who stay with parents have 10% chance of not paying the loan

In [53]:
plot(train,test,'NAME_HOUSING_TYPE',rot=True)

### OCCUPATION TYPE

Most of the clients are laborers and low-skill laborers have 17.5% chance of not paying the loan

In [54]:
plot(train,test,'OCCUPATION_TYPE',rot=True)

### CNT_FAM_MEMBERS

number of members in the family. As we have seen in the count children section and same goes with count family members.
if family member is 11 or 13 the chance that they will not repay is 100%

In [56]:
plot(train,test,'CNT_FAM_MEMBERS',rot=True)

### ORGANIZATION_TYPE

In [63]:
plot(train,test,'ORGANIZATION_TYPE',rot=True)

Now we will explore the continous variables like income,loan amount etc.

In [76]:
def plotc(train,feature):
    plt.subplots(nrows=1,ncols=2,figsize=(12,7))
    
    plt.subplot(121)
    sns.distplot(train[feature])
    
    plt.subplot(122)
    sns.boxplot(x=train['TARGET'],y=train[feature])
    plt.show()


### AMT_INCOME_TOTAL

In [77]:
plotc(train,'AMT_INCOME_TOTAL')

### AMT_CREDIT

In [78]:
plotc(train,'AMT_CREDIT')

### DAYS_BIRTH

Age of the client in days. negative means it is in the pas

In [80]:
plotc(train,'DAYS_BIRTH')

In [81]:
plotc(train,'DAYS_EMPLOYED')

In [82]:
plotc(train,'DAYS_REGISTRATION')

This notebook is in construction

Sugesstion are always welcome and if you have any doubts I am always at your disposal