# This Notebook contains an Exploratory Data Analysis of the Home Credit Risk Competition

## 1. Load in and quick look at datasets      
## 2. Visualise the data 
## 3. One hot encoding and imputing missing data
## 4. Baseline Model Logistic Regression
## 5. Feature Engineering
## 6. Alternative Models

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

five_thirty_eight = [
    "#30a2da",
    "#fc4f30",
    "#e5ae38",
    "#6d904f",
    "#8b8b8b",
]


plt.style.use('fivethirtyeight')
#sns.set_palette(five_thirty_eight)
#sns.palplot(sns.color_palette())

%matplotlib inline 


from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go

init_notebook_mode(connected=True) #do not miss this line
from plotly import tools


# For model estimation
from sklearn.preprocessing import LabelEncoder,MinMaxScaler, Imputer
from sklearn.linear_model import LogisticRegression
from sklearn import svm


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
PATH = "../input"
# Any results you write to the current directory are saved as output.

## 1: Load in and take a quick look at the data

In [2]:
application_train = pd.read_csv(PATH+"/application_train.csv")
application_test = pd.read_csv(PATH+"/application_test.csv")
bureau = pd.read_csv(PATH+"/bureau.csv")
bureau_balance = pd.read_csv(PATH+"/bureau_balance.csv")
credit_card_balance = pd.read_csv(PATH+"/credit_card_balance.csv")
installments_payments = pd.read_csv(PATH+"/installments_payments.csv")
previous_application = pd.read_csv(PATH+"/previous_application.csv")
POS_CASH_balance = pd.read_csv(PATH+"/POS_CASH_balance.csv")

## Calculate some simple correlations to give us an indication of the important variables

In [115]:
correlations = application_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations: \n', correlations.tail(15))
print('\nMost Negative Correlations: \n', correlations.head(15))

## Variables we want to visualise

### Application dataset
* TARGET
* CODE_GENDER
* AMT_INCOME_TOTAL
* AMT_CREDIT
* NAME_EDUCATION_TYPE
* NAME_FAMILY_STATUS
* CNT_CHILDREN
* OCCUPATION_TYPE
* NAME_INCOME_TYPE
* FLAG_OWN_CAR (own car)
* FLAG_OWN_REALTY (own house)
* ORGANIZATION_TYPE
* DAYS_REGISTRATION
* DAYS_EMPLOYED

In [4]:
application_train.head()
application_train.columns.values

In [5]:
bureau.head()

## Check for missing data

In [6]:
def missing(data):
    miss = data.isnull().sum().sort_values(ascending = False)
    perc = 100*(data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
    return pd.concat([miss, perc], axis = 1, keys = ['Total', 'Percent'])

In [7]:
missing(application_train).head(20)

There are alot of missing values in certain columns. These columns appear to mainly be related to the applicants housing situation
which may not be hugely important for prediction. We may get rid of these or use simple mean or mode imputation later on.

In [13]:
missing(bureau).head(20)

In [14]:
missing(bureau_balance)

In [16]:
missing(credit_card_balance)

In [18]:
missing(previous_application)

Substantial amount of missing data on rates in this dataset

In [19]:
missing(POS_CASH_balance)

In [20]:
missing(installments_payments)

In [102]:
plt.figure(figsize=(10,6))
sns.countplot(x='TARGET', data =  application_train)
plt.title("Number of Loans that were repayed and not repayed")


# prop = (temp.values/(temp.values[0]+temp.values[1]))*100
# print("Propotion of loans not repayed: %.2f" %prop[1]+"%")

### It looks like we have substantial class imbalance (92% loans repayed )here and we will probably need to under or over sample the data when we get to prediction. This means we could just say all loans are repayed and our model would score 92 on accuracy. Probably need to use F-score to determine if model is good or not

## 2. Visualise the data 

## Lets make a function to plot the data by the TARGET value.

In [10]:
def count_plots(feature, label_rotation=False):
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,9))
    plot1 = sns.countplot(ax = ax1, x=feature, hue = 'TARGET',data =  application_train)
    if(label_rotation):
        plot1.set_xticklabels(plot1.get_xticklabels(),rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=10)
    # since 1 means not repayed the mean will give us the proportion of non repayed loans
    perc_grouped = application_train[[feature, 'TARGET']].groupby(feature).mean().sort_values(by="TARGET", ascending= False)
    plot2 = sns.barplot(ax=ax2, x=perc_grouped['TARGET'], y = perc_grouped.index, orient="h")

    plt.tight_layout()
    plt.show()

In [11]:
count_plots("CODE_GENDER")

## It looks like females are much more likely to take out loans with females having almost twice as many loans. 
## Females also have a higher number of total defaults
## However the figure on the right tells us that man are actually more likely to default with default rates at around 10% vs 7% for females.

In [98]:
count_plots('NAME_EDUCATION_TYPE', label_rotation=True)

### Loans are mostly going to people with lower educational achievment from the left figure. It also looks like the rate of defaults reduces the higher your education. This variable is likely quite important for prediction.

In [110]:
count_plots("NAME_FAMILY_STATUS")

In [111]:
count_plots("CNT_CHILDREN")

## Occupation type and Income type

In [106]:
# * OCCUPATION_TYPE
# * FLAG_OWN_CAR (own car)
# * FLAG_OWN_REALTY (own house)
count_plots("OCCUPATION_TYPE",  label_rotation=True)

In [114]:
count_plots("NAME_INCOME_TYPE", True)

## Does owning a car or real estate affect their ability to pay?

In [107]:
count_plots("FLAG_OWN_CAR")

In [108]:
count_plots("FLAG_OWN_REALTY")

In [109]:
count_plots("NAME_CONTRACT_TYPE")

In [115]:
count_plots("ORGANIZATION_TYPE", True)

### Alot of these variables look like they may be pretty useful as there seems to be different patterns between those who repay and those who do not.

# Lets look at the distrubtion of some of the continuous variables now
* AMT_INCOME_TOTAL
* AMT_CREDIT
*  AMT_ANNUITY
* AMT_GOODS_PRICE
* DAYS_EMPLOYED

## Function Plots the distrubtion of our variables

In [46]:
def plot_distribution(data, feature, log_transform = False):
    plt.figure(figsize=(12,8))
    if log_transform:
        plt.title("Distribution of log of %s" % feature)
        sns.distplot(np.log(data[feature]).dropna(), kde=True,bins=100)
    else:
        plt.title("Distribution of %s" % feature)
        sns.distplot(data[feature].dropna(), kde=True,bins=100)
    plt.show() 

## Function to plot variable ditributions for defaulters and non defaulters

In [125]:
def plot_target(data, feature, xlab= '', ylab= '', title= ""):
    plt.figure(figsize=(12,9))
    sns.kdeplot(data.loc[data['TARGET'] == 0, feature], label = 'target == 0')

    # KDE plot of loans which were not repaid on time
    sns.kdeplot(data.loc[data['TARGET'] == 1, feature], label = 'target == 1')
    
    # Labeling of plot
    plt.xlabel(feature); plt.ylabel('Density'); plt.title("Distribution of %s"%(feature));

In [89]:
plot_distribution(application_train, "AMT_INCOME_TOTAL")

In [98]:
plot_target(application_train, "AMT_INCOME_TOTAL")

In [90]:
plot_distribution(application_train, "AMT_INCOME_TOTAL", True)

In [91]:
plot_distribution(application_train,"AMT_CREDIT")

In [99]:
plot_target(application_train, "AMT_CREDIT")

In [92]:
plot_distribution(application_train,"AMT_CREDIT", True)

In [93]:
plot_distribution(application_train,"AMT_ANNUITY")

In [103]:
plot_target(application_train, "DAYS_BIRTH", "Days since Birth")
# The younger you are the less likely you are to repay your loan

In [136]:
plot_distribution(application_train,"AMT_ANNUITY", True)

In [137]:
plot_distribution(application_train,"AMT_GOODS_PRICE")

In [138]:
plot_distribution(application_train,"AMT_GOODS_PRICE", True)

In [139]:
plot_distribution(application_train,"DAYS_EMPLOYED")

In [32]:
plot_distribution(application_train,"EXT_SOURCE_1")

In [126]:
plot_target(application_train, "EXT_SOURCE_1")

In [117]:
plot_distribution(application_train,"EXT_SOURCE_2")

In [127]:
plot_target(application_train, "EXT_SOURCE_2")

In [119]:
plot_distribution(application_train,"EXT_SOURCE_3")

In [128]:
plot_target(application_train, "EXT_SOURCE_3")

## From this graph it looks like most people are < 0 i.e. unemployed. The distribution appears to be bimodal with a large amount of people  being employed for a long time.

In [152]:
application_train['DAYS_EMPLOYED'].describe()
print("Longest time unemployed is %.0f" % (application_train['DAYS_EMPLOYED'].min()/365 * -1) + " years")
print("Longest time employed is %.0f" % (application_train['DAYS_EMPLOYED'].max()/365) + " years")

## Ok so obviously some of this data needs to be cleaned since we have someone being employed for 1000 years

In [153]:
plot_distrubtion(application_train,"DAYS_REGISTRATION")

# Now lets take a look at the Bureau dataset

The bureau dataset contains info on applicants previous loan applications with other credit institutions

### We will mergre these datasets on the primary key which is SK_ID_CURR, using an inner join

In [13]:
bureau.info()

In [14]:
app_bur_train = application_train.merge(bureau, left_on = 'SK_ID_CURR', right_on = 'SK_ID_CURR', how = 'inner')

In [15]:
print(application_train.shape, bureau.shape, app_bur_train.shape)

In [16]:
def count_plots2(data, feature, label_rotate = False):
    plt.figure(figsize=(12,9))
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,9))
    plot1 = sns.countplot(ax=ax1, x = feature, hue = 'TARGET', data = data)
    if(label_rotate):
            plot1.set_xticklabels(plot1.get_xticklabels(),rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=10)
    perc_grouped = data[[feature, 'TARGET']].groupby(feature).mean().sort_values(by="TARGET", ascending= False)
    plot2 = sns.barplot(ax=ax2, x=perc_grouped['TARGET'], y = perc_grouped.index, orient="h")

    plt.tight_layout()
    plt.show()

## Credit Active

In [167]:
count_plots2(app_bur_train, "CREDIT_ACTIVE" ,True)

## I think this is saying that the majority of people with current home credit loans have closed loans with the credit bureau. Some of these however, are still active.
## We can also see that of those with bad debt previously, around 20% have defaulted on these home loans.

In [130]:
app_bur_train["CREDIT_ACTIVE"].value_counts()

## Ok so there are 20 current applicants with bad debt previously and around 5 of the have defaulted (20%)

## Credit Currency

In [169]:
count_plots2(app_bur_train, "CREDIT_CURRENCY", True)

*  The majority of the loans are in currency one which I guess is euro since this is the czech republic?

## Do some plotly figures

In [133]:
import plotly.graph_objs as go
x = app_bur_train["CREDIT_ACTIVE"].value_counts()

trace1 = go.Bar(
                x = x.index,
                y = x.values,
                name = "Credit")
data = [trace1]

fig = go.Figure(data = data)
py.offline.iplot(fig)

## Plotly function to plot the counts of Variables

In [110]:
def plotly_plot(df,feature):
    x = df[feature].value_counts().sort_values(ascending=False)
    trace = go.Bar(
                x = x.index,
                y = x.values,
                name = feature)
    data = [trace]

    fig = go.Figure(data = data)
    py.offline.iplot(fig)

In [111]:
plotly_plot(app_bur_train,"CREDIT_TYPE")

## Function to plot the default rates by different variables

In [215]:
def plotly_plots_percs(data, feature):
    # groupby and mean gives us average default rate by that particular feature
    x = pd.DataFrame(data[[feature, "TARGET"]].groupby(feature).mean().sort_values(by = "TARGET",ascending=False))
    trace1 = go.Bar(
                x = x.index,
                y = x['TARGET'],
                name = feature)

    data = [trace1]

    fig = go.Figure(data = data)
    py.offline.iplot(fig)

In [113]:
plotly_plot(app_bur_train, "CREDIT_TYPE")
# x = pd.DataFrame(app_bur_train[["CREDIT_TYPE", "TARGET"]].groupby("CREDIT_TYPE").mean().sort_values(by = "TARGET",ascending=False))

## Bureau Balance Dataset

- This dataset contains data previous loans from the bureau, each row is one month of previous loans and each loan can have  multiple rows. e.g. one particular loan can be outstanding for many months

In [37]:
bureau_balance.columns.values

In [45]:
# number of months that each loan has had the balance outstanding
bureau_balance_size = bureau_balance.groupby('SK_ID_BUREAU')['MONTHS_BALANCE'].size()

# longest number of months they have had the balance outstanding
bureau_balance_max = bureau_balance.groupby('SK_ID_BUREAU')['MONTHS_BALANCE'].max()

# shortest number of months they had balance outstanding
bureau_balance_min = bureau_balance.groupby('SK_ID_BUREAU')['MONTHS_BALANCE'].min()

In [137]:
plot_distribution(bureau_balance,"MONTHS_BALANCE")
# It looks like the majoroty of loans are repayed quite quickly

 ## credit_card_balance Dataset
 - contains info on previous clients credit cards that they have had with Home Credit.
 - each row contains monthly info on credit cards balance
 - one credit card can have multiple rows

In [61]:
credit_card_balance.info()

In [60]:
print("Number of rows in the dataset: %d" %len(credit_card_balance['SK_ID_CURR']))
print("Number of unique credit cards: %d" %len(credit_card_balance['SK_ID_CURR'].unique()))

### MONTHS_BALANCE

In [104]:
plot_distribution(credit_card_balance, "MONTHS_BALANCE")

### AMT_BALANCE

In [138]:
plot_distribution(credit_card_balance, "AMT_BALANCE")

In [139]:
plot_distribution(credit_card_balance, "AMT_CREDIT_LIMIT_ACTUAL")
print("Max credit limit: %d" %credit_card_balance['AMT_CREDIT_LIMIT_ACTUAL'].max())
print("Min credit limit: %d" %credit_card_balance['AMT_CREDIT_LIMIT_ACTUAL'].min())

In [149]:
temp1 = credit_card_balance["NAME_CONTRACT_STATUS"].value_counts().sort_values(ascending=False)
print(temp1)
plt.figure(figsize=(10,8))
temp_perc = 100*(credit_card_balance["NAME_CONTRACT_STATUS"].value_counts().sort_values(ascending=False)/temp1.sum())
print(temp_perc)
sns.countplot(x="NAME_CONTRACT_STATUS", data=credit_card_balance)

## Next steps
**- The vase majority of the loans are currently active (96%)**

**- Next we will merge this with our other data to plot by the TARGET value**

## 3. One hot encoding and imputing missing data

In [6]:
le = LabelEncoder()
le_count = 0

# only label encode those variables with 2 or less categories
for col in application_train:
    if application_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(application_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(application_train[col])
            # Transform both training and testing data
            application_train[col] = le.transform(application_train[col])
            application_test[col] = le.transform(application_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

### Use one-hot encoding on the rest of the categorical variables

In [8]:
application_train = pd.get_dummies(application_train)
application_test = pd.get_dummies(application_test)

print('Training Features shape: ', application_train.shape)
print('Testing Features shape: ', application_test.shape)

### We can align the training and test set now since some variables in the training set are not in the test set

In [9]:
target = application_train['TARGET']

application_train, application_test = application_train.align(application_test, join = 'inner', axis = 1)

print('Training Features shape: ', application_train.shape)
print('Testing Features shape: ', application_test.shape)

In [11]:
# put target back in training set
#application_train['TARGET'] = target

In [13]:
from sklearn.pipeline import Pipeline
# Drop the target from the training data
if 'TARGET' in application_train:
    train = application_train.drop(columns = ['TARGET'])
else:
    train = application_train.copy()
features = list(train.columns)

test = application_test.copy()

# Median imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(application_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)


In [15]:
# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0005)

# Train on the training data
log_reg.fit(train, target)

# Make predictions
# Make sure to select the second column only
log_reg_pred = log_reg.predict_proba(test)[:, 1]

In [19]:
# Submission dataframe
submit = application_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [20]:
# Save the submission to a csv file
submit.to_csv('logistic_baseline.csv', index = False)