<hr>
<h2 style="text-decoration:bold;color:Navy;" >Introduction &#127908; </h2>
<hr>

<html>
    <img src="https://cdn.standardmedia.co.ke/images/friday/top_10_most_attracti5ea2fa735cf45.jpeg" style="width:1200px;height:300px;background-color:lightgrey;" alt="">
    <p style = "color:brown;font-size:16px;" >
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The data uses is as below:
        </p> 
                <img src="https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png" style=" display:block; margin-left:auto;margin-right:auto; width:1400px;height:650px;background-color:lightgrey;" alt="">
        
     
</html>

<h2 style = "color:brown;font-size:18px;" >
In this notebook, we will stick to using only the main application training and testing data. 
 We will apply a standard classification model
</h2>
<ul style = "color:brown;font-size:16px;">
    <li>
Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features
    </li>    
    <li>    
Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)
    </li>    
</ul>

<hr>
<h2 style = "text-decoration:bold;color:Navy;">Import required packages and load data &#128451;</h2>
<hr>

In [None]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd 
pd.set_option('max_columns',None)

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_data = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
print('The train data has {} training observations with {} features'.format(train_data.shape[0],train_data.shape[1]))
train_data.head(3)

In [None]:
test_data = pd.read_csv('../input/home-credit-default-risk/application_test.csv')
print('The test data has {} test observations with {} features'.format(test_data.shape[0],test_data.shape[1]))
test_data.head(3)

<hr>
<h2 style="text-decoration:bold;color:Navy;" >Exploratory Data Analysis (EDA) &#128202;; </h2>
<p style = "color:brown;font-size:16px;">
    The goal is to analyze our data to find patterns and correlations as well as calculate statistics that will enable us better understand our data
</p>

<hr>

<h2 style="color:brown;">
Target Variable Distribution
</h2>

<p style="color:brown;font-size:16px;">
From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid.
</p>

In [None]:
train_data['TARGET'].value_counts().plot.pie()

<h2 style="color:brown;">
Determine Missing Values
</h2>


<p style = "color:brown;font-size:16px;">
    No action will be taken for now on missing values. However, in subsequent procedures, we will review as to whether to drop the columns with high number of missing values or fill in with other values.
</p>


In [None]:
pd.options.display.float_format = '{:,.2f}'.format
df_missing_val = pd.DataFrame(columns=['column','number of missing values'])
for column in train_data.columns: 
    missing_val = train_data[column].isnull().sum()
    x = {'column':column,'number of missing values':missing_val}
    df_missing_val = df_missing_val.append(x,ignore_index=True)
    

df_missing_val = df_missing_val[df_missing_val['number of missing values']!=0]
df_missing_val['number of missing values'] = df_missing_val['number of missing values'].astype(int)
df_missing_val = df_missing_val.sort_values(by='number of missing values',ascending=False)

df_missing_val.head(10)


<h2 style="color:brown;">
Dealing with Data Types
</h2>


<h3 style="color:brown;">
The table Below shows the datatypes in our data and the number of features per each datatype
</h3>


In [None]:
print(train_data.dtypes.value_counts())

<h3 style="color:brown;">
Lets look at the Categorical variables
</h3>

In [None]:
print(train_data.select_dtypes('object').apply(pd.Series.nunique,axis=0).sort_values(ascending=False))

<h4 style="color:brown;">
We have two main methds of dealing with categorical variables ie
    <ul>
        <li> One hot encoding </li>
        <li> Label encoding </li>
    </ul>
We will apply label encoding for features with two or less variables and for those with more than two variables  we'll use one-hot encoding. <br>
We will then apply a dimension reduction method known as *PCA* to reduce the number of dimensions while still preserving the information
</h4>

<h4 style="color:brown;">
i. Label Encodig
</h4>

In [None]:
le = LabelEncoder()
le_count = 0

for col in train_data.columns:
    if train_data[col].dtype == 'object':
        if len(list(train_data[col].unique()))<=2:
            le.fit(train_data[col])
            train_data[col] = le.transform(train_data[col])
            test_data[col] = le.transform(test_data[col])
            
            le_count +=1
                   
print('{} columns were transformed'.format(le_count))

<h4 style="color:brown;">
ii. One-Hot Encodig
</h4>

In [None]:
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)

print('The new train data shape after one-hot encoding is {} features and {} variables'.format(train_data.shape[1],train_data.shape[0]))
print('The new test data shape after one-hot encoding is {} features and {} variables'.format(test_data.shape[1],test_data.shape[0]))

<h4 style="color:brown;">One hot necoding has created more variables in the train data than test data since not all data in the test dataset is in the train dataset.
    The two will need to align the train data to the test dataset in order to have the same number of columns.
    However, as we align, we will ensure that the target variable is not excluded
</h4>

<h4 style="color:brown;">
    The training and testing datasets now have the same features which is required for machine learning
</h4>

In [None]:
train_labels = train_data.TARGET

train_data,test_data= train_data.align(test_data,join='inner', axis=1)
train_data['TARGET'] = train_labels

print('The new train data shape after one-hot aligning is {} features and {} variables'.format(train_data.shape[1],train_data.shape[0]))
print('The new test data shape after one-hot aligning is {} features and {} variables'.format(test_data.shape[1],test_data.shape[0]))


<h3 style="color:brown;">
Anomalies
</h3>

<h4 style="color:brown;">
    Let's look at other data anomalies.
    The days employed column seems unreasonable with the maximum number being 365243 which is unreasonable.
    Anomalies have different ways of being dealt with. I will chose to exclude the same from our dataset.
</h4>

In [None]:
train_data[['DAYS_EMPLOYED']].describe().plot.hist()
train_data[['DAYS_EMPLOYED']].describe()

In [None]:
# Create an anomalous flag column
train_data['DAYS_EMPLOYED_ANOM'] = train_data["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
train_data['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

train_data['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

test_data['DAYS_EMPLOYED_ANOM'] = test_data["DAYS_EMPLOYED"] == 365243
test_data["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (test_data["DAYS_EMPLOYED_ANOM"].sum(), len(test_data)))

<h3 style="color:brown;">
Correlations
</h3>

<h4 style="color:brown;">
Let's get a snippet of the most negatively and positiely correlations with the target variable.
    From there, we will look at their distributions.
    We will then proceed to visualize the correlations
</h4>

In [None]:
corelations = train_data.corr()['TARGET'].sort_values()
high_corelation = corelations.tail(15)
low_corelations =  corelations.head(15)

print('most positive corelations:\n', high_corelation)
print('most negative corelations:\n', low_corelations)

In [None]:
most_corr=train_data[['NAME_INCOME_TYPE_Working','REGION_RATING_CLIENT',
                      'REGION_RATING_CLIENT_W_CITY','DAYS_EMPLOYED','DAYS_BIRTH','TARGET']]
most_corr_corr = most_corr.corr()

sns.set_style("dark")
sns.set_context("notebook", font_scale=2.0, rc={"lines.linewidth": 1.0})
fig, axes = plt.subplots(figsize = (20,10),sharey=True)
sns.heatmap(most_corr_corr,cmap=plt.cm.RdYlBu_r,vmin=-0.25,vmax=0.6,annot=True)
plt.title('Correlation Heatmap for features with highest correlations with target variables')


In [None]:


# iterate through the sources
for i, source in enumerate(['NAME_INCOME_TYPE_Working','REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','DAYS_EMPLOYED','DAYS_BIRTH']):
    
    # create a new subplot for each source
    sns.set_style("dark")
    sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.0})
    fig, axes = plt.subplots(figsize = (10,5))
    # plot repaid loans
    sns.kdeplot(train_data.loc[train_data['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(train_data.loc[train_data['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    


In [None]:
sns.set_style("dark")
sns.set_context("notebook", font_scale=2.0, rc={"lines.linewidth": 1.0})
fig, axes = plt.subplots(figsize = (20,10),sharey=True)
least_corr = train_data[['EXT_SOURCE_3','EXT_SOURCE_2','EXT_SOURCE_1','NAME_EDUCATION_TYPE_Higher education','CODE_GENDER_F']]
least_corr_corr = least_corr.corr()
sns.heatmap(least_corr_corr,cmap=plt.cm.RdYlBu_r,vmin=-0.25,vmax=0.6,annot=True)
plt.title('Correlation Heatmap for features with lowest correlations with target variables')

<html>
    <h2 style="text-decoration:italics;color:red;">Work in Progress. 4 Days Left to Complete</h2>    
</html>