In [this dataset](https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset), we need to predict whether or not to approve a loan based on the past information of the person. This is a classification problem and we will use machine learning, Decision Tree Classifier model, to make the prediction.

# Import Libraries
First, we import necessary libraries, such as:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

# Import The Data

In [None]:
train = pd.read_csv('/kaggle/input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')

# Read The Data
First, let's see the first 5 rows to familiarize ourself with the data.

In [None]:
train.head()

To get more details, we are going to print ```info()``` and ```describe()``` to make a quick observation and gain some insight from it.

In [None]:
train.info()

In [None]:
train.describe(include='all')

In [None]:
train.isnull().sum().sort_values(ascending=False)

### Quick observation on the combined data
- Total loaner: 614
- Feature that can be dropped from training immediately:
    - **Loan_ID**
- The **Loan_Status** feature, as target array, can be used as a reference to fill missing values, so we will not drop it immediately.
- The **Dependents** feature is given in categorical but contain numerical variables. Therefore, we have to converted it to numerical variables.
- The **Credit_History** feature is given in numerical, with 1 means 'Yes' and 0 means 'No'. We will converted it to categorical feature.
- Features that have missing values:
    - **Credit_History:**        50
    - **Self_Employed:**         32
    - **LoanAmount:**            22
    - **Dependents:**            15
    - **Gender:**                14
    - **Loan_Amount_Term:**      13
    - **Married:**                3

### Plot The Distribution of Numerical Features

In [None]:
#plot the distribution of numerical features
train.hist(bins=50,figsize=(10,10),grid=False)
plt.tight_layout()
plt.show()

We can see we got right-skewed and left-skewed. We will fix this in the next step by taking the log of the values to make it normally distributed. By making it normally distributed, we can improve our model.

# Exploratory Data Analysis

### Drop Features

In [None]:
#drop feature
train.drop(['Loan_ID'], axis=1, inplace=True)

### Change to Numerical

In [None]:
#check unique values
train['Dependents'].unique()

In [None]:
#replace '3+' with '3'
train['Dependents'].replace('3+', '3', inplace=True)

#change to numerical
train['Dependents'] = train['Dependents'].astype('float')

### Changet to Categorical

In [None]:
#check unique values
train['Credit_History'].unique()

In [None]:
#replace 1.0 with 'Y' and 0.0 with 'N'
train['Credit_History'].replace({1.: 'Y', 0.: 'N'}, inplace=True)

#change to categorical
train['Credit_History'] = train['Credit_History'].astype('object')

### Fill Missing Value: 

- Fill with mode()

In [None]:
#fill missing values with mode 
features_fill_with_mode = ['Self_Employed',
                           'Dependents',
                           'Gender',
                           'Loan_Amount_Term']

for feature in features_fill_with_mode:
    train[feature].fillna(train[feature].mode()[0], inplace=True)

- Fill with mean()

In [None]:
#fill missing values with mean
train['LoanAmount'].fillna(train['LoanAmount'].mean(), inplace=True)

- Credit_History Feature

Before we fill missing values in Credit_History feature, we will take a deeper look by plotting it.

In [None]:
sns.countplot(x='Credit_History', hue='Loan_Status', data=train);

From the plot above, we can see that Credit_History is important feature. Most people with 0 credit history didn't get a loan. But, most people who got credit history have so much better chance to get a loan.

Since Credit_History = 'Y' is the value that appears most often in both Loan_Status, so we will fill missing values with 'Y' 

In [None]:
train['Credit_History'].fillna('Y', inplace=True)

- Married Feature

For start, we will check if the missing values in the Married feature have Dependets or CoapplicantIncome more than 0, and fill it with 'Yes' if true and 'No' if otherwise.

In [None]:
#check Dependents and CoapplicantIncome
mask = ((train['Dependents'] > 0) | (train['CoapplicantIncome'] > 0)) \
        & \
        train['Married'].isnull()

train[mask][['Married','Dependents','CoapplicantIncome']]

In [None]:
#Fill missing values
train.loc[mask,'Married'] = 'Yes'
train['Married'].fillna('No', inplace=True)

### Target Array
let's look at the target distribution

In [None]:
sns.countplot(train['Loan_Status']);

From the distribution above, we can consider that the data is not imbalanced. So, we can straight to the next step: change it to numerical feature.

In [None]:
#transform to numerical
train['Loan_Status'] = train['Loan_Status'].apply(lambda x: 1 if x=='Y' else 0)

#correlation
sns.heatmap(train.corr(),annot=True);

In [None]:
#copy 
target_array = train['Loan_Status'].copy()

#drop
train.drop(['Loan_Status'], axis=1, inplace=True)

### Creating new features

In [None]:
#create total income feature
train['Total_Income'] = train['ApplicantIncome'] + train['CoapplicantIncome']

#create average loan amount feature (per day)
train['Loan_Amount_Avg'] = train['LoanAmount'] / train['Loan_Amount_Term']

#drop
train.drop(['ApplicantIncome','CoapplicantIncome'], axis=1, inplace=True)

### Epilogue

- Chech for any missing values

In [None]:
#missing values
print(train.isnull().any().sum())

- Normality Test

In [None]:
#define a normality test function
def normalityTest(data, alpha=0.05):
    """data (array)   : The array containing the sample to be tested.
	   alpha (float)  : Significance level.
	   return True if data is normal distributed"""
    
    from scipy import stats
    
    statistic, p_value = stats.normaltest(data)
    
    #null hypothesis: array comes from a normal distribution
    if p_value < alpha:  
        #The null hypothesis can be rejected
        is_normal_dist = False
    else:
        #The null hypothesis cannot be rejected
        is_normal_dist = True
    
    return is_normal_dist

In [None]:
#check normality of all numericaal features and transform it if not normal distributed
for feature in train.columns:
    if (train[feature].dtype != 'object'):
        if normalityTest(train[feature]) == False:
            train[feature] = np.log1p(train[feature])

- Creating Dummies

In [None]:
#create dummies
train = pd.get_dummies(train, drop_first=True)

print(train.shape)
display(train.head())

- Creating features matrix (X) and target array (y)

In [None]:
X = train
y = target_array

# Creating a Model
We begin by splitting data into two subsets: for training data and for testing data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state = 0)

Model training : Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

#create a model
model = DecisionTreeClassifier()

In [None]:
#search grid for optimal parameters
from sklearn.model_selection import GridSearchCV

param_grid = {'random_state' : [0,42],
              'max_depth': [1,10,100]}

grid = GridSearchCV(model, param_grid, cv=5)

grid.fit(X_train, y_train)

print(grid.best_params_)
print(grid.best_score_)

In [None]:
from sklearn.metrics import classification_report

#use the best model
model = grid.best_estimator_

#make a prediction
y_predict = model.predict(X_test)

#calculate classification report
print(classification_report(y_test,y_predict))