**Sources**:

https://www.kaggle.com/kamalkhumar/loan-status-prediction

https://www.kaggle.com/charlessamuel/are-you-getting-the-loan-loan-status-prediction

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **DS Toolkit for Classification**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from pylab import plot, show, subplot, specgram, imshow, savefig
from sklearn import preprocessing
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix
%matplotlib inline

In [None]:
FILEPATH = '/kaggle/input/loan-data-set/loan_data_set.csv'

In [None]:
df = pd.read_csv(FILEPATH)
df.head()

# **DataFrame Analysis**

* Size
* Description
* Info
* Null values in any columns

In [None]:
df.shape

In [None]:
df.describe(include='all')

In [None]:
df.info()

In [None]:
df.isnull().any()

Null values are there but,

How many values are null exactly??

In [None]:
df.isna().sum()

It is hard to replace values like Gender, Married since they're categorical.

Other columns are not numerical so we can mostly drop them for the lack of a better strategy.

For the columns that are numerical, let's fill 'em shall we??

In [None]:
df.Credit_History.fillna(df.Credit_History.mean(), inplace=True)
df.Loan_Amount_Term.fillna(df.Loan_Amount_Term.mean(), inplace=True)

In [None]:
df.dropna(how="any",inplace=True)

In [None]:
df.isnull().any()

Ok null values are handled 

Loan ID is good but not that important so we drop it

In [None]:
df.drop("Loan_ID", axis=1, inplace=True)

We need numeric values for a classifier so we need to encode it. Label Encoder is used in this notebook

In [None]:
le = LabelEncoder()
cols = df.columns.tolist()
for column in cols:
    if df[column].dtype == 'object':
        df[column] = le.fit_transform(df[column])

In [None]:
df.dtypes

Alright on to,

# The Heatmap of Correlation

In [None]:
fig, ax = plt.subplots(figsize=(20, 15))
sns.heatmap(data=df.corr().round(2), annot=True, linewidths=0.7, cmap='YlGnBu')
plt.show()

This is a function to plot feature importance for a model

In [None]:
def plot_feature_importance(importance,names,model_type):

    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + ' FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

**Random Forest Classifier**

In [None]:
X = df.drop("Loan_Status", axis=1)
y = df["Loan_Status"]

rand_f = RandomForestClassifier().fit(X, y)

plot_feature_importance(rand_f.feature_importances_, X.columns, 'RANDOM FOREST')

# Top 5

* Credit_History
* Applicant_Income
* Loan_Amount
* Copplicant_Income
* Loan_Amount_Term

Which makes sense since these are the parameters on which a bank decides whether to give a loan or not

# **Gradient Boosting Classifier**

In [None]:
gb_m = GradientBoostingClassifier().fit(X, y)

plot_feature_importance(gb_m.feature_importances_, X.columns, 'GRADIENT BOOSTING')

Same Top 5 as Random Forest

# **Ada Boosting Classifier**

In [None]:
ada = AdaBoostClassifier().fit(X, y)

plot_feature_importance(ada.feature_importances_, X.columns, 'ADA BOOST')

Property Area sneaks ahead of Credit History in Ada Boosting which is interesting

In [None]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

svc = SVC()

grid = GridSearchCV(svc, parameters)

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=30) 

In [None]:
grid.fit(Xtrain, ytrain)

In [None]:
grid.best_params_

In [None]:
pred = grid.best_estimator_.predict(Xtest)

In [None]:
confusion_matrix(ytest,pred)

In [None]:
print("Accuracy score: {0}%".format((accuracy_score(ytest,pred)*100).round(2)))

In [None]:
fig,ax=plt.subplots(figsize=(15,8))
sns.regplot(x=ytest,y=pred,marker="*")
plt.show()