##### Among all industries, the insurance domain has one of the largest uses of analytics & data science methods. This dataset provides you a taste of working on data sets from insurance companies – what challenges are faced there, what strategies are used, which variables influence the outcome, etc. This is a classification problem. The data has 615 rows and 13 columns.

**Problem: Predict if a loan will get approved or not.**

We are going to work on binary classification problem, where we got some information about sample of peoples , and we need to predict whether we should give some one a loan or not depending on his information . we actually have a few sample size (614 rows), so we will go with machine learning techniques to solve our problem .

## The Dataset
In the Dataset we find the following variables:
- Loan ID, the identifier code of each applicant.
- Gender, Male or Female for each applicant.
- Married, the maritage state.
- Dependents, how many dependents does the applicant have?
- Education, the level of education, graduate or non graduate
- Self Employed, Yes or No in the case
- Applicant Income
- Coapplicant Income
- Loan Amount
- Loan Amount Term
- Credit History, just Yes or No in the case
- Property Area, urban, semiurban or rural area of the applicant’s property
- Loan Status, Yes or No ( The independent Variable)

## Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')

Let's analyse our data with pandas profiling first

In [None]:
!pip install pandas_profiling

In [None]:
from pandas_profiling import ProfileReport

In [None]:
design_report = ProfileReport(df)
design_report.to_file(output_file='report.html')

In [None]:
design_report

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.info()

We have missing data , we will handle them as we go

In [None]:
df.describe()

It seems that credit history is 1 or 0. So let's change it to binary

In [None]:
df['Credit_History'].value_counts()

In [None]:
df['Credit_History'] = df['Credit_History'].astype('O')

In [None]:
df.describe(include='O')

We will drop ID because it's not important for our model

In [None]:
df.drop('Loan_ID',axis=1,inplace=True)

##### Do we have any duplicate ?

In [None]:
df.duplicated().any()

We got no duplicated rows

#### Let's look at our target

In [None]:
df.Loan_Status.value_counts().plot.bar(color='blue')

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(df['Loan_Status']);

print('The percentage of Y class : %.2f' % (df['Loan_Status'].value_counts()[0] / len(df)))
print('The percentage of N class : %.2f' % (df['Loan_Status'].value_counts()[1] / len(df)))

In [None]:
# Credit_History

grid = sns.FacetGrid(df,col='Loan_Status', size=3.2, aspect=1.6)
grid.map(sns.countplot, 'Credit_History');

# we didn't give a loan for most people who got Credit History = 0
# but we did give a loan for most of people who got Credit History = 1
# so we can say if you got Credit History = 1 , you will have better chance to get a loan

# important feature

In [None]:
# Gender

grid = sns.FacetGrid(df,col='Loan_Status', size=3.2, aspect=1.6)
grid.map(sns.countplot, 'Gender');

# most males got loan and most females got one too so (No pattern)

# i think it's not so important feature, we will see later

In [None]:
# Married
plt.figure(figsize=(15,5))
sns.countplot(x='Married', hue='Loan_Status', data=df);

# most people who get married did get a loan
# if you'r married then you have better chance to get a loan :)
# good feature

Before analyse the dependents columns let's analyse the different value inside.

In [None]:
df.Dependents.value_counts()

In [None]:
# Dependents

plt.figure(figsize=(15,5))
sns.countplot(x='Dependents', hue='Loan_Status', data=df);

# first if Dependents = 0 , we got higher chance to get a loan ((very hight chance))
# good feature

In [None]:
# Education

grid = sns.FacetGrid(df,col='Loan_Status', size=3.2, aspect=1.6)
grid.map(sns.countplot, 'Education');

# If you are graduated or not, you will get almost the same chance to get a loan (No pattern)
# Here you can see that most people did graduated, and most of them got a loan
# on the other hand, most of people who did't graduate also got a loan, but with less percentage from people who graduated

# not important feature

In [None]:
# Self_Employed

grid = sns.FacetGrid(df,col='Loan_Status', size=3.2, aspect=1.6)
grid.map(sns.countplot, 'Self_Employed');

# No pattern (same as Education)

In [None]:
# Property_Area

plt.figure(figsize=(15,5))
sns.countplot(x='Property_Area', hue='Loan_Status', data=df);

# We can say, Semiurban Property_Area got more than 50% chance to get a loan

# good feature

### Correlation

In [None]:
sns.heatmap(df.corr(),annot=True)
plt.show()

### Processing our data

Missing values

In [None]:
df.isnull().sum().sort_values(ascending = False)

Let's separate the numerical columns from the categorical


In [None]:
cat_data = []
num_data = []

for i,c in enumerate(df.dtypes):
    if c == object:
        cat_data.append(df.iloc[:, i])
    else :
        num_data.append(df.iloc[:, i])

In [None]:
cat_data = pd.DataFrame(cat_data).transpose()
num_data = pd.DataFrame(num_data).transpose()

In [None]:
cat_data.head()

In [None]:
num_data.head()

In [None]:
cat_data.isnull().sum()

In [None]:
# If you want to fill every column with its own most frequent value you can use

cat_data = cat_data.apply(lambda x:x.fillna(x.value_counts().index[0]))
cat_data.isnull().sum().any() # no more missing data 

In [None]:
num_data.isnull().sum()

In [None]:
# fill every missing value with their previous value in the same column

num_data.fillna(method='bfill', inplace=True)
num_data.isnull().sum().any() # no more missing data 

### Categorical columns


we are going to use LabelEncoder :

In [None]:
from sklearn.preprocessing import LabelEncoder  
le = LabelEncoder()
cat_data.head()

In [None]:
# transform the target column

target_values = {'Y': 0 , 'N' : 1}

target = cat_data['Loan_Status']
cat_data.drop('Loan_Status', axis=1, inplace=True)

target = target.map(target_values)

In [None]:
# transform other columns

for i in cat_data:
    cat_data[i] = le.fit_transform(cat_data[i])

In [None]:
cat_data.head(10)

In [None]:
target.head()

We can now contact our cat data, num data, and our target data

In [None]:
df = pd.concat([cat_data, num_data, target], axis=1)

In [None]:
df.head()

### Train the data

In [None]:
X = pd.concat([cat_data,num_data],axis=1)
y = target

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, log_loss, accuracy_score


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [None]:
    
print('X_train shape', X_train.shape)
print('y_train shape', y_train.shape)
print('X_test shape', X_test.shape)
print('y_test shape', y_test.shape)

### Using Logistic Regression

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)


In [None]:
def metrics(y_true,y_pred,retu=False):
    pre = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    loss = log_loss(y_true, y_pred)
    acc = accuracy_score(y_true, y_pred)
    
    if retu:
        return pre, rec, f1, loss, acc
    else:
        print('  pre: %.3f\n  rec: %.3f\n  f1: %.3f\n  loss: %.3f\n  acc: %.3f' % (pre, rec, f1, loss, acc))
    

In [None]:
metrics(y_test,y_pred)

### Using Decision Tree

In [None]:
tree = DecisionTreeClassifier(max_depth=2,random_state=42)

In [None]:
tree.fit(X_train,y_train)
y_pred_tree = tree.predict(X_test)

In [None]:
metrics(y_test,y_pred_tree)

### Using Random Forest

In [None]:
forest = RandomForestClassifier()
forest.fit(X_train,y_train)
y_pred_forest = forest.predict(X_test)

In [None]:
metrics(y_test,y_pred_forest)

In [None]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
print(forest.get_params())