# Loan Prediction ML Project
Weekly challenge: Week 08  
Date: 9/12/2022

**Objective:** Solve binary classification problem using Python

**Steps that I will follow:**  
1. Problem statement  
2. Hypothesis generation  
3. Getting the system ready and loading the data  
4. Understanding the data  
5. Exploratory Data Analysis (EDA)  
  * Univariate analysis
  * Bivariate analysis
6. Missing value and outlier treatment
7. Evaluation Metrics for classification problems
8. Model building: Part I
9. Logistic Regression using stratified k-folds cross validation
10. Feature Engineering  
11. Model building: Part II  
  * Logistic Regression  
  * Decision Tree  
  * Random Forest  
  * XGBoost

## 1. Problem statement  
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are **Gender, Marital Status, Education, # of Dependents, Income, Loan Amount, Credit History** and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

This is a classification problem where we have to predict whether a loan would be approved or not. In a classification problem, we have to predict discrete values based on a given set of independent variables.  

Classification can be of two types:  
* Binary classification: Here we have to predict either of the two given classes. e.g. classifying the gender as male or female, predictiing the result as win or loss...
* Multiclass classification: Here we have to classify the data into three or more classes. e.g. classifying a movies's genre as comedy, action or romantic, classify fruits as oranges, apples or pears...

## 2. Hypothesis generation
It is a process of listing out all the possible factors that can affect the outcome. Below are some of the factors that can affect the loan approval (independent variables for the loan prediction problem).  

* Salary: Applicant with high income should have more chances of loan approval.
* Previous history: Applicants who have repayed their previous debts should have higher chance of loan approval.  
* Loan amount: If the loan amount is less, chances of loan approval should be high.  
* Loan term: Loan for lesser duration should have a higher chance of loan approval.  
* Monthly installment: Chances of loan approval should be higher if the amount to be paid monthly is lower.  

I have listed above some factors that I think might affect the target variable.

## 3. Getting the system ready and loading the data

### Loading the packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Data

For this project, there are 2 files: train.csv, test.csv.
* train.csv file will be used for training the model; the model will learn from this file. It contains all the independent variables as well as the target variable.
* test.csv file contains all the independent variables, it does not contain the target variable. We will apply the model to predict the target variable for the test data.

### Reading the data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
# Making copies of train & test data so we can make changes to the datasets, if needed.
train_original = train.copy()
test_original = test.copy()

## 4. Understanding the data
Here, we will look at the structure of train & test datasets. First, we will check the features present in the data and then will check their data types.

In [4]:
# Checking columns of the test data
train.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

There are 12 independent variables and 1 target variable in the train dataset.

In [5]:
# Checking columns of the test data
test.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

There are 12 independent variables in the train dataset. 'Loan_Status' is missing which is our target variable. We will predict the 'Loan_Status' using our model that we will build using the train sataset.

**Description of each variable**  
* Loan_ID: Unique loan ID  
* Gender: Male / Female  
* Married: Applicant married (Yes / No)  
* Dependents: # of dependents  
* Education: Applicant's education (Graduate, Not graduate)   
* Self_Employed: Applisant self employed? (Yes / No)  
* ApplicantIncome: Income of the applicant  
* CoapplicantIncome: Income of the co-applicant  
* LoanAmount: Loan amount in thousands  
* Loan_Amount_Term: Term / duration of loan in months  
* Credit_History: Credit history meets guidelines (1 - Yes / 0 - No)  
* Property_Area: Urban / Semi urban / Rural  
* Loan_Status: Loan approved (Yes / No)

**Data type of each variable in the train dataset**

In [13]:
train.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

There are three formats of data types.  
* Object: It represents variables that are text or categorical.  
  -- *There are 8 variables: Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status'*  
  -- Here, 'Loan_ID' is text variable and the remaining 7 are categorical variables.
* int64: It represents variables that are integer.  
  -- *There is 1 variable: 'ApplicantIncome'*  
* float64: It represents variables with decimal values.  
  -- *There are 4 variables: 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History'*  
  -- Here, 'Credit_History' is float64 but it is a categorical variable with either credit history meets guidelines as 1 (Yes) or credit history does not meets guidelines as 0 (No). Due to this fact, we can convert it to the categorical variable.

**Shape of each dataset**

In [14]:
train.shape

(614, 13)

In [15]:
test.shape

(367, 12)

There are 614 rows and 13 columns in the train dataset.  
There are 367 rows and 12 columns in the test dataset.

## 5. Exploratory Data Analysis (EDA)

###   Univariate analysis
* It is the simplest form of analyzing data where we examine each variable individually.  
* For categorical features, we can use frequency table or bar plot which will calculate the number of values / rows in each category in a particular variable.  
* For numerical features, we can use probability density plots to look at the distribution of the variables.

In [None]:
convert 'Credit_History' to the categorical variable.

### Bivariate analysis

In [None]:

6. Missing value and outlier treatment
7. Evaluation Metrics for classification problems
8. Model building: Part I
9. Logistic Regression using stratified k-folds cross validation
10. Feature Engineering  
11. Model building: Part II  
  * Logistic Regression  
  * Decision Tree  
  * Random Forest  
  * XGBoost

In [10]:
train['Education'].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [9]:
train

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y
