# Data Munging

In this, all the missing values of every attribute will be filled with either their mean, median or mode. In some cases, if the attribute is found to be not affecting Loan Status, it will be dropped. Along will that all attributes will be converted to categorial type. Label Encoding will also be done to make it easier for classifier to process

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

In [16]:
df=pd.read_csv("Dataset/train.csv")

The missing values in each column are : 

In [17]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

## Handling individual Attributes

### Gender

Through the visualization done earlier we can infer that gender does not influence the chances of getting a loan, hence we can remove this particular attribute.

In [18]:
df=df.drop(['Gender'], axis=1)

### Married

In [19]:
df=df.dropna(axis=0, subset=["Married"])
MarriedTypes = ["No","Yes"]
df.Married = df.Married.astype("category",categories=MarriedTypes).cat.codes

### Dependents

In [20]:
df["Dependents"].replace('3+','3',inplace=True)
df["Dependents"].fillna(df.Dependents.median(), inplace=True)

### Education

In [21]:
EducationTypes = ["Graduate","Not Graduate"]
df.Education = df.Education.astype("category",categories=EducationTypes).cat.codes

### Self Employed

In [22]:
df['Self_Employed']=df['Self_Employed'].fillna('No')
Self_EmployedTypes = ["No","Yes"]
df.Self_Employed = df.Self_Employed.astype("category",categories=Self_EmployedTypes).cat.codes

### Applicant Income

In [23]:
df["ApplicantIncome"] = pd.qcut(df.ApplicantIncome, 5, labels=[0,1,2,3,4])

### Co-Applicant Income

In [24]:
df["CoapplicantIncome"] = pd.cut(df.CoapplicantIncome, 5, labels=[0,1,2,3,4])

### Loan Amount

In [25]:
df["LoanAmount"].fillna(df.LoanAmount.median(), inplace=True)
df["LoanAmount"] = pd.qcut(df.LoanAmount, 3, labels=[0,1,2])

### Loan Amount Term

In [26]:
df["Loan_Amount_Term"].fillna(df.Loan_Amount_Term.median(), inplace=True)
df["Loan_Amount_Term"] = pd.cut(df.Loan_Amount_Term, 3, labels=[0,1,2])

### Credit History

In [27]:
df['Credit_History']=df['Credit_History'].fillna(0)

### Property Area

In [28]:
PropertyAreaTypes = ["Urban","Rural","Semiurban"]
df.Property_Area = df.Property_Area.astype("category",categories=PropertyAreaTypes).cat.codes

### Loan Status

In [29]:
Loan_StatusTypes = ["N","Y"]
df.Loan_Status = df.Loan_Status.astype("category",categories=Loan_StatusTypes).cat.codes

In [30]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Loan_ID              0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [31]:
Y = df["Loan_Status"]
X = df.drop(['Loan_Status','Loan_ID'], axis=1)

In [None]:
df.head(10)