<a href="https://colab.research.google.com/github/sakshi2215/Machine_learning/blob/main/HandlinMissing_LoanDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df= pd.read_csv('/content/train_ctrUa4K.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [3]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
#total missing value per column
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [5]:
#total no of missing values in entrire data
missing_value= df.isnull().sum().sum()

In [6]:
# % of missing value
total_value = np.product(df.shape)
print((missing_value/total_value)*100)

1.8667000751691305


**Handling missing value**


*   Either by deleting the value -*If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.*
*   Or by imputing them



In [7]:
#Deleting the entire row (listwise deletion)
df1= df.dropna()
df1.isnull().sum()


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [8]:
# We can drop Dependent column as this column has no much use
df1= df.drop(['Dependents'], axis=1)
df1.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

**Imputing the Missing value**

In [9]:
# Here we can impute the missing value by some arbitary constant
#Imputing Dependent columns by arbitary constant 0.
df['Dependents'] = df['Dependents'].fillna(0)
df['Dependents'].isnull().sum()

0

**Imputing by Mean**

In [10]:
# Replacing missing value with mean of loan amont and credit history
df['LoanAmount']= df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Credit_History']= df['Credit_History'].fillna(df['Credit_History'].mean())
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents            0
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

**Imputation By Mode** -- Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’



In [11]:
#Replace the missing value for categorical column with mode
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married']= df['Married'].fillna(df['Married'].mode()[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
df.isnull().sum()

Loan_ID               0
Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

**Imputation by Median**--The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.

In [12]:
df['Loan_Amount_Term']= df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median())


**Forward Fill** -- It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’

In [13]:
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test

0    0.0
1    1.0
2    NaN
3    NaN
4    NaN
5    5.0
dtype: float64

In [14]:
#Forward fill
test.fillna(method='ffill')

0    0.0
1    1.0
2    1.0
3    1.0
4    1.0
5    5.0
dtype: float64

**Backward Fill**

In [16]:
test.fillna(method='bfill')

0    0.0
1    1.0
2    5.0
3    5.0
4    5.0
5    5.0
dtype: float64

**Interpolation**--Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’

In [17]:
test.interpolate()

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
dtype: float64

**Handling missing value for Categorical Values**


*   Using Most Frequent Value
*   Impute the value missing



In [19]:
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X

Unnamed: 0,Shape
0,square
1,square
2,oval
3,circle
4,


In [20]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)

array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['square']], dtype=object)

In [21]:
imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit_transform(X)

array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['missing']], dtype=object)

**Using Skilit Library**

In [23]:
#Univariate Approach
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]
