# Treating Misisng values

## Introduction

* The problem of missing value is quite common in many real-life datasets. 
* Missing value can bias the results of the machine learning models and/or reduce the accuracy of the model. 

## What is a Missing Value?
* Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset.
* Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.

<center> <img src = 'https://editor.analyticsvidhya.com/uploads/39935Missing%20Values%201.png' width = 600></center>

Some of the reasons awhy data is missing:

1. Past data might get corrupted due to improper maintenance.
2. Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
3. The user has not provided the values intentionally.

Types of Missing Values:

https://editor.analyticsvidhya.com/uploads/63807Types.png

### Missing Completely At Random (MCAR):

* In MCAR, the probability of data being missing is the same for all the observations.

* In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset.

* That is, missing values are completely independent of other data. There is no pattern.

* In the case of MCAR, the data could be missing due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values.

### Missing At Random (MAR)

* Missing at random (MAR) means that the reason for missing values can be explained by variables on which you have complete information as there is some relationship between the missing data and other values/data.

* In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data and there is some pattern in the missing values.

### Missing Not At Random (MNAR)

* Missing values depend on the unobserved data.

* If there is some structure/pattern in missing data and other observed data can not explain it, then it is Missing Not At Random (MNAR).

* If the missing data does not fall under the MCAR or MAR then it can be categorized as MNAR.

### It is important to handle the missing values appropriately.

* Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like <font color = 'red'>K-nearest and Naive Bayes </font>support data with missing values.
* You may end up building a biased machine learning model which will lead to incorrect results if the missing values are not handled properly.
Missing data can lead to a lack of precision in the statistical analysis.


## Lets Work on california-housing-prices
* downloading california-housing-prices data from kaggle.


In [9]:
! pip install -q kaggle
from google.colab import files
files.upload()
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d camnugent/california-housing-prices
! kaggle kernels pull sazid28/home-loan-prediction
! mkdir dataset
! unzip /content/california-housing-prices.zip -d dataset
! kaggle datasets download -d gavincanacam/home-loan-predictions
! mkdir dataset_homeloan
! unzip /content/home-loan-predictions.zip -d dataset_homeloan

Saving kaggle.json to kaggle.json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading california-housing-prices.zip to /content
  0% 0.00/400k [00:00<?, ?B/s]
100% 400k/400k [00:00<00:00, 110MB/s]
Archive:  /content/california-housing-prices.zip
  inflating: dataset/housing.csv     


Loading dataset

In [10]:
import pandas as pd
dataset = pd.read_csv('/content/dataset/housing.csv')
dataset.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Checking missing values

In [59]:
dataset.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

total bedrooms have 207 missing values

In [15]:
dataset[dataset.total_bedrooms.isna()]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
290,-122.16,37.77,47.0,1256.0,,570.0,218.0,4.3750,161900.0,NEAR BAY
341,-122.17,37.75,38.0,992.0,,732.0,259.0,1.6196,85100.0,NEAR BAY
538,-122.28,37.78,29.0,5154.0,,3741.0,1273.0,2.5762,173400.0,NEAR BAY
563,-122.24,37.75,45.0,891.0,,384.0,146.0,4.9489,247100.0,NEAR BAY
696,-122.10,37.69,41.0,746.0,,387.0,161.0,3.9063,178400.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20267,-119.19,34.20,18.0,3620.0,,3171.0,779.0,3.3409,220500.0,NEAR OCEAN
20268,-119.18,34.19,19.0,2393.0,,1938.0,762.0,1.6953,167400.0,NEAR OCEAN
20372,-118.88,34.17,15.0,4260.0,,1701.0,669.0,5.1033,410700.0,<1H OCEAN
20460,-118.75,34.29,17.0,5512.0,,2734.0,814.0,6.6073,258100.0,<1H OCEAN


### Techniques to handle missing values


## 1. Deleting the missing values

* Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values.

* The disadvantage of this method is one might end up deleting some useful data from the dataset.

* There are 2 ways one can delete the missing values:

#### i) Deleting the entire row

In [21]:
dataset_delete_row = dataset.copy()
dataset_delete_row = dataset_delete_row.dropna(axis = 0 , subset = ['total_bedrooms'],inplace = False)
dataset_delete_row.isnull().sum()


longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

ii) Deleting the entire column

In [22]:
dataset_delete_column = dataset.copy()
dataset_delete_column = dataset_delete_column.dropna(axis = 1 ,inplace = False)
dataset_delete_column.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

## Imputing the Missing Value

There are different ways of replacing the missing values.

### Replacing With Arbitrary Value

If we can make an educated guess about the missing value then we can replace it with some arbitrary value using the following code.



In [23]:
dataset_impute_random = dataset.copy()
dataset_impute_random = dataset_impute_random.fillna(0)
dataset_impute_random.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Replacing With Mean

* This is the most common method of imputing missing values of numeric columns. 
* If there are outliers then the mean will not be appropriate. In such cases, outliers need to be treated first.

* we can use the ‘fillna’ method for imputing the columns 'total_bedrooms' with the mean of the respective column values.


In [24]:
dataset_impute_mean = dataset.copy()
dataset_impute_mean = dataset_impute_mean.fillna(dataset_impute_mean.total_bedrooms.mean())
dataset_impute_mean.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Replacing With Mode

* Mode is the most frequently occurring value. 
* It is used in the case of categorical features.but lets try on continuos data.

In [26]:
dataset_impute_mode = dataset.copy()
dataset_impute_mode = dataset_impute_mode.fillna(dataset_impute_mode.total_bedrooms.mode()[0])
dataset_impute_mode.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Replacing With Median

Median is the middlemost value. It’s better to use the median value for imputation in the case of outliers.

In [29]:
dataset_impute_median = dataset.copy()
dataset_impute_median = dataset_impute_median.fillna(dataset_impute_median.total_bedrooms.median())
dataset_impute_median.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Replacing with previous value – Forward fill

In some cases, imputing the values with the previous value instead of mean, mode or median is more appropriate. This is called forward fill. It is mostly used in time series data.

In [31]:
dataset_impute_forwardfill = dataset.copy()
dataset_impute_forwardfill = dataset_impute_forwardfill.fillna(method='ffill')
dataset_impute_forwardfill.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Replacing with next value – Backward fill

In backward fill, the missing value is imputed using the next value.

In [32]:
dataset_impute_backwardfill = dataset.copy()
dataset_impute_backwardfill = dataset_impute_backwardfill.fillna(method='ffill')
dataset_impute_backwardfill.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Interpolation

* Missing values can also be imputed using interpolation. 
* Pandas interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial’, ‘linear’, ‘quadratic’. Default method is ‘linear’.

for interpolation refer : https://en.wikipedia.org/wiki/Interpolation

In [33]:
dataset_impute_interpolate = dataset.copy()
dataset_impute_interpolate = dataset_impute_interpolate.interpolate()
dataset_impute_interpolate.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

## Imputing Missing Values For Categorical Features
* for that lets use home loan datasset

There are two ways to impute missing values for categorical features as follows:

### Impute the Most Frequent Value

We will make use of ‘SimpleImputer’ in this case and as this is a non-numeric column we can’t use mean or median but we can use most frequent value and constant.

In [37]:
homeloan_dataset = pd.read_csv('/content/dataset_homeloan/Train_Loan_Home.csv')
homeloan_dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [38]:
homeloan_dataset.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [43]:
from sklearn.impute import SimpleImputer
homeloan_dataset_most_frequent_value = homeloan_dataset[['Gender','Married','Self_Employed','Property_Area']]
imputer = SimpleImputer(strategy='most_frequent')
homeloan_dataset_most_frequent_value = imputer.fit_transform(homeloan_dataset_most_frequent_value)
pd.DataFrame(homeloan_dataset_most_frequent_value,columns =['Gender','Married','Self_Employed','Property_Area'] ).isna().sum()

Gender           0
Married          0
Self_Employed    0
Property_Area    0
dtype: int64

### Impute the Value “missing”, which treats it as a Separate Category

In [45]:
homeloan_dataset_seperate_category= homeloan_dataset[['Gender','Married','Self_Employed','Property_Area']]
imputer = SimpleImputer(strategy='constant', fill_value='missing')
homeloan_dataset_seperate_category = imputer.fit_transform(homeloan_dataset_seperate_category)
pd.DataFrame(homeloan_dataset_seperate_category,columns =['Gender','Married','Self_Employed','Property_Area'] ).isna().sum()

Gender           0
Married          0
Self_Employed    0
Property_Area    0
dtype: int64

### Imputation with Univariate Approach

* In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median or some constant value.

In [54]:
import numpy as np
dataset_univariate= dataset.copy()
dataset_univariate.ocean_proximity = dataset_univariate.ocean_proximity.map({'<1H OCEAN':0,'INLAND':1,'NEAR OCEAN':2,'NEAR BAY':3,'ISLAND':4})
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
dataset_univariate = imputer.fit_transform(dataset_univariate)
pd.DataFrame(dataset_univariate,columns = list(dataset.columns)).isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Multivariate Approach

In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.

In [56]:
from sklearn.experimental import enable_iterative_imputer
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
from sklearn.impute import IterativeImputer
dataset_multivariate_iterative= dataset.copy()
dataset_multivariate_iterative.ocean_proximity = dataset_multivariate_iterative.ocean_proximity.map({'<1H OCEAN':0,'INLAND':1,'NEAR OCEAN':2,'NEAR BAY':3,'ISLAND':4})
impute_it = IterativeImputer()
dataset_multivariate_iterative = impute_it.fit_transform(dataset_multivariate_iterative)
pd.DataFrame(dataset_multivariate_iterative,columns = list(dataset.columns)).isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### Nearest Neighbors Imputations (KNNImputer)

Missing values are imputed using the k-Nearest Neighbors approach where a Euclidean distance is used to find the nearest neighbors.

In [57]:
from sklearn.impute import KNNImputer
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
dataset_multivariate_knn= dataset.copy()
dataset_multivariate_knn.ocean_proximity = dataset_multivariate_knn.ocean_proximity.map({'<1H OCEAN':0,'INLAND':1,'NEAR OCEAN':2,'NEAR BAY':3,'ISLAND':4})
dataset_multivariate_knn = impute_knn.fit_transform(dataset_multivariate_knn)
pd.DataFrame(dataset_multivariate_knn,columns = list(dataset.columns)).isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

References = https://github.com/justmarkham/scikit-learn-tips