<a href="https://colab.research.google.com/github/shivangi-975/Data-Preparation-for-ML/blob/main/Univariate_%26_Multivariate_FeatureImputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Importing libraries

In [1]:
import pandas as pd
import numpy as np


Source: https://www.kaggle.com/uciml/pima-indians-diabetes-database/

Mounting Google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
diabetes = pd.read_csv('/content/drive/MyDrive/datasets/diabetes.csv')

diabetes.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [6]:
diabetes.shape

(768, 9)

In [7]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


#### Describe data
Here we can see that for Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin and BMI min value is 0. For Pregnancies it is possible but for the rest it is impossible. So that means these values are missing values. 

In [9]:
diabetes.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In above table clearly min values have 0 which don't make any sense .This means that missing values are represented using 0 in dataset

#### Replace 0 values with NaN

In [10]:
diabetes['Glucose'].replace(0, np.nan, inplace= True)
diabetes['BloodPressure'].replace(0, np.nan, inplace= True)
diabetes['SkinThickness'].replace(0, np.nan, inplace= True)
diabetes['Insulin'].replace(0, np.nan, inplace= True)
diabetes['BMI'].replace(0, np.nan, inplace= True)

#### Sum the null values in every column to see which columns have missing values

In [11]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

We are reshaping 'SkinThickness' values in the form of 2D array

In [12]:
arr = diabetes['SkinThickness'].values.reshape(-1, 1)

arr.shape

(768, 1)

In order to fill in missing values using inference from the current data which is available,Scikit Learn offers this SimpleImputer estimator object.This simple Imputer offers a number of basic strategies for imputing missing values.

### Univariate feature imputation
#### SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

##### Here Strategy = 'most_frequent' which means it will replace missing using the most frequent (Mode) value in the column. 

In [13]:
from sklearn.impute import SimpleImputer

In [14]:
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp.fit(diabetes['SkinThickness'].values.reshape(-1, 1))

diabetes['SkinThickness'] = imp.transform(diabetes['SkinThickness'].values.reshape(-1, 1))

In [15]:
diabetes['SkinThickness'].describe()

count    768.000000
mean      29.994792
std        8.886506
min        7.000000
25%       25.000000
50%       32.000000
75%       32.000000
max       99.000000
Name: SkinThickness, dtype: float64

In [16]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness                 0
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

##### Here Strategy = 'median' which means it will replace missing values using the median in the column. 

In [17]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')

imp.fit(diabetes['Glucose'].values.reshape(-1, 1))

diabetes['Glucose'] = imp.transform(diabetes['Glucose'].values.reshape(-1, 1))

In [18]:
diabetes['Glucose'].describe()

count    768.000000
mean     121.656250
std       30.438286
min       44.000000
25%       99.750000
50%      117.000000
75%      140.250000
max      199.000000
Name: Glucose, dtype: float64

In [19]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                35
SkinThickness                 0
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

##### Here Strategy = 'mean' which means it will replace missing values using the mean in the column. 

In [20]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

imp.fit(diabetes['BloodPressure'].values.reshape(-1, 1))

diabetes['BloodPressure'] = imp.transform(diabetes['BloodPressure'].values.reshape(-1, 1))

In [21]:
diabetes['BloodPressure'].describe()

count    768.000000
mean      72.405184
std       12.096346
min       24.000000
25%       64.000000
50%       72.202592
75%       80.000000
max      122.000000
Name: BloodPressure, dtype: float64

In [22]:
imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=32)

imp.fit(diabetes['BMI'].values.reshape(-1, 1))

diabetes['BMI'] = imp.transform(diabetes['BMI'].values.reshape(-1, 1))

In [23]:
diabetes['BMI'].describe()

count    768.000000
mean      32.450911
std        6.875366
min       18.200000
25%       27.500000
50%       32.000000
75%       36.600000
max       67.100000
Name: BMI, dtype: float64

In [24]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                     374
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [25]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,394.0,768.0,768.0,768.0,768.0
mean,3.845052,121.65625,72.405184,29.994792,155.548223,32.450911,0.471876,33.240885,0.348958
std,3.369578,30.438286,12.096346,8.886506,118.775855,6.875366,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,25.0,76.25,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,32.0,125.0,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### Save the file in csv format

**Multivariate Feature Imputation**

Multivariate Feature Imputation algos uses entire set of available features to estimate the missing values,not just values for that feature alone


IterativeImputer

We can fit Iterative Imputer on the entire dataset ,where multiple columns have missing values.It will model each feature with missing value as a function of other features in a iterative round robin fashion.
It essentially fits the regression model on all of the other features to find values for the feature with missing values.

In [49]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [28]:
imp = IterativeImputer(max_iter=100, random_state=0)

In [50]:
features = [[4, 2, 1], 
            [24, 12, 6], 
            [8, np.nan, 2], 
            [28, 14, 7], 
            [32, 16, np.nan], 
            [600, 300, 150], 
            [np.nan, 60, 30], 
            [np.nan, np.nan, 1]]

In [30]:
imp.fit(features)

IterativeImputer(max_iter=100, random_state=0)

In [31]:
imp.transform(features)

array([[  4.        ,   2.        ,   1.        ],
       [ 24.        ,  12.        ,   6.        ],
       [  8.        ,   3.99966002,   2.        ],
       [ 28.        ,  14.        ,   7.        ],
       [ 32.        ,  16.        ,   7.92735309],
       [600.        , 300.        , 150.        ],
       [120.00314828,  60.        ,  30.        ],
       [  5.58961604,   2.79614869,   1.        ]])

By fitting the model it understand that each value in a row is half of the previous value so now if we will give some 2D array which contains NaN values, it is fill the values according to the pattern it learned

We can see that for two NaN values it is not as accurate as for one NaN value

In [32]:
X_test = [[np.nan, 24, 12], 
          [36, np.nan, np.nan], 
          [100, np.nan, 25], 
          [np.nan, 6, 3],
          [np.nan, 8, np.nan]]

In [33]:
imp.transform(X_test)

array([[ 48.00364638,  24.        ,  12.        ],
       [ 36.        ,  17.99997418,   8.92708811],
       [100.        ,  49.9996788 ,  25.        ],
       [ 12.00389542,   6.        ,   3.        ],
       [ 16.12053702,   8.        ,   5.86176342]])

In [35]:


diabetes.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,32.0,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,32.0,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,72.405184,32.0,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,32.0,,32.0,0.232,54,1


In [36]:
diabetes.shape

(768, 9)

In [37]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    float64
 2   BloodPressure             768 non-null    float64
 3   SkinThickness             768 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB


In [38]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                     374
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [39]:
diabetes_features = diabetes.drop('Outcome', axis=1)
diabetes_label = diabetes[['Outcome']]

diabetes_features.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.0,35.0,,33.6,0.627,50
1,1,85.0,66.0,29.0,,26.6,0.351,31
2,8,183.0,64.0,32.0,,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33


In [40]:
imp = IterativeImputer(max_iter=10000, random_state=0)

In [41]:
imp.fit(diabetes_features)

IterativeImputer(max_iter=10000, random_state=0)

In [42]:
diabetes_features_arr = imp.transform(diabetes_features)

In [43]:
diabetes_features_arr.shape

(768, 8)

In [44]:
diabetes_features = pd.DataFrame(diabetes_features_arr, columns=diabetes_features.columns)

diabetes_features.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.0,219.028414,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,70.34155,26.6,0.351,31.0
2,8.0,183.0,64.0,32.0,270.573172,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


In [45]:
diabetes = pd.concat([diabetes_features, diabetes_label], axis=1)

diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,219.028414,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,70.34155,26.6,0.351,31.0,0
2,8.0,183.0,64.0,32.0,270.573172,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [46]:
diabetes.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [48]:
diabetes.to_csv('/content/drive/MyDrive/datasets/diabetes_processed.csv', index=False)