https://mlforanalytics.com/2020/03/30/handling-missing-values-part-2/?fbclid=IwAR1ur-duOs9WtUAZmik1xDIlu-KnFuGRkuFoygGhxVCLrd54U8wCjYV9Q2k

#### The default value for this method is the mean value of imputation.
#### That is, each NaN value of column A is filled with the mean of the non-NaN data points corresponding to column A.

#### There are few points to be noted here. When doing missing value imputation to continuous quantitative data, we should look at the histogram of a specific column.

- If the distribution comes out to be normal (approximate to bell curve), we should do missing value imputation with mean.

- If the distribution comes out to be skewed, we should do this with median.


#### For categorical data,

  - we should do missing value imputation with mode.

## Univariate feature imputation

In [1]:
import numpy as np

In [2]:
from sklearn.impute import SimpleImputer

## Simple Imputer
### 1. Numeric values : mean, median

In [3]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)

In [4]:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


-------------------------------------------------------------------------------------------------------------------------------

## 2. Categorical data

The SimpleImputer class also supports categorical data represented as string values or pandas categoricals when using the 

   'most_frequent' or 'constant' strategy:

In [5]:
import pandas as pd

In [6]:
df = pd.DataFrame([["a", "x"],
                       [np.nan, "y"],
                       ["a", np.nan],
                       ["b", "y"]], dtype="category")
df

Unnamed: 0,0,1
0,a,x
1,,y
2,a,
3,b,y


In [7]:
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]


-------------------------------------------------------------------------------------------------------------------------------

In [9]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)

In [10]:
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]
