<a href="https://colab.research.google.com/github/sudama-inc/Machine-Learning-Models/blob/main/missing_values_transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sklearn : Imputer**


> **Imputation of missing values** : Transformers for missing value imputation


1.   **SimpleImputer** : Univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

> **SimpleImputer**(*, missing_values=nan, strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False) | strategy: mean, median, most_frequent, constant


2.   **IterativeImputer** : It use the entire set of available feature dimensions to estimate the missing values
3.   **KNNImputer** : Imputation for completing missing values using k-Nearest Neighbors.

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.


> **KNNImputer**(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False, keep_empty_features=False)



1.   **SimpleImputer**

In [1]:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X = [[np.nan, 2], [6, np.nan], [7, 6]]
imp.fit(X)
# imp.fit_transform(X)
print(imp.transform(X))

[[6.5 2. ]
 [6.  4. ]
 [7.  6. ]]


In [5]:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, -1], [8, 4], [-1, 5]])
imp = SimpleImputer(missing_values=-1, strategy='mean')
imp.fit(X)
# X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
print(imp.transform(X).toarray())

[[1.         2.        ]
 [0.         3.66666667]
 [8.         4.        ]
 [3.         5.        ]]


In [None]:
import pandas as pd
df = pd.DataFrame([["a", "x", 1],
                   [np.nan, "y", 1],
                   ["a", np.nan, 3],
                   ["b", "y", np.nan]], columns=['a','b','c'])
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))

[['a' 'x' 1.0]
 ['a' 'y' 1.0]
 ['a' 'y' 3.0]
 ['b' 'y' 1.0]]


2.   **IterativeImputer**

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]])
X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
# the model learns that the second feature is double the first
print(np.round(imp.transform(X_test)))

[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]


3.   **KNNImputer**

In [None]:
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(X)

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

**Handeling Missing Values by Pandas**


1.   fillna(method='ffill')
2.   dropna()
3.   replace()
4.   interpolate()
2.   df.drop(columns=['column_name'])





**fillna**
1.   fillna(0)
2.   fillna(method='ffill')
3.   fillna(df.mean())
4.   fillna(df.groupby('group_column')['column'].transform('mean'))

In [7]:
import numpy as np
import pandas as pd

In [9]:
df = pd.DataFrame([
                  [np.nan, 2, np.nan, 0],
                  [3, 4, np.nan, 1],
                  [np.nan, np.nan, np.nan, np.nan],
                  [np.nan, 3, np.nan, 4]
                  ], columns=list("ABCD")
                )
df

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,
3,,3.0,,4.0


**Apply fillna on entire dataframe**

In [11]:
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0.0
1,3.0,4.0,0.0,1.0
2,0.0,0.0,0.0,0.0
3,0.0,3.0,0.0,4.0


**Restrict rows to apply fillna**

In [13]:
df.fillna(0, limit=1)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0.0
1,3.0,4.0,,1.0
2,,0.0,,0.0
3,,3.0,,4.0


**Apply fillna on specific column**

In [14]:
df['A'].fillna(method ='ffill')

0    NaN
1    3.0
2    3.0
3    3.0
Name: A, dtype: float64

**Apply fillna on specific column with specific values**

In [12]:
values = {"A": 0, "B": 1, "C": 2, "D": 3}
df.fillna(value=values)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0.0
1,3.0,4.0,2.0,1.0
2,0.0,1.0,2.0,3.0
3,0.0,3.0,2.0,4.0


**Apply dropna on entire dataframe**

In [15]:
df.dropna()

Unnamed: 0,A,B,C,D


In [16]:
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
3,,3.0,,4.0


In [17]:
df.dropna(how='any')

Unnamed: 0,A,B,C,D


**Apply a threshold on certain percentage**

In [19]:
df.dropna(thresh=.2)

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
3,,3.0,,4.0


**Apply dropna on rows**

In [20]:
df.dropna(axis=0, how='any')

Unnamed: 0,A,B,C,D


In [21]:
df.dropna(axis=0, how='any')

Unnamed: 0,A,B,C,D


**drop columns**

In [24]:
df.drop(['A'], axis=1)

Unnamed: 0,B,C,D
0,2.0,,0.0
1,4.0,,1.0
2,,,
3,3.0,,4.0


In [25]:
df.drop(['C', 'D'], axis=1)

Unnamed: 0,A,B
0,,2.0
1,3.0,4.0
2,,
3,,3.0


In [27]:
df.drop(df.columns[[0, 2]], axis=1)

Unnamed: 0,B,D
0,2.0,0.0
1,4.0,1.0
2,,
3,3.0,4.0


In [28]:
df.drop(df.iloc[:, 0:2], axis=1)

Unnamed: 0,C,D
0,,0.0
1,,1.0
2,,
3,,4.0


**Drop a row by index**

In [30]:
df.drop([0, 1])

Unnamed: 0,A,B,C,D
2,,,,
3,,3.0,,4.0


In [31]:
df.drop(3)

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,


In [32]:
df.drop([df.index[1], df.index[2]])

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
3,,3.0,,4.0
