# Simple examples of data imputation with scikit-learn
#### (read and play)

In [0]:
import numpy as np
import pandas as pd
from io import StringIO

## 1. Creating some data with missing values

In [0]:
csvdata = '''
A,B,C,D,E
1,2,3,4,
5,6,,8,
0,,11,12,13
,4,15,16,17
'''

df = pd.read_csv(StringIO(csvdata))
df

## 2. Deleting missing values
Radical choice: [delete whole column](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [0]:
df.drop(["E"], axis=1, inplace=True)
df

Recreating

In [0]:
df = pd.read_csv(StringIO(csvdata))
df

Less Radical: [delete rows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) with missing values on "C" column

In [0]:
df.dropna(axis=0, how='any', subset=["C"], inplace=True)
df

If you do not specify the columns, it will delete every row with any missing value

In [0]:
df.dropna(axis=0, how='any', subset=None, inplace=True)
df

## 3. Filling missing values using Pandas
Fill missing values with panda's [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method.

In [0]:
df = pd.read_csv(StringIO(csvdata))

Impute a constant value

In [0]:
df.fillna(value=200, inplace = True)
df

Impute a constant value for each column

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
df.fillna(value={"A": 100, "B": 200, "C": 300, "D": 400, "E": 500}, inplace = True)
df

[Forward fill](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html) propagates the last valid observation.
Alternatively [`fillna(mehtod="ffill")`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) can be used.

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
df.ffill(inplace=True)
df

[Back fill](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html)
uses the next valid observation to fill missing values.
Again the alternative is [`.fillna(method="bfill")`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html).

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
df.bfill(inplace=True)
df

Forward fill and back fill can be combined.

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
df = df.ffill().bfill()
df

## 4. Imputing with scikit-learn
### 4.1 [Simple imputing](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

In [0]:
from sklearn.impute import SimpleImputer

Imputing mean values

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(df["C"].values.reshape(-1,1))
df["C"] = imp.transform(df["C"].values.reshape(-1,1))
df

Imputing a constant value

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
imp = SimpleImputer(missing_values=np.nan, fill_value=200, strategy='constant')
imp.fit(df["C"].values.reshape(-1,1))
df["C"] = imp.transform(df["C"].values.reshape(-1,1))
df

### 4.2 [Interactive imputing](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) (experimental)

In [0]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [0]:
df = pd.read_csv(StringIO(csvdata))

In [0]:
imp_mean = IterativeImputer(random_state=0)
imp_mean.fit(df)
columns = df.columns
df = pd.DataFrame(imp_mean.transform(df), columns=columns)
df