# Data cleaning - How to impute missing values

## Introduction

This notebook contains:
  * Horse Colic dataset
  * Statistical Imputation with SimpleImputer
    1. SimpleImputer data transform
    2. SimpleImputer and model evaluation
    3. Comparing different imputed statistics
    4. SimpleImputer transform when making prediction

## Horse colic dataset

In [1]:
import pandas as pd
import numpy as np

path = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"
data = pd.read_csv(path, header=None, na_values='?')
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


In [11]:
n_miss = data.isnull().sum()
print(f"Missing percentage:\n{n_miss/data.shape[0]*100}")

Missing percentage:
0      0.333333
1      0.000000
2      0.000000
3     20.000000
4      8.000000
5     19.333333
6     18.666667
7     23.000000
8     15.666667
9     10.666667
10    18.333333
11    14.666667
12    18.666667
13    34.666667
14    35.333333
15    82.333333
16    34.000000
17    39.333333
18     9.666667
19    11.000000
20    55.000000
21    66.000000
22     0.333333
23     0.000000
24     0.000000
25     0.000000
26     0.000000
27     0.000000
dtype: float64


### 1. SimpleImputer data transform

In [16]:
from sklearn.impute import SimpleImputer

ix = [i for i,x in enumerate(data) if i != 23]
X = data.iloc[:, ix]
y = data.iloc[:, 23]

print(f"Missing: {sum(np.isnan(X))}")

Missing: 355


In [19]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(X)
Xtrans = imputer.transform(X)
print(f"Missing: {sum(np.isnan(Xtrans).flatten())}")

Missing: 0
