Imputing with Scikit-learn: One of the most common technique is using mean imputation. We replace the missing values with the mean value of all known values in that column (axis=0). We will use Sckitlearn’s “SimpleImputer” function. We can also use different strategies such as strategy='median' or strategy='most_frequent'.


In [1]:
from io import StringIO
from sklearn.preprocessing import Imputer
import pandas as pd

In [2]:
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

In [3]:
df = pd.read_csv(StringIO(csv_data))

In [4]:
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

In [5]:
df.head()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [6]:
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)



In [7]:
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

Inputing with Pandas: Our data can have missing or black values for different reasons. There can be error in data collection process or particular fields could have been left blank in a survey. Missing values are usually seen blank spaces in our data or they can be mapped placeholders such as NaN (Not A Number). An example with two missing values.


In [18]:
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
df = pd.read_csv(StringIO(csv_data))
print(df)

      A     B     C    D
0   1.0   2.0   3.0  4.0
1   5.0   6.0   NaN  8.0
2  10.0  11.0  12.0  NaN


We can also see the number of missing values using isnull() and sum() functions. It gives number of missing values for each column.m

In [19]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

Eliminate rows with missing data: We can use the “dropna()” function. “inplace” parameter keeps the valid entries in the same variable.


In [20]:
df.dropna(inplace=True)  
print(df)

     A    B    C    D
0  1.0  2.0  3.0  4.0


In [21]:
df = df.dropna()

In [22]:
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Some useful parameters for dropnan function: df.dropna(axis=1) # Drops columns with at least one NaN in any row. df.dropna(how=’all’) # Only drops rows where all columns are NaN df.dropna(subset=[‘C’]) # Drops rows where NaN appears in specific columns (here: ‘C’)


Although removing records with missing information seems convenient, we may remove too many rows or columns that might then reduce the performance of our models. Imputing missing values: Because of the mentioned reasons above, we might use interpolation techniques to estimate missing values from the other training samples in the dataset.


In [23]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
df.iloc[3:5,0] = np.nan
df.iloc[4:6,1] = np.nan
df.iloc[5:8,2] = np.nan
print(df)

          A         B         C
0 -0.840216 -0.847114  0.041934
1 -1.123485  0.573704 -0.348643
2 -1.375232  0.087336  0.520995
3       NaN -0.385076 -0.022960
4       NaN       NaN  0.628753
5 -0.234647       NaN       NaN
6  0.685074 -1.393554       NaN
7 -0.853454 -1.772317       NaN
8  0.887729  0.638184 -0.033063
9 -0.096867 -0.436735  1.413863


Fill missing values with mean of each column:


In [24]:
print(df.fillna(df.mean()))

          A         B         C
0 -0.840216 -0.847114  0.041934
1 -1.123485  0.573704 -0.348643
2 -1.375232  0.087336  0.520995
3 -0.368887 -0.385076 -0.022960
4 -0.368887 -0.441946  0.628753
5 -0.234647 -0.441946  0.314411
6  0.685074 -1.393554  0.314411
7 -0.853454 -1.772317  0.314411
8  0.887729  0.638184 -0.033063
9 -0.096867 -0.436735  1.413863


Fill specific columns:

In [25]:
print(df.fillna(df.mean()['B':'C']))

          A         B         C
0 -0.840216 -0.847114  0.041934
1 -1.123485  0.573704 -0.348643
2 -1.375232  0.087336  0.520995
3       NaN -0.385076 -0.022960
4       NaN -0.441946  0.628753
5 -0.234647 -0.441946  0.314411
6  0.685074 -1.393554  0.314411
7 -0.853454 -1.772317  0.314411
8  0.887729  0.638184 -0.033063
9 -0.096867 -0.436735  1.413863


Important Note: “fillna()” method has an "inplace" parameter. If it is not set to True (It is False by default), it will not update the original variable.
