### Handling Missing Data with Pandas

Pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time.

In [1]:
# Hands on
import numpy as np
import pandas as pd

### Pandas utility functions

Similarly to numpy, pandas also has a few utility functions to identify and detect null values:

In [2]:
pd.isnull(np.nan)

True

In [3]:
pd.isnull(None)

True

In [4]:
pd.isna(np.nan)

True

In [5]:
pd.isna(None)

True

In [6]:
# opposite
pd.notnull(np.nan)

False

In [7]:
pd.notnull(None)

False

In [8]:
pd.notna(np.nan)

False

In [9]:
pd.notnull(3)

True

In [10]:
# The above functions also work with Series and DataFrames
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [11]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [12]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


### Pandas Operations with Missing Values

Pandas manages missing values more gracefully than numpy. nan's will no longer behave as "viruses", and operations will just ignore them completely.

In [13]:
pd.Series([1, 2, np.nan]).count()

2

In [15]:
pd.Series([1, 2, np.nan]).sum()

3.0

In [16]:
pd.Series([2, 2, np.nan]).mean()

2.0

### Filtering Missing Data

As we saw with numpy, we could combine boolean selection + pd.isnull to filter out those nan's and null values:

In [17]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [18]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [19]:
pd.isnull(s)

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [20]:
pd.notnull(s).sum()

4

In [21]:
pd.isnull(s).sum()

2

In [22]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

In [23]:
s[pd.isnull(s)]

3   NaN
4   NaN
dtype: float64

In [24]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [25]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [26]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping Null Values

Boolean selection + notnull() seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the dropna method.

In [27]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [28]:
s.dropna()
# we dropped the 3 and 4 index since they're NaN

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping Null Values on DataFrames

You saw how simple it is to drop na's with a Series. But with DataFrames, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame as df.

In [29]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

In [30]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [31]:
df.shape

(4, 4)

In [32]:
df.info()
# this shows how many non-null values are there in each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


In [33]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [34]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

In [35]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In this case, we're dropping rows. Rows containing null values are dropped from the DF. You can also use the axis parameter to drop columns containing null values.

In [36]:
df.dropna(axis='columns')
# only column D was left since it's the only column that has no null value/s

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In [37]:
df.dropna(axis='rows')
# only row 2 was left since it's the only row that has no null value/s

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with (how) parameter. Can be either 'any' or 'all':

In [38]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})

In [42]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [44]:
df2.dropna(how='all')

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [45]:
# this is the default function
df2.dropna(how='any')

Unnamed: 0,Column A,Column B,Column C
2,30.0,31.0,100.0


You can also use the thresh parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:

In [46]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [47]:
df.dropna(thresh=3)
# any row in the df that has two or fewer non-null values will be removed
# while the rows with three or more non-null values will be retained

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


In [48]:
df.dropna(thresh=3, axis='columns')
# we specified the axis, the default is rows

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


### Filling Null Values

Sometimes, instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a NaN can be replaced with a 0, sometimes it can be replaced with the mean of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.

In [49]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [50]:
# Filling nulls with an arbitrary value
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [51]:
# Filling the nulls with the mean of the set
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

### Filling nulls with contiguous (close) values

The methods argument is used to fill null values with other values close to that null one:

In [52]:
s.fillna(method='ffill')

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [53]:
s.fillna(method='bfill')

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

In [54]:
# This can still leave null values at the extremes of the Series/DataFrame
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [55]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

### Filling Null Values on DataFrames

The fillna method also works on DataFrames, and it works similarly. The main differences are that you can specify the axis (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:

In [56]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [59]:
# fill null values 0 in column A
# fill null values 99 in column B
# fill null values with mean in Column C
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})



Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,99.0,100.0,110


In [60]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


### Checking if there are NAs

The question is: Does this Series or DataFrame contain any missing value? The answer should be yes or no: True or False. How can you verify it?

### Example 1: Checking the length

If there are missing values, s.dropna() will have less elements than s:

In [61]:
s.dropna().count()

4

In [62]:
missing_values = len(s.dropna()) != len(s)
missing_values
# there are missing values

True

In [64]:
# There's also a count method, that excludes nans from tis result
len(s)

6

In [65]:
s.count()
# this mean 6-4 = 2 null values

4

In [66]:
# So we could just do:
missing_values = s.count() != len(s)
missing_values

True

### More Pythonic Solution any

The methods 'any' and 'all' check if either there's 'any' True value in a Series or 'all' the values are 'True'. They work in the same way as in Python:

In [67]:
pd.Series([True, False, False]).any()
# check if there's any True value in our series

True

In [68]:
pd.Series([True, False, False]).all()
# check if all values in series are True

False

In [69]:
pd.Series([True, True, True]).all()

True

The 'isnull()' methods returned a Boolean Series with True values wherever there was a NaN:

In [70]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [71]:
# we cna just use the 'any' method with the boolean array returned:
pd.Series([1, np.nan]).isnull().any()

True

In [72]:
pd.Series([1, 2]).isnull().any()
# it returned false since there's no null value in our series

False

In [73]:
s.isnull().any()
# it returned True since we have NaN in our series

True

A more strict version would check only the values of the Series:

In [74]:
s.isnull().values

array([False, False, False,  True,  True, False])

In [75]:
s.isnull().values.any()

True