### Missing Data

In [1]:
# Hands on
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string fore example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy".

In [2]:
# what are the falsy values or values that indicates "missing"
falsy_values = (0, False, None, '', [], {})
falsy_values

(0, False, None, '', [], {})

In [3]:
any(falsy_values)

False

In Python, the 'any()' function is a built-in function that returns 'True' if at least one element in an iterable is 'True', and 'False' if all the elements are 'False' or the iterable is empty. It takes an iterable as its argument, such as list, tuple, or set, and applies a logical OR operation to the elements.

In [4]:
# Numpy has a special "nullable" value for numbers which is below. It's NaN: Not a number.
np.nan
# np.nan is kind of virus. Everything it touches becomes np.nan

nan

In [5]:
3 + np.nan

nan

In [6]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [7]:
a.sum()

nan

In [8]:
a.mean()

nan

This is better than regular None values, which in the previous examples would have raised an exception:

In [11]:

a = np.array([1, 2, 3, np.nan, None, 4], dtype='float')
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [12]:
# Numpy also supports an "Infinite" type
np.inf
# which also behaves as a virus

inf

In [13]:
3 + np.inf

inf

In [14]:
np.inf / 3

inf

In [15]:
np.inf / np.inf

nan

In [19]:
# b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float)

b = np.array([1, 2, 3, np.inf, np.nan, 4])

In [20]:
b.sum()

nan

### Checking for nan or inf

There are two functions: np.isnan and np.isinf that will perform the desired checks:

In [21]:
np.isnan(np.nan)

True

In [22]:
np.isinf(np.inf)

True

In [23]:
np.isnan(np.inf)

False

In [24]:
np.isinf(np.nan)

False

In [26]:
# And the joint operation can be performed with np.isfinite
np.isfinite(np.nan), np.isfinite(np.inf)

(False, False)

In [27]:
# np.isnan and np.isinf also take arrays as inputs, and return boolean arrays as results:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [28]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [29]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan

### Filtering Them Out

Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [30]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [31]:
a[~np.isnan(a)]
# The symbol (~) means negation

array([1., 2., 3., 4.])

In [32]:
# above is also equivalent to:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

In [33]:
# and with that result, all the operation can be now performed:
a[np.isfinite(a)].sum()

10.0

In [34]:
a[np.isfinite(a)].mean()

2.5