# Missing Data

Real-world data does not tend to be clean, homogeneous and analysis-friendly. In many cases, there will be missing instances of data that can be indicated in many different ways.

The two most common strategies to deal with missing values are:
1. Using a mask that indicates missing values
2. Choosing a _sentinel value_ that indicates that an entry is missing

Each alternative has its own drawbacks:

Storing a mask may require the storage of a whole new array containing missing data information, adding overhead in terms of storage and computation.

Choosing a _sentinel value_ means  reducing the range of valid values, and special values like `NaN` are not available for every single data type.

## The Pandas approach

Pandas chose to use sentinels for missing data (also called NA), making use of the built-in Python null values: the `None` object and the special floating-point `NaN` value.

Since `None` is a Python object it cannot be used in any NumPy/Pandas array, but only in arrays of type `'object'` (arrays of Python objects):

In [1]:
import numpy as np
import pandas as pd

In [2]:
values = np.array([1, None, 3, 4])
values

array([1, None, 3, 4], dtype=object)

While having a representation of missing values is definitely useful, this means that any calculation will be performed at Python level, without the performance benefits of using NumPy arrays with native types for fast calculations:

In [3]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
62.2 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.77 ms ± 59.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



Using Python objects in an array also means that some aggregations like `sum()` or `min()` will yield an error:

In [4]:
values.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Addition between an integer and `None` is undefined, hence the operation can not be performed successfully.

The other representation of missing data is `NaN` (Not a Number), a special floating-point value used in the IEEE floating-point representation:

In [5]:
values = np.array([1, np.nan, 3, 4])
values.dtype

dtype('float64')

Unlike the object array from before, this array contains only native types and supports fast operations that are pushed into compiled code. One thing to note is that the result of any operation with `NaN` will be another `NaN`:

In [6]:
1 + np.nan

nan

In [7]:
0 * np.nan

nan

Although in this case aggregates are well defined, they are not always useful:

In [8]:
(values.sum(), values.min(), values.max())

(nan, nan, nan)

NumPy provides some special aggregates for those cases that ignore the missing (`NaN`) values:

In [9]:
(np.nansum(values), np.nanmin(values), np.nanmax(values))

(8.0, 1.0, 4.0)

One important thing to note is that, since `NaN` is specifically a floating-point value, there is no equivalent value for other data types.

Pandas can convert between `None` and `NaN` where appropriate:

In [10]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Pandas can also perform automatic type-casting for types that don't have an sentinel value:

In [11]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [12]:
# Here x is converted to a float64 array to be able to work with NaN
x[0] = None
x

0    NaN
1    1.0
dtype: float64

## Operations with Null Values

Pandas provides several methods for dealing with NA values in Pandas data structures:

* `isnull()`: Creates a boolean mask indicating the position of NA values
* `notnull()`: Opposite of `isnull()`
* `dropna()`: Returns a filtered version of the data (without NA values)
* `fillna()`: Returns a copy of the data with missing values filled or imputed

In [13]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [14]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [15]:
data.dropna()

0        1
2    hello
dtype: object

`DataFrame` provides more options:

In [16]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


It is impossible to drop a single value from a `DataFrame`, only full rows or full columns. By default, `dropna()` will drop all rows that contain a null value:

In [17]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [18]:
# axis='columns' is equivalent to axis=1
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


The `how` and `thresh` parameters allows us to perform a finer control of how the NA values will be dropped. The default is `how='any'` and will drop any row or column (depending on the `axis` parameter) that contains at least one null value.  Specifying `how='all'` will only drop rows/columns that are filled with null values:

In [19]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [20]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


`thresh` lets us specify a minimum number of non-null values for the row/column to be **kept**:

In [21]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


Rather than dropping NA values, pandas provides the option to replace them with a valid value:

In [22]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

We can replace NA entries with a single value:

In [23]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill that propagates the previous valid value forward:

In [24]:
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or, similarly, perform a back-fill:

In [25]:
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

The options for `DataFrame` objects are similar, but it is also possible to specify an `axis`