# Working with missing data

In [4]:
import numpy as np
import pandas as pd

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

In [44]:
df = pd.DataFrame(
    np.random.randn(5, 3),
    index=["a", "c", "e", "f", "h"],
    columns=["one", "two", "three"],
)
df["four"] = "bar"
df["five"] = df["one"] > 0
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
df2

Unnamed: 0,one,two,three,four,five
a,0.75022,0.565527,0.39492,bar,True
b,,,,,
c,1.787866,-0.797105,0.923675,bar,True
d,,,,,
e,-1.254917,-0.935671,0.279436,bar,False
f,0.976882,1.359126,-0.527689,bar,True
g,,,,,
h,-0.143684,-1.974573,0.640641,bar,False


> To make detecting missing values easier (and across different array dtypes), pandas provides the ``isna()`` and ``notna()`` functions, which are also methods on Series and DataFrame objects:

In [8]:
df2["one"]

a    1.042811
b         NaN
c    0.426486
d         NaN
e    0.750833
f   -1.116411
g         NaN
h   -0.175790
Name: one, dtype: float64

In [11]:
df2["one"].isna()

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [10]:
df2["four"].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

In [14]:
df2.isna()

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do. Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [16]:
np.nan == np.nan

False

In [17]:
df2["one"] == np.nan

a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

## Integer dtypes and missing data

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

In [20]:
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())

0       1
1       2
2    <NA>
3       4
dtype: Int64

For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between NaT and NaN.

In [22]:
df2 = df.copy()

In [24]:
df2["timestamp"] = pd.Timestamp("20120101")
df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan

In [25]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,0.395957,-1.238431,bar,True,NaT
c,,0.56855,0.761625,bar,True,NaT
e,0.750833,-0.856533,-0.789349,bar,True,2012-01-01
f,-1.116411,0.271629,1.215195,bar,False,2012-01-01
h,,1.493427,-1.216145,bar,False,NaT


## Cleaning / filling missing data

pandas objects are equipped with various data manipulation methods for dealing with missing data

### Filling missing values: fillna

``fillna()`` can “fill in” NA values with non-NA data in a couple of ways, which we illustrate:

In [31]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,0.395957,-1.238431,bar,True,NaT
c,,0.56855,0.761625,bar,True,NaT
e,0.750833,-0.856533,-0.789349,bar,True,2012-01-01
f,-1.116411,0.271629,1.215195,bar,False,2012-01-01
h,,1.493427,-1.216145,bar,False,NaT


In [32]:
df2.fillna(0)

Unnamed: 0,one,two,three,four,five,timestamp
a,0.0,0.395957,-1.238431,bar,True,0
c,0.0,0.56855,0.761625,bar,True,0
e,0.750833,-0.856533,-0.789349,bar,True,2012-01-01 00:00:00
f,-1.116411,0.271629,1.215195,bar,False,2012-01-01 00:00:00
h,0.0,1.493427,-1.216145,bar,False,0


##### Fill gaps forward or backward 

In [39]:
df2.ffill()

Unnamed: 0,one,two,three,four,five,timestamp
a,,0.395957,-1.238431,bar,True,NaT
c,,0.56855,0.761625,bar,True,NaT
e,0.750833,-0.856533,-0.789349,bar,True,2012-01-01
f,-1.116411,0.271629,1.215195,bar,False,2012-01-01
h,-1.116411,1.493427,-1.216145,bar,False,2012-01-01


#### Dropping axis labels with missing data: dropna

In [54]:
df2.loc["b"] = df2.loc["b"].fillna(0)

In [55]:
df2

Unnamed: 0,one,two,three,four,five
a,0.75022,0.565527,0.39492,bar,True
b,0.0,0.0,0.0,0,0
c,1.787866,-0.797105,0.923675,bar,True
d,,,,,
e,-1.254917,-0.935671,0.279436,bar,False
f,0.976882,1.359126,-0.527689,bar,True
g,,,,,
h,-0.143684,-1.974573,0.640641,bar,False


In [48]:
df2.dropna()

Unnamed: 0,one,two,three,four,five
a,0.75022,0.565527,0.39492,bar,True
c,1.787866,-0.797105,0.923675,bar,True
e,-1.254917,-0.935671,0.279436,bar,False
f,0.976882,1.359126,-0.527689,bar,True
h,-0.143684,-1.974573,0.640641,bar,False


In [52]:
df2.dropna(axis="columns")

a
b
c
d
e
f
g
h


### Interpolation

Both Series and DataFrame objects have ``interpolate()`` that, by default, performs ``linear`` interpolation at missing data points.

In [58]:
df2.interpolate()

Unnamed: 0,one,two,three,four,five
a,0.75022,0.565527,0.39492,bar,True
b,0.0,0.0,0.0,0,0
c,1.787866,-0.797105,0.923675,bar,True
d,0.266475,-0.866388,0.601555,,
e,-1.254917,-0.935671,0.279436,bar,False
f,0.976882,1.359126,-0.527689,bar,True
g,0.416599,-0.307724,0.056476,,
h,-0.143684,-1.974573,0.640641,bar,False


## Replacing generic values

Often times we want to replace arbitrary values with other values.

replace() in Series and replace() in DataFrame provides an efficient yet flexible way to perform such replacements.

For a Series, you can replace a single value or a list of values by another value:

In [61]:
ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])

ser.replace(0, 5)

0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [62]:
ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64