# Working with missing data

In this section, we will discuss missing (also referred to as `NA`) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by `<NA>`. These values are also referenced as "null values".

## How to Detect missing values

To detect missing values, you can use `isna()` and `notna()` functions.

In [1]:
import numpy as np

import cudf

In [2]:
df = cudf.DataFrame({"a": [1, 2, None, 4], "b": [0.1, None, 2.3, 17.17]})

In [3]:
df

Unnamed: 0,a,b
0,1.0,0.1
1,2.0,
2,,2.3
3,4.0,17.17


In [4]:
df.isna()

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False
3,False,False


In [5]:
df["a"].notna()

0     True
1     True
2    False
3     True
Name: a, dtype: bool

One has to be mindful that in Python (and NumPy), the nan's don't compare equal, but None's do. Note that cudf/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`.

In [6]:
None == None

True

In [7]:
np.nan == np.nan

False

So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information.

In [8]:
df["b"] == np.nan

0    False
1     <NA>
2    False
3    False
Name: b, dtype: bool

In [9]:
s = cudf.Series([None, 1, 2])

In [10]:
s

0    <NA>
1       1
2       2
dtype: int64

In [11]:
s == None

0    <NA>
1    <NA>
2    <NA>
dtype: bool

In [12]:
s = cudf.Series([1, 2, np.nan], nan_as_null=False)

In [13]:
s

0    1.0
1    2.0
2    NaN
dtype: float64

In [14]:
s == np.nan

0    False
1    False
2    False
dtype: bool

## Float dtypes and missing data

Because ``NaN`` is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.

By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `<NA>` value.

In [15]:
cudf.Series([1, 2, np.nan])

0       1
1       2
2    <NA>
dtype: int64

Hence to consider a ``NaN`` as ``NaN`` you will have to pass `nan_as_null=False` parameter into `Series` constructor.

In [16]:
cudf.Series([1, 2, np.nan], nan_as_null=False)

0    1.0
1    2.0
2    NaN
dtype: float64

## Datetimes

For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(`<NA>`) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object.

In [17]:
import pandas as pd

datetime_series = cudf.Series(
    [pd.Timestamp("20120101"), pd.NaT, pd.Timestamp("20120101")]
)
datetime_series

0    2012-01-01 00:00:00.000000
1                          <NA>
2    2012-01-01 00:00:00.000000
dtype: datetime64[us]

In [18]:
datetime_series.to_pandas()

0   2012-01-01
1          NaT
2   2012-01-01
dtype: datetime64[ns]

any operations on rows having `<NA>` values in `datetime` column will result in `<NA>` value at the same location in resulting column:

In [19]:
datetime_series - datetime_series

0    0 days 00:00:00
1               <NA>
2    0 days 00:00:00
dtype: timedelta64[us]

## Calculations with missing data

Null values propagate naturally through arithmetic operations between pandas objects.

In [20]:
df1 = cudf.DataFrame(
    {
        "a": [1, None, 2, 3, None],
        "b": cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False),
    }
)

In [21]:
df2 = cudf.DataFrame(
    {"a": [1, 11, 2, 34, 10], "b": cudf.Series([0.23, 22, 3.2, None, 1])}
)

In [22]:
df1

Unnamed: 0,a,b
0,1.0,
1,,2.0
2,2.0,3.2
3,3.0,0.1
4,,1.0


In [23]:
df2

Unnamed: 0,a,b
0,1,0.23
1,11,22.0
2,2,3.2
3,34,
4,10,1.0


In [24]:
df1 + df2

Unnamed: 0,a,b
0,2.0,
1,,24.0
2,4.0,6.4
3,37.0,
4,,2.0


While summing the data along a series, `NA` values will be treated as `0`.

In [25]:
df1["a"]

0       1
1    <NA>
2       2
3       3
4    <NA>
Name: a, dtype: int64

In [26]:
df1["a"].sum()

6

Since `NA` values are treated as `0`, the mean would result to 2 in this case `(1 + 0 + 2 + 3 + 0)/5 = 2`

In [27]:
df1["a"].mean()

2.0

To preserve `NA` values in the above calculations, `sum` & `mean` support `skipna` parameter.
By default it's value is
set to `True`, we can change it to `False` to preserve `NA` values.

In [28]:
df1["a"].sum(skipna=False)

nan

In [29]:
df1["a"].mean(skipna=False)

nan

Cumulative methods like `cumsum` and `cumprod` ignore `NA` values by default.

In [30]:
df1["a"].cumsum()

0       1
1    <NA>
2       3
3       6
4    <NA>
Name: a, dtype: int64

To preserve `NA` values in cumulative methods, provide `skipna=False`.

In [31]:
df1["a"].cumsum(skipna=False)

0       1
1    <NA>
2    <NA>
3    <NA>
4    <NA>
Name: a, dtype: int64

## Sum/product of Null/nans

The sum of an empty or all-NA Series of a DataFrame is 0.

In [32]:
cudf.Series([np.nan], nan_as_null=False).sum()

0.0

In [33]:
cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)

nan

In [34]:
cudf.Series([], dtype="float64").sum()

0.0

The product of an empty or all-NA Series of a DataFrame is 1.

In [35]:
cudf.Series([np.nan], nan_as_null=False).prod()

1.0

In [36]:
cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)

nan

In [37]:
cudf.Series([], dtype="float64").prod()

1.0

## NA values in GroupBy

`NA` groups in GroupBy are automatically excluded. For example:

In [38]:
df1

Unnamed: 0,a,b
0,1.0,
1,,2.0
2,2.0,3.2
3,3.0,0.1
4,,1.0


In [39]:
df1.groupby("a").mean()

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
2,3.2
1,
3,0.1


It is also possible to include `NA` in groups by passing `dropna=False`

In [40]:
df1.groupby("a", dropna=False).mean()

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
2.0,3.2
1.0,
3.0,0.1
,1.5


## Inserting missing data

All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to `None`.

In [41]:
series = cudf.Series([1, 2, 3, 4])

In [42]:
series

0    1
1    2
2    3
3    4
dtype: int64

In [43]:
series[2] = None

In [44]:
series

0       1
1       2
2    <NA>
3       4
dtype: int64

## Filling missing values: fillna

`fillna()` can fill in `NA` & `NaN` values with non-NA data.

In [45]:
df1

Unnamed: 0,a,b
0,1.0,
1,,2.0
2,2.0,3.2
3,3.0,0.1
4,,1.0


In [46]:
df1["b"].fillna(10)

0    10.0
1     2.0
2     3.2
3     0.1
4     1.0
Name: b, dtype: float64

## Filling with cudf Object

You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.

In [47]:
import cupy as cp

dff = cudf.DataFrame(cp.random.randn(10, 3), columns=list("ABC"))

In [48]:
dff.iloc[3:5, 0] = np.nan

In [49]:
dff.iloc[4:6, 1] = np.nan

In [50]:
dff.iloc[5:8, 2] = np.nan

In [51]:
dff

Unnamed: 0,A,B,C
0,-0.408268,-0.676643,-1.274743
1,-0.029322,-0.873593,-1.214105
2,-0.866371,1.081735,-0.22684
3,,0.812278,1.074973
4,,,-0.366725
5,-1.016239,,
6,0.675123,1.067536,
7,0.221568,2.025961,
8,-0.317241,1.011275,0.674891
9,-0.877041,-1.919394,-1.029201


In [52]:
dff.fillna(dff.mean())

Unnamed: 0,A,B,C
0,-0.408268,-0.676643,-1.274743
1,-0.029322,-0.873593,-1.214105
2,-0.866371,1.081735,-0.22684
3,-0.327224,0.812278,1.074973
4,-0.327224,0.316145,-0.366725
5,-1.016239,0.316145,-0.337393
6,0.675123,1.067536,-0.337393
7,0.221568,2.025961,-0.337393
8,-0.317241,1.011275,0.674891
9,-0.877041,-1.919394,-1.029201


In [53]:
dff.fillna(dff.mean()[1:3])

Unnamed: 0,A,B,C
0,-0.408268,-0.676643,-1.274743
1,-0.029322,-0.873593,-1.214105
2,-0.866371,1.081735,-0.22684
3,,0.812278,1.074973
4,,0.316145,-0.366725
5,-1.016239,0.316145,-0.337393
6,0.675123,1.067536,-0.337393
7,0.221568,2.025961,-0.337393
8,-0.317241,1.011275,0.674891
9,-0.877041,-1.919394,-1.029201


## Dropping axis labels with missing data: dropna

Missing data can be excluded using `dropna()`:

In [54]:
df1

Unnamed: 0,a,b
0,1.0,
1,,2.0
2,2.0,3.2
3,3.0,0.1
4,,1.0


In [55]:
df1.dropna(axis=0)

Unnamed: 0,a,b
2,2,3.2
3,3,0.1


In [56]:
df1.dropna(axis=1)

0
1
2
3
4


An equivalent `dropna()` is available for Series.

In [57]:
df1["a"].dropna()

0    1
2    2
3    3
Name: a, dtype: int64

## Replacing generic values

Often times we want to replace arbitrary values with other values.

`replace()` in Series and `replace()` in DataFrame provides an efficient yet flexible way to perform such replacements.

In [58]:
series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])

In [59]:
series

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [60]:
series.replace(0, 5)

0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

We can also replace any value with a `<NA>` value.

In [61]:
series.replace(0, None)

0    <NA>
1     1.0
2     2.0
3     3.0
4     4.0
dtype: float64

You can replace a list of values by a list of other values:

In [62]:
series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])

0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64

You can also specify a mapping dict:

In [63]:
series.replace({0: 10, 1: 100})

0     10.0
1    100.0
2      2.0
3      3.0
4      4.0
dtype: float64

For a DataFrame, you can specify individual values by column:

In [64]:
df = cudf.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})

In [65]:
df

Unnamed: 0,a,b
0,0,5
1,1,6
2,2,7
3,3,8
4,4,9


In [66]:
df.replace({"a": 0, "b": 5}, 100)

Unnamed: 0,a,b
0,100,100
1,1,6
2,2,7
3,3,8
4,4,9


## String/regular expression replacement

cudf supports replacing string values using `replace` API:

In [67]:
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", None, "d"]}

In [68]:
df = cudf.DataFrame(d)

In [69]:
df

Unnamed: 0,a,b,c
0,0,a,a
1,1,b,b
2,2,.,
3,3,.,d


In [70]:
df.replace(".", "A Dot")

Unnamed: 0,a,b,c
0,0,a,a
1,1,b,b
2,2,A Dot,
3,3,A Dot,d


In [71]:
df.replace([".", "b"], ["A Dot", None])

Unnamed: 0,a,b,c
0,0,a,a
1,1,,
2,2,A Dot,
3,3,A Dot,d


Replace a few different values (list -> list):

In [72]:
df.replace(["a", "."], ["b", "--"])

Unnamed: 0,a,b,c
0,0,b,b
1,1,b,b
2,2,--,
3,3,--,d


Only search in column 'b' (dict -> dict):

In [73]:
df.replace({"b": "."}, {"b": "replacement value"})

Unnamed: 0,a,b,c
0,0,a,a
1,1,b,b
2,2,replacement value,
3,3,replacement value,d


## Numeric replacement

`replace()` can also be used similar to `fillna()`.

In [74]:
df = cudf.DataFrame(cp.random.randn(10, 2))

In [75]:
df[np.random.rand(df.shape[0]) > 0.5] = 1.5

In [76]:
df.replace(1.5, None)

Unnamed: 0,0,1
0,-0.089358787,-0.728419386
1,-2.141612003,-0.574415182
2,,
3,0.774643462,2.07287721
4,0.93799853,-1.054129436
5,,
6,-0.435293012,1.163009584
7,1.346623287,0.31961371
8,,
9,,


Replacing more than one value is possible by passing a list.

In [77]:
df00 = df.iloc[0, 0]

In [78]:
df.replace([1.5, df00], [5, 10])

Unnamed: 0,0,1
0,10.0,-0.728419
1,-2.141612,-0.574415
2,5.0,5.0
3,0.774643,2.072877
4,0.937999,-1.054129
5,5.0,5.0
6,-0.435293,1.16301
7,1.346623,0.319614
8,5.0,5.0
9,5.0,5.0


You can also operate on the DataFrame in place:

In [79]:
df.replace(1.5, None, inplace=True)

In [80]:
df

Unnamed: 0,0,1
0,-0.089358787,-0.728419386
1,-2.141612003,-0.574415182
2,,
3,0.774643462,2.07287721
4,0.93799853,-1.054129436
5,,
6,-0.435293012,1.163009584
7,1.346623287,0.31961371
8,,
9,,
