Dataset.sum() returns 0 for nan values #5693

MarkusZehner · 2021-08-11T16:47:32Z

What happened:
Noticed errors while doing stats over sparsely populated xarray.

What you expected to happen:
Expected nan values to remain nan when setting skipna=True.
Also same result for Int type.

Minimal Complete Verifiable Example:

# thx to https://www.theurbanist.com.au/2020/03/how-to-create-an-xarray-dataset-from-scratch/
import numpy as np
import xarray as xr
import pandas as pd

times = pd.date_range(start='2000-01-01',freq='1H',periods=6)
x = np.arange(6)
y = np.arange(6)
ds = xr.Dataset({
    'stuff': xr.DataArray(
                data   = np.zeros((6,6,6), dtype=float),  
                dims   = ['time', 'x', 'y'],
                coords = {'time': times,
                'x': x,
                'y':y,
                }
            )
        }
    )

print(ds)
print(ds.stuff.values)
ds_nan = ds.where(ds.stuff.values > 0)

print(ds_nan.stuff.values)
print(ds_nan.sum(dim=["time"], skipna=True).stuff.values)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:35:11)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.5.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.6
cfgrib: None
iris: None
bottleneck: None
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.2.2
conda: None
pytest: 6.2.4
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

dcherian · 2021-08-11T17:07:53Z

Yes this is by design and matches numpy IIRC. See #2889. I think you have to manually mask the output using .count to get what you want.

Please reopen if you have more questions.

MarkusZehner · 2021-08-12T09:19:12Z

What exactly is the skipna parameter doing? From the documentation it sounds like it would solve exactly the above 'problem'. Or is it just setting nan=0 for summing up?

dcherian · 2021-08-12T14:40:11Z

It selects ither np.mean (skipna=False) or np.nanmean (skipna=True). You could try the min_count parameter

dcherian closed this as completed Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.sum() returns 0 for nan values #5693

Dataset.sum() returns 0 for nan values #5693

MarkusZehner commented Aug 11, 2021

INSTALLED VERSIONS

dcherian commented Aug 11, 2021

MarkusZehner commented Aug 12, 2021 •

edited

Loading

dcherian commented Aug 12, 2021

Dataset.sum() returns 0 for nan values #5693

Dataset.sum() returns 0 for nan values #5693

Comments

MarkusZehner commented Aug 11, 2021

INSTALLED VERSIONS

dcherian commented Aug 11, 2021

MarkusZehner commented Aug 12, 2021 • edited Loading

dcherian commented Aug 12, 2021

MarkusZehner commented Aug 12, 2021 •

edited

Loading