Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.sum() returns 0 for nan values #5693

Closed
MarkusZehner opened this issue Aug 11, 2021 · 3 comments
Closed

Dataset.sum() returns 0 for nan values #5693

MarkusZehner opened this issue Aug 11, 2021 · 3 comments

Comments

@MarkusZehner
Copy link

What happened:
Noticed errors while doing stats over sparsely populated xarray.

What you expected to happen:
Expected nan values to remain nan when setting skipna=True.
Also same result for Int type.

Minimal Complete Verifiable Example:

# thx to https://www.theurbanist.com.au/2020/03/how-to-create-an-xarray-dataset-from-scratch/
import numpy as np
import xarray as xr
import pandas as pd

times = pd.date_range(start='2000-01-01',freq='1H',periods=6)
x = np.arange(6)
y = np.arange(6)
ds = xr.Dataset({
    'stuff': xr.DataArray(
                data   = np.zeros((6,6,6), dtype=float),  
                dims   = ['time', 'x', 'y'],
                coords = {'time': times,
                'x': x,
                'y':y,
                }
            )
        }
    )

print(ds)
print(ds.stuff.values)
ds_nan = ds.where(ds.stuff.values > 0)

print(ds_nan.stuff.values)
print(ds_nan.sum(dim=["time"], skipna=True).stuff.values)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:35:11)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.5.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.6
cfgrib: None
iris: None
bottleneck: None
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.2.2
conda: None
pytest: 6.2.4
IPython: None
sphinx: None

@dcherian
Copy link
Contributor

Yes this is by design and matches numpy IIRC. See #2889. I think you have to manually mask the output using .count to get what you want.

Please reopen if you have more questions.

@MarkusZehner
Copy link
Author

MarkusZehner commented Aug 12, 2021

What exactly is the skipna parameter doing? From the documentation it sounds like it would solve exactly the above 'problem'. Or is it just setting nan=0 for summing up?

@dcherian
Copy link
Contributor

It selects ither np.mean (skipna=False) or np.nanmean (skipna=True). You could try the min_count parameter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants