Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit use of dask feature #4164

Closed
inakleinbottle opened this issue Jun 18, 2020 · 3 comments · Fixed by #4318
Closed

Implicit use of dask feature #4164

inakleinbottle opened this issue Jun 18, 2020 · 3 comments · Fixed by #4318

Comments

@inakleinbottle
Copy link
Contributor

What happened:
I tried to use the to_netcdf function to store a dataset into a NetCDF file, but the following exception was raised

Traceback (most recent call last):
  File "dask-error.py", line 27, in <module>
    ds.to_netcdf("test.nc")
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/core/dataset.py", line 1544, in to_netcdf
    return to_netcdf(
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1051, in to_netcdf
    scheduler = _get_scheduler()
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/xarray/backends/locks.py", line 79, in _get_scheduler
    actual_get = dask.base.get_scheduler(get, collection)
AttributeError: module 'dask' has no attribute 'base'

This code sample works perfectly as expected when the dask package is not installed in the environment, and the method works as expected. However, we dask is installed the _get_scheduler function is called and produces the error (this can be found here)

actual_get = dask.base.get_scheduler(get, collection)

After a little digging through, the problem is that the base module in the dask package depends on the toolz package, which is not a default dependency of dask and so causes a silent import failure when dask initialises its namespace (https://github.com/dask/dask/blob/416d348f7174a302815758cb87dbf6983226ddc5/dask/__init__.py#L10). As a result, the base package is not importable form the dask top level, and importing it separately gives as follows

from dask import base

raises a ModuleNotFoundError.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sam/dev/xarray-test/.venv/lib/python3.8/site-packages/dask/base.py", line 13, in <module>
    from tlz import merge, groupby, curry, identity
ModuleNotFoundError: No module named 'tlz'

I recommend the following fix. At the following line in the _get_scheduler function

import dask # noqa: F401

replace the import with the following

from dask.base import get_scheduler

and remove dask.base from the later call.

I should, however, point out that get_scheduler does not appear to be part of the Dask public API.

What you expected to happen:
The to_netcdf method should have exited silently and created a new file in the working directory with the contents of the data set.

Minimal Complete Verifiable Example:
This code is basically the "Toy weather data" example from the documentation, except for the last line.

import numpy as np
import pandas as pd

import xarray as xr

np.random.seed(123)

xr.set_options(display_style="html")

times = pd.date_range("2000-01-01", "2001-12-31", name="time")
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset(
    {
        "tmin": (("time", "location"), tmin_values),
        "tmax": (("time", "location"), tmax_values),
    },
    {"time": times, "location": ["IA", "IN", "IL"]},
)

ds.to_netcdf("test.nc") ## error here

Anything else we need to know?:
As mentioned above, the error on manifests when the dask package with no extras installed is present in the environment. (Many of the extras require the toolz package, at which time the import error goes away.)

Environment:
In a clean virtual environment, install the following packages.

pip install xarray netCDF4 dask

The package versions installed are as followed (generated by pip freeze):

cftime==1.1.3
dask==2.18.1
netCDF4==1.5.3
numpy==1.18.5
pandas==1.0.5
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
six==1.15.0
xarray==0.15.1

(Also running python3.8.2 on Debian Linux, not that I suppose this matters.)

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.8.2+ (heads/3.8:882a7f44da, Apr 26 2020, 19:31:38) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.3

xarray: 0.15.1
pandas: 1.0.5
numpy: 1.18.5
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.18.1
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: None
IPython: None
sphinx: None

@dcherian
Copy link
Contributor

Thanks @inakleinbottle for the very well-written issue and great diagnosis.

Can you open a PR with your suggested fix?

@inakleinbottle
Copy link
Contributor Author

I will see if I can find time over the next couple of days.

@inakleinbottle
Copy link
Contributor Author

I've created a pull request for a fix of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants