Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Dask single-threaded setting in tests #7489

Merged
merged 2 commits into from
Jan 31, 2023

Conversation

jrbourbeau
Copy link
Contributor

I'm not totally sure why single-threaded was being set to begin with, so I might be missing some important context here

Closes #7483

@jrbourbeau
Copy link
Contributor Author

cc @dcherian

@dcherian
Copy link
Contributor

Oh hmm that was trivial.

Is this a regression though?

The test has a distributed.Client created in a context manager.

with cluster() as (s, [a, b]):
with Client(s["address"], loop=loop):
original = create_test_data().chunk(chunks)
if engine == "scipy":
with pytest.raises(NotImplementedError):
original.to_netcdf(
tmp_netcdf_filename, engine=engine, format=nc_format
)
return
original.to_netcdf(tmp_netcdf_filename, engine=engine, format=nc_format)

but it seems like dask.base.get_scheduler isn't returning a distributed.Client any more (this is called somewhere in the to_netcdf call chain)

def _get_scheduler(get=None, collection=None) -> str | None:
"""Determine the dask scheduler that is being used.
None is returned if no dask scheduler is active.
See Also
--------
dask.base.get_scheduler
"""
try:
# Fix for bug caused by dask installation that doesn't involve the toolz library
# Issue: 4164
import dask
from dask.base import get_scheduler # noqa: F401
actual_get = get_scheduler(get, collection)
except ImportError:
return None
try:
from dask.distributed import Client
if isinstance(actual_get.__self__, Client):
return "distributed"
except (ImportError, AttributeError):
pass
try:
# As of dask=2.6, dask.multiprocessing requires cloudpickle to be installed
# Dependency removed in https://github.com/dask/dask/pull/5511
if actual_get is dask.multiprocessing.get:
return "multiprocessing"
except AttributeError:
pass
return "threaded"

actual_get is now get_sync whereas it used to be a the get for a distributed.Client.

@jrbourbeau
Copy link
Contributor Author

Hmm I'm not able to reproduce with this example using the latest 2023.1.1 release:

In [1]: import dask

In [2]: dask.utils.show_versions()
{
  "Python": "3.10.4",
  "Platform": "Darwin",
  "dask": "2023.1.1+3.gdb5b2178a",
  "distributed": "2023.1.1",
  "numpy": "1.24.1",
  "pandas": "2.0.0.dev0+1309.g7f2aa8f46a",
  "cloudpickle": "2.0.0",
  "fsspec": "2023.1.0+5.g012816b",
  "bokeh": "2.4.3",
  "fastparquet": "2022.12.1.dev6",
  "pyarrow": "11.0.0.dev316",
  "zarr": "2.4.1.dev528"
}
In [3]: from distributed import Client

In [4]: type(dask.base.get_scheduler())
Out[4]: NoneType

In [5]: c = Client()

In [6]: dask.base.get_scheduler()
Out[6]: <bound method Client.get of <Client: 'tcp://127.0.0.1:61659' processes=4 threads=8, memory=16.00 GiB>>

@jrbourbeau
Copy link
Contributor Author

I've not looked super deeply, so please let me know if I'm missing something, but I think that these lines in get_scheduler are the relevant ones here

https://github.com/dask/dask/blob/db5b2178a79cacc1c882d60a82bf86e2e188eccb/dask/base.py#L1405-L1406

Previously, distributed would set the scheduler config option to point to the default distributed.Client (if one existed). We've since changed that logic and distributed no no longer uses the config option for saying "you've got a distributed.Client, you should use it".

I think the problem here is, with the previous config-based behavior, the scheduler option, which is currently being set to the single-threaded scheduler in the test suite, would be overwritten by distributed to point to the Client. However, now that we're no longer using the scheduler config option in distributed, the test suite is actually using the single-threaded scheduler, which is why get_scheduler() is returning get_sync when using the latest dask / distributed release.

I'd argue the new behavior is actually what we want, but I see what you're saying about it being a change in behavior. I think in this case though it's just a tests-related issue. Does that sounds right? Or was setting the scheduler config option important for real-life user code too?

@jrbourbeau
Copy link
Contributor Author

Just checking in, @dcherian does this look okay to you? Happy to continue iterating if you see a problem with just removing this config option

@dcherian
Copy link
Contributor

dcherian commented Jan 31, 2023

Yup, after thinking about it for a while, this seems like an edge case that's not common in user code.

Thanks @jrbourbeau !

@dcherian dcherian merged commit 67d0ee2 into pydata:main Jan 31, 2023
@jrbourbeau jrbourbeau deleted the distributed-fixup branch January 31, 2023 18:48
@jrbourbeau
Copy link
Contributor Author

Thanks @dcherian. Feel free to let me know if you or others run into this issue elsewhere. I also think it's an edge case the vast majority of users won't encounter (this is the first occurrence I've seen outside of the dask test suite), but if that turns out to not be the case then we can make adjustments upstream in dask 👍

dcherian added a commit to bzah/xarray that referenced this pull request Feb 28, 2023
* upstream/main: (291 commits)
  Update error message when saving multiindex (pydata#7475)
  DOC: cross ref the groupby tutorial (pydata#7555)
  [pre-commit.ci] pre-commit autoupdate (pydata#7543)
  supress namespace_package deprecation warning (doctests) (pydata#7548)
  [skip-ci] Add PDF of Xarray logo (pydata#7530)
  Support complex arrays in xr.corr (pydata#7392)
  clarification for thresh arg of dataset.dropna() (pydata#7481)
  [pre-commit.ci] pre-commit autoupdate (pydata#7524)
  Require to explicitly defining optional dimensions such as hue and markersize (pydata#7277)
  Use plt.rc_context for default styles (pydata#7318)
  Update HOW_TO_RELEASE.md (pydata#7512)
  Update whats-new for dev (pydata#7511)
  Fix whats-new for 2023.02.0 (pydata#7506)
  Update apply_ufunc output_sizes error message (pydata#7509)
  Zarr: drop "source" and  "original_shape" from encoding (pydata#7500)
  [pre-commit.ci] pre-commit autoupdate (pydata#7507)
  Whats-new for 2023.03.0
  Add `inclusive` argument to `cftime_range` and `date_range` and deprecate `closed` argument (pydata#7373)
  Make text match code example (pydata#7499)
  Remove Dask single-threaded setting in tests (pydata#7489)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrading dask-core and distributed packages to 2023.1.1 breaks tests
3 participants