-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow performance when storing datasets in gcsfs-backed zarr stores #1770
Comments
does |
There threading locks in your profile is likely due to using the dask
threaded scheduler. I recommend using the single threaded scheduler when
profiling. Dask.get
|
When pickling the GCS mapping it looks like we're actually pulling down all of the data within it (Zarr has already placed some metadata) instead of serializing the connection information. @martindurant what information should we safely be passing around when serializing? These tasks would need to remain valid for longer than the standard hour-long short-lived-token. |
I am puzzled that serializing the mapping is pulling the data. GCSMap does not have get/set_state, but the only attributes are the GCSFileSystem and path. Perhaps the |
Ah, we can just serialize the Perhaps the |
It looks like serializing |
t |
Yes, |
import gcsfs
fs = gcsfs.GCSFileSystem(project='pangeo-181919')
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test997', gcs=fs, check=True,
create=True)
import dask.array as dsa
shape = (30, 50, 1080, 2160)
chunkshape = (1, 1, 1080, 2160)
ar = dsa.random.random(shape, chunks=chunkshape)
import zarr
za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap)
In [2]: import cloudpickle
In [3]: %time len(cloudpickle.dumps(gcsmap))
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 560 µs
Out[3]: 213 |
Is this still an issue? |
Closing. I think our fixes in xarray and zarr last winter addressed most of the problems here. If others feel differently, please reopen. |
We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray.
Dask Only
Here is an example using only dask and zarr.
When you do this, it spends a long time serializing stuff before the computation starts.
For a more fine-grained look at the process, one can instead do
This reveals that the pre-compute step takes about 10s. Monitoring the distributed scheduler, I can see that, once the computation starts, it takes about 1:30 to store the array (27 GB). (This is actually not bad!)
Some debugging by @mrocklin revealed the following step is quite slow
On my system, this was taking close to 1s. On contrast, when the
store
passed togcsmap
is not aGCSMap
but instead a path, it is in the microsecond territory. So picklingGCSMap
objects is relatively slow. I'm not sure whether this pickling happens when we callclient.compute
or during the task execution.There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance.
Xarray
This get much worse once xarray enters the picture. (Note that this example requires the xarray PR #1528, which has not been merged yet.)
Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example.
Profiling the
to_zarr
with snakeviz reveals that it is spending most of its time waiting for thread locks.I don't understand this, since I specifically eliminated locks when storing the zarr arrays.
The text was updated successfully, but these errors were encountered: