Description
Is your feature request related to a problem?
This idea follows up on @orbeckst's suggestion from a few months ago and a discussion with @hmacdope about making full use of dask in mda.
Current parallelism development allows splitting a trajectory into a number of parts and then combining intermediate results. However, allowing analysis classes to use dask arrays for positions, velocities, forces across the entire trajectory can cover cases that the split-apply-combine paradigm doesn't cover (like RMSF, AFAIK) and potentially lead to greater speedup.
Describe the solution you'd like
A DaskTimeSeriesAnalysisBase
which accepts a dasktimeseries
as an argument. A dask timeseries is exactly the same as a reader's timeseries
except that it is a dask.array
rather than a numpy.ndarray
, so it is loaded lazily into memory and a dask task graph is created and optimized by dask automatically before .compute()
is called.
Describe alternatives you've considered
Do nothing.
Additional context
I provide an extremely minimal example in PR #4714. Here, using dask to perform RMSF rather than in serial leads to a speedup of ~15x
Sample notebook available here: https://github.com/ljwoods2/mdanalysis/blob/dask-timeseries/tmp/lazyts.ipynb