Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

scd75 · 2020-11-07T14:52:07Z

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.4.0-dev20201006
Python version: 3.7

Describe the current behavior
I work on timeserie forecasting, involving manipulating 3-D arrays as inputs of my models (batch_size, nb_timesteps, nb_features).
I used to rely on Keras "Sequence" objects to feed the "fit" function.
Yet, I recently chose to move on tf.data.Dataset API to have a more robust input pipeline. In addition, this allow using named features, which is a huge advantage. Nonetheless, I found that usage and performances of tf datasets could be improved when it comes to window data:

windowing a dataset of named features is slow and not very intuitive. I thought this could maybe be improved (see issue For a dataset of nested elements window creates a dataset of nested datasets of flat elements, not a dataset of datasets of nested elements. #43703)
datasets of windows tend to be heavy. While shuffling, the dataset "shuffle" method necessarily pre-buffers all samples to use for training, resulting in huge memory usage, which makes training of such models not tractable with the Dataset API. Note that for timeseries problems, as we start from chronologically-ordered data, it is very important to get a "perfect" shuffling, hence we cannot reduce the shuffle buffer size (it is necessary to set it equal or greater than the dataset cardinality). I was thinking: maybe the "Shuffle" method could be improved so that it does not pre-buffer all possible samples, but rather does the following:
- index the whole dataset, in case cardinality can be known
- shuffle the index list
- pick-up a new batch based on the shuffled index list, at each training step
This would be very similar to the way the Keras Sequence works. However, I think the Keras Sequence lack a lot of other interesting features brought by the Dataset API. So it would be good if we could get the best of the two in the Dataset API, which now seems to be the standard for data input pipelines in tf.

Describe the expected behavior
improved performance and architecture for windowed datasets, using tf.data.Dataset API

The text was updated successfully, but these errors were encountered:

code-fury · 2021-02-13T07:03:14Z

I am having the same issue. Are there any fixes?

scd75 added the type:performance Performance Issue label Nov 7, 2020

google-ml-butler bot assigned amahendrakar Nov 7, 2020

scd75 changed the title ~~Bad performance of tf.data.Dataset APIs when dealing with windowed inputs / timeseries problems / 3D arrays~~ Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays Nov 7, 2020

amahendrakar added comp:data tf.data related issues TF 2.4 for issues related to TF 2.4 type:feature Feature requests labels Nov 8, 2020

amahendrakar assigned rmothukuru and unassigned amahendrakar Nov 8, 2020

rmothukuru assigned ymodak and unassigned rmothukuru Nov 11, 2020

ymodak added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed type:feature Feature requests labels Nov 13, 2020

ymodak assigned aaudiber and unassigned ymodak Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

scd75 commented Nov 7, 2020

code-fury commented Feb 13, 2021

Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

Comments

scd75 commented Nov 7, 2020

code-fury commented Feb 13, 2021