Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675

Open
scd75 opened this issue Nov 7, 2020 · 1 comment
Assignees
Labels
comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:performance Performance Issue

Comments

@scd75
Copy link

scd75 commented Nov 7, 2020

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.4.0-dev20201006
Python version: 3.7

Describe the current behavior
I work on timeserie forecasting, involving manipulating 3-D arrays as inputs of my models (batch_size, nb_timesteps, nb_features).
I used to rely on Keras "Sequence" objects to feed the "fit" function.
Yet, I recently chose to move on tf.data.Dataset API to have a more robust input pipeline. In addition, this allow using named features, which is a huge advantage. Nonetheless, I found that usage and performances of tf datasets could be improved when it comes to window data:

  1. windowing a dataset of named features is slow and not very intuitive. I thought this could maybe be improved (see issue For a dataset of nested elements window creates a dataset of nested datasets of flat elements, not a dataset of datasets of nested elements. #43703)

  2. datasets of windows tend to be heavy. While shuffling, the dataset "shuffle" method necessarily pre-buffers all samples to use for training, resulting in huge memory usage, which makes training of such models not tractable with the Dataset API. Note that for timeseries problems, as we start from chronologically-ordered data, it is very important to get a "perfect" shuffling, hence we cannot reduce the shuffle buffer size (it is necessary to set it equal or greater than the dataset cardinality). I was thinking: maybe the "Shuffle" method could be improved so that it does not pre-buffer all possible samples, but rather does the following:
    - index the whole dataset, in case cardinality can be known
    - shuffle the index list
    - pick-up a new batch based on the shuffled index list, at each training step
    This would be very similar to the way the Keras Sequence works. However, I think the Keras Sequence lack a lot of other interesting features brought by the Dataset API. So it would be good if we could get the best of the two in the Dataset API, which now seems to be the standard for data input pipelines in tf.

Describe the expected behavior
improved performance and architecture for windowed datasets, using tf.data.Dataset API

@scd75 scd75 added the type:performance Performance Issue label Nov 7, 2020
@scd75 scd75 changed the title Bad performance of tf.data.Dataset APIs when dealing with windowed inputs / timeseries problems / 3D arrays Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays Nov 7, 2020
@amahendrakar amahendrakar added comp:data tf.data related issues TF 2.4 for issues related to TF 2.4 type:feature Feature requests labels Nov 8, 2020
@rmothukuru rmothukuru assigned ymodak and unassigned rmothukuru Nov 11, 2020
@ymodak ymodak added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed type:feature Feature requests labels Nov 13, 2020
@ymodak ymodak assigned aaudiber and unassigned ymodak Nov 13, 2020
@code-fury
Copy link

I am having the same issue. Are there any fixes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

6 participants