Bad performance of tf.data.Dataset API when dealing with windowed inputs / timeseries problems / 3D arrays #44675
Labels
comp:data
tf.data related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
TF 2.4
for issues related to TF 2.4
type:performance
Performance Issue
Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.4.0-dev20201006
Python version: 3.7
Describe the current behavior
I work on timeserie forecasting, involving manipulating 3-D arrays as inputs of my models (batch_size, nb_timesteps, nb_features).
I used to rely on Keras "Sequence" objects to feed the "fit" function.
Yet, I recently chose to move on tf.data.Dataset API to have a more robust input pipeline. In addition, this allow using named features, which is a huge advantage. Nonetheless, I found that usage and performances of tf datasets could be improved when it comes to window data:
windowing a dataset of named features is slow and not very intuitive. I thought this could maybe be improved (see issue For a dataset of nested elements
window
creates a dataset of nested datasets of flat elements, not a dataset of datasets of nested elements. #43703)datasets of windows tend to be heavy. While shuffling, the dataset "shuffle" method necessarily pre-buffers all samples to use for training, resulting in huge memory usage, which makes training of such models not tractable with the Dataset API. Note that for timeseries problems, as we start from chronologically-ordered data, it is very important to get a "perfect" shuffling, hence we cannot reduce the shuffle buffer size (it is necessary to set it equal or greater than the dataset cardinality). I was thinking: maybe the "Shuffle" method could be improved so that it does not pre-buffer all possible samples, but rather does the following:
- index the whole dataset, in case cardinality can be known
- shuffle the index list
- pick-up a new batch based on the shuffled index list, at each training step
This would be very similar to the way the Keras Sequence works. However, I think the Keras Sequence lack a lot of other interesting features brought by the Dataset API. So it would be good if we could get the best of the two in the Dataset API, which now seems to be the standard for data input pipelines in tf.
Describe the expected behavior
improved performance and architecture for windowed datasets, using tf.data.Dataset API
The text was updated successfully, but these errors were encountered: