Does TFDF support lazy dataset loading during training? #30

Willian-Zhang · 2021-06-21T09:46:46Z

I'm asking for this feature because the dataset I'm working on is generally greater than RAM size (>1.5TiB)

For regular Tensorflow tasks, this can be get around via tweaking training loops and dataset API.

As for TFDF, if I understand correctly, is an wrapping over Yggdrasil C API, datasets are either copied or moved to Yggdrasil as a whole,

decision-forests/tensorflow_decision_forests/tensorflow/ops/training/features.h

Lines 381 to 393 in 0114e4a

    
           // Initialize a dataset (including the dataset's dataspec) from the linked 
        
           // resource aggregators. 
        
           tensorflow::Status InitializeDatasetFromFeatures( 
        
               tensorflow::OpKernelContext* ctx, 
        
               const ::yggdrasil_decision_forests::dataset::proto:: 
        
                   DataSpecificationGuide& guide, 
        
               ::yggdrasil_decision_forests::dataset::VerticalDataset* dataset); 
        
           // Moves the feature values contained in the aggregators into the dataset. 
        
           // Following this call, the feature aggregators are empty. 
        
           tensorflow::Status MoveExamplesFromFeaturesToDataset( 
        
               tensorflow::OpKernelContext* ctx, 
        
               ::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);

However I'm seeing some interesting codes in Yggdrasil:

https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L84-L93
https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L132-L140

I wonder if you could clarify a bit on how datasets are handled in and between TFDF and Yggdrasil. Is it even possible to train an large dataset (> RAM size). If that could be achieved via playing around TFRecord, are they relate to how we define TFRecord data layout?

The text was updated successfully, but these errors were encountered:

achoum · 2021-06-24T15:18:15Z

Hi Willian,

Thanks for the interest :)

Yggdrasil DF supports two types of dataset inputs for training: Training from an in-memory dataset or Training from a set of dataset files (for example a collection of TFRecord files). The first option is efficient for small datasets, while the second one is best for large datasets (e.g. datasets not fitting in memory). Each learning algorithm implements one of the other interfaces.

We have not opensourced any learning algorithm for large datasets in Yggdrasil yet. This should happen soon (mid Q3). Notably, we will open source the Exact Distributed Random Forest algorithm.

Currently, TF-DF uses the in-memory interface of Yggdrasil. At training time, the dataset is streamed and stored in memory (with a few memory optimizations). At the end of the first epoch, the Yggdrasil training starts.

We are currently working on a distributed version of TF-DF compatible with TF Distributions Strategy. Both the data and the computation will be distributed. This will be released at the same time as the distributed learning algorithm in Yggdrasil.

Cheers,
M.

rjchee · 2023-08-04T04:51:24Z

Hi, sorry for reopening this old issue. Is there currently any solution which allows for a single worker to stream over a dataset instead of requiring the entire dataset to be in memory?

In my case, I have a large dataset that worked reasonably well with a GBDT model from TF1's estimator API, but I'm having difficulty migrating it over to use this library due to the massive memory requirements from storing the dataset. Migrating to use distributed training on a cluster of workers is not an option for me unfortunately.

achoum closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does TFDF support lazy dataset loading during training? #30

Does TFDF support lazy dataset loading during training? #30

Willian-Zhang commented Jun 21, 2021

achoum commented Jun 24, 2021

rjchee commented Aug 4, 2023

Does TFDF support lazy dataset loading during training? #30

Does TFDF support lazy dataset loading during training? #30

Comments

Willian-Zhang commented Jun 21, 2021

achoum commented Jun 24, 2021

rjchee commented Aug 4, 2023