Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does TFDF support lazy dataset loading during training? #30

Closed
Willian-Zhang opened this issue Jun 21, 2021 · 2 comments
Closed

Does TFDF support lazy dataset loading during training? #30

Willian-Zhang opened this issue Jun 21, 2021 · 2 comments

Comments

@Willian-Zhang
Copy link
Contributor

I'm asking for this feature because the dataset I'm working on is generally greater than RAM size (>1.5TiB)

For regular Tensorflow tasks, this can be get around via tweaking training loops and dataset API.

As for TFDF, if I understand correctly, is an wrapping over Yggdrasil C API, datasets are either copied or moved to Yggdrasil as a whole,

// Initialize a dataset (including the dataset's dataspec) from the linked
// resource aggregators.
tensorflow::Status InitializeDatasetFromFeatures(
tensorflow::OpKernelContext* ctx,
const ::yggdrasil_decision_forests::dataset::proto::
DataSpecificationGuide& guide,
::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);
// Moves the feature values contained in the aggregators into the dataset.
// Following this call, the feature aggregators are empty.
tensorflow::Status MoveExamplesFromFeaturesToDataset(
tensorflow::OpKernelContext* ctx,
::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);

However I'm seeing some interesting codes in Yggdrasil:

https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L84-L93
https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L132-L140

I wonder if you could clarify a bit on how datasets are handled in and between TFDF and Yggdrasil. Is it even possible to train an large dataset (> RAM size). If that could be achieved via playing around TFRecord, are they relate to how we define TFRecord data layout?

@achoum
Copy link
Collaborator

achoum commented Jun 24, 2021

Hi Willian,

Thanks for the interest :)

Yggdrasil DF supports two types of dataset inputs for training: Training from an in-memory dataset or Training from a set of dataset files (for example a collection of TFRecord files). The first option is efficient for small datasets, while the second one is best for large datasets (e.g. datasets not fitting in memory). Each learning algorithm implements one of the other interfaces.

We have not opensourced any learning algorithm for large datasets in Yggdrasil yet. This should happen soon (mid Q3). Notably, we will open source the Exact Distributed Random Forest algorithm.

Currently, TF-DF uses the in-memory interface of Yggdrasil. At training time, the dataset is streamed and stored in memory (with a few memory optimizations). At the end of the first epoch, the Yggdrasil training starts.

We are currently working on a distributed version of TF-DF compatible with TF Distributions Strategy. Both the data and the computation will be distributed. This will be released at the same time as the distributed learning algorithm in Yggdrasil.

Cheers,
M.

@achoum achoum closed this as completed Jun 30, 2021
@rjchee
Copy link

rjchee commented Aug 4, 2023

Hi, sorry for reopening this old issue. Is there currently any solution which allows for a single worker to stream over a dataset instead of requiring the entire dataset to be in memory?

In my case, I have a large dataset that worked reasonably well with a GBDT model from TF1's estimator API, but I'm having difficulty migrating it over to use this library due to the massive memory requirements from storing the dataset. Migrating to use distributed training on a cluster of workers is not an option for me unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants