You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder if you could clarify a bit on how datasets are handled in and between TFDF and Yggdrasil. Is it even possible to train an large dataset (> RAM size). If that could be achieved via playing around TFRecord, are they relate to how we define TFRecord data layout?
The text was updated successfully, but these errors were encountered:
Yggdrasil DF supports two types of dataset inputs for training: Training from an in-memory dataset or Training from a set of dataset files (for example a collection of TFRecord files). The first option is efficient for small datasets, while the second one is best for large datasets (e.g. datasets not fitting in memory). Each learning algorithm implements one of the other interfaces.
We have not opensourced any learning algorithm for large datasets in Yggdrasil yet. This should happen soon (mid Q3). Notably, we will open source the Exact Distributed Random Forest algorithm.
Currently, TF-DF uses the in-memory interface of Yggdrasil. At training time, the dataset is streamed and stored in memory (with a few memory optimizations). At the end of the first epoch, the Yggdrasil training starts.
We are currently working on a distributed version of TF-DF compatible with TF Distributions Strategy. Both the data and the computation will be distributed. This will be released at the same time as the distributed learning algorithm in Yggdrasil.
Hi, sorry for reopening this old issue. Is there currently any solution which allows for a single worker to stream over a dataset instead of requiring the entire dataset to be in memory?
In my case, I have a large dataset that worked reasonably well with a GBDT model from TF1's estimator API, but I'm having difficulty migrating it over to use this library due to the massive memory requirements from storing the dataset. Migrating to use distributed training on a cluster of workers is not an option for me unfortunately.
I'm asking for this feature because the dataset I'm working on is generally greater than RAM size (>1.5TiB)
For regular Tensorflow tasks, this can be get around via tweaking training loops and
dataset
API.As for TFDF, if I understand correctly, is an wrapping over Yggdrasil C API, datasets are either copied or moved to Yggdrasil as a whole,
decision-forests/tensorflow_decision_forests/tensorflow/ops/training/features.h
Lines 381 to 393 in 0114e4a
However I'm seeing some interesting codes in Yggdrasil:
https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L84-L93
https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L132-L140
I wonder if you could clarify a bit on how datasets are handled in and between TFDF and Yggdrasil. Is it even possible to train an large dataset (> RAM size). If that could be achieved via playing around TFRecord, are they relate to how we define TFRecord data layout?
The text was updated successfully, but these errors were encountered: