From 80324db3fbc0e148bb68bb5ebaeb0e6d3a51fa2e Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Tue, 26 Jun 2018 23:15:41 -0400 Subject: [PATCH 01/11] add wip --- rfcs/20180626-tensor-forest.md | 60 ++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 rfcs/20180626-tensor-forest.md diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md new file mode 100644 index 000000000..3b2f4f764 --- /dev/null +++ b/rfcs/20180626-tensor-forest.md @@ -0,0 +1,60 @@ +# TensorForest Estimator + +| Status | Proposed | +:---------------|:-----------------------------------------------------| +| **Author(s)** | Peng Yu(yupbank@gmail.com) | +| **Sponsor** | Natalia P (Google) | +| **Updated** | 2018-06-26 | + +## Objective + +In this doc, we discuss the TensorForest Estimator API, which enable user create +Random Forest models. + +## Motivation + +Since tree algorithm is one of the most popular algorithm used in kaggle competition +and we already have a contrib project tensor_forest and people like them. If would be +beneficial to move them inside of canned estimators. + +## Design Proposal + +### Examples + +``` +classifier = random_forest.TensorForestEstimator(feature_columns, + n_classes, + model_dir=None, + weight_column=None, + label_vocabulary=None, + n_trees=50, + max_nodes=1000, + num_trainers=1, + num_splits_to_consider=10, + split_after_samples=250, + bagging_fraction=1.0, + feature_bagging_fraction=1.0, + base_random_seed=0) + +def input_fn_train(): + ... + return dataset + +classifier.train(input_fn=input_fn_train) + +def input_fn_eval(): + ... + return dataset + +metrics = classifier.evaluate(input_fn=input_fn_eval) +``` +## Detailed Design + +This section is optional. Elaborate on details if they’re important to +understanding the design, but would make it hard to read the proposal section +above. + +## Questions and Discussion Topics + +Seed this with open questions you require feedback on from the RFC process. + From 3a762118cf7c66ecb3d34388a20db240e5bb911a Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Thu, 28 Jun 2018 21:44:12 -0400 Subject: [PATCH 02/11] add regressor --- rfcs/20180626-tensor-forest.md | 96 ++++++++++++++++++++++++++++------ 1 file changed, 81 insertions(+), 15 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index 3b2f4f764..f611c069b 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -9,32 +9,39 @@ ## Objective In this doc, we discuss the TensorForest Estimator API, which enable user create -Random Forest models. +[Extremely Randomized Forest](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) +Classifier and Regressor. +And by inheriting from the `Estimator` class, all the corresponding interfaces will be supported ## Motivation Since tree algorithm is one of the most popular algorithm used in kaggle competition -and we already have a contrib project tensor_forest and people like them. If would be -beneficial to move them inside of canned estimators. +and we already have a contrib project [tensor_forest](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensor_forest) and people like them. If would be beneficial to move them inside of canned estimators. ## Design Proposal -### Examples +### TensorForestClassifier ``` -classifier = random_forest.TensorForestEstimator(feature_columns, - n_classes, +bucketized_feature_1 = bucketized_column( + numeric_column('feature_1'), BUCKET_BOUNDARIES_1) +bucketized_feature_2 = bucketized_column( + numeric_column('feature_2'), BUCKET_BOUNDARIES_2) + +classifier = estimator.TensorForestClassifier(feature_columns=[bucketized_feature_1, bucketized_feature_2], model_dir=None, - weight_column=None, + n_classes=2, label_vocabulary=None, - n_trees=50, + n_trees=100, max_nodes=1000, num_trainers=1, num_splits_to_consider=10, split_after_samples=250, bagging_fraction=1.0, feature_bagging_fraction=1.0, - base_random_seed=0) + base_random_seed=0, + config=None) + def input_fn_train(): ... @@ -48,13 +55,72 @@ def input_fn_eval(): metrics = classifier.evaluate(input_fn=input_fn_eval) ``` -## Detailed Design -This section is optional. Elaborate on details if they’re important to -understanding the design, but would make it hard to read the proposal section -above. +Here are some explained details for the classifier parameters: + +* **feature_columns:** An iterable containing all the feature columns used by the model. + All items in the set should be instances of classes derived from `FeatureColumn`. +* **n_classes:** Defaults to 2. The number of classes in a classification problem. +* **model_dir:** Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. +* **label_vocabulary:** A list of strings represents possible label values. If given, labels must be string type and have any value in `label_vocabulary`. If it is not given, that means labels are already encoded as integer or float within [0, 1] for `n_classes=2` and encoded as integer values in {0, 1,..., n_classes-1} for `n_classes`>2 . Also there will be errors if vocabulary is not provided and labels are string. +* **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. +* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. +* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +* **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. +* **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far. +* **feature_bagging_fraction:** If less than 1.0, then each tree sees only a different feature_bagging_fraction * num_features sized subset of the input features. Defaults to 1.0 (no feature bagging). +* **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. +* **config:** `RunConfig` object to configure the runtime settings. + +### TensorForestRegressor + +``` +bucketized_feature_1 = bucketized_column( + numeric_column('feature_1'), BUCKET_BOUNDARIES_1) +bucketized_feature_2 = bucketized_column( + numeric_column('feature_2'), BUCKET_BOUNDARIES_2) + +regressor = estimator.TensorForestRegressor(feature_columns=[bucketized_feature_1, bucketized_feature_2], + model_dir=None, + label_dimension=1, + n_trees=100, + max_nodes=1000, + num_trainers=1, + num_splits_to_consider=10, + split_after_samples=250, + bagging_fraction=1.0, + feature_bagging_fraction=1.0, + base_random_seed=0, + config=None) -## Questions and Discussion Topics -Seed this with open questions you require feedback on from the RFC process. +def input_fn_train(): + ... + return dataset + +regressor.train(input_fn=input_fn_train) + +def input_fn_eval(): + ... + return dataset + +metrics = regressor.evaluate(input_fn=input_fn_eval) +``` + +Here are some explained details for the regressor parameters: + +* **feature_columns:** An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from `FeatureColumn`. +* **model_dir:** Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. +* **label_dimension:** Defaults to 1. Number of regression targets per example. +* **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. +* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. +* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +* **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. +* **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far. +* **feature_bagging_fraction:** If less than 1.0, then each tree sees only a different feature_bagging_fraction * num_features sized subset of the input features. Defaults to 1.0 (no feature bagging). +* **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. +* **config:** `RunConfig` object to configure the runtime settings. + +## Questions and Discussion Topics +TBD From 16d1708fe5bdaedbbf21ed98ec8355912d5ba6de Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Sat, 7 Jul 2018 21:58:15 -0400 Subject: [PATCH 03/11] address the comment --- rfcs/20180626-tensor-forest.md | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index f611c069b..ede9b57fd 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -4,7 +4,7 @@ :---------------|:-----------------------------------------------------| | **Author(s)** | Peng Yu(yupbank@gmail.com) | | **Sponsor** | Natalia P (Google) | -| **Updated** | 2018-06-26 | +| **Updated** | 2018-07-07 | ## Objective @@ -18,6 +18,24 @@ And by inheriting from the `Estimator` class, all the corresponding interfaces w Since tree algorithm is one of the most popular algorithm used in kaggle competition and we already have a contrib project [tensor_forest](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensor_forest) and people like them. If would be beneficial to move them inside of canned estimators. +## BenchMark + +Comparing with Scikit-learn ExtraTrees. Both using 100 trees with 10k nodes. And Scikit-learn ExtraTrees is a batch algorithm while TensorForest is a streaming/online algorithm. + +|Data Set| #Examples | #Features| #Classes| TensorForest Accuracy(%)/R^2 Score| Scikit-learn ExtraTrees Accuracy(%)/R^2 Score| +|-------|:---------:|:---------:|:---------:|:---------:|---------:| +| Iris| 150| 4| 3| 95.6| 94.6| +|Diabetes| 442| 10| Regression| 0.462| 0.461| +|Boston| 506| 13| Regression| 0.793| 0.872| +|Digits| 1797| 64| 10| 96.7| 97.6| +|Sensit(Comb.)| 78k| 100| 3| 81.0| 83.1| +|Aloi| 108K| 128| 1000| 89.8| 91.7| +|rcv1| 518k| 47,236| 53| 78.7| 81.5| +|Covertype| 581k| 54| 7| 83.0| 85.0| +|HiGGS| 11M| 28| 2| 70.9| 71.7| + +With single machine training, TensorForest finishes much faster on big dataset like HIGGS, takes about one percent of the time scikit-lean required. + ## Design Proposal ### TensorForestClassifier @@ -34,11 +52,8 @@ classifier = estimator.TensorForestClassifier(feature_columns=[bucketized_featur label_vocabulary=None, n_trees=100, max_nodes=1000, - num_trainers=1, num_splits_to_consider=10, split_after_samples=250, - bagging_fraction=1.0, - feature_bagging_fraction=1.0, base_random_seed=0, config=None) @@ -65,10 +80,8 @@ Here are some explained details for the classifier parameters: * **label_vocabulary:** A list of strings represents possible label values. If given, labels must be string type and have any value in `label_vocabulary`. If it is not given, that means labels are already encoded as integer or float within [0, 1] for `n_classes=2` and encoded as integer values in {0, 1,..., n_classes-1} for `n_classes`>2 . Also there will be errors if vocabulary is not provided and labels are string. * **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. * **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. -* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +* **num_splits_to_consider:** Defaults to `sqrt(num_features)`. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. * **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. -* **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far. -* **feature_bagging_fraction:** If less than 1.0, then each tree sees only a different feature_bagging_fraction * num_features sized subset of the input features. Defaults to 1.0 (no feature bagging). * **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. * **config:** `RunConfig` object to configure the runtime settings. @@ -85,11 +98,8 @@ regressor = estimator.TensorForestRegressor(feature_columns=[bucketized_feature_ label_dimension=1, n_trees=100, max_nodes=1000, - num_trainers=1, num_splits_to_consider=10, split_after_samples=250, - bagging_fraction=1.0, - feature_bagging_fraction=1.0, base_random_seed=0, config=None) @@ -114,10 +124,8 @@ Here are some explained details for the regressor parameters: * **label_dimension:** Defaults to 1. Number of regression targets per example. * **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. * **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. -* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +* **num_splits_to_consider:** Defaults to `sqrt(num_features)`. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. * **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. -* **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far. -* **feature_bagging_fraction:** If less than 1.0, then each tree sees only a different feature_bagging_fraction * num_features sized subset of the input features. Defaults to 1.0 (no feature bagging). * **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. * **config:** `RunConfig` object to configure the runtime settings. From db86784480779a1a528011d6cbe36ef27d53d0df Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Sun, 15 Jul 2018 16:43:30 -0400 Subject: [PATCH 04/11] including natalia"s comment --- rfcs/20180626-tensor-forest.md | 137 ++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 29 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index ede9b57fd..c4ff3dad4 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -4,19 +4,77 @@ :---------------|:-----------------------------------------------------| | **Author(s)** | Peng Yu(yupbank@gmail.com) | | **Sponsor** | Natalia P (Google) | -| **Updated** | 2018-07-07 | +| **Updated** | 2018-07-15 | ## Objective -In this doc, we discuss the TensorForest Estimator API, which enable user create -[Extremely Randomized Forest](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) -Classifier and Regressor. -And by inheriting from the `Estimator` class, all the corresponding interfaces will be supported +### Goals + +* Provide state of the art (in terms of model quality) online random forest implementation as a canned estimator in tf.estimator module. +* Design interface for the random forest estimator that is intuitive and easy to experiment with. +* Simplify the design of the current contrib implementation, making code cleaner and easier to maintain. +* Use all the new APIs for the new implementation, including supporting new feature columns and new Estimator interface +* Provide value for both + - Users with small data, which fits into memory, by having a fast local version + - Provide a distributed version that requires minimum configuration and works well out of the box. + +### Non-Goals + +* Provide an implementation with all the features available in [contrib version](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensor_forest) + ## Motivation -Since tree algorithm is one of the most popular algorithm used in kaggle competition -and we already have a contrib project [tensor_forest](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensor_forest) and people like them. If would be beneficial to move them inside of canned estimators. +Tree based algorithms have been very popular in the last decade. Random Forest by Breiman [1](https://books.google.ca/books/about/Classification_and_Regression_Trees.html?id=JwQx-WOmSyQC&redir_esc=y) is among one of the most widely used tree-based algorithms so far. Numerous empirical benchmarks demonstrate its remarkable performance on small to medium size datasets, high dimensional datasets and in Kaggle competitions([2](https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf) , [3](http://icml2008.cs.helsinki.fi/papers/632.pdf), [4](https://www.kaggle.com/dansbecker/random-forests), [5](https://www.kaggle.com/sshadylov/titanic-solution-using-random-forest-classifier), [6](https://www.kaggle.com/thierryherrie/house-prices-random-forest)). Random Forest are champions in industry adoption, with numerous implementations (like scikit learn [7](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), Spark [8](https://spark.apache.org/docs/2.2.0/mllib-ensembles.html), Mahout [9](https://hub.packtpub.com/learning-random-forest-using-mahout/) and other) and tutorials ([10](https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883), [11](https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd) etc. ) available online. Random Forests also remain popular in academic community, as demonstrated by a number of papers each year in venues like ICML, NIPS and JMLR ([12](http://proceedings.mlr.press/v37/nan15.html), [13](http://proceedings.mlr.press/v32/denil14.html), [14](https://icml.cc/Conferences/2018/Schedule?showEvent=3238), [15](https://www.icml.cc/Conferences/2018/Schedule?showEvent=3181) and others). + +Widespread adoption of random forests is due to the advantages of tree-based algorithms and power of ensembling. + +__Tree-based algorithms are__ + +- **Easy to use:** + - Tree models are invariant to inputs scale, so no preprocessing of numerical features is required (e.g. no normalization that is a must for gradient-descent based methods) + - They work well out of the box and are not as sensitive to hyperparameters as neural nets, thus making them easier to tune + - They are able to filter out irrelevant features +- **Easier to understand:** + - Trees can be used to obtain feature importances + - They are arguably easier to reason about, for example, tools like described [here](http://engineering.pivotal.io/post/interpreting-decision-trees-and-random-forests/) are able to explain even the whole ensembles. + +__Ensembling delivers__ + +- **Excellent performance**: by combining multiple predictors, ensembling obtains better predictive performance than that of using a single predictor [16](https://en.wikipedia.org/wiki/Ensemble_learning) +- **Reduced overfitting**: comes from the fact that individual predictors don't have to be very strong themselves (which can be obtained, for example, by using bagging or feature sampling techniques). Weaker predictors have fewer possibilities to overfit. +- **Fast inference time**: Each predictor is independent from others, so inference time can be easily parallelized across different predictors + +On top of all the above, random forests +- Provide a way to estimate the generalization performance without having a validation set: random forests use bagging, that allows to obtain out-of-bag estimates, that have been shown to be good metrics for generalization performance. + +Compared with gradient boosted trees, which is another popular algorithm that uses both ensembling and tree based learners, random forests have the following +- __Advantages__: + - **Much faster to train**: due to the fact that each tree is independent from another in a random forest, parallelization during inference time is trivial. Boosted trees, on the other hand, is an iterative algorithm that relies on predictors built so far to obtain the next predictor. + - **Might be less prone to overfitting** since trees are less correlated with each other + - **More "robust"** to different values of hyperparameters. They require less tuning and a default configuration works well most of the time, with tuning allowing usually just marginal improvements +- __Disadvantages__: + - Need much larger ensembles due to the fact that trees are independent + - Can't easily handle custom losses like ranking + - Can have worse performance than boosted trees for a very complicated decision boundary + +## Algorithm + +[Extremely Randomized Forest](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) is an online training algorithm, that makes quick split decisions. +In contrast with a classic random forest, in extremely randomized forests: + +- Split candidates are generated after seeing only a number of samples from the data (as opposed to seeing the full dataset or a large portion, determined by the bagging fraction, of it. +- Splits quality is evaluated over the samples of the data, as opposed to full or a large portion of the dataset. + +Those modifications allow to make quick decisions and forest can be grown in online fashion, and experiments demonstrate the ERF provide similar performance to that of classical forests (at the cost of having to potentially build much deeper trees). + +At the start of training, the tree structure is initialized to a root node, and the leaf and growing statistics for it are both empty. Then, for each batch `{(x_i, y_i)}` of training data, the following steps are performed: + +1. Given the current tree structure, each instance `x_i` is used to find the leaf assignment `l_i` where this instance falls into. +2. `Y_i` (the label of `x_i`) is used to update the leaf statistics of leaf `l_i`. +3. If the growing statistics for the leaf `l_i` do not yet contain `num_splits_to_consider` splits, `x_i` is used to generate another split. Specifically, a random feature value is chosen, and `x_i`'s value at that feature is used for the split's threshold. +4. Otherwise, `(x_i, y_i)` is used to update the statistics of every split in the growing statistics of leaf `l_i`. If leaf `l_i` has now seen `split_after_samples` data points since creating all of its potential splits, the split with the best score is chosen, and the tree structure is grown. + ## BenchMark @@ -38,15 +96,14 @@ With single machine training, TensorForest finishes much faster on big dataset l ## Design Proposal +### Interface ### TensorForestClassifier ``` -bucketized_feature_1 = bucketized_column( - numeric_column('feature_1'), BUCKET_BOUNDARIES_1) -bucketized_feature_2 = bucketized_column( - numeric_column('feature_2'), BUCKET_BOUNDARIES_2) +feature_1 = numeric_column('feature_1') +feature_2 = numeric_column('feature_2') -classifier = estimator.TensorForestClassifier(feature_columns=[bucketized_feature_1, bucketized_feature_2], +classifier = estimator.TensorForestClassifier(feature_columns=[feature_1, feature_2], model_dir=None, n_classes=2, label_vocabulary=None, @@ -73,27 +130,23 @@ metrics = classifier.evaluate(input_fn=input_fn_eval) Here are some explained details for the classifier parameters: -* **feature_columns:** An iterable containing all the feature columns used by the model. - All items in the set should be instances of classes derived from `FeatureColumn`. -* **n_classes:** Defaults to 2. The number of classes in a classification problem. -* **model_dir:** Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. -* **label_vocabulary:** A list of strings represents possible label values. If given, labels must be string type and have any value in `label_vocabulary`. If it is not given, that means labels are already encoded as integer or float within [0, 1] for `n_classes=2` and encoded as integer values in {0, 1,..., n_classes-1} for `n_classes`>2 . Also there will be errors if vocabulary is not provided and labels are string. -* **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. -* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. -* **num_splits_to_consider:** Defaults to `sqrt(num_features)`. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. -* **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. -* **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. -* **config:** `RunConfig` object to configure the runtime settings. +- **feature_columns**: An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from FeatureColumn. +- **n_classes**: Defaults to 2. The number of classes in a classification problem. +- **model_dir**: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. +- **label_vocabulary**: A list of strings representing all possible label values. If provided, labels must be of string type and their values must be present in label_vocabulary list. If label_vocabulary is omitted, it is assumed that the labels are already encoded as integer or float values within [0, 1] for n_classes=2, or encoded as integer values in {0, 1,..., n_classes-1} for n_classes>2 . If vocabulary is not provided and labels are of string, an error will be generated. +- **n_trees**: The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values (assuming deep enough trees are built). +- **max_nodes**: Defaults to 10k. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. +- **num_splits_to_consider**: Defaults to sqrt(num_features). In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +- **split_after_samples**: Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. + ### TensorForestRegressor ``` -bucketized_feature_1 = bucketized_column( - numeric_column('feature_1'), BUCKET_BOUNDARIES_1) -bucketized_feature_2 = bucketized_column( - numeric_column('feature_2'), BUCKET_BOUNDARIES_2) +feature_1 = numeric_column('feature_1') +feature_2 = numeric_column('feature_2') -regressor = estimator.TensorForestRegressor(feature_columns=[bucketized_feature_1, bucketized_feature_2], +regressor = estimator.TensorForestRegressor(feature_columns=[feature_1, feature_2], model_dir=None, label_dimension=1, n_trees=100, @@ -129,6 +182,32 @@ Here are some explained details for the regressor parameters: * **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. * **config:** `RunConfig` object to configure the runtime settings. + +## High Level Design + +Each tree in the forest is trained independently and in parallel. + +In the first version, we only support dense numeric features. And no sample weight is supported yet. + +For each tree, we maintain the following data(two tf.resources): +1. Tree resource + - The tree structure, giving the two children of each non-leaf node and the split used to route data between them. Each split looks at a single input feature and compares it to a threshold value. + - Leaf statistics. Each leaf needs to gather statistics, and those statistics have the property that at the end of training, they can be turned into predictions. For classification problems, the statistics are class counts, and for regression problems they are the vector sum of the values seen at the leaf, along with a count of those values (which can be turned into the mean for the final prediction). +2. Fertile Stats resource + - Growing statistics. Each leaf needs to gather data that will potentially allow it to grow into a non-leaf parent node. That data usually consists of a list of potential splits, along with the statistics for each of those splits. Split statistics in turn consist of leaf statistics for their left and right branches, along with some other information that allows us to assess the quality of the split. For classification problems, that's usually the [gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) of the split, while for regression problems it's the [mean-squared error](https://en.wikipedia.org/wiki/Mean_squared_error). + +During training, every tree is being trained completely independent. For each tree, for every batch of data, we first pass through the tree structure to obtain the leaf ids. Then we update the leaf statistics and after that we update the grow statics and pick the leaf to grow and finally we grow the tree. + +During inferencing, for every batch of data, we pass through the tree structure and obtain the predictions from all the trees and then we average over all the predictions. + +## Distributed version + +Since the trees are independent, for the distributed version, we would distribute the number of trees required to train evenly around all the workers. For every tree, they would have two tf.resources available for training. + +## Future Work + +Add sample importance, right now we don’t support sample importance, which it’s a widely used [feature](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier.fit). + ## Questions and Discussion Topics -TBD +[Google Groups](https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/yreM9FRiBs4) From 58d0bf80401170f4a0f2efbe32510090cfc86929 Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Fri, 20 Jul 2018 21:46:30 -0400 Subject: [PATCH 05/11] add natlias nice improving --- rfcs/20180626-tensor-forest.md | 39 +++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index c4ff3dad4..817430b88 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -182,31 +182,50 @@ Here are some explained details for the regressor parameters: * **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. * **config:** `RunConfig` object to configure the runtime settings. +### First version supported features + +The first version will only: + +- Support dense numeric features. Categorical features would need to be imported as one-hot encoding +- No sample weight is supported +- No feature importances will be provided + ## High Level Design Each tree in the forest is trained independently and in parallel. -In the first version, we only support dense numeric features. And no sample weight is supported yet. +For each tree, we maintain the following data: + +1. Tree Resource over the DecisionTree [proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/decision_trees/proto/generic_tree_model.proto#L73). It contains information about the tree structure and has statistics for: + + - Non-leaf nodes: namely two children of each non-leaf node and the split used to route data between them. Each split looks at a single input feature and compares it to a threshold value. Right now only numeric features will be supported. Categorical features should be encoded as 1-hot. + - Leaf nodes ([proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/decision_trees/proto/generic_tree_model.proto#L137)). Each leaf needs to gather statistics, and those statistics have the property that at the end of training, they can be turned into predictions. For classification problems, the statistics are class counts, and for regression problems they are the vector sum of the values seen at the leaf, along with a count of those values (which can be turned into the mean for the final prediction). + +2. Fertile Stats resource over Growing statistics. Each leaf needs to gather data that will potentially allow it to grow into a non-leaf parent node. That data usually consists of + + - A list of potential splits, which is an array in a [proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tensor_forest/proto/fertile_stats.proto#L80) + - Split Statistics for each of those splits. Split statistics in turn consist of leaf statistics for their left and right branches, along with some other information that allows us to assess the quality of the split. For classification problems, that's usually the [gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) of the split, while for regression problems it's the mean-squared error. + -For each tree, we maintain the following data(two tf.resources): -1. Tree resource - - The tree structure, giving the two children of each non-leaf node and the split used to route data between them. Each split looks at a single input feature and compares it to a threshold value. - - Leaf statistics. Each leaf needs to gather statistics, and those statistics have the property that at the end of training, they can be turned into predictions. For classification problems, the statistics are class counts, and for regression problems they are the vector sum of the values seen at the leaf, along with a count of those values (which can be turned into the mean for the final prediction). -2. Fertile Stats resource - - Growing statistics. Each leaf needs to gather data that will potentially allow it to grow into a non-leaf parent node. That data usually consists of a list of potential splits, along with the statistics for each of those splits. Split statistics in turn consist of leaf statistics for their left and right branches, along with some other information that allows us to assess the quality of the split. For classification problems, that's usually the [gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) of the split, while for regression problems it's the [mean-squared error](https://en.wikipedia.org/wiki/Mean_squared_error). +During training, every tree is being trained completely independently. For each tree, for every batch of the data, we -During training, every tree is being trained completely independent. For each tree, for every batch of data, we first pass through the tree structure to obtain the leaf ids. Then we update the leaf statistics and after that we update the grow statics and pick the leaf to grow and finally we grow the tree. + - First, pass through the tree structure to obtain the leaf ids. + - Then we update the leaf statistics + - Update the growing statistics + - Pick the leaf to grow + - And finally, grow the tree. -During inferencing, for every batch of data, we pass through the tree structure and obtain the predictions from all the trees and then we average over all the predictions. +During inference, for every batch of data, we pass through the tree structure and obtain the predictions from all the trees and then we average over all the predictions. ## Distributed version -Since the trees are independent, for the distributed version, we would distribute the number of trees required to train evenly around all the workers. For every tree, they would have two tf.resources available for training. +Since the trees are independent, for the distributed version, we would distribute the number of trees required to train evenly among all the available workers. For every tree, they would have two tf.resources available for training. ## Future Work Add sample importance, right now we don’t support sample importance, which it’s a widely used [feature](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier.fit). +## Alternatives Considered ## Questions and Discussion Topics From fc56dbc67874dbcf47b5393c05142e1d7aff449c Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Wed, 8 Aug 2018 20:27:29 -0400 Subject: [PATCH 06/11] incoporating all the latest comments --- rfcs/20180626-tensor-forest.md | 44 ++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 12 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index 817430b88..de5a26f47 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -107,11 +107,11 @@ classifier = estimator.TensorForestClassifier(feature_columns=[feature_1, featur model_dir=None, n_classes=2, label_vocabulary=None, + head=None, n_trees=100, max_nodes=1000, num_splits_to_consider=10, split_after_samples=250, - base_random_seed=0, config=None) @@ -121,6 +121,12 @@ def input_fn_train(): classifier.train(input_fn=input_fn_train) +def input_fn_predict(): + ... + return dataset + +classifier.predict(input_fn=input_fn_predict) + def input_fn_eval(): ... return dataset @@ -133,11 +139,14 @@ Here are some explained details for the classifier parameters: - **feature_columns**: An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from FeatureColumn. - **n_classes**: Defaults to 2. The number of classes in a classification problem. - **model_dir**: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. -- **label_vocabulary**: A list of strings representing all possible label values. If provided, labels must be of string type and their values must be present in label_vocabulary list. If label_vocabulary is omitted, it is assumed that the labels are already encoded as integer or float values within [0, 1] for n_classes=2, or encoded as integer values in {0, 1,..., n_classes-1} for n_classes>2 . If vocabulary is not provided and labels are of string, an error will be generated. +- **label_vocabulary**: A list of strings representing all possible label values. If provided, labels must be of string type and their values must be present in label_vocabulary list. If label_vocabulary is omitted, it is assumed that the labels are already encoded as integer values within {0, 1} for n_classes=2, or encoded as integer values in {0, 1,..., n_classes-1} for n_classes>2 . If vocabulary is not provided and labels are of string, an error will be generated. +- **head**: .A `head_lib._Head` instance, the loss would be calculated for metrics purpose and not being used for training. If not provided, one will be automatically created based on params - **n_trees**: The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values (assuming deep enough trees are built). - **max_nodes**: Defaults to 10k. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. - **num_splits_to_consider**: Defaults to sqrt(num_features). In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. - **split_after_samples**: Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. +- **config**: RunConfig object to configure the runtime settings. + ### TensorForestRegressor @@ -149,11 +158,11 @@ feature_2 = numeric_column('feature_2') regressor = estimator.TensorForestRegressor(feature_columns=[feature_1, feature_2], model_dir=None, label_dimension=1, + head=None, n_trees=100, max_nodes=1000, num_splits_to_consider=10, split_after_samples=250, - base_random_seed=0, config=None) @@ -163,6 +172,12 @@ def input_fn_train(): regressor.train(input_fn=input_fn_train) +def input_fn_predict(): + ... + return dataset + +regressor.predict(input_fn=input_fn_predict) + def input_fn_eval(): ... return dataset @@ -172,15 +187,15 @@ metrics = regressor.evaluate(input_fn=input_fn_eval) Here are some explained details for the regressor parameters: -* **feature_columns:** An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from `FeatureColumn`. -* **model_dir:** Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. -* **label_dimension:** Defaults to 1. Number of regression targets per example. -* **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. -* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. -* **num_splits_to_consider:** Defaults to `sqrt(num_features)`. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. -* **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. -* **base_random_seed:** By default (base_random_seed = 0), the random number generator for each tree is seeded by a 64-bit random value when each tree is first created. Using a non-zero value causes tree training to be deterministic, in that the i-th tree's random number generator is seeded with the value base_random_seed + i. -* **config:** `RunConfig` object to configure the runtime settings. +- **feature_columns:** An iterable containing all the feature columns used by the model. All items in the set should be instances of classes derived from `FeatureColumn`. +- **model_dir:** Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. +- **label_dimension:** Defaults to 1. Number of regression targets per example. +- **head**: .A `head_lib._Head` instance, the loss would be calculated for metrics purpose and not being used for training. If not provided, one will be automatically created based on params +- **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values. +- **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large. +- **num_splits_to_consider:** Defaults to `sqrt(num_features)`. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node. +- **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples. +- **config:** `RunConfig` object to configure the runtime settings. ### First version supported features @@ -222,6 +237,11 @@ During inference, for every batch of data, we pass through the tree structure an Since the trees are independent, for the distributed version, we would distribute the number of trees required to train evenly among all the available workers. For every tree, they would have two tf.resources available for training. +## Differences from the latest contrib version + +- Simplified code with only limited subset of features (obviously, excluding all the experimental ones) +- New estimator interface, support for new feature columns and losses + ## Future Work Add sample importance, right now we don’t support sample importance, which it’s a widely used [feature](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier.fit). From 74650ab86b11fa7583e316b025b9c5f90bf9233a Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Thu, 9 Aug 2018 14:34:11 -0400 Subject: [PATCH 07/11] add more content in the differences from the latest contrib section --- rfcs/20180626-tensor-forest.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index de5a26f47..609f4fe40 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -241,6 +241,7 @@ Since the trees are independent, for the distributed version, we would distribut - Simplified code with only limited subset of features (obviously, excluding all the experimental ones) - New estimator interface, support for new feature columns and losses +- We will try to reuse as much code from canned boosted trees as possible (proto, inference etc) ## Future Work From 424fe6523db36dfa3c8288c2049220abce4d3b7f Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Thu, 9 Aug 2018 14:37:35 -0400 Subject: [PATCH 08/11] update status --- rfcs/20180626-tensor-forest.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index 609f4fe40..5c3f0c922 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -1,10 +1,10 @@ # TensorForest Estimator -| Status | Proposed | +| Status | Accepted | :---------------|:-----------------------------------------------------| | **Author(s)** | Peng Yu(yupbank@gmail.com) | | **Sponsor** | Natalia P (Google) | -| **Updated** | 2018-07-15 | +| **Updated** | 2018-08-09 | ## Objective @@ -251,3 +251,5 @@ Add sample importance, right now we don’t support sample importance, which it ## Questions and Discussion Topics [Google Groups](https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/yreM9FRiBs4) + +[Github Pull Request Discussion](https://github.com/tensorflow/community/pull/3) From 4e2326c175e9aeff7d776d4e5da52d342bb0098c Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Thu, 9 Aug 2018 15:05:24 -0400 Subject: [PATCH 09/11] high light python syntax --- rfcs/20180626-tensor-forest.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index 5c3f0c922..29e43d019 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -99,7 +99,7 @@ With single machine training, TensorForest finishes much faster on big dataset l ### Interface ### TensorForestClassifier -``` +```python feature_1 = numeric_column('feature_1') feature_2 = numeric_column('feature_2') @@ -151,7 +151,7 @@ Here are some explained details for the classifier parameters: ### TensorForestRegressor -``` +```python feature_1 = numeric_column('feature_1') feature_2 = numeric_column('feature_2') From 0b86700710fb542e9a0ce008e36add1c08881369 Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Fri, 10 Aug 2018 10:14:01 -0400 Subject: [PATCH 10/11] address comments --- rfcs/20180626-tensor-forest.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index 29e43d019..d1a0b64da 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -76,7 +76,7 @@ At the start of training, the tree structure is initialized to a root node, and 4. Otherwise, `(x_i, y_i)` is used to update the statistics of every split in the growing statistics of leaf `l_i`. If leaf `l_i` has now seen `split_after_samples` data points since creating all of its potential splits, the split with the best score is chosen, and the tree structure is grown. -## BenchMark +## Benchmark Comparing with Scikit-learn ExtraTrees. Both using 100 trees with 10k nodes. And Scikit-learn ExtraTrees is a batch algorithm while TensorForest is a streaming/online algorithm. From 030a57462ff45b42afda28519cfa587060d93634 Mon Sep 17 00:00:00 2001 From: Peng Yu Date: Fri, 10 Aug 2018 11:18:06 -0400 Subject: [PATCH 11/11] add perfromance source --- rfcs/20180626-tensor-forest.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/20180626-tensor-forest.md b/rfcs/20180626-tensor-forest.md index d1a0b64da..f362f8da8 100644 --- a/rfcs/20180626-tensor-forest.md +++ b/rfcs/20180626-tensor-forest.md @@ -78,7 +78,7 @@ At the start of training, the tree structure is initialized to a root node, and ## Benchmark -Comparing with Scikit-learn ExtraTrees. Both using 100 trees with 10k nodes. And Scikit-learn ExtraTrees is a batch algorithm while TensorForest is a streaming/online algorithm. +Comparing with Scikit-learn ExtraTrees. Both using 100 trees with 10k nodes. And Scikit-learn ExtraTrees is a batch algorithm while TensorForest is a streaming/online algorithm [16](https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxtbHN5c25pcHMyMDE2fGd4OjFlNTRiOWU2OGM2YzA4MjE). |Data Set| #Examples | #Features| #Classes| TensorForest Accuracy(%)/R^2 Score| Scikit-learn ExtraTrees Accuracy(%)/R^2 Score| |-------|:---------:|:---------:|:---------:|:---------:|---------:| @@ -92,7 +92,7 @@ Comparing with Scikit-learn ExtraTrees. Both using 100 trees with 10k nodes. And |Covertype| 581k| 54| 7| 83.0| 85.0| |HiGGS| 11M| 28| 2| 70.9| 71.7| -With single machine training, TensorForest finishes much faster on big dataset like HIGGS, takes about one percent of the time scikit-lean required. +With single machine training, TensorForest finishes much faster on big dataset like HIGGS, takes about one percent of the time scikit-lean required [17](https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxtbHN5c25pcHMyMDE2fGd4OjFlNTRiOWU2OGM2YzA4MjE). ## Design Proposal