Request for Rborist support. #418

suiji · 2016-04-27T19:54:05Z

Please add appropriate hooks and documentation for invoking the Rborist package, an implementation of the Random Forest algorithm.

Please note that when iteratively training over a fixed data set, the command PreTrain can benefit performance by caching, rather than recomputing, internal state.

topepo · 2016-04-27T21:30:14Z

Is there any way to determine the total list of predictors used in any split among the trees?

Also, I'm not sure that PreTrain helps very much for train. We can exploit cases where the same object can be used to generate predictions from different configurations. It looks like PreTrain would require model training (albeit faster model training) across values of the tuning parameter.

suiji · 2016-04-27T22:16:37Z

The 'predInfo' vector reports the Gini gain across all trees, so those
entries with values greater than zero would indicate the set of
predictors participating in at least one split on at least one tree.

I may be missing your point, but PreTrain is invoked independently of
the tuning parameters involved in actual training. So, once built, the
same PreTrain object can be passed to iterative or parallel invocations
of the training method with any desired collection of tuning
parameters. If I understand your point, then, yes - PreTrain currently
offers only a minor speed advantage. If we are able to attain some of
the speedups we believe we can get from the upcoming GPU implementation,
though, then the sliver of time spent during pretraining will grow to a
much greater proportion of overall execution time. If and when that
happens, the ability to cache pre-trained state should be a performance
win for iterative training.

mls

On 04/27/2016 02:30 PM, topepo wrote:

Is there any way to determine the total list of predictors used in any
split among the trees?

Also, I'm not sure that |PreTrain| helps very much for |train|. We can
exploit cases where the same object can be used to generate
predictions from different configurations. It looks like |PreTrain|
would require model training (albeit faster model training) across
values of the tuning parameter.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

topepo · 2016-04-28T14:55:28Z

I've checked in a draft of the method. Please take a look.

Right now, train can do a workflow that exploits cases where a trained model can make predictions for sub-models without re-training.

For example, if I create a boosted tree with B = 100 iterations, I can usually get predictions from that same model object for sub-models where B <= 100 (all other parameters being equal).

It looks like Rborist would need to re-trian the model to do that. With the current framework in train, this would have to occur whenever you make predictions (and we prob don't want to do that).

LluisRamon · 2016-04-28T15:04:22Z

Correct me if I am wrong, but Rborist now accepts class weights. I think that commit 4f6a245 doesn't take this into account.

Thank you both for your great packages!

topepo · 2016-04-28T15:36:36Z

It doesn't but wouldn't need to. Since classWeight isn't an argument to train, it would pass through to Rborist via the ...

LluisRamon · 2016-04-28T15:49:44Z

Great, thank you for the clarification.

I thought it was necessary as in ranger #414.

suiji · 2016-04-28T15:54:31Z

Thank you for checking in the draft. I have not looked at it yet, but
wanted to respond to your point.

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels. How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

On a separate topic, the PreTrain object is just meant as a convenience,
and can be passed either as a substitute for the data.frame or as an
accompaniment. In the latter case, the Arborist should silently ignore
the data.frame and prefer the PreTrain. Hence the Caret user who wants
to benefit from PreTrain need not do anything more than pass an
additional argument. It should otherwise be transparent to Caret.

mls

On 04/28/2016 07:55 AM, topepo wrote:

I've checked in a draft of the method. Please take a look.

Right now, |train| can do a workflow that exploits cases where a
trained model can make predictions for sub-models without re-training.

For example, if I create a boosted tree with B = 100 iterations, I can
usually get predictions from that same model object for sub-models
where B <= 100 (all other parameters being equal).

It looks like Rborist would need to re-trian the model to do that.
With the current framework in |train|, this would have to occur
whenever you make predictions (and we prob don't want to do that).

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

topepo · 2016-04-28T16:00:54Z

I thought it was necessary as in ranger #414.

Issue #414 was about case weights, which would need some modification. If you pass in the case weights for the entire data set, the model is fit with a resampled version (and this is an issue).

topepo · 2016-04-28T19:19:07Z

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.

I'm not sure that I understand this. I'm saying that, when there is a trained model object where we can get predictions from different values of the tuning parameter (e.g. predProb), we can get some time savings using the "sub-model trick".

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

No, I don't think that it generally has anything to do with that.

The bottom-line is that I don't think that train can exploit the PreTrain option since the model has not been trained yet. Once it is trained, the value of predProb is inherent in the model object.

suiji · 2016-04-28T20:19:07Z

On 04/28/2016 12:19 PM, topepo wrote:

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.
I'm not sure that I understand this. I'm saying that, when there is a
trained model object where we can get predictions from different
values of the tuning parameter (e.g. |predProb|), we can get some time
savings using the "sub-model trick"
http://topepo.github.io/caret/custom_models.html#Illustration2.

Agreed. You were speaking about training, but I had misconstrued your
remarks to be about an obscure property of separate testing. In a
nutshell, I had read "submodel" but had mapped it to "subdesign".

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?
No, I don't think that it generally has anything to do with that.

Agreed again: further down the wrong path.

The bottom-line is that I don't think that |train| can exploit the
|PreTrain| option since the model has not been trained yet. Once it is
trained, the value of |predProb| is inherent in the model object.

Agree with the second point, but not the first:

"PreTrain" is proving to be an unfortunate choice of terms. It should
probably have been "PreFormat" or "PreSort". A pretrained object is
simply a reworking of the data.frame into a format friendlier to the
Arborist's internal representation. Multiple models can be trained from
the same "PreTrain" object, without each instance having to generate the
same representation each time. All that Caret should need to do is
pass it down as an uninteresting extra parameter to the Arborist.

mls

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

suiji · 2016-04-28T21:53:10Z

Oh, dear. The points to which I was responding do not show up in the most recent reply, despite their being present in the "Sent" image. Let's try again, with some editing:

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.

I'm not sure that I understand this. I'm saying that, when there is a trained model object where we can get predictions from different values of the tuning parameter (e.g. predProb), we can get some time savings using the "sub-model trick".

Agreed. You were speaking about training, but I had misconstrued your remarks to be about an obscure feature of separate testing. In a nutshell, I had read "submodel" but had mapped it to "subdesign" by mistake.

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

No, I don't think that it generally has anything to do with that.

Agreed again: further down the wrong path.

The bottom-line is that I don't think that train can exploit the PreTrain option since the model has not been trained yet. Once it is trained, the value of predProb is inherent in the model object.

Agree with the second point, but not the first:

"PreTrain" is proving to be an unfortunate choice of terms. It should probably have been "PreFormat" or "PreSort". A "pretrained" object is simply a reworking of the data.frame into a format friendlier to the Arborist's internal representation. Multiple models can be trained from the same "PreTrain" object, without each instance having to generate the same internal representation each time. In fact, invocations of the "PreTrain" command accept only a single argument, the observation set. All that Caret need do, assuming the user has independently generated a PreTrain object, is silently pass the object down to the Arborist as an uninteresting optional parameter.

In fact, one could even train subdesigns using the "submodel trick" by setting selected predictor weights to zero.

topepo · 2016-04-29T13:22:54Z

That reminds me... what happens with predProb = 0?

suiji · 2016-04-29T14:14:36Z

That reminds me... what happens with predProb = 0?

Should default to 'predFixed' using derived quantity.

topepo · 2016-04-29T15:11:41Z

Is predProb just a scaled version of predFixed then?

suiji · 2016-04-29T15:41:14Z

Is predProb just a scaled version of predFixed then?

No:
'predFixed' is equivalent to 'mtry': sampling without replacement.
'predProb' specifies the mean parameter for Bernoulli sampling.

'predFixed' is the default for low predictor counts, using a value similar
to that used by randomForest for 'mtry'. 'predProb' is the default for
higher predictor counts..

topepo · 2016-04-30T19:26:52Z

A final version and test documents are now checked in.

topepo added a commit that referenced this issue Apr 28, 2016

added Rborist for issue #418

4f6a245

topepo added a commit that referenced this issue Apr 28, 2016

updated on class probability fix and added regression test (issue #418)

1670908

topepo closed this as completed Apr 30, 2016

larry77 mentioned this issue Jun 7, 2017

Status of the Sub model Trick in mlr mlr-org/mlr#1831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Rborist support. #418

Request for Rborist support. #418

suiji commented Apr 27, 2016

topepo commented Apr 27, 2016

suiji commented Apr 27, 2016

topepo commented Apr 28, 2016

LluisRamon commented Apr 28, 2016

topepo commented Apr 28, 2016

LluisRamon commented Apr 28, 2016

suiji commented Apr 28, 2016

topepo commented Apr 28, 2016

topepo commented Apr 28, 2016

suiji commented Apr 28, 2016

suiji commented Apr 28, 2016

topepo commented Apr 29, 2016

suiji commented Apr 29, 2016

topepo commented Apr 29, 2016

suiji commented Apr 29, 2016

topepo commented Apr 30, 2016

Request for Rborist support. #418

Request for Rborist support. #418

Comments

suiji commented Apr 27, 2016

topepo commented Apr 27, 2016

suiji commented Apr 27, 2016

topepo commented Apr 28, 2016

LluisRamon commented Apr 28, 2016

topepo commented Apr 28, 2016

LluisRamon commented Apr 28, 2016

suiji commented Apr 28, 2016

topepo commented Apr 28, 2016

topepo commented Apr 28, 2016

suiji commented Apr 28, 2016

suiji commented Apr 28, 2016

topepo commented Apr 29, 2016

suiji commented Apr 29, 2016

topepo commented Apr 29, 2016

suiji commented Apr 29, 2016

topepo commented Apr 30, 2016