Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Rborist support. #418

Closed
suiji opened this issue Apr 27, 2016 · 16 comments
Closed

Request for Rborist support. #418

suiji opened this issue Apr 27, 2016 · 16 comments

Comments

@suiji
Copy link

suiji commented Apr 27, 2016

Please add appropriate hooks and documentation for invoking the Rborist package, an implementation of the Random Forest algorithm.

Please note that when iteratively training over a fixed data set, the command PreTrain can benefit performance by caching, rather than recomputing, internal state.

@topepo
Copy link
Owner

topepo commented Apr 27, 2016

Is there any way to determine the total list of predictors used in any split among the trees?

Also, I'm not sure that PreTrain helps very much for train. We can exploit cases where the same object can be used to generate predictions from different configurations. It looks like PreTrain would require model training (albeit faster model training) across values of the tuning parameter.

@suiji
Copy link
Author

suiji commented Apr 27, 2016

The 'predInfo' vector reports the Gini gain across all trees, so those
entries with values greater than zero would indicate the set of
predictors participating in at least one split on at least one tree.

I may be missing your point, but PreTrain is invoked independently of
the tuning parameters involved in actual training. So, once built, the
same PreTrain object can be passed to iterative or parallel invocations
of the training method with any desired collection of tuning
parameters. If I understand your point, then, yes - PreTrain currently
offers only a minor speed advantage. If we are able to attain some of
the speedups we believe we can get from the upcoming GPU implementation,
though, then the sliver of time spent during pretraining will grow to a
much greater proportion of overall execution time. If and when that
happens, the ability to cache pre-trained state should be a performance
win for iterative training.

  • mls

On 04/27/2016 02:30 PM, topepo wrote:

Is there any way to determine the total list of predictors used in any
split among the trees?

Also, I'm not sure that |PreTrain| helps very much for |train|. We can
exploit cases where the same object can be used to generate
predictions from different configurations. It looks like |PreTrain|
would require model training (albeit faster model training) across
values of the tuning parameter.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

topepo added a commit that referenced this issue Apr 28, 2016
@topepo
Copy link
Owner

topepo commented Apr 28, 2016

I've checked in a draft of the method. Please take a look.

Right now, train can do a workflow that exploits cases where a trained model can make predictions for sub-models without re-training.

For example, if I create a boosted tree with B = 100 iterations, I can usually get predictions from that same model object for sub-models where B <= 100 (all other parameters being equal).

It looks like Rborist would need to re-trian the model to do that. With the current framework in train, this would have to occur whenever you make predictions (and we prob don't want to do that).

@LluisRamon
Copy link
Contributor

Correct me if I am wrong, but Rborist now accepts class weights. I think that commit 4f6a245 doesn't take this into account.

Thank you both for your great packages!

@topepo
Copy link
Owner

topepo commented Apr 28, 2016

It doesn't but wouldn't need to. Since classWeight isn't an argument to train, it would pass through to Rborist via the ...

@LluisRamon
Copy link
Contributor

Great, thank you for the clarification.

I thought it was necessary as in ranger #414.

@suiji
Copy link
Author

suiji commented Apr 28, 2016

Thank you for checking in the draft. I have not looked at it yet, but
wanted to respond to your point.

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels. How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

On a separate topic, the PreTrain object is just meant as a convenience,
and can be passed either as a substitute for the data.frame or as an
accompaniment. In the latter case, the Arborist should silently ignore
the data.frame and prefer the PreTrain. Hence the Caret user who wants
to benefit from PreTrain need not do anything more than pass an
additional argument. It should otherwise be transparent to Caret.

  • mls

On 04/28/2016 07:55 AM, topepo wrote:

I've checked in a draft of the method. Please take a look.

Right now, |train| can do a workflow that exploits cases where a
trained model can make predictions for sub-models without re-training.

For example, if I create a boosted tree with B = 100 iterations, I can
usually get predictions from that same model object for sub-models
where B <= 100 (all other parameters being equal).

It looks like Rborist would need to re-trian the model to do that.
With the current framework in |train|, this would have to occur
whenever you make predictions (and we prob don't want to do that).


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

@topepo
Copy link
Owner

topepo commented Apr 28, 2016

I thought it was necessary as in ranger #414.

Issue #414 was about case weights, which would need some modification. If you pass in the case weights for the entire data set, the model is fit with a resampled version (and this is an issue).

@topepo
Copy link
Owner

topepo commented Apr 28, 2016

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.

I'm not sure that I understand this. I'm saying that, when there is a trained model object where we can get predictions from different values of the tuning parameter (e.g. predProb), we can get some time savings using the "sub-model trick".

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

No, I don't think that it generally has anything to do with that.

The bottom-line is that I don't think that train can exploit the PreTrain option since the model has not been trained yet. Once it is trained, the value of predProb is inherent in the model object.

@suiji
Copy link
Author

suiji commented Apr 28, 2016

On 04/28/2016 12:19 PM, topepo wrote:

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.

I'm not sure that I understand this. I'm saying that, when there is a
trained model object where we can get predictions from different
values of the tuning parameter (e.g. |predProb|), we can get some time
savings using the "sub-model trick"
http://topepo.github.io/caret/custom_models.html#Illustration2.

Agreed. You were speaking about training, but I had misconstrued your
remarks to be about an obscure property of separate testing. In a
nutshell, I had read "submodel" but had mapped it to "subdesign".

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

No, I don't think that it generally has anything to do with that.

Agreed again: further down the wrong path.

The bottom-line is that I don't think that |train| can exploit the
|PreTrain| option since the model has not been trained yet. Once it is
trained, the value of |predProb| is inherent in the model object.

Agree with the second point, but not the first:

"PreTrain" is proving to be an unfortunate choice of terms. It should
probably have been "PreFormat" or "PreSort". A pretrained object is
simply a reworking of the data.frame into a format friendlier to the
Arborist's internal representation. Multiple models can be trained from
the same "PreTrain" object, without each instance having to generate the
same representation each time. All that Caret should need to do is
pass it down as an uninteresting extra parameter to the Arborist.

  • mls


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#418 (comment)

@suiji
Copy link
Author

suiji commented Apr 28, 2016

Oh, dear. The points to which I was responding do not show up in the most recent reply, despite their being present in the "Sent" image. Let's try again, with some editing:

Yes, I think I see the problem: prediction requires the same predictor
signature as was used to train. In fact, an error is reported if the
signatures to do not match. You want to loosen this restriction to
accept submodels.

I'm not sure that I understand this. I'm saying that, when there is a trained model object where we can get predictions from different values of the tuning parameter (e.g. predProb), we can get some time savings using the "sub-model trick".

Agreed. You were speaking about training, but I had misconstrued your remarks to be about an obscure feature of separate testing. In a nutshell, I had read "submodel" but had mapped it to "subdesign" by mistake.

How general can the submodels be? - For example, are
you only interested in excluding predictors not referenced in the
trained forest?

No, I don't think that it generally has anything to do with that.

Agreed again: further down the wrong path.

The bottom-line is that I don't think that train can exploit the PreTrain option since the model has not been trained yet. Once it is trained, the value of predProb is inherent in the model object.

Agree with the second point, but not the first:

"PreTrain" is proving to be an unfortunate choice of terms. It should probably have been "PreFormat" or "PreSort". A "pretrained" object is simply a reworking of the data.frame into a format friendlier to the Arborist's internal representation. Multiple models can be trained from the same "PreTrain" object, without each instance having to generate the same internal representation each time. In fact, invocations of the "PreTrain" command accept only a single argument, the observation set. All that Caret need do, assuming the user has independently generated a PreTrain object, is silently pass the object down to the Arborist as an uninteresting optional parameter.

In fact, one could even train subdesigns using the "submodel trick" by setting selected predictor weights to zero.

@topepo
Copy link
Owner

topepo commented Apr 29, 2016

That reminds me... what happens with predProb = 0?

@suiji
Copy link
Author

suiji commented Apr 29, 2016

That reminds me... what happens with predProb = 0?

Should default to 'predFixed' using derived quantity.

@topepo
Copy link
Owner

topepo commented Apr 29, 2016

Is predProb just a scaled version of predFixed then?

@suiji
Copy link
Author

suiji commented Apr 29, 2016

Is predProb just a scaled version of predFixed then?

No:
'predFixed' is equivalent to 'mtry': sampling without replacement.
'predProb' specifies the mean parameter for Bernoulli sampling.

'predFixed' is the default for low predictor counts, using a value similar
to that used by randomForest for 'mtry'. 'predProb' is the default for
higher predictor counts..

@topepo
Copy link
Owner

topepo commented Apr 30, 2016

A final version and test documents are now checked in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants