Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong OOB estimates with cforest method #351

Closed
asardaes opened this issue Jan 13, 2016 · 5 comments
Closed

Wrong OOB estimates with cforest method #351

asardaes opened this issue Jan 13, 2016 · 5 comments

Comments

@asardaes
Copy link
Contributor

This is a tricky one, and it took me a while to figure it out.

When using "oob" as training method, the default function to obtain Accuracy and Kappa is the following:

obs <- x@data@get("response")[,1]
pred <- predict(x,  x@data@get("input"), OOB = TRUE)
postResample(pred, obs)

The second argument to predict is the newdata parameter. This should be ok, since x@data@get("input") just loads the input data. However, looking at the source code from the party package, I realized that the call to the actual prediction function (which is a C function) does the boolean operation OOB && is.null(newdata) to determine which output to give. Therefore, the only way to get the true OOB estimate is by calling the predict generic with newdata = NULL, which is the default when calling the function as predict(x, OOB = TRUE)...

I realized I was getting OOB Accuracy estimates that were much larger than the one for finalModel, and this is the reason.


Proof:

library(party)

data(mtcars)

cforest_fit <- cforest(mpg ~ ., data = mtcars, controls = cforest_unbiased(mtry = 0))

# True OOB prediction
pred1 <- predict(cforest_fit, OOB = TRUE)

# This should be equal to 'pred1'...
pred2 <- predict(cforest_fit, newdata = cforest_fit@data@get("input"), OOB = TRUE)

# Prediction without using OOB
pred3 <- predict(cforest_fit, OOB = FALSE)

# This is FALSE
identical(pred1, pred2)

# This is actually TRUE, i.e. 'pred2' sort of ignores OOB = TRUE
identical(pred2, pred3)
@Sandy4321
Copy link

so all we mistakenly used it ?

On Wed, Jan 13, 2016 at 7:56 AM, Alexis Sarda notifications@github.com
wrote:

This is a tricky one, and it took me a while to figure it out

When using "oob" as training method, the default function to obtain
Accuracy and Kappa is the following:

obs <- x@data@get("response")[,1]pred <- predict(x, x@data@get("input"), OOB = TRUE)
postResample(pred, obs)

The second argument to predict is the newdata parameter This should be
ok, since x@data@get("input") just loads the input data However, looking
at the source code from the party package, I realized that, the call to
the actual prediction function (which is a C function) does the boolean
operation OOB & isnull(newdata) to determine which output to give
Therefore, the only way to get the true OOB estimate is by calling the
predict generic with newdata = NULL, which is the default when calling
the function as predict(x, OOB = TRUE)

I realized I was getting OOB Accuracy estimates that were much larger than
the one for finalModel, and this is the reason


Reply to this email directly or view it on GitHub
#351.

@topepo
Copy link
Owner

topepo commented Jan 13, 2016

Wow, nice catch.

I'll remove the second argument. In the meantime, you can just redefine the oob module to get the correct results.

topepo added a commit that referenced this issue Jan 13, 2016
@topepo
Copy link
Owner

topepo commented Jan 13, 2016

Please test when you have the time

@asardaes
Copy link
Contributor Author

Yes, that fixes it for me at least

@asardaes
Copy link
Contributor Author

I should also point out that the predict generic for randomForest has a similar behavior.

I see you've taken that into account during training, but be aware that if you use train with rf, predict(train_rf) != predict(train_rf$finalModel), which may or may not be desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants