Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential bug in pd inner #5

Closed
flinder opened this issue Sep 11, 2014 · 2 comments
Closed

potential bug in pd inner #5

flinder opened this issue Sep 11, 2014 · 2 comments

Comments

@flinder
Copy link
Collaborator

flinder commented Sep 11, 2014

I'm not 100% sure. But I think in
pred <- predict(fit, newdata = df, outcome = "test")$predicted.oob

the outcome argument should be 'train'. The part in the documentation is a bit cryptic:

"If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration"

But I don't think we want to recalculate the terminal nodes from the original forest, but just drop down the test data and get a prediction. I came to it bc i got different and counterintuitive results from it with the outcome='test' but matching and intuitive results with outcome='train'.

I changed it in the package code to do some tests.

@zmjones
Copy link
Owner

zmjones commented Sep 11, 2014

yea. i have thought about it a bit now and i think you are right in that we should not have outcome = "test. however, it is unclear to me from the docs whether predict(fit, newdata = df, outcome = "train") is the same as predict(fit, newdata = df). for one, predicted.oob is not available in the latter option. this suggests to me that they are dropping the unmodified training data down the tree. does the pd actually work in this case? should be easy to detect since you wouldn't be setting the dataframe to its unique values.

@zmjones
Copy link
Owner

zmjones commented Sep 12, 2014

wow my commits got super messed up there for a minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants