Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added predleaf to xgbTree for predict #318

Closed
wants to merge 2 commits into from

Conversation

terrytangyuan
Copy link
Contributor

Is this the correct way to add it?

@topepo
Copy link
Owner

topepo commented Nov 11, 2015

What was the thought behind adding predleaf as an argument? I don't know of any way to pass that through from train?

For the random grid search, you can get rid of the len multipliers on colsample_bytree, nrounds, and `gamma.

Also, I did some testing and I don't get the same answers as the code that doesn't use the sub-model trick. Here is my testing code:

library(caret)
modelInfo2 <- modelInfo
modelInfo2$loop <- NULL

###################################################################

small <- expand.grid(max_depth = c(1, 10),
                     nrounds = c(10, 100, 500),
                     eta = .3,
                     gamma = 0,
                     colsample_bytree = .6,
                     min_child_weight = 1)

###################################################################

set.seed(46)
dat <- twoClassSim(200)

seeds <- vector(mode = "list", length = 26)
seeds <- lapply(seeds, function(x) 1:40)

set.seed(1)
mod1 <- train(Class ~ ., data = dat,
              method = modelInfo,
              tuneGrid = small,
              trControl = trainControl(seeds = seeds, 
                                       savePredictions = TRUE,
                                       classProbs = TRUE))

set.seed(1)
mod2 <- train(Class ~ ., data = dat,
              method = modelInfo2,
              tuneGrid = small,
              trControl = trainControl(seeds = seeds, 
                                       savePredictions = TRUE,
                                       classProbs = TRUE))

all.equal(mod1$results$Accuracy, mod2$results$Accuracy)
summary(mod1$results$Accuracy-mod2$results$Accuracy)
mod2$times$everything[3]/mod1$times$everything[3]

I get:

> all.equal(mod1$results$Accuracy, mod2$results$Accuracy)
[1] "Mean relative difference: 0.00694488"
> summary(mod1$results$Accuracy-mod2$results$Accuracy)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.005201 -0.001801  0.000000  0.001264  0.003199  0.010920 

They aren't large differences but they shouldn't be there.

It looks like the results are the same for some tuning parameters, specifically when the number of rounds are small. If I re-run mod1 with the same seed, I get the same results so that isn't the issue.

I did my checking wrong

I can get the same answers from xgboost:

> library(xgboost)
> 
> data(agaricus.train, package='xgboost')
> data(agaricus.test, package='xgboost')
> train <- agaricus.train
> test <- agaricus.test
> 
> set.seed(1)
> bst1 <- xgboost(data = train$data, label = train$label, max.depth = 2,
+                eta = 1, nthread = 1, nround = 100, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
[5] train-error:0.001228
[6] train-error:0.001228
[7] train-error:0.001228
[8] train-error:0.001228
[9] train-error:0.000000
[10]    train-error:0.000000
[11]    train-error:0.000000
[12]    train-error:0.000000
[13]    train-error:0.000000
[14]    train-error:0.000000
[15]    train-error:0.000000
[16]    train-error:0.000000
[17]    train-error:0.000000
[18]    train-error:0.000000
[19]    train-error:0.000000
[20]    train-error:0.000000
[21]    train-error:0.000000
[22]    train-error:0.000000
[23]    train-error:0.000000
[24]    train-error:0.000000
[25]    train-error:0.000000
[26]    train-error:0.000000
[27]    train-error:0.000000
[28]    train-error:0.000000
[29]    train-error:0.000000
[30]    train-error:0.000000
[31]    train-error:0.000000
[32]    train-error:0.000000
[33]    train-error:0.000000
[34]    train-error:0.000000
[35]    train-error:0.000000
[36]    train-error:0.000000
[37]    train-error:0.000000
[38]    train-error:0.000000
[39]    train-error:0.000000
[40]    train-error:0.000000
[41]    train-error:0.000000
[42]    train-error:0.000000
[43]    train-error:0.000000
[44]    train-error:0.000000
[45]    train-error:0.000000
[46]    train-error:0.000000
[47]    train-error:0.000000
[48]    train-error:0.000000
[49]    train-error:0.000000
[50]    train-error:0.000000
[51]    train-error:0.000000
[52]    train-error:0.000000
[53]    train-error:0.000000
[54]    train-error:0.000000
[55]    train-error:0.000000
[56]    train-error:0.000000
[57]    train-error:0.000000
[58]    train-error:0.000000
[59]    train-error:0.000000
[60]    train-error:0.000000
[61]    train-error:0.000000
[62]    train-error:0.000000
[63]    train-error:0.000000
[64]    train-error:0.000000
[65]    train-error:0.000000
[66]    train-error:0.000000
[67]    train-error:0.000000
[68]    train-error:0.000000
[69]    train-error:0.000000
[70]    train-error:0.000000
[71]    train-error:0.000000
[72]    train-error:0.000000
[73]    train-error:0.000000
[74]    train-error:0.000000
[75]    train-error:0.000000
[76]    train-error:0.000000
[77]    train-error:0.000000
[78]    train-error:0.000000
[79]    train-error:0.000000
[80]    train-error:0.000000
[81]    train-error:0.000000
[82]    train-error:0.000000
[83]    train-error:0.000000
[84]    train-error:0.000000
[85]    train-error:0.000000
[86]    train-error:0.000000
[87]    train-error:0.000000
[88]    train-error:0.000000
[89]    train-error:0.000000
[90]    train-error:0.000000
[91]    train-error:0.000000
[92]    train-error:0.000000
[93]    train-error:0.000000
[94]    train-error:0.000000
[95]    train-error:0.000000
[96]    train-error:0.000000
[97]    train-error:0.000000
[98]    train-error:0.000000
[99]    train-error:0.000000
> pred_1_50 <- predict(bst1, test$data, ntreelimit = 50)
> 
> set.seed(1)
> bst2 <- xgboost(data = train$data, label = train$label, max.depth = 2,
+                 eta = 1, nthread = 1, nround = 50, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
<snip>
> pred_2_50 <- predict(bst2, test$data)
> pred_2_50_ntl <- predict(bst2, test$data, ntreelimit = 50)
> 
> all.equal(pred_2_50, pred_1_50)
[1] TRUE
> all.equal(pred_2_50, pred_2_50_ntl)
[1] TRUE
> 
> sessionInfo()
R version 3.2.2 Patched (2015-10-19 r69547)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_0.4-2

loaded via a namespace (and not attached):
[1] magrittr_1.5     Matrix_1.2-2     tools_3.2.2      grid_3.2.2       data.table_1.9.6
[6] stringr_0.6.2    chron_2.3-45     lattice_0.20-33 

I'll look more at this tomorrow.

@terrytangyuan
Copy link
Contributor Author

That's weird but fortunately it's not huge difference. I am not sure how the seeds works inside trainControl though. Specifying predleaf will return the predicted indices of the leaves, which would be helpful for some people/problems, e.g. use it as additional features, etc. I think passing a default predleaf = F would not affect the normal usage of xgboost. Let me know what you think.

@terrytangyuan
Copy link
Contributor Author

Could you re-run the Travis build? I did what you suggested but have no idea why it failed.

Also, this might be related to your question? dmlc/xgboost#310

@terrytangyuan
Copy link
Contributor Author

Any progress on this yet? @topepo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants