Added predleaf to xgbTree for predict #318

terrytangyuan · 2015-11-11T01:28:06Z

Is this the correct way to add it?

topepo · 2015-11-11T16:42:43Z

What was the thought behind adding predleaf as an argument? I don't know of any way to pass that through from train?

For the random grid search, you can get rid of the len multipliers on colsample_bytree, nrounds, and `gamma.

Also, I did some testing and I don't get the same answers as the code that doesn't use the sub-model trick. Here is my testing code:

library(caret)
modelInfo2 <- modelInfo
modelInfo2$loop <- NULL

###################################################################

small <- expand.grid(max_depth = c(1, 10),
                     nrounds = c(10, 100, 500),
                     eta = .3,
                     gamma = 0,
                     colsample_bytree = .6,
                     min_child_weight = 1)

###################################################################

set.seed(46)
dat <- twoClassSim(200)

seeds <- vector(mode = "list", length = 26)
seeds <- lapply(seeds, function(x) 1:40)

set.seed(1)
mod1 <- train(Class ~ ., data = dat,
              method = modelInfo,
              tuneGrid = small,
              trControl = trainControl(seeds = seeds, 
                                       savePredictions = TRUE,
                                       classProbs = TRUE))

set.seed(1)
mod2 <- train(Class ~ ., data = dat,
              method = modelInfo2,
              tuneGrid = small,
              trControl = trainControl(seeds = seeds, 
                                       savePredictions = TRUE,
                                       classProbs = TRUE))

all.equal(mod1$results$Accuracy, mod2$results$Accuracy)
summary(mod1$results$Accuracy-mod2$results$Accuracy)
mod2$times$everything[3]/mod1$times$everything[3]

I get:

> all.equal(mod1$results$Accuracy, mod2$results$Accuracy)
[1] "Mean relative difference: 0.00694488"
> summary(mod1$results$Accuracy-mod2$results$Accuracy)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.005201 -0.001801  0.000000  0.001264  0.003199  0.010920

They aren't large differences but they shouldn't be there.

It looks like the results are the same for some tuning parameters, specifically when the number of rounds are small. If I re-run mod1 with the same seed, I get the same results so that isn't the issue.

I did my checking wrong

I can get the same answers from xgboost:

> library(xgboost)
> 
> data(agaricus.train, package='xgboost')
> data(agaricus.test, package='xgboost')
> train <- agaricus.train
> test <- agaricus.test
> 
> set.seed(1)
> bst1 <- xgboost(data = train$data, label = train$label, max.depth = 2,
+                eta = 1, nthread = 1, nround = 100, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
[5] train-error:0.001228
[6] train-error:0.001228
[7] train-error:0.001228
[8] train-error:0.001228
[9] train-error:0.000000
[10]    train-error:0.000000
[11]    train-error:0.000000
[12]    train-error:0.000000
[13]    train-error:0.000000
[14]    train-error:0.000000
[15]    train-error:0.000000
[16]    train-error:0.000000
[17]    train-error:0.000000
[18]    train-error:0.000000
[19]    train-error:0.000000
[20]    train-error:0.000000
[21]    train-error:0.000000
[22]    train-error:0.000000
[23]    train-error:0.000000
[24]    train-error:0.000000
[25]    train-error:0.000000
[26]    train-error:0.000000
[27]    train-error:0.000000
[28]    train-error:0.000000
[29]    train-error:0.000000
[30]    train-error:0.000000
[31]    train-error:0.000000
[32]    train-error:0.000000
[33]    train-error:0.000000
[34]    train-error:0.000000
[35]    train-error:0.000000
[36]    train-error:0.000000
[37]    train-error:0.000000
[38]    train-error:0.000000
[39]    train-error:0.000000
[40]    train-error:0.000000
[41]    train-error:0.000000
[42]    train-error:0.000000
[43]    train-error:0.000000
[44]    train-error:0.000000
[45]    train-error:0.000000
[46]    train-error:0.000000
[47]    train-error:0.000000
[48]    train-error:0.000000
[49]    train-error:0.000000
[50]    train-error:0.000000
[51]    train-error:0.000000
[52]    train-error:0.000000
[53]    train-error:0.000000
[54]    train-error:0.000000
[55]    train-error:0.000000
[56]    train-error:0.000000
[57]    train-error:0.000000
[58]    train-error:0.000000
[59]    train-error:0.000000
[60]    train-error:0.000000
[61]    train-error:0.000000
[62]    train-error:0.000000
[63]    train-error:0.000000
[64]    train-error:0.000000
[65]    train-error:0.000000
[66]    train-error:0.000000
[67]    train-error:0.000000
[68]    train-error:0.000000
[69]    train-error:0.000000
[70]    train-error:0.000000
[71]    train-error:0.000000
[72]    train-error:0.000000
[73]    train-error:0.000000
[74]    train-error:0.000000
[75]    train-error:0.000000
[76]    train-error:0.000000
[77]    train-error:0.000000
[78]    train-error:0.000000
[79]    train-error:0.000000
[80]    train-error:0.000000
[81]    train-error:0.000000
[82]    train-error:0.000000
[83]    train-error:0.000000
[84]    train-error:0.000000
[85]    train-error:0.000000
[86]    train-error:0.000000
[87]    train-error:0.000000
[88]    train-error:0.000000
[89]    train-error:0.000000
[90]    train-error:0.000000
[91]    train-error:0.000000
[92]    train-error:0.000000
[93]    train-error:0.000000
[94]    train-error:0.000000
[95]    train-error:0.000000
[96]    train-error:0.000000
[97]    train-error:0.000000
[98]    train-error:0.000000
[99]    train-error:0.000000
> pred_1_50 <- predict(bst1, test$data, ntreelimit = 50)
> 
> set.seed(1)
> bst2 <- xgboost(data = train$data, label = train$label, max.depth = 2,
+                 eta = 1, nthread = 1, nround = 50, objective = "binary:logistic")
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
<snip>
> pred_2_50 <- predict(bst2, test$data)
> pred_2_50_ntl <- predict(bst2, test$data, ntreelimit = 50)
> 
> all.equal(pred_2_50, pred_1_50)
[1] TRUE
> all.equal(pred_2_50, pred_2_50_ntl)
[1] TRUE
> 
> sessionInfo()
R version 3.2.2 Patched (2015-10-19 r69547)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_0.4-2

loaded via a namespace (and not attached):
[1] magrittr_1.5     Matrix_1.2-2     tools_3.2.2      grid_3.2.2       data.table_1.9.6
[6] stringr_0.6.2    chron_2.3-45     lattice_0.20-33

I'll look more at this tomorrow.

terrytangyuan · 2015-11-11T17:34:32Z

That's weird but fortunately it's not huge difference. I am not sure how the seeds works inside trainControl though. Specifying predleaf will return the predicted indices of the leaves, which would be helpful for some people/problems, e.g. use it as additional features, etc. I think passing a default predleaf = F would not affect the normal usage of xgboost. Let me know what you think.

terrytangyuan · 2015-11-12T02:38:18Z

Could you re-run the Travis build? I did what you suggested but have no idea why it failed.

Also, this might be related to your question? dmlc/xgboost#310

terrytangyuan · 2015-12-20T23:16:13Z

Any progress on this yet? @topepo

Added predleaf to xgbTree for predict

3986ade

Removed len in random search for some parameters

2a714ed

terrytangyuan closed this Dec 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added predleaf to xgbTree for predict #318

Added predleaf to xgbTree for predict #318

terrytangyuan commented Nov 11, 2015

topepo commented Nov 11, 2015

terrytangyuan commented Nov 11, 2015

terrytangyuan commented Nov 12, 2015

terrytangyuan commented Dec 20, 2015

Added predleaf to xgbTree for predict #318

Added predleaf to xgbTree for predict #318

Conversation

terrytangyuan commented Nov 11, 2015

topepo commented Nov 11, 2015

terrytangyuan commented Nov 11, 2015

terrytangyuan commented Nov 12, 2015

terrytangyuan commented Dec 20, 2015