Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glmboost issue with verboseIter and parameters #396

Closed
4 tasks done
pverspeelt opened this issue Mar 26, 2016 · 1 comment
Closed
4 tasks done

glmboost issue with verboseIter and parameters #396

pverspeelt opened this issue Mar 26, 2016 · 1 comment

Comments

@pverspeelt
Copy link

Running the code below runs 2 models with glmboost. The only difference is the order of the parameter prune. In the first model the order is "yes" , "no", in the second model the other way around. The issue is that the training log only shows 3 folds, and only for mstop = 150 and the prune parameter specified as the first one. Selecting the tuning parameters it selects mstop = 100 and the pruning parameter specified as the first one. If you look at the prints of both models it looks like the same tuning parameters were chosen and the pruning parameter was ignored.

One other point is the use of the prune parameter. It looks like it has no effect. If that is the case, shouldn't it be removed as a parameter option?

Minimal, reproducible example:

library(mlbench)
library(mboost)

data(Sonar)

set.seed(25)
trainIndex = createDataPartition(Sonar$Class, p = 0.9, list = FALSE)
training = Sonar[ trainIndex,]
testing  = Sonar[-trainIndex,]

### set training parameters
fitControl = trainControl(method = "cv",
                          number = 3,
                          ## Estimate class probabilities
                          classProbs = TRUE,
                          verboseIter = TRUE,
                          ## Evaluate a two-class performances  
                          ## (ROC, sensitivity, specificity) using the following function 
                          summaryFunction = twoClassSummary)

### train the models

# Use the expand.grid to specify the search space   
glmBoostGrid1 = expand.grid(mstop = c(50, 100, 150),
                           prune = c("yes", "no"))

set.seed(4242)
glmBoostFit1 = train(Class ~ ., 
                    data = training,
                    method = "glmboost",
                    trControl = fitControl,
                    tuneGrid = glmBoostGrid1,
                    metric = "ROC")

print(glmBoostFit1)


glmBoostGrid2 = expand.grid(mstop = c(50, 100, 150),
                           prune = c("no", "yes"))

set.seed(4242)
glmBoostFit2 = train(Class ~ ., 
                     data = training,
                     method = "glmboost",
                     trControl = fitControl,
                     tuneGrid = glmBoostGrid2,
                     metric = "ROC")


print(glmBoostFit2)

Session Info:

>sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mboost_2.6-0    stabs_0.5-1     mlbench_2.1-1   caret_6.0-64    ggplot2_2.1.0   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        compiler_3.2.3     nloptr_1.0.4       plyr_1.8.3         iterators_1.0.8    tools_3.2.3        lme4_1.1-11       
 [8] nlme_3.1-126       gtable_0.2.0       mgcv_1.8-12        Matrix_1.2-4       foreach_1.4.3      SparseM_1.7        mvtnorm_1.0-5     
[15] coin_1.1-2         stringr_1.0.0      pROC_1.8           MatrixModels_0.4-1 stats4_3.2.3       grid_3.2.3         nnet_7.3-12       
[22] survival_2.38-3    multcomp_1.4-4     TH.data_1.0-7      minqa_1.2.4        reshape2_1.4.1     car_2.1-1          magrittr_1.5      
[29] scales_0.4.0       codetools_0.2-14   modeltools_0.2-21  MASS_7.3-45        splines_3.2.3      nnls_1.4           pbkrtest_0.4-6    
[36] strucchange_1.5-1  colorspace_1.2-6   quantreg_5.21      quadprog_1.5-5     sandwich_2.3-4     stringi_1.0-1      party_1.0-25      
[43] munsell_0.4.3      zoo_1.7-12        
topepo added a commit that referenced this issue Mar 30, 2016
@topepo
Copy link
Owner

topepo commented Mar 30, 2016

It isn't alarming that the verbose logging didn't print out every model; use use something called the sub-model trick here. If you want to evaluate models with 10, 50, and 100 boosting iterations we don't have to fit all three models (only the last one). This can save a ton of time tuning the model.

Also:

  • the default tuning grid fixes prune = "no". You have the option to tune it (as you showed) but it defaults to a single value
  • it is very possible that you might get the exact same answers with and without pruning. If the AIC statistic says that the largest number of iterations is optimal (by that metric) then pruning doesn't eliminate any iterations.

However, there was a bug here. The mboost package does something very atypical for R by changing the object in memory without the object being re-assigned. For example:

> cars.gb <- glmboost(dist ~ speed, data = cars,
+                     control = boost_control(mstop = 2000),
+                     center = FALSE)
> cars.gb

     Generalized Linear Models Fitted via Gradient Boosting

Call:
glmboost.formula(formula = dist ~ speed, data = cars, center = FALSE,     control = boost_control(mstop = 2000))


     Squared Error (Regression) 

Loss function: (y - f)^2 


Number of boosting iterations: mstop = 2000 
Step size:  0.1 
Offset:  42.98 

Coefficients: 
(Intercept)       speed 
 -60.331204    3.918359 
attr(,"offset")
[1] 42.98

> 
> ### initial number of boosting iterations
> mstop(cars.gb)
[1] 2000
> ### look at the model after only 10 iterations:
> cars.gb[10]

     Generalized Linear Models Fitted via Gradient Boosting

Call:
glmboost.formula(formula = dist ~ speed, data = cars, center = FALSE,     control = boost_control(mstop = 2000))


     Squared Error (Regression) 

Loss function: (y - f)^2 


Number of boosting iterations: mstop = 10 
Step size:  0.1 
Offset:  42.98 

Coefficients: 
(Intercept)       speed 
 -0.6546347   0.2338597 
attr(,"offset")
[1] 42.98

> 

> mstop(cars.gb)
[1] 10

This is documented in ?mstop (although I didn't see it originally)

The [.mboost function can be used to enhance or restrict a given boosting model to the specified boosting iteration i. Note that in both cases the original x will be changed to reduce the memory footprint. If the boosting model is enhanced by specifying an index that is larger than the initial stop, only the missing i - stop steps are fitted. If the model is restricted, the spare steps are not dropped, i.e., if we increase i again, these boosting steps are immediately available. Alternatively, the same operation can be done by mstop(x) <- i.

I made changes to this and gamboost so that they work properly. You can get the model code listed in the changes above or wait until it is on CRAN (I think in about 2 weeks)

@topepo topepo closed this as completed Mar 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants