Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typo in caret rpart varImp, NA and varImp != preProcess with "zv,nzv,corr" #1087

Open
laz8 opened this issue Oct 31, 2019 · 2 comments
Open

typo in caret rpart varImp, NA and varImp != preProcess with "zv,nzv,corr" #1087

laz8 opened this issue Oct 31, 2019 · 2 comments

Comments

@laz8
Copy link

@laz8 laz8 commented Oct 31, 2019

Hi guys ;)...


1. The easy stuff first, a typo in rpart varImp()

View(caret:::getModelInfo("rpart", FALSE)[[1]]$varImp)

WRONG in LINE 37 : out <- data.frame(x = numeric(), Vaiable = character())

CORRECT : out <- data.frame(x = numeric(), Variable = character())


2. using "corr" in preProcess can remove features (inputs)

If in varImp(fit,useModel=FALSE) or if the model has no own varImp() it uses filterVarImp()

View(caret:::varImp.train)

Example:

data(iris)

dat <- iris[sample(1:NROW(iris)),]

ctr <- trainControl(method="repeatedcv",number=2,repeats=2)

fit <- train(Species~.,data=dat,method="lvq",trControl=ctr,preProcess=c("corr"))

fit$preProcess

fit$preProcess$method

Created from 150 samples and 1 variables << wrong created from 1 var
Pre-processing:

  • ignored (0)
  • removed (1)
    $ignore
    character(0)
    $remove
    [1] "Petal.Length"
vim <- varImp(fit,useModel=F) # lvq uses filterVarImp()
vim

ROC curve variable importance

variables are sorted by maximum importance across the classes
setosa versicolor virginica
Petal.Length 100.00000 100.00000 100.00000
Petal.Width 100.00000 100.00000 100.00000
Sepal.Length 90.70048 59.29952 90.70048
Sepal.Width 54.58937 54.58937 0.00000

4 rows, including the removed "Petal.Length" which shows 100%...?

filterVarImp() ignores removed inputs and shows wrong/high importance values, that's confusing and the question is what to do? Remove the correlated input even if it has a high importance ?


3. NA in regression varImp() importance from "zv,nzv"

Lets do a regression on iris data...

dat <- iris[sample(1:NROW(iris)),]
	
dat$Sepal.Length <- 0.01 # create a zv, nzv var
	
dat$Species      <- as.integer(dat$Species) # do regression on target
	
ctr <- trainControl(method="repeatedcv",number=2,repeats=2)
	
fit <- train(Species~.,data=dat,method="rpart",trControl=ctr,preProcess=c("zv","nzv"))

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.

Why is this? It is not depending on the preProcess... It comes from the metrics used, i use my own - but sorry i don't remember what the problem was...

fit$preProcess
	
fit$preProcess$method

Created from 150 samples and 1 variables # also wrong, created from 1?

Pre-processing:

  • ignored (0)
  • removed (1)

$ignore
character(0)

$remove
[1] "Sepal.Length"

vim <- varImp(fit,useModel=F)
vim

Warning in FUN(newX[, i], ...) :
no non-missing arguments to max; returning -Inf
Overall
Petal.Width 100.0000
Petal.Length 99.9709
Sepal.Width 0.0000
Sepal.Length NA <<< setting this to 0.0 would remove the warning


3.1 classification works

dat <- iris[sample(1:NROW(iris)),]

dat$Sepal.Length <- 0.01 # create a zv, nzv var

ctr <- trainControl(method="repeatedcv",number=2,repeats=2)

fit <- train(Species~.,data=dat,method="rpart",trControl=ctr,preProcess=c("zv","nzv"))
	
fit$preProcess
	
fit$preProcess$method

vim <- varImp(fit,useModel=T)
vim

rpart variable importance
Overall
Petal.Width 100.00
Petal.Length 97.57
Sepal.Width 0.00
Sepal.Length <-- missing

vim <- varImp(fit,useModel=F)
vim

ROC curve variable importance
variables are sorted by maximum importance across the classes
setosa versicolor virginica
Petal.Length 100.00 100.00 100.00
Petal.Width 100.00 100.00 100.00
Sepal.Width 84.96 84.96 66.88
Sepal.Length 0.00 0.00 0.00

So sometimes removed features are set to 0.0, sometimes to NA and sometimes they are also removed from importance. I would think it is better to ALWAYS show all features AND to set removed/unused features to 0.0 is the correct way?


### Session Info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-84    ggplot2_3.2.0   lattice_0.20-38

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.1     gower_0.2.1        plyr_1.8.4         iterators_1.0.10   class_7.3-15      
 [8] tools_3.6.1        rpart_4.1-15       ipred_0.9-9        lubridate_1.7.4    tibble_2.1.3       nlme_3.1-140       gtable_0.3.0      
[15] pkgconfig_2.0.2    rlang_0.4.0        Matrix_1.2-17      foreach_1.4.4      rstudioapi_0.10    prodlim_2018.04.18 stringr_1.4.0     
[22] withr_2.1.2        dplyr_0.8.3        generics_0.0.2     recipes_0.1.6      stats4_3.6.1       nnet_7.3-12        grid_3.6.1        
[29] tidyselect_0.2.5   data.table_1.12.2  glue_1.3.1         R6_2.4.0           survival_2.44-1.1  lava_1.6.5         reshape2_1.4.3    
[36] purrr_0.3.2        magrittr_1.5       ModelMetrics_1.2.2 splines_3.6.1      scales_1.0.0       codetools_0.2-16   MASS_7.3-51.4     
[43] assertthat_0.2.1   timeDate_3043.102  colorspace_1.4-1   stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4 

Thanks!

@laz8
Copy link
Author

@laz8 laz8 commented Nov 1, 2019

The next problem is that using "zv,nzv,corr,pca" and others in resampling can create new or remove features, the tune grids created (based on the unresampled data) are sometimes wrong (wrong input/row count)...

invalid mtry: reset to within valid range

The iris data has 4 features, if corr removes 1 that is the result in resampling...

@topepo
Copy link
Owner

@topepo topepo commented Jan 2, 2020

Please set the seed before running train() so I can reproduce your results.

For "using "corr" in preProcess can remove features"

  • filterVarImp() doesn't know anything about the model specification (since useModel = FALSE)

For issue 3.1, the warning "There were missing values in resampled performance measures." usually comes when the model predicts a single value for all samples (since that R^2 cannot be computed due to divide by zero). For trees, it probably means that the tree found no good splits (but I can't tell for sure since the seed wasn't set).

topepo added a commit that referenced this issue Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.