Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign uptypo in caret rpart varImp, NA and varImp != preProcess with "zv,nzv,corr" #1087
Comments
|
The next problem is that using "zv,nzv,corr,pca" and others in resampling can create new or remove features, the tune grids created (based on the unresampled data) are sometimes wrong (wrong input/row count)... invalid mtry: reset to within valid range The iris data has 4 features, if corr removes 1 that is the result in resampling... |
|
Please set the seed before running For "using "corr" in preProcess can remove features"
For issue 3.1, the warning "There were missing values in resampled performance measures." usually comes when the model predicts a single value for all samples (since that R^2 cannot be computed due to divide by zero). For trees, it probably means that the tree found no good splits (but I can't tell for sure since the seed wasn't set). |
Hi guys ;)...
1. The easy stuff first, a typo in rpart varImp()
View(caret:::getModelInfo("rpart", FALSE)[[1]]$varImp)WRONG in LINE 37 : out <- data.frame(x = numeric(), Vaiable = character())
CORRECT : out <- data.frame(x = numeric(), Variable = character())
2. using "corr" in preProcess can remove features (inputs)
If in varImp(fit,useModel=FALSE) or if the model has no own varImp() it uses filterVarImp()
View(caret:::varImp.train)Example:
4 rows, including the removed "Petal.Length" which shows 100%...?
filterVarImp() ignores removed inputs and shows wrong/high importance values, that's confusing and the question is what to do? Remove the correlated input even if it has a high importance ?
3. NA in regression varImp() importance from "zv,nzv"
Lets do a regression on iris data...
Why is this? It is not depending on the preProcess... It comes from the metrics used, i use my own - but sorry i don't remember what the problem was...
3.1 classification works
rpart variable importance
Overall
Petal.Width 100.00
Petal.Length 97.57
Sepal.Width 0.00
Sepal.Length <-- missing
So sometimes removed features are set to 0.0, sometimes to NA and sometimes they are also removed from importance. I would think it is better to ALWAYS show all features AND to set removed/unused features to 0.0 is the correct way?
### Session Info:
Thanks!