Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upIssue with RLDA Method Prediction Probabilities Matching Positive Class #761
Comments
|
I noticed the same with dda. |
|
Weirdly, the package does have They are not documented, but it was reasonable to think that > # they don't set the seed so I will
> set.seed(76324)
>
> # from ? lda_pseudo
> n <- nrow(iris)
> train <- sample(seq_len(n), n / 2)
> lda_pseudo_out <- lda_pseudo(Species ~ ., data = iris[train, ])
> predicted <- predict(lda_pseudo_out, iris[-train, -5])
>
> table(levels(iris$Species)[apply(predicted$scores, 2, which.max)], predicted$class)
setosa versicolor virginica
setosa 0 29 25
virginica 21 0 0
> table(levels(iris$Species)[apply(predicted$scores, 2, which.min)], predicted$class)
setosa versicolor virginica
setosa 21 0 0
versicolor 0 29 0
virginica 0 0 25I've made a change that fixes the issue: > rlda_fit$pred %>% arrange(rowIndex) %>% filter(estimator == "Thomaz-Kitani-Gillies") %>% head(10)
pred obs Bad Good rowIndex estimator Resample
1 Bad Good 0.5765554 0.4234446 1 Thomaz-Kitani-Gillies Fold08
2 Bad Bad 0.8103405 0.1896595 2 Thomaz-Kitani-Gillies Fold03
3 Bad Good 0.6202382 0.3797618 3 Thomaz-Kitani-Gillies Fold02
4 Bad Good 0.8769421 0.1230579 4 Thomaz-Kitani-Gillies Fold06
5 Bad Bad 0.7634530 0.2365470 5 Thomaz-Kitani-Gillies Fold08
6 Bad Good 0.8812585 0.1187415 6 Thomaz-Kitani-Gillies Fold07
7 Bad Good 0.6693844 0.3306156 7 Thomaz-Kitani-Gillies Fold06
8 Bad Good 0.8441843 0.1558157 8 Thomaz-Kitani-Gillies Fold01
9 Bad Good 0.6790054 0.3209946 9 Thomaz-Kitani-Gillies Fold08
10 Bad Bad 0.7817670 0.2182330 10 Thomaz-Kitani-Gillies Fold03 |
|
I'll close this but please test and make sure that you get the same results. |
If you are filing a bug, make sure these boxes are checked before submitting your issue— thank you!
update.packages(oldPkgs="caret", ask=FALSE)sessionInfo()Basic Issue
When using the "rlda" method with
caret::train, the predicted classes are the opposite of the highest probability class (binary classification models). With other methods such as "glm", the predicted classes do match the highest probability class.Minimal, reproducible example:
update.packages(oldPkgs="caret", ask=FALSE)
library(dplyr)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
Load Data
data(GermanCredit)
Remove Predictors with Zero Variance
GermanCredit = GermanCredit %>% select(-Purpose.Vacation, -Personal.Female.Single)
Basic Train Control Settings
myControl <- trainControl(classProbs = TRUE,
verboseIter = FALSE,
savePredictions = TRUE,
method = "cv"
)
Fit RLDA
rlda_fit = train(Class ~ .,
data = GermanCredit,
method="rlda",
trControl = myControl
)
Display Predictions and Probabilities from RLDA
rlda_fit$pred %>% arrange(rowIndex) %>% filter(estimator == "Thomaz-Kitani-Gillies") %>% head(10)
#> pred obs Bad Good rowIndex estimator Resample
#> 1 Bad Good 0.42505113 0.5749489 1 Thomaz-Kitani-Gillies Fold06
#> 2 Bad Bad 0.17682025 0.8231798 2 Thomaz-Kitani-Gillies Fold09
#> 3 Bad Good 0.38313288 0.6168671 3 Thomaz-Kitani-Gillies Fold05
#> 4 Bad Good 0.12930325 0.8706967 4 Thomaz-Kitani-Gillies Fold08
#> 5 Bad Bad 0.23457351 0.7654265 5 Thomaz-Kitani-Gillies Fold06
#> 6 Bad Good 0.08919051 0.9108095 6 Thomaz-Kitani-Gillies Fold02
Notice that the Probability for row 1 for Bad is 0.425 and Good is 0.575, and yet the prediction is Bad, the lower probability
Fit Logistic Regression
log_fit = train(Class ~ .,
data = GermanCredit,
method="glm",
family = "binomial",
trControl = myControl
)
Display Logistic Model Predictions and Probabilities
log_fit$pred %>% head(10)
#> pred obs Bad Good rowIndex parameter Resample
#> 1 Good Good 0.31684339 0.6831566 4 none Fold01
#> 2 Good Bad 0.38331551 0.6166845 14 none Fold01
#> 3 Good Good 0.18120281 0.8187972 41 none Fold01
#> 4 Good Good 0.01738535 0.9826147 72 none Fold01
#> 5 Good Good 0.17669562 0.8233044 79 none Fold01
#> 6 Good Good 0.39631486 0.6036851 89 none Fold01
Notice that the Probability for row 1 for Bad is 0.317 and Good is 0.683, and the prediction is Good, the higher probability.
Session Info:
sessionInfo()
#> R version 3.4.2 (2017-09-28)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.4.2 backports_1.1.0 magrittr_1.5 rprojroot_1.2
#> [5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.13
#> [9] stringi_1.1.5 rmarkdown_1.6 knitr_1.16 stringr_1.2.0
#> [13] digest_0.6.12 evaluate_0.10.1