Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with RLDA Method Prediction Probabilities Matching Positive Class #761

Closed
cojamalo opened this issue Oct 26, 2017 · 3 comments
Closed

Issue with RLDA Method Prediction Probabilities Matching Positive Class #761

cojamalo opened this issue Oct 26, 2017 · 3 comments

Comments

@cojamalo
Copy link

@cojamalo cojamalo commented Oct 26, 2017

If you are filing a bug, make sure these boxes are checked before submitting your issue— thank you!

Basic Issue

When using the "rlda" method with caret::train, the predicted classes are the opposite of the highest probability class (binary classification models). With other methods such as "glm", the predicted classes do match the highest probability class.

Minimal, reproducible example:

update.packages(oldPkgs="caret", ask=FALSE)

library(dplyr)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2

Load Data

data(GermanCredit)

Remove Predictors with Zero Variance

GermanCredit = GermanCredit %>% select(-Purpose.Vacation, -Personal.Female.Single)

Basic Train Control Settings

myControl <- trainControl(classProbs = TRUE,
verboseIter = FALSE,
savePredictions = TRUE,
method = "cv"
)

Fit RLDA

rlda_fit = train(Class ~ .,
data = GermanCredit,
method="rlda",
trControl = myControl
)

Display Predictions and Probabilities from RLDA

rlda_fit$pred %>% arrange(rowIndex) %>% filter(estimator == "Thomaz-Kitani-Gillies") %>% head(10)

#> pred obs Bad Good rowIndex estimator Resample
#> 1 Bad Good 0.42505113 0.5749489 1 Thomaz-Kitani-Gillies Fold06
#> 2 Bad Bad 0.17682025 0.8231798 2 Thomaz-Kitani-Gillies Fold09
#> 3 Bad Good 0.38313288 0.6168671 3 Thomaz-Kitani-Gillies Fold05
#> 4 Bad Good 0.12930325 0.8706967 4 Thomaz-Kitani-Gillies Fold08
#> 5 Bad Bad 0.23457351 0.7654265 5 Thomaz-Kitani-Gillies Fold06
#> 6 Bad Good 0.08919051 0.9108095 6 Thomaz-Kitani-Gillies Fold02

Notice that the Probability for row 1 for Bad is 0.425 and Good is 0.575, and yet the prediction is Bad, the lower probability

Fit Logistic Regression

log_fit = train(Class ~ .,
data = GermanCredit,
method="glm",
family = "binomial",
trControl = myControl
)

Display Logistic Model Predictions and Probabilities

log_fit$pred %>% head(10)

#> pred obs Bad Good rowIndex parameter Resample
#> 1 Good Good 0.31684339 0.6831566 4 none Fold01
#> 2 Good Bad 0.38331551 0.6166845 14 none Fold01
#> 3 Good Good 0.18120281 0.8187972 41 none Fold01
#> 4 Good Good 0.01738535 0.9826147 72 none Fold01
#> 5 Good Good 0.17669562 0.8233044 79 none Fold01
#> 6 Good Good 0.39631486 0.6036851 89 none Fold01

Notice that the Probability for row 1 for Bad is 0.317 and Good is 0.683, and the prediction is Good, the higher probability.

Session Info:

sessionInfo()
#> R version 3.4.2 (2017-09-28)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.4.2 backports_1.1.0 magrittr_1.5 rprojroot_1.2
#> [5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.13
#> [9] stringi_1.1.5 rmarkdown_1.6 knitr_1.16 stringr_1.2.0
#> [13] digest_0.6.12 evaluate_0.10.1

@Wermeling
Copy link

@Wermeling Wermeling commented Nov 8, 2017

I noticed the same with dda.

topepo added a commit that referenced this issue Nov 9, 2017
@topepo
Copy link
Owner

@topepo topepo commented Nov 9, 2017

Weirdly, the package does have sparsediscrim:::posterior_probs to compute the probabilities but I reliably get covariance matrices that are all Inf or NaN so that fails.

They are not documented, but it was reasonable to think that scores would be the discriminant functions. Instead they appear to be something more like the distance to the class centroid since smaller values are associated with the most likely class:

> # they don't set the seed so I will
> set.seed(76324)
> 
> # from ? lda_pseudo
> n <- nrow(iris)
> train <- sample(seq_len(n), n / 2)
> lda_pseudo_out <- lda_pseudo(Species ~ ., data = iris[train, ])
> predicted <- predict(lda_pseudo_out, iris[-train, -5])
> 
> table(levels(iris$Species)[apply(predicted$scores, 2, which.max)], predicted$class)
           
            setosa versicolor virginica
  setosa         0         29        25
  virginica     21          0         0
> table(levels(iris$Species)[apply(predicted$scores, 2, which.min)], predicted$class)
            
             setosa versicolor virginica
  setosa         21          0         0
  versicolor      0         29         0
  virginica       0          0        25

I've made a change that fixes the issue:

> rlda_fit$pred %>% arrange(rowIndex) %>% filter(estimator == "Thomaz-Kitani-Gillies") %>% head(10)
   pred  obs       Bad      Good rowIndex             estimator Resample
1   Bad Good 0.5765554 0.4234446        1 Thomaz-Kitani-Gillies   Fold08
2   Bad  Bad 0.8103405 0.1896595        2 Thomaz-Kitani-Gillies   Fold03
3   Bad Good 0.6202382 0.3797618        3 Thomaz-Kitani-Gillies   Fold02
4   Bad Good 0.8769421 0.1230579        4 Thomaz-Kitani-Gillies   Fold06
5   Bad  Bad 0.7634530 0.2365470        5 Thomaz-Kitani-Gillies   Fold08
6   Bad Good 0.8812585 0.1187415        6 Thomaz-Kitani-Gillies   Fold07
7   Bad Good 0.6693844 0.3306156        7 Thomaz-Kitani-Gillies   Fold06
8   Bad Good 0.8441843 0.1558157        8 Thomaz-Kitani-Gillies   Fold01
9   Bad Good 0.6790054 0.3209946        9 Thomaz-Kitani-Gillies   Fold08
10  Bad  Bad 0.7817670 0.2182330       10 Thomaz-Kitani-Gillies   Fold03
@topepo
Copy link
Owner

@topepo topepo commented Nov 11, 2017

I'll close this but please test and make sure that you get the same results.

@topepo topepo closed this Nov 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.