Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite case value causes error #30

Closed
jir88 opened this issue May 11, 2018 · 8 comments
Closed

Infinite case value causes error #30

jir88 opened this issue May 11, 2018 · 8 comments

Comments

@jir88
Copy link

jir88 commented May 11, 2018

pROC generates an error whenever the list of case values contains Inf. I suspect this is related to issue #25 . I am using the most recent GitHub version of pROC (as of May 11).

To Reproduce
Steps to reproduce the behavior:

  1. Started with a fresh R session:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4 yaml_2.1.18

  1. Run attached demo script (includes minimal data set).

pROC-inf-bug-test.txt

  1. Error message is:

Error in delongPlacements(roc) :
pROC: error in calculating DeLong's theta: got 0.73333333333333328152 instead of 0.65000000000000002220. Diagnostic data saved in pROC_bug.RData. Please report this bug to https://github.com/xrobin/pROC/issues.

@xrobin
Copy link
Owner

xrobin commented May 11, 2018

Thanks for the report. It is related with issue #25 indeed, but this is a really tricky one.

If you define the criteria for the positivity of an observation as x >= t (this is the case in pROC with direction = "<"), then we have the negative criteria x < t. If x is -Inf you can never have a threshold that makes it negative. If it is in a control and direction = "<", then you have a control which will always be positive, so you cannot reach 100% specificity. (If it's a case then it's just always misclassified which is actually fine).

The opposite is true if you define positivity as x > t (strict) and negative x <= t, you can setup curves that never reach 100% sensitivity. In both cases you have a curve which is not fully defined. Curves not hitting 0 or 100% will create all sort of issues down the line (with AUC etc) and pROC should not create any.

This is definitely not handled properly by the DeLong Theta which is agnostic to the actual threshold and only cares about the ordering. It is also not handled properly by algorithm 2, which actually works like the DeLong Theta making this bit run:

controls <- c(1,2,3,4,5)
cases <- c(2,3,4,5,6,Inf)
roc_data <- roc(controls = controls,
                cases = cases,
                percent=TRUE, ci = TRUE, print.auc=TRUE,
                algorithm=2)
plot(roc_data)

Also note how the curve is not actually complete if the infinity is in the controls (or I guess in the cases with direction = ">").

controls <- c(1,2,3,4,5,Inf)
cases <- c(2,3,4,5,6)
roc_data <- roc(controls = controls,
                cases = cases,
                percent=TRUE, ci = TRUE, print.auc=TRUE)
coords(roc_data, "all")
all       all      all  all      all      all      all all
threshold   -Inf   1.50000  2.50000  3.5  4.50000  5.50000      Inf  NA
specificity    0  16.66667 33.33333 50.0 66.66667 83.33333 83.33333  NA
sensitivity  100 100.00000 80.00000 60.0 40.00000 20.00000  0.00000  NA

I am not sure how to handle this correctly yet. Most likely I'll have to return NA when infinities are present, possibly relaxing it in some selected circumstances. In any case lots of cleanup to do.

@xrobin
Copy link
Owner

xrobin commented May 11, 2018

Two curves that will be interesting to test this bug and define clear test cases.

~~This is one curve that can never reach 100% specificity:~~~~

controls <- c(-Inf, 1,2,3,4,5)
cases <- c(2,3,4,5,6)
roc_data <- roc(controls = controls,
                cases = cases, direction = "<")

This one actually does reach 100% sensitivity due to the positivity being x >= threshold, so if t=Inf the last case is positive:

controls <- c(1,2,3,4,5)
cases <- c(2,3,4,5,6, Inf)
roc_data <- roc(controls = controls,
                cases = cases, direction = "<")

Edit: The above is just plain wrong and shows how confusing thresholding infinities with infinities is.

@xrobin
Copy link
Owner

xrobin commented May 11, 2018

One more note to self: it is very likely that only algorithms 1 and 3 can deal with infinite values properly, so it will be important to remember to:

  • Disable algorithm 2 and DeLong when any +-Inf values are present.
  • Switch to algorithm 1 or 3 and bootstrap if appropriate, or throw an error message if the user specifically asked for DeLong or algorithm 2.
  • Verify behavior of Obuchowski and Venkatraman

@xrobin
Copy link
Owner

xrobin commented May 14, 2018

The more I look at this issue, the more I think +-Inf should be just flat-out rejected. It is simply too confusing, even I need to think very hard to get which conditions work and which ones don't. I fear that accepting some Inf but not others will be just far too confusing. Refusing to threshold infinities with infinity in all cases makes sense, and is much easier to understand.

This leaves the question: what should roc do when predictor contains an infinite value? Possible options include:

  1. return NA, probably with a warning
  2. stop with an error
  3. A mix of 1 and 2 depending on the status of na.rm

Feedback from users is welcome.

xrobin added a commit that referenced this issue May 14, 2018
@jir88
Copy link
Author

jir88 commented May 23, 2018

Thanks for handling this issue so quickly! My personal preference would be option 2: stop with an error. A well-written error message will hopefully help users quickly isolate and understand the problem.

@xrobin
Copy link
Owner

xrobin commented May 24, 2018

Option 2 has my preference too, the previous commit implements it.

@xrobin
Copy link
Owner

xrobin commented Jul 12, 2018

After lots of thinking and hesitation, I don't think throwing errors is a good idea, and it is likely to break things. Instead I think returning NaN makes more sense, as thresholding infinities with infinities is not defined mathematically, rather than missing.

I need to perform more tests, especially of reverse dependencies before submitting to CRAN.

@xrobin
Copy link
Owner

xrobin commented Sep 24, 2018

I pushed pROC v1.13.0 to CRAN yesterday and it the submission just went through. It should be available shortly.

@xrobin xrobin closed this as completed Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants