-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Calculation for Area Under ROC curve #17
Comments
Hi Michael, Of course I am always open to contributions. First a question, how does this compare to the DeLong code that I have in I guess the real question here is how to get rid of the overhead due to the construction of the ROC curve. I see the following things to think about:
|
Hi Thanks for responding quickly. Yes, the code in Since this method is restricted to the full AUC, it might make sense for this functionality to be provided in a separate function that exists outside of the primary |
I agree it should be separate, and actually it sounds like it should be a separate package. pROC does a lot of checks on the inputs, and accepts a pretty large range of formats (numeric, ordered, dealing with NAs etc.), using arbitrary levels and direction for the comparison. This has of course a significant impact on the runtime that you'll probably want to avoid if you're interested in pure speed. It has little impact when dealing with large data sets, but I can see a usefulness for your code also when dealing with a large number of curves. I think it would be confusing to have some functions not check their inputs as thoroughly in pROC and I'd rather avoid that. A separate package such as |
The area under the ROC curve can be calculated directly from a vector of predictions and a vector of binary labels using the Mann-Whitney U Test. Since this algorithm does not require calculating the ROC curve, it can provide a significant performance increase. My benchmarks on show that, on 10 thousand observations, this algorithm is 1,000 times faster than calculating AUROC with your package (2100 milliseconds seconds vs 2.3 milliseconds).
Would you be interested in adding a C++ implementation of this algorithm to your package? The speedup that this algorithm provides would be valuable for users who need to evaluate hundreds to thousands of models (e.g. with a grid search over a feature / hyper-parameter space).
If you are interested in this contributions to your package, please let me know.
The text was updated successfully, but these errors were encountered: