-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats.ecdf: add confidence_interval
methods
#18136
Conversation
confidence_intervals
and __call__
methods
confidence_intervals
and __call__
methodsconfidence_interval
and __call__
methods
About removing We could still have this private for a plot method. About the renaming. We could although I am not sure about the name. This is quite obscure for non experts I think. Not that I am reference, but I would not have guess the meaning of that function. Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt! Some suggestions
confidence_interval
and __call__
methodsconfidence_interval
methods
If tests are passing, I think this is about ready @tupui. I added tests that cover both types of confidence intervals at varying CI failures seem unrelated. It might be useful to have examples of reference code all in one place. This function with this PR (documentation) import numpy as np
from scipy import stats
times = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81] # times
died = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1] # 1 means deaths (not censored)
sample = stats.CensoredData.right_censored(times, np.logical_not(died))
res = stats.ecdf(sample)
ci = res.sf.confidence_interval(method='log-log', confidence_level=0.95)
print(np.array([ci.low, ci.high]).T)
print(res.sf.points) Lifelines (https://colab.research.google.com/, documentation) # !pip install lifelines # I used Colab
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
t1 = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81] # times
d1 = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1] # 1 means deaths (not censored)
kmf.fit(t1, event_observed=d1)
kmf.survival_function_
kmf.confidence_interval_survival_function_ R library(survival)
options(digits=16)
time = c(37, 43, 47, 56, 60, 62, 71, 77, 80, 81)
status = c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1)
res = survfit(Surv(time, status)~1, conf.type="log-log", conf.int=0.95)
res$time; res$lower; res$upper Matlab (https://matlab.mathworks.com/, documentation) format long
t = [37 43 47 56 60 62 71 77 80 81];
d = [0 0 1 1 0 0 0 1 1 1];
censored = ~d;
[f, x, flo, fup] = ecdf(t, 'Censoring', censored, 'Function', 'survivor', 'Alpha', 0.05); Mathematica (https://www.wolframcloud.com/, documentation) e = {24, 3, 11, 19, 24, 13, 14, 2, 18, 17, 24, 21, 12, 1, 10, 23, 6, 5, 9, 17}
ci = {1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0}
R = EventData[e, ci]
S = SurvivalModelFit[R]
S["PointwiseIntervals", ConfidenceLevel->0.95, ConfidenceTransform->"LogLog"] |
To help us resolve lingering interface questions, I wanted to compile some information about existing nonparametric fit features in other libraries/languages. --
It's nice to have a flat structure, but the downsides include a long list of attributes for a single object, long attribute names, and lots of documentation duplication. I also don't think that confidence limits should need to be provided at the times of fitting. R's
The downsides of this approach are similar to lifelines, and some of the bells and whistles we have considered don't seem to be implemented as attributes of the return object (although there might be other functions for doing these things). I'd rather keep this self-contained to avoid disorganized growth of SciPy's survival analysis features. Matlab's Mathematica's Some of those properties work like functions. For instance, Statsmodels has two relevant functions that I've found.
One thing I notice is that R's In a sense, Matlab does that, too. If you specify Ignoring the I like this grouping of everything about the survival function together. In our situation, we want more than just the survival function - we also want the CDF, and why not also the PPF and ISF (at some point). It seems natural, then, for each of these distribution functions to have a similar interface to the SF - each comes with its x-coordinates, y-coordinates, and confidence interval. If I had my way: A
An
The To create a plot of the survival function point estimate and confidence interval: res = NonparametricFit(sample)
ci = res.sf.confidence_interval()
_, ax = res.sf.plot()
ci.low.plot(ax)
ci.high.plot(ax) To do QMC bootstrap resampling based on the sample: res = NonparametricFit(sample)
qrng = Sobol(1)
p = qrng.random(size)
q = res.ppf.evaluate(p) To get the median survival time and confidence interval: res = NonparametricFit(sample)
ci = res.isf.confidence_interval()
median = res.isf.evaluate(0.5)
low, high = ci.low.evaluate(0.5), ci.high.evaluate(0.5 |
I think I am in general fine with the proposal. I just don't like much the plotting for the CI. ATM we have a CI object which does not have a plotting semantic. If we add this capability here, then one could ask to do something similar for other CI objects (e.g. for Sobol' indices we can have arrays as well and could then ask to have a plot function.) Hence it would not be a namedTuple anymore but a normal object. If we add a plot method, I would maybe prefer to directly be able to specify there a confidence level and have it plotted directly using this function (in this scenario |
The In any case, we don't need to add it here or even make the structural changes here, since we had discussed not letting the mathy parts of this PR get held up by interface. I was just recording my thoughts here. Do the algorithmic parts look OK? After this, I'd open a PR with the restructuring and send an email to the mailing list, since it has changed a lot since the initial email. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt! LGTM with the latest updates. For people following, we had some extensive discussions offline about the API and I think we found a good compromise between simplicity, extendability and maintenance.
(I will merge after including the last suggestion about keyword only. The CI is green besides usual failures.)
self._x = x | ||
self._n = n | ||
self._d = d | ||
self._sf = points if kind == 'sf' else 1 - points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is private so ok. If we have some follow up consider naming this self._complement
so it's not tainted with CDF/SF.
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
Reference issue
Follow-up to gh-18092
What does this implement/fix?
This adds confidence intervals for the CDF/SF estimated by
stats.ecdf
.Additional information
Other confidence interval methods can/will be added in follow-up PRs.The default CI implemented here is the oldest and most common, and it's the one Matlab uses. Another, more modern method is also included.I didn't add an example because the existing data yields NaNs in the variance calculation (correctly). I plan to substantially revise the example in a follow-up, at which point I'll show how to use the
confidence_interval
method.Because this makes theDone. Should the upper and lower confidence intervals be objects withcdf
andsf
attributes objects (instead of arrays), these objects can easily be endowed with__call__
methods that evaluate the ECDF/ESF at user-specified points. This will respond to requests from the original PR.__call__
methods, too?Perhaps we should remove the
points
attributes of theEmpiricalDistributionFunction
objects after all? The user can recover the points using the__call__
method at theECDFResult
pointsx
.We've discussed that it's weird that
stats.ecdf
returns an object with an attributesf
. What about renamingstats.ecdf
to something likestats.nonparametric_fit
, andECDFResult
would becomeNonparametricFitResult
? Ultimately, these classes should be documented on their own likeFitResult
is.Matlab returns confidence intervals in which the first value as NaN, even though that is not produced by Greenwood's formula. Intuitively, this makes some sense, but I'd like to check against some other references before I implement this. Also, Matlab returns NaNs in the confidence intervals, but arguably
[0, 1]
could be used instead of[np.nan, np.nan]
.To do?:
See what R does when we/Matlab produce NaNsWe agree with R survival survfit. Mathematica agrees with Matlab. Lifelines never seems to produce NaNs.method
s__call__
methods tocdf
/sf
__call__
methods tolow
/high
ends of confidence interval?ecdf
tononparametric_fit
?points
attributes ofcdf
/sf
?