ENH: stats.ecdf: add `confidence_interval` methods #18136

mdhaber · 2023-03-12T01:03:48Z

Reference issue

Follow-up to gh-18092

What does this implement/fix?

This adds confidence intervals for the CDF/SF estimated by stats.ecdf.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Enter censored data
times = [24, 3, 11, 19, 24, 13, 14, 2, 18, 17, 24, 21, 12, 1, 10, 23, 6, 5, 9, 17]
died = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]
sample = stats.CensoredData.right_censored(times, np.logical_not(died))

# Generate point estimate and confidence interval of the survival function
res = stats.ecdf(sample)
point_estimate = res.sf.points
ci = res.sf.confidence_interval(0.90)

# plotting
ax = plt.subplot()
x = np.insert(res.x, 0, 0)  # plots look better when a leftmost point is added
ax.step(x, np.insert(point_estimate, 0, 1), color='C0', where='post')
ax.step(x, np.insert(ci.low, 0, 1), color='C0', linestyle='dashed', where='post')
ax.step(x, np.insert(ci.high, 0, 1),  color='C0', linestyle='dashed', where='post')
ax.set_xlabel("time")
ax.set_ylabel("survival function")
ax.set_title("Survival function point estimate and 90% CI")
plt.show()

Additional information

~~Other confidence interval methods can/will be added in follow-up PRs.~~ The default CI implemented here is the oldest and most common, and it's the one Matlab uses. Another, more modern method is also included.

I didn't add an example because the existing data yields NaNs in the variance calculation (correctly). I plan to substantially revise the example in a follow-up, at which point I'll show how to use the confidence_interval method.

Because this makes the cdf and sf attributes objects (instead of arrays), these objects can easily be endowed with __call__ methods that evaluate the ECDF/ESF at user-specified points. This will respond to requests from the original PR. Done. Should the upper and lower confidence intervals be objects with __call__ methods, too?

Perhaps we should remove the points attributes of the EmpiricalDistributionFunction objects after all? The user can recover the points using the __call__ method at the ECDFResult points x.

We've discussed that it's weird that stats.ecdf returns an object with an attribute sf. What about renaming stats.ecdf to something like stats.nonparametric_fit, and ECDFResult would become NonparametricFitResult? Ultimately, these classes should be documented on their own like FitResult is.

Matlab returns confidence intervals in which the first value as NaN, even though that is not produced by Greenwood's formula. Intuitively, this makes some sense, but I'd like to check against some other references before I implement this. Also, Matlab returns NaNs in the confidence intervals, but arguably [0, 1] could be used instead of [np.nan, np.nan].

To do?:

~~See what R does when we/Matlab produce NaNs~~ We agree with R survival survfit. Mathematica agrees with Matlab. Lifelines never seems to produce NaNs.
Add other confidence interval methods
Replace example
Email about additional interface changes
Add __call__ methods to cdf/sf
Add __call__ methods to low/high ends of confidence interval?
Rename ecdf to nonparametric_fit?
Remove points attributes of cdf/sf?

tupui · 2023-03-13T14:50:59Z

About removing points, I think we can consider that if E also remove x. In the end it's a matter of thinking how this is going to be used the most I suppose.

We could still have this private for a plot method.

About the renaming. We could although I am not sure about the name. This is quite obscure for non experts I think. Not that I am reference, but I would not have guess the meaning of that function.

Maybe ecdf some ok if instead we call sf: ccdf. I know sf is for consistency and linked also to survival analysis, but ccdf is more descriptive.

tupui

Thanks Matt! Some suggestions

scipy/stats/_survival.py

scipy/stats/tests/test_survival.py

…ervals

mdhaber · 2023-03-18T17:16:54Z

If tests are passing, I think this is about ready @tupui. I added tests that cover both types of confidence intervals at varying confidence_levels against a few references (R, Matlab, Mathematica), and I added a test that is mostly interested in NaNs that are produced by the formulas. There is a lot of disagreement here - we agree with R's survival package survfit function, and Mathematica and Matlab agree with each other, but there is some disagreement about leading NaNs that don't seem to be produced natively by the formulas. ISTM that lifelines turns NaNs into 1s or 0s. Anyway, I think we do a reasonable thing in following the formulas strictly, and R's survival package is solid precedent.

CI failures seem unrelated.

It might be useful to have examples of reference code all in one place.

This function with this PR (documentation)

import numpy as np
from scipy import stats
times = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81]  # times
died = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1]  # 1 means deaths (not censored)
sample = stats.CensoredData.right_censored(times, np.logical_not(died))
res = stats.ecdf(sample)

ci = res.sf.confidence_interval(method='log-log', confidence_level=0.95)
print(np.array([ci.low, ci.high]).T)
print(res.sf.points)

Lifelines (https://colab.research.google.com/, documentation)

# !pip install lifelines  # I used Colab
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()

t1 = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81]  # times
d1 = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1]  # 1 means deaths (not censored)

kmf.fit(t1, event_observed=d1)
kmf.survival_function_
kmf.confidence_interval_survival_function_

R survival library survfit function (https://rdrr.io/snippets/, documentation)

library(survival)
options(digits=16)
time = c(37, 43, 47, 56, 60, 62, 71, 77, 80, 81)
status = c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1)
res = survfit(Surv(time, status)~1, conf.type="log-log", conf.int=0.95)
res$time; res$lower; res$upper

Matlab (https://matlab.mathworks.com/, documentation)

format long
 t = [37 43 47 56 60 62 71 77 80 81];
d = [0 0 1 1 0 0 0 1 1 1];
censored = ~d;
[f, x, flo, fup] = ecdf(t, 'Censoring', censored, 'Function', 'survivor', 'Alpha', 0.05);

Mathematica (https://www.wolframcloud.com/, documentation)

e = {24, 3, 11, 19, 24, 13, 14, 2, 18, 17, 24, 21, 12, 1, 10, 23, 6, 5, 9, 17}
ci = {1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0}
R = EventData[e, ci]
S = SurvivalModelFit[R]
S["PointwiseIntervals", ConfidenceLevel->0.95, ConfidenceTransform->"LogLog"]

mdhaber · 2023-03-19T23:50:47Z

To help us resolve lingering interface questions, I wanted to compile some information about existing nonparametric fit features in other libraries/languages.

--

lifelines has a KaplanMeierFitter with a fit method that accepts the durations and booleans indicating whether the event was observed. The confidence level alpha must be provided when fitting, and there is only one method (log-log) for producing the confidence interval. All attributes we might want (point and confidence interval CDF/SF) are in a flat structure. The object has attributes:

durations (our x)
cumulative_density_, survival_function_ (currently cdf.points, `sf.points)
confidence_interval_survival_function_, confidence_interval_cumulative_density_
median_survival_time_ (we considered having ppf, isf callables)
cumulative_density_at_times, survival_function_at_times to evaluate CDF/SF at provided times
plot_cumulative_density, plot_survival_function

It's nice to have a flat structure, but the downsides include a long list of attributes for a single object, long attribute names, and lots of documentation duplication. I also don't think that confidence limits should need to be provided at the times of fitting.

R's survival library has a survfit function that accepts a "formula" and returns a survfit object. It only computes the survival function and related quantities, not the CDF, so it's easier for it to be flat. The confidence level and method conf.int/conf.type are provided as additional arguments when calling survfit. All attributes we might want (point and confidence interval CDF/SF) are in a flat structure. It has attributes:

time (our x)
surv (our sf.points)
lower and upper

The downsides of this approach are similar to lifelines, and some of the bells and whistles we have considered don't seem to be implemented as attributes of the return object (although there might be other functions for doing these things). I'd rather keep this self-contained to avoid disorganized growth of SciPy's survival analysis features.

Matlab's ecdf function returns just points, points and confidence intervals, or nothing (but produces a plot) depending on how many variables there are to accept the output (e.g. [f,x] = ecdf(y) vs [f,x,flo,fup] = ecdf(y) vs ecdf(y)). You get the SF instead of the CDF by passing a pair of strings e.g. [f,x] = ecdf(y, 'Function', 'survivor'). Options for confidence interval level and type are specified similarly.

Mathematica's SurvivalModelFit is difficult to describe from a Python perspective. It returns an object with a lot of properties.

Some of those properties work like functions. For instance, S["CDF"] works like a function that can be evaluated at any point.
Some work like attributes. For instance, S["PointwiseIntervals"] returns confidence limits.
Sometimes the analogies break down. S["PointwiseIntervals", ConfidenceLevel → 0.9, ConfidenceTransform → "LogLog"] returns a 90% confidence interval with the log-log method. It can do a lot but the syntax is very foreign.

Statsmodels has two relevant functions that I've found.

ECDF just creates a step function callable from data.

SurvfuncRight creates an object representing only the survival function. It has attributes:

surv_times (our x)
surv_prob (our `sf.points)
simultaneous_cb`` (a method like confidence_interval`)
quantile callable (like we've considered for isf)
quantile_ci callable (confidence interval around the quantile)
plot

One thing I notice is that R's survfit interface seems simpler because it returns an object focused on only one distribution function - the survival function. For the survival function, it gives the x-coordinates ("times"/"durations") that were in the data, the y-coordinates (point estimates of the survival function), and lower/upper bounds of a confidence interval.

In a sense, Matlab does that, too. If you specify 'Function', 'survivor' you get the xs, ys, and lower/upper bounds of the confidence interval, all for the the SF. (If you specify 'Function', 'cdf', you get all that for the CDF).

Ignoring the quantile and quantile_ci attributes, so does statsmodels SurvfuncRight. surv_times for the x-coordinates, surv_prob for the y-coordinates, and simultaneous_cb for confidence intervals.

I like this grouping of everything about the survival function together. In our situation, we want more than just the survival function - we also want the CDF, and why not also the PPF and ISF (at some point). It seems natural, then, for each of these distribution functions to have a similar interface to the SF - each comes with its x-coordinates, y-coordinates, and confidence interval.

If I had my way:
nonparametric_fit (instead of ecdf) would return a NonparametricFitResult.

A NonparametricFitResult would have four attributes:

cdf
sf
ppf
isf
(although I'm not sure if we want to do ppf and isf right now).
Each of these would be an object of type EmpiricalDistributionFunction.

An EmpiricalDistributionFunction has the following attributes:

p (for probability) and q (for quantile): paired arrays containing the "x" and "y" coordinates of the function that were actually observed. For example, for the survival function, these are like the x and sf.points arrays we have now.
confidence_interval: a method for getting the lower/upper bounds of the confidence interval (mostly like w have now).
evaluate: a method for evaluating the y-coordinates given x-coordinates (like the __call__ method we had. I think giving the function a name keeps things more organized.).
plot (eventually): a method for adding a plot of the function to a new or existing axis.

The confidence_interval method accepts parameters confidence_level and method (as it does now), and returns two attributes, low and high. Each of these is an EmpiricalDistributionFunction. The only difference between these and the EmpiricalDistributionFunctions above is that their confidence_interval method would be disabled - either it doesn't exist, or it raises a NotImplementedError.

To create a plot of the survival function point estimate and confidence interval:

res = NonparametricFit(sample)
ci = res.sf.confidence_interval()
_, ax = res.sf.plot()
ci.low.plot(ax)
ci.high.plot(ax)

To do QMC bootstrap resampling based on the sample:

res = NonparametricFit(sample)
qrng = Sobol(1)
p = qrng.random(size)
q = res.ppf.evaluate(p)

To get the median survival time and confidence interval:

res = NonparametricFit(sample)
ci = res.isf.confidence_interval()
median = res.isf.evaluate(0.5)
low, high = ci.low.evaluate(0.5), ci.high.evaluate(0.5

tupui · 2023-03-20T16:54:37Z

I think I am in general fine with the proposal. I just don't like much the plotting for the CI.

ATM we have a CI object which does not have a plotting semantic. If we add this capability here, then one could ask to do something similar for other CI objects (e.g. for Sobol' indices we can have arrays as well and could then ask to have a plot function.) Hence it would not be a namedTuple anymore but a normal object.

If we add a plot method, I would maybe prefer to directly be able to specify there a confidence level and have it plotted directly using this function (in this scenario confidence_interval would accept an input array to get CI for the given points and allow plotting values where we want.) Since SciPy is not really a plotting library, I am not sure we want to add too many plot functions.

mdhaber · 2023-03-20T18:12:08Z

ATM we have a CI object which does not have a plotting semantic. If we add this capability here, then one could ask to do something similar for other CI objects (e.g. for Sobol' indices we can have arrays as well and could then ask to have a plot function.) Hence it would not be a namedTuple anymore but a normal object.

The plot method doesn't belong to the CI object. The CI object is still a ConfidenceInterval object like any other one; it's just that the objects it contains are EmpiricalDistributionFunctions rather than scalars.

In any case, we don't need to add it here or even make the structural changes here, since we had discussed not letting the mathy parts of this PR get held up by interface. I was just recording my thoughts here. Do the algorithmic parts look OK?

After this, I'd open a PR with the restructuring and send an email to the mailing list, since it has changed a lot since the initial email.

scipy/stats/_survival.py

tupui

Thanks Matt! LGTM with the latest updates. For people following, we had some extensive discussions offline about the API and I think we found a good compromise between simplicity, extendability and maintenance.

(I will merge after including the last suggestion about keyword only. The CI is green besides usual failures.)

tupui · 2023-03-21T12:12:28Z

scipy/stats/_survival.py

+        self._x = x
+        self._n = n
+        self._d = d
+        self._sf = points if kind == 'sf' else 1 - points


This is private so ok. If we have some follow up consider naming this self._complement so it's not tainted with CDF/SF.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

mdhaber added 5 commits March 6, 2023 14:35

MAINT: stats: make cdf and sf attributes objects

6923bd3

ENH: stats.ecdf: add confidence intervals

6c8ee71

DOC: stats.ecdf: add reference for Greenwood's formula

b1a83dc

MAINT: stats.ecdf: adjust import order

485915d

DOC: stats.ecdf: improve documentation of new objects

553843c

mdhaber added scipy.stats enhancement A new feature or improvement labels Mar 12, 2023

mdhaber requested a review from tupui March 12, 2023 01:03

mdhaber added 2 commits March 11, 2023 22:38

MAINT: stats.ecdf: add __call__ methods, improve documentation

00369b5

ENH: stats.ecdf: add exponential greenwood CI

6c7a6ae

mdhaber changed the title ~~ENH: stats.ecdf: add confidence intervals~~ ENH: stats.ecdf: add confidence_intervals and __call__ methods Mar 12, 2023

mdhaber changed the title ~~ENH: stats.ecdf: add confidence_intervals and __call__ methods~~ ENH: stats.ecdf: add confidence_interval and __call__ methods Mar 12, 2023

tupui reviewed Mar 13, 2023

View reviewed changes

mdhaber commented Mar 14, 2023

View reviewed changes

scipy/stats/tests/test_survival.py Show resolved Hide resolved

mdhaber commented Mar 14, 2023

View reviewed changes

scipy/stats/tests/test_survival.py Show resolved Hide resolved

mdhaber added 2 commits March 17, 2023 19:53

MAINT: stats.ecdf: remove __call__ methods

9189c79

Merge remote-tracking branch 'upstream/main' into ecdf_confidence_int…

a44db0c

…ervals

mdhaber changed the title ~~ENH: stats.ecdf: add confidence_interval and __call__ methods~~ ENH: stats.ecdf: add confidence_interval methods Mar 18, 2023

TST: stats.ecdf: add tests

a11ff6e

mdhaber requested a review from tupui March 18, 2023 17:17

mdhaber commented Mar 20, 2023

View reviewed changes

scipy/stats/_survival.py Outdated Show resolved Hide resolved

tupui approved these changes Mar 21, 2023

View reviewed changes

MAINT: method as keyword only argument in CI. [skip ci]

2aabcea

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

tupui merged commit fbfb107 into scipy:main Mar 21, 2023

tupui added this to the 1.11.0 milestone Mar 21, 2023

This was referenced Mar 22, 2023

MAINT: stats.ecdf: store number at risk just before events #18187

Merged

ENH: stats.ecdf: add evaluate and plot methods; restructure result #18212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: stats.ecdf: add `confidence_interval` methods #18136

ENH: stats.ecdf: add `confidence_interval` methods #18136

mdhaber commented Mar 12, 2023 •

edited

Loading

tupui commented Mar 13, 2023 •

edited

Loading

tupui left a comment

mdhaber commented Mar 18, 2023 •

edited

Loading

mdhaber commented Mar 19, 2023 •

edited

Loading

tupui commented Mar 20, 2023

mdhaber commented Mar 20, 2023 •

edited

Loading

tupui left a comment

tupui Mar 21, 2023

ENH: stats.ecdf: add confidence_interval methods #18136

ENH: stats.ecdf: add confidence_interval methods #18136

Conversation

mdhaber commented Mar 12, 2023 • edited Loading

Reference issue

What does this implement/fix?

Additional information

tupui commented Mar 13, 2023 • edited Loading

tupui left a comment

Choose a reason for hiding this comment

mdhaber commented Mar 18, 2023 • edited Loading

mdhaber commented Mar 19, 2023 • edited Loading

tupui commented Mar 20, 2023

mdhaber commented Mar 20, 2023 • edited Loading

tupui left a comment

Choose a reason for hiding this comment

tupui Mar 21, 2023

Choose a reason for hiding this comment

ENH: stats.ecdf: add `confidence_interval` methods #18136

ENH: stats.ecdf: add `confidence_interval` methods #18136

mdhaber commented Mar 12, 2023 •

edited

Loading

tupui commented Mar 13, 2023 •

edited

Loading

mdhaber commented Mar 18, 2023 •

edited

Loading

mdhaber commented Mar 19, 2023 •

edited

Loading

mdhaber commented Mar 20, 2023 •

edited

Loading