Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.ecdf: add confidence_interval methods #18136

Merged
merged 11 commits into from
Mar 21, 2023

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Mar 12, 2023

Reference issue

Follow-up to gh-18092

What does this implement/fix?

This adds confidence intervals for the CDF/SF estimated by stats.ecdf.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Enter censored data
times = [24, 3, 11, 19, 24, 13, 14, 2, 18, 17, 24, 21, 12, 1, 10, 23, 6, 5, 9, 17]
died = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]
sample = stats.CensoredData.right_censored(times, np.logical_not(died))

# Generate point estimate and confidence interval of the survival function
res = stats.ecdf(sample)
point_estimate = res.sf.points
ci = res.sf.confidence_interval(0.90)

# plotting
ax = plt.subplot()
x = np.insert(res.x, 0, 0)  # plots look better when a leftmost point is added
ax.step(x, np.insert(point_estimate, 0, 1), color='C0', where='post')
ax.step(x, np.insert(ci.low, 0, 1), color='C0', linestyle='dashed', where='post')
ax.step(x, np.insert(ci.high, 0, 1),  color='C0', linestyle='dashed', where='post')
ax.set_xlabel("time")
ax.set_ylabel("survival function")
ax.set_title("Survival function point estimate and 90% CI")
plt.show()

image

Additional information

Other confidence interval methods can/will be added in follow-up PRs. The default CI implemented here is the oldest and most common, and it's the one Matlab uses. Another, more modern method is also included.

I didn't add an example because the existing data yields NaNs in the variance calculation (correctly). I plan to substantially revise the example in a follow-up, at which point I'll show how to use the confidence_interval method.

Because this makes the cdf and sf attributes objects (instead of arrays), these objects can easily be endowed with __call__ methods that evaluate the ECDF/ESF at user-specified points. This will respond to requests from the original PR. Done. Should the upper and lower confidence intervals be objects with __call__ methods, too?

Perhaps we should remove the points attributes of the EmpiricalDistributionFunction objects after all? The user can recover the points using the __call__ method at the ECDFResult points x.

We've discussed that it's weird that stats.ecdf returns an object with an attribute sf. What about renaming stats.ecdf to something like stats.nonparametric_fit, and ECDFResult would become NonparametricFitResult? Ultimately, these classes should be documented on their own like FitResult is.

Matlab returns confidence intervals in which the first value as NaN, even though that is not produced by Greenwood's formula. Intuitively, this makes some sense, but I'd like to check against some other references before I implement this. Also, Matlab returns NaNs in the confidence intervals, but arguably [0, 1] could be used instead of [np.nan, np.nan].

To do?:

  • See what R does when we/Matlab produce NaNs We agree with R survival survfit. Mathematica agrees with Matlab. Lifelines never seems to produce NaNs.
  • Add other confidence interval methods
  • Replace example
  • Email about additional interface changes
  • Add __call__ methods to cdf/sf
  • Add __call__ methods to low/high ends of confidence interval?
  • Rename ecdf to nonparametric_fit?
  • Remove points attributes of cdf/sf?

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Mar 12, 2023
@mdhaber mdhaber requested a review from tupui March 12, 2023 01:03
@mdhaber mdhaber changed the title ENH: stats.ecdf: add confidence intervals ENH: stats.ecdf: add confidence_intervals and __call__ methods Mar 12, 2023
@mdhaber mdhaber changed the title ENH: stats.ecdf: add confidence_intervals and __call__ methods ENH: stats.ecdf: add confidence_interval and __call__ methods Mar 12, 2023
@tupui
Copy link
Member

tupui commented Mar 13, 2023

About removing points, I think we can consider that if E also remove x. In the end it's a matter of thinking how this is going to be used the most I suppose.

We could still have this private for a plot method.

About the renaming. We could although I am not sure about the name. This is quite obscure for non experts I think. Not that I am reference, but I would not have guess the meaning of that function.

Maybe ecdf some ok if instead we call sf: ccdf. I know sf is for consistency and linked also to survival analysis, but ccdf is more descriptive.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt! Some suggestions

scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Show resolved Hide resolved
scipy/stats/_survival.py Show resolved Hide resolved
scipy/stats/_survival.py Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
scipy/stats/_survival.py Outdated Show resolved Hide resolved
@mdhaber mdhaber changed the title ENH: stats.ecdf: add confidence_interval and __call__ methods ENH: stats.ecdf: add confidence_interval methods Mar 18, 2023
@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 18, 2023

If tests are passing, I think this is about ready @tupui. I added tests that cover both types of confidence intervals at varying confidence_levels against a few references (R, Matlab, Mathematica), and I added a test that is mostly interested in NaNs that are produced by the formulas. There is a lot of disagreement here - we agree with R's survival package survfit function, and Mathematica and Matlab agree with each other, but there is some disagreement about leading NaNs that don't seem to be produced natively by the formulas. ISTM that lifelines turns NaNs into 1s or 0s. Anyway, I think we do a reasonable thing in following the formulas strictly, and R's survival package is solid precedent.

CI failures seem unrelated.

It might be useful to have examples of reference code all in one place.

This function with this PR (documentation)

import numpy as np
from scipy import stats
times = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81]  # times
died = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1]  # 1 means deaths (not censored)
sample = stats.CensoredData.right_censored(times, np.logical_not(died))
res = stats.ecdf(sample)

ci = res.sf.confidence_interval(method='log-log', confidence_level=0.95)
print(np.array([ci.low, ci.high]).T)
print(res.sf.points)

Lifelines (https://colab.research.google.com/, documentation)

# !pip install lifelines  # I used Colab
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()

t1 = [37, 43, 47, 56, 60, 62, 71, 77, 80, 81]  # times
d1 = [0, 0, 1, 1, 0, 0, 0, 1, 1, 1]  # 1 means deaths (not censored)

kmf.fit(t1, event_observed=d1)
kmf.survival_function_
kmf.confidence_interval_survival_function_

R survival library survfit function (https://rdrr.io/snippets/, documentation)

library(survival)
options(digits=16)
time = c(37, 43, 47, 56, 60, 62, 71, 77, 80, 81)
status = c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1)
res = survfit(Surv(time, status)~1, conf.type="log-log", conf.int=0.95)
res$time; res$lower; res$upper

Matlab (https://matlab.mathworks.com/, documentation)

format long
 t = [37 43 47 56 60 62 71 77 80 81];
d = [0 0 1 1 0 0 0 1 1 1];
censored = ~d;
[f, x, flo, fup] = ecdf(t, 'Censoring', censored, 'Function', 'survivor', 'Alpha', 0.05);

Mathematica (https://www.wolframcloud.com/, documentation)

e = {24, 3, 11, 19, 24, 13, 14, 2, 18, 17, 24, 21, 12, 1, 10, 23, 6, 5, 9, 17}
ci = {1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0}
R = EventData[e, ci]
S = SurvivalModelFit[R]
S["PointwiseIntervals", ConfidenceLevel->0.95, ConfidenceTransform->"LogLog"]

@mdhaber mdhaber requested a review from tupui March 18, 2023 17:17
@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 19, 2023

To help us resolve lingering interface questions, I wanted to compile some information about existing nonparametric fit features in other libraries/languages.

--

lifelines has a KaplanMeierFitter with a fit method that accepts the durations and booleans indicating whether the event was observed. The confidence level alpha must be provided when fitting, and there is only one method (log-log) for producing the confidence interval. All attributes we might want (point and confidence interval CDF/SF) are in a flat structure. The object has attributes:

  • durations (our x)
  • cumulative_density_, survival_function_ (currently cdf.points, `sf.points)
  • confidence_interval_survival_function_, confidence_interval_cumulative_density_
  • median_survival_time_ (we considered having ppf, isf callables)
  • cumulative_density_at_times, survival_function_at_times to evaluate CDF/SF at provided times
  • plot_cumulative_density, plot_survival_function

It's nice to have a flat structure, but the downsides include a long list of attributes for a single object, long attribute names, and lots of documentation duplication. I also don't think that confidence limits should need to be provided at the times of fitting.


R's survival library has a survfit function that accepts a "formula" and returns a survfit object. It only computes the survival function and related quantities, not the CDF, so it's easier for it to be flat. The confidence level and method conf.int/conf.type are provided as additional arguments when calling survfit. All attributes we might want (point and confidence interval CDF/SF) are in a flat structure. It has attributes:

  • time (our x)
  • surv (our sf.points)
  • lower and upper

The downsides of this approach are similar to lifelines, and some of the bells and whistles we have considered don't seem to be implemented as attributes of the return object (although there might be other functions for doing these things). I'd rather keep this self-contained to avoid disorganized growth of SciPy's survival analysis features.


Matlab's ecdf function returns just points, points and confidence intervals, or nothing (but produces a plot) depending on how many variables there are to accept the output (e.g. [f,x] = ecdf(y) vs [f,x,flo,fup] = ecdf(y) vs ecdf(y)). You get the SF instead of the CDF by passing a pair of strings e.g. [f,x] = ecdf(y, 'Function', 'survivor'). Options for confidence interval level and type are specified similarly.


Mathematica's SurvivalModelFit is difficult to describe from a Python perspective. It returns an object with a lot of properties.
image

Some of those properties work like functions. For instance, S["CDF"] works like a function that can be evaluated at any point.
Some work like attributes. For instance, S["PointwiseIntervals"] returns confidence limits.
Sometimes the analogies break down. S["PointwiseIntervals", ConfidenceLevel → 0.9, ConfidenceTransform → "LogLog"] returns a 90% confidence interval with the log-log method. It can do a lot but the syntax is very foreign.


Statsmodels has two relevant functions that I've found.

ECDF just creates a step function callable from data.

SurvfuncRight creates an object representing only the survival function. It has attributes:

  • surv_times (our x)
  • surv_prob (our `sf.points)
  • simultaneous_cb`` (a method like confidence_interval`)
  • quantile callable (like we've considered for isf)
  • quantile_ci callable (confidence interval around the quantile)
  • plot

One thing I notice is that R's survfit interface seems simpler because it returns an object focused on only one distribution function - the survival function. For the survival function, it gives the x-coordinates ("times"/"durations") that were in the data, the y-coordinates (point estimates of the survival function), and lower/upper bounds of a confidence interval.

In a sense, Matlab does that, too. If you specify 'Function', 'survivor' you get the xs, ys, and lower/upper bounds of the confidence interval, all for the the SF. (If you specify 'Function', 'cdf', you get all that for the CDF).

Ignoring the quantile and quantile_ci attributes, so does statsmodels SurvfuncRight. surv_times for the x-coordinates, surv_prob for the y-coordinates, and simultaneous_cb for confidence intervals.

I like this grouping of everything about the survival function together. In our situation, we want more than just the survival function - we also want the CDF, and why not also the PPF and ISF (at some point). It seems natural, then, for each of these distribution functions to have a similar interface to the SF - each comes with its x-coordinates, y-coordinates, and confidence interval.

If I had my way:
nonparametric_fit (instead of ecdf) would return a NonparametricFitResult.

A NonparametricFitResult would have four attributes:

  • cdf
  • sf
  • ppf
  • isf
    (although I'm not sure if we want to do ppf and isf right now).
    Each of these would be an object of type EmpiricalDistributionFunction.

An EmpiricalDistributionFunction has the following attributes:

  • p (for probability) and q (for quantile): paired arrays containing the "x" and "y" coordinates of the function that were actually observed. For example, for the survival function, these are like the x and sf.points arrays we have now.
  • confidence_interval: a method for getting the lower/upper bounds of the confidence interval (mostly like w have now).
  • evaluate: a method for evaluating the y-coordinates given x-coordinates (like the __call__ method we had. I think giving the function a name keeps things more organized.).
  • plot (eventually): a method for adding a plot of the function to a new or existing axis.

The confidence_interval method accepts parameters confidence_level and method (as it does now), and returns two attributes, low and high. Each of these is an EmpiricalDistributionFunction. The only difference between these and the EmpiricalDistributionFunctions above is that their confidence_interval method would be disabled - either it doesn't exist, or it raises a NotImplementedError.

To create a plot of the survival function point estimate and confidence interval:

res = NonparametricFit(sample)
ci = res.sf.confidence_interval()
_, ax = res.sf.plot()
ci.low.plot(ax)
ci.high.plot(ax)

To do QMC bootstrap resampling based on the sample:

res = NonparametricFit(sample)
qrng = Sobol(1)
p = qrng.random(size)
q = res.ppf.evaluate(p)

To get the median survival time and confidence interval:

res = NonparametricFit(sample)
ci = res.isf.confidence_interval()
median = res.isf.evaluate(0.5)
low, high = ci.low.evaluate(0.5), ci.high.evaluate(0.5

@tupui
Copy link
Member

tupui commented Mar 20, 2023

I think I am in general fine with the proposal. I just don't like much the plotting for the CI.

ATM we have a CI object which does not have a plotting semantic. If we add this capability here, then one could ask to do something similar for other CI objects (e.g. for Sobol' indices we can have arrays as well and could then ask to have a plot function.) Hence it would not be a namedTuple anymore but a normal object.

If we add a plot method, I would maybe prefer to directly be able to specify there a confidence level and have it plotted directly using this function (in this scenario confidence_interval would accept an input array to get CI for the given points and allow plotting values where we want.) Since SciPy is not really a plotting library, I am not sure we want to add too many plot functions.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 20, 2023

ATM we have a CI object which does not have a plotting semantic. If we add this capability here, then one could ask to do something similar for other CI objects (e.g. for Sobol' indices we can have arrays as well and could then ask to have a plot function.) Hence it would not be a namedTuple anymore but a normal object.

The plot method doesn't belong to the CI object. The CI object is still a ConfidenceInterval object like any other one; it's just that the objects it contains are EmpiricalDistributionFunctions rather than scalars.

In any case, we don't need to add it here or even make the structural changes here, since we had discussed not letting the mathy parts of this PR get held up by interface. I was just recording my thoughts here. Do the algorithmic parts look OK?

After this, I'd open a PR with the restructuring and send an email to the mailing list, since it has changed a lot since the initial email.

scipy/stats/_survival.py Outdated Show resolved Hide resolved
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt! LGTM with the latest updates. For people following, we had some extensive discussions offline about the API and I think we found a good compromise between simplicity, extendability and maintenance.

(I will merge after including the last suggestion about keyword only. The CI is green besides usual failures.)

self._x = x
self._n = n
self._d = d
self._sf = points if kind == 'sf' else 1 - points
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is private so ok. If we have some follow up consider naming this self._complement so it's not tainted with CDF/SF.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@tupui tupui merged commit fbfb107 into scipy:main Mar 21, 2023
@tupui tupui added this to the 1.11.0 milestone Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants