DOC: Likelihood function depending on censoring type #19732

gbene · 2023-12-22T11:41:33Z

Issue with current documentation:

Hi all!

I am using the CensoredData class introduced in scipy v1.11.0 to model a censored dataset. I think that in the documentation for this class is not very clear which type of censoring is considered, or if there is the possibility to change the likelihood function depending on the censoring type. From the text "Reliability and Survival Analysis" by Karim and Islam 2019, three types of censoring are defined Type I, Type II and random and depending on the three types there are different formulations to calculate the likelihood (chapter 4.4).

Type I: $L = \prod f(t_{i})^{\delta_{i}}S(t_{r})^{1-\delta_{i}}$
Type II: $L = [\prod f(t_{i})]S(t_{r})^{n-r}$
Random censoring: The same as Type I if censoring is considered independent from the event.

There are also more complex and umbrella formulations that consider right, left and interval censoring all together. Which of these formulations are used in scipy?

Idea or request for content:

I think that a more precise overview of the underlying likelihood calculation used in scipy should be added.

Additional context (e.g. screenshots, GIFs)

No response

mdhaber · 2023-12-23T07:49:40Z

I believe this is the relevant code:

scipy/scipy/stats/_distn_infrastructure.py

Line 2308 in 5e4a5e3

def _nnlf_and_penalty(self, x, args):

Since you have the book, please let us know what type this corresponds with.

WarrenWeckesser · 2023-12-29T03:27:29Z

@gbene, I don't have the Karim and Islam book, but some relevant references that we can all access are:

Wikipedia Censoring (specifically Types)
NIST Engineering Statistics Handbook, 8.1.3.1. Censoring
NIST Engineering Statistics Handbook, 8.4.1.2. Maximum likelihood estimation

Type I, Type II and random (or uninformative) censoring all involve mixes of uncensored and right-censored data. For example, in the Type I case, the failure times of $n$ units are observed in a fixed time interval T (or $t_r$ in your notation). For the units that fail within the given time interval, the failure times are uncensored. The failure times of the units that haven't failed by time T are unknown, but they are greater than T, so those observations are recorded as right-censored with lower bound T. The formula that you quote from Karim and Islam for the Type I case is equivalent to the formula used in SciPy for right-censoring (assuming that $\delta_i$ is the failure indicator: 1 for an observed failure, 0 for right-censored).

Are you sure that the formula that you quoted for $L$ for the Type II case is correct? In the Type II case, $r$ is given, and the test is stopped when $r$ failures have occurred. So the failure times of the $n - r$ units that didn't fail are right-censored with lower bound $t_r$. I think the second factor in $L$ should be $S(t_r)^{n-r}$ instead of $S(t_i)^{n-r}$, since $t_r$ is the last observed failure time, when the test was stopped. (Note that this factor is outside of the product, so $i$ can't be referenced.)

gbene · 2023-12-29T11:14:07Z

Hi all! Thank you for your answers!

The formula that you quote from Karim and Islam for the Type I case is equivalent to the formula used in SciPy for right-censoring.

@WarrenWeckesser I can see this in the code provided by @mdhaber. In the _nnlf_and_penalty function, the log-likelihood is used (understandably) and I can follow the code fairly well until the end. Now what I do not understand is the penalization factor and from where it comes from. At the end, the total sum is penalized by the total number of bad values multiplied by _LOGMAX*100.

return -total + n_bad * _LOGXMAX * 100

Why is this?

Are you sure that the formula that you quoted for for the Type II case is correct?

Sorry, you are right! I made a mistake while writing the formula! I edited the original comment.

Thank you again!

mdhaber · 2023-12-29T15:34:41Z

Why is this?

It is a penalty method for turning a constrained optimization problem into an unconstrained problem. When all the observations are within the support of the distribution, there are no "bad" points, and the penalty disappears. The penalty is supposed to be very large so the solver never converges to the true MLE rather than a solution where there is a penalty. It looks unsophisticated to me, but I've been surprised at how well it works in practice.

gbene · 2023-12-29T15:43:30Z

Alright! Thank you very much for these clarifications. I think they would be a useful addition for the docs to be more clear on what is going on in the code!

mdhaber · 2023-12-29T15:47:45Z

Would you like to submit a PR?

gbene · 2023-12-29T17:01:15Z

Sure! I can follow this guide: https://docs.scipy.org/doc/scipy/dev/contributor/rendering_documentation.html#contributing-docs correct?

mdhaber · 2023-12-29T17:08:10Z

Yup, that would be great. Note that if you're working on Windows, building the documentation locally is tricky, so you can skip that step. In any case, please include the text [docs only] in the body of your commit message. This will ensure that the documentation is built when you submit the PR (without performing other unnecessary tests) so we can see how it renders.

Maybe draft what you plan to add here in the issue?

gbene · 2023-12-29T18:31:26Z

All right! I think that I will add the discussed information in the CensoredData page after: "Left-, right-, and interval-censored data can be represented by CensoredData."

More in particular I am planning to:

Add a further distinction to right-censoring with type I, II and uninformative censoring
Precise that type I and II have a different formulation to calculate the likelihood.
Point out that scipy uses the generalized form (including left and interval censoring) and because of this, right-censored data must be type I / uninformative otherwise the likelihood calculation will be incorrect.
Maybe add a note for the applied penalty so that users know what happens when invalid data are present in the dataset

What do you think? Is it too specific?

Also it would be nice to add bibliography from where the generalized formulation came from. I have the book by Karim and Islam, but maybe you used other sources.

Also I think that a clarification on this line is in order:

https://github.com/scipy/scipy/blob/5e4a5e3785f79dd4e8930eed883da89958860db2/scipy/stats/_distn_infrastructure.py#L2328C7-L2328C7

To estimate the interval censoring term, the _delta_cdf function is used but this causes a bit of confusion. The formulation in Karim states that the interval censoring term is calculated as the difference between S($L_i$) and S($R_i$)

Where I is the set of interval censored times and the event occurred between (L$_i$ , R$_i$).

In scipy this is implemented correctly but only because the _delta_cdf function returns CDF(X2)-CDF(X1).

In the code the use case is

np.log(self._delta_cdf(i1, i2, *args))

Where i1 = $L_i$ and i2 = $R_i$. Thus in our case would be CDF($R_i$)-CDF($L_i$). This is the same of S($L_i$)-S($R_i$) because CDF = 1-SF

So I think that the comment in the code should also be extended to clarify this step, what do you think?

Thanks again!

mdhaber · 2023-12-30T16:26:31Z

What do you think? Is it too specific?

I'd clarify what we do in SciPy, but not compare and contrast all the different options. Just add a reference for that (see how references are added in differential_entropy, for example). Also, this sort of information can be appended to the Notes section rather than included in the main description.

The penalty is already mentioned in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html#scipy.stats.rv_continuous.fit, which is where that would belong.

So I think that the comment in the code should also be extended to clarify this step, what do you think?

I don't understand the confusion. Is it about the equivalence of SF(x) - SF(y) and CDF(y) - CDF(x), or that the name _delta_cdf is not descriptive of CDF(y) - CDF(x)? With the name _delta_cdf, I would expect it to be equivalent to calculate the probability mass between x and y, which I think is more intuitive than either SF(x) - SF(y) or CDF(y) - CDF(x).

gbene · 2024-01-02T18:58:21Z

I'd clarify what we do in SciPy, but not compare and contrast all the different options. Just add a reference for that (see how references are added in differential_entropy, for example). Also, this sort of information can be appended to the Notes section rather than included in the main description.

All right, perfect!

I don't understand the confusion.

I was mostly confused by the name, for me it did not immediately equate to CDF(i2) - CDF(i1). I initially thought that it was the difference in order of input (so CDF(i1) -CDF(i2)) and only after I read the description of _delta_cdf I understood what was going on. I was suggesting to add a comment to avoid the back-and-forth while reading the code.

Thanks again!

mdhaber · 2024-01-02T20:06:53Z

I was suggesting to add a comment to avoid the back-and-forth while reading the code.

Sure. Maybe phrase it like "calculate the probability mass between...".

See scipy#19732 [docs only]

gbene added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Dec 22, 2023

gbene changed the title ~~DOC: Likelihood function depending on censoring type<Please write a comprehensive title after the 'DOC: ' prefix>~~ DOC: Likelihood function depending on censoring type Dec 22, 2023

j-bowhay added the scipy.stats label Dec 22, 2023

m-maggi added a commit to m-maggi/scipy that referenced this issue Mar 9, 2024

DOC add likelihood formula to stats.CensoredData

90815ee

See scipy#19732 [docs only]

m-maggi mentioned this issue Mar 9, 2024

DOC add likelihood formula to stats.CensoredData #20222

Merged

mdhaber closed this as completed in #20222 Mar 11, 2024

lucascolley added this to the 1.13.0 milestone Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Likelihood function depending on censoring type #19732

DOC: Likelihood function depending on censoring type #19732

gbene commented Dec 22, 2023 •

edited

mdhaber commented Dec 23, 2023

WarrenWeckesser commented Dec 29, 2023 •

edited

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023 •

edited

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023 •

edited

gbene commented Dec 29, 2023

mdhaber commented Dec 30, 2023 •

edited

gbene commented Jan 2, 2024

mdhaber commented Jan 2, 2024

DOC: Likelihood function depending on censoring type #19732

DOC: Likelihood function depending on censoring type #19732

Comments

gbene commented Dec 22, 2023 • edited

Issue with current documentation:

Idea or request for content:

Additional context (e.g. screenshots, GIFs)

mdhaber commented Dec 23, 2023

WarrenWeckesser commented Dec 29, 2023 • edited

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023 • edited

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023

gbene commented Dec 29, 2023

mdhaber commented Dec 29, 2023 • edited

gbene commented Dec 29, 2023

mdhaber commented Dec 30, 2023 • edited

gbene commented Jan 2, 2024

mdhaber commented Jan 2, 2024

gbene commented Dec 22, 2023 •

edited

WarrenWeckesser commented Dec 29, 2023 •

edited

mdhaber commented Dec 29, 2023 •

edited

mdhaber commented Dec 29, 2023 •

edited

mdhaber commented Dec 30, 2023 •

edited