Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Likelihood function depending on censoring type #19732

Closed
gbene opened this issue Dec 22, 2023 · 12 comments · Fixed by #20222
Closed

DOC: Likelihood function depending on censoring type #19732

gbene opened this issue Dec 22, 2023 · 12 comments · Fixed by #20222
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Milestone

Comments

@gbene
Copy link

gbene commented Dec 22, 2023

Issue with current documentation:

Hi all!

I am using the CensoredData class introduced in scipy v1.11.0 to model a censored dataset. I think that in the documentation for this class is not very clear which type of censoring is considered, or if there is the possibility to change the likelihood function depending on the censoring type. From the text "Reliability and Survival Analysis" by Karim and Islam 2019, three types of censoring are defined Type I, Type II and random and depending on the three types there are different formulations to calculate the likelihood (chapter 4.4).

  • Type I: $L = \prod f(t_{i})^{\delta_{i}}S(t_{r})^{1-\delta_{i}}$
  • Type II: $L = [\prod f(t_{i})]S(t_{r})^{n-r}$
  • Random censoring: The same as Type I if censoring is considered independent from the event.

There are also more complex and umbrella formulations that consider right, left and interval censoring all together. Which of these formulations are used in scipy?

Idea or request for content:

I think that a more precise overview of the underlying likelihood calculation used in scipy should be added.

Additional context (e.g. screenshots, GIFs)

No response

@gbene gbene added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Dec 22, 2023
@gbene gbene changed the title DOC: Likelihood function depending on censoring type<Please write a comprehensive title after the 'DOC: ' prefix> DOC: Likelihood function depending on censoring type Dec 22, 2023
@mdhaber
Copy link
Contributor

mdhaber commented Dec 23, 2023

I believe this is the relevant code:

def _nnlf_and_penalty(self, x, args):

Since you have the book, please let us know what type this corresponds with.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Dec 29, 2023

@gbene, I don't have the Karim and Islam book, but some relevant references that we can all access are:

Type I, Type II and random (or uninformative) censoring all involve mixes of uncensored and right-censored data. For example, in the Type I case, the failure times of $n$ units are observed in a fixed time interval T (or $t_r$ in your notation). For the units that fail within the given time interval, the failure times are uncensored. The failure times of the units that haven't failed by time T are unknown, but they are greater than T, so those observations are recorded as right-censored with lower bound T. The formula that you quote from Karim and Islam for the Type I case is equivalent to the formula used in SciPy for right-censoring (assuming that $\delta_i$ is the failure indicator: 1 for an observed failure, 0 for right-censored).

Are you sure that the formula that you quoted for $L$ for the Type II case is correct? In the Type II case, $r$ is given, and the test is stopped when $r$ failures have occurred. So the failure times of the $n - r$ units that didn't fail are right-censored with lower bound $t_r$. I think the second factor in $L$ should be $S(t_r)^{n-r}$ instead of $S(t_i)^{n-r}$, since $t_r$ is the last observed failure time, when the test was stopped. (Note that this factor is outside of the product, so $i$ can't be referenced.)

@gbene
Copy link
Author

gbene commented Dec 29, 2023

Hi all! Thank you for your answers!

The formula that you quote from Karim and Islam for the Type I case is equivalent to the formula used in SciPy for right-censoring.

@WarrenWeckesser I can see this in the code provided by @mdhaber. In the _nnlf_and_penalty function, the log-likelihood is used (understandably) and I can follow the code fairly well until the end. Now what I do not understand is the penalization factor and from where it comes from. At the end, the total sum is penalized by the total number of bad values multiplied by _LOGMAX*100.

return -total + n_bad * _LOGXMAX * 100

Why is this?

Are you sure that the formula that you quoted for for the Type II case is correct?

Sorry, you are right! I made a mistake while writing the formula! I edited the original comment.

Thank you again!

@mdhaber
Copy link
Contributor

mdhaber commented Dec 29, 2023

Why is this?

It is a penalty method for turning a constrained optimization problem into an unconstrained problem. When all the observations are within the support of the distribution, there are no "bad" points, and the penalty disappears. The penalty is supposed to be very large so the solver never converges to the true MLE rather than a solution where there is a penalty. It looks unsophisticated to me, but I've been surprised at how well it works in practice.

@gbene
Copy link
Author

gbene commented Dec 29, 2023

Alright! Thank you very much for these clarifications. I think they would be a useful addition for the docs to be more clear on what is going on in the code!

@mdhaber
Copy link
Contributor

mdhaber commented Dec 29, 2023

Would you like to submit a PR?

@gbene
Copy link
Author

gbene commented Dec 29, 2023

@mdhaber
Copy link
Contributor

mdhaber commented Dec 29, 2023

Yup, that would be great. Note that if you're working on Windows, building the documentation locally is tricky, so you can skip that step. In any case, please include the text [docs only] in the body of your commit message. This will ensure that the documentation is built when you submit the PR (without performing other unnecessary tests) so we can see how it renders.

Maybe draft what you plan to add here in the issue?

@gbene
Copy link
Author

gbene commented Dec 29, 2023

All right! I think that I will add the discussed information in the CensoredData page after: "Left-, right-, and interval-censored data can be represented by CensoredData."

More in particular I am planning to:

  1. Add a further distinction to right-censoring with type I, II and uninformative censoring
  2. Precise that type I and II have a different formulation to calculate the likelihood.
  3. Point out that scipy uses the generalized form (including left and interval censoring) and because of this, right-censored data must be type I / uninformative otherwise the likelihood calculation will be incorrect.
  4. Maybe add a note for the applied penalty so that users know what happens when invalid data are present in the dataset

What do you think? Is it too specific?

Also it would be nice to add bibliography from where the generalized formulation came from. I have the book by Karim and Islam, but maybe you used other sources.

Also I think that a clarification on this line is in order:

https://github.com/scipy/scipy/blob/5e4a5e3785f79dd4e8930eed883da89958860db2/scipy/stats/_distn_infrastructure.py#L2328C7-L2328C7

To estimate the interval censoring term, the _delta_cdf function is used but this causes a bit of confusion. The formulation in Karim states that the interval censoring term is calculated as the difference between S($L_i$) and S($R_i$)

image

Where I is the set of interval censored times and the event occurred between (L$_i$ , R$_i$).

In scipy this is implemented correctly but only because the _delta_cdf function returns CDF(X2)-CDF(X1).

In the code the use case is

np.log(self._delta_cdf(i1, i2, *args))  

Where i1 = $L_i$ and i2 = $R_i$. Thus in our case would be CDF($R_i$)-CDF($L_i$). This is the same of S($L_i$)-S($R_i$) because CDF = 1-SF

So I think that the comment in the code should also be extended to clarify this step, what do you think?

Thanks again!

@mdhaber
Copy link
Contributor

mdhaber commented Dec 30, 2023

What do you think? Is it too specific?

I'd clarify what we do in SciPy, but not compare and contrast all the different options. Just add a reference for that (see how references are added in differential_entropy, for example). Also, this sort of information can be appended to the Notes section rather than included in the main description.

The penalty is already mentioned in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html#scipy.stats.rv_continuous.fit, which is where that would belong.

So I think that the comment in the code should also be extended to clarify this step, what do you think?

I don't understand the confusion. Is it about the equivalence of SF(x) - SF(y) and CDF(y) - CDF(x), or that the name _delta_cdf is not descriptive of CDF(y) - CDF(x)? With the name _delta_cdf, I would expect it to be equivalent to calculate the probability mass between x and y, which I think is more intuitive than either SF(x) - SF(y) or CDF(y) - CDF(x).

@gbene
Copy link
Author

gbene commented Jan 2, 2024

I'd clarify what we do in SciPy, but not compare and contrast all the different options. Just add a reference for that (see how references are added in differential_entropy, for example). Also, this sort of information can be appended to the Notes section rather than included in the main description.

All right, perfect!

I don't understand the confusion.

I was mostly confused by the name, for me it did not immediately equate to CDF(i2) - CDF(i1). I initially thought that it was the difference in order of input (so CDF(i1) -CDF(i2)) and only after I read the description of _delta_cdf I understood what was going on. I was suggesting to add a comment to avoid the back-and-forth while reading the code.

Thanks again!

@mdhaber
Copy link
Contributor

mdhaber commented Jan 2, 2024

I was suggesting to add a comment to avoid the back-and-forth while reading the code.

Sure. Maybe phrase it like "calculate the probability mass between...".

m-maggi added a commit to m-maggi/scipy that referenced this issue Mar 9, 2024
@lucascolley lucascolley added this to the 1.13.0 milestone Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants