New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Likelihood function depending on censoring type #19732
Comments
I believe this is the relevant code: scipy/scipy/stats/_distn_infrastructure.py Line 2308 in 5e4a5e3
Since you have the book, please let us know what type this corresponds with. |
@gbene, I don't have the Karim and Islam book, but some relevant references that we can all access are:
Type I, Type II and random (or uninformative) censoring all involve mixes of uncensored and right-censored data. For example, in the Type I case, the failure times of Are you sure that the formula that you quoted for |
Hi all! Thank you for your answers!
@WarrenWeckesser I can see this in the code provided by @mdhaber. In the _nnlf_and_penalty function, the log-likelihood is used (understandably) and I can follow the code fairly well until the end. Now what I do not understand is the penalization factor and from where it comes from. At the end, the total sum is penalized by the total number of bad values multiplied by _LOGMAX*100. return -total + n_bad * _LOGXMAX * 100 Why is this?
Sorry, you are right! I made a mistake while writing the formula! I edited the original comment. Thank you again! |
It is a penalty method for turning a constrained optimization problem into an unconstrained problem. When all the observations are within the support of the distribution, there are no "bad" points, and the penalty disappears. The penalty is supposed to be very large so the solver never converges to the true MLE rather than a solution where there is a penalty. It looks unsophisticated to me, but I've been surprised at how well it works in practice. |
Alright! Thank you very much for these clarifications. I think they would be a useful addition for the docs to be more clear on what is going on in the code! |
Would you like to submit a PR? |
Sure! I can follow this guide: https://docs.scipy.org/doc/scipy/dev/contributor/rendering_documentation.html#contributing-docs correct? |
Yup, that would be great. Note that if you're working on Windows, building the documentation locally is tricky, so you can skip that step. In any case, please include the text Maybe draft what you plan to add here in the issue? |
All right! I think that I will add the discussed information in the CensoredData page after: "Left-, right-, and interval-censored data can be represented by CensoredData." More in particular I am planning to:
What do you think? Is it too specific? Also it would be nice to add bibliography from where the generalized formulation came from. I have the book by Karim and Islam, but maybe you used other sources. Also I think that a clarification on this line is in order: To estimate the interval censoring term, the _delta_cdf function is used but this causes a bit of confusion. The formulation in Karim states that the interval censoring term is calculated as the difference between S( Where I is the set of interval censored times and the event occurred between (L$_i$ , R$_i$). In scipy this is implemented correctly but only because the _delta_cdf function returns CDF(X2)-CDF(X1). In the code the use case is np.log(self._delta_cdf(i1, i2, *args)) Where i1 = So I think that the comment in the code should also be extended to clarify this step, what do you think? Thanks again! |
I'd clarify what we do in SciPy, but not compare and contrast all the different options. Just add a reference for that (see how references are added in The penalty is already mentioned in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html#scipy.stats.rv_continuous.fit, which is where that would belong.
I don't understand the confusion. Is it about the equivalence of SF(x) - SF(y) and CDF(y) - CDF(x), or that the name |
All right, perfect!
I was mostly confused by the name, for me it did not immediately equate to CDF(i2) - CDF(i1). I initially thought that it was the difference in order of input (so CDF(i1) -CDF(i2)) and only after I read the description of Thanks again! |
Sure. Maybe phrase it like "calculate the probability mass between...". |
See scipy#19732 [docs only]
Issue with current documentation:
Hi all!
I am using the CensoredData class introduced in scipy v1.11.0 to model a censored dataset. I think that in the documentation for this class is not very clear which type of censoring is considered, or if there is the possibility to change the likelihood function depending on the censoring type. From the text "Reliability and Survival Analysis" by Karim and Islam 2019, three types of censoring are defined Type I, Type II and random and depending on the three types there are different formulations to calculate the likelihood (chapter 4.4).
There are also more complex and umbrella formulations that consider right, left and interval censoring all together. Which of these formulations are used in scipy?
Idea or request for content:
I think that a more precise overview of the underlying likelihood calculation used in scipy should be added.
Additional context (e.g. screenshots, GIFs)
No response
The text was updated successfully, but these errors were encountered: