Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add predict helpers such as: predict_expectation, predict_percentile (similar to lifelines) #190

Open
konradsemsch opened this issue Mar 22, 2021 · 11 comments

Comments

@konradsemsch
Copy link

Hi! I really like your package implementation and its compatibility with sklearn, but one thing that is refraining us from using it in our solution is the lack of the above predict function implementations (particularly predict_expectation). In our case, we need to know the expected number of days a given client is expected to remain a client. Ranking observations by the obtained "arbitrary" score will not work.

Given that the above function is based on the estimated survival functions it should be possible to implement (as long as I read the lifelines code correctly):

    def predict_expectation(self, X: DataFrame, conditional_after: Optional[ndarray] = None) -> pd.Series:
        r"""
        Compute the expected lifetime, :math:`E[T]`, using covariates X. This algorithm to compute the expectation is
        to use the fact that :math:`E[T] = \int_0^\inf P(T > t) dt = \int_0^\inf S(t) dt`. To compute the integral, we use the trapezoidal rule to approximate the integral.
        Caution
        --------
        If the survival function doesn't converge to 0, then the expectation is really infinity and the returned
        values are meaningless/too large. In that case, using ``predict_median`` or ``predict_percentile`` would be better.
        Parameters
        ----------
        X: numpy array or DataFrame
            a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns
            can be in any order. If a numpy array, columns must be in the
            same order as the training data.
        conditional_after: iterable, optional
            Must be equal is size to X.shape[0] (denoted `n` above).  An iterable (array, list, series) of possibly non-zero values that represent how long the
            subject has already lived for. Ex: if :math:`T` is the unknown event time, then this represents :math:`s` in
            :math:`T | T > s`. This is useful for knowing the *remaining* hazard/survival of censored subjects.
            The new timeline is the remaining duration of the subject, i.e. normalized back to starting at 0.
        Notes
        -----
        If X is a DataFrame, the order of the columns do not matter. But
        if X is an array, then the column ordering is assumed to be the
        same as the training dataset.
        See Also
        --------
        predict_median
        predict_percentile
        """
        subjects = utils._get_index(X)
        v = self.predict_survival_function(X, conditional_after=conditional_after)[subjects]
        return pd.Series(trapz(v.values.T, v.index), index=subjects)

Reference to the lifelines implementation:

  1. https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.SemiParametricPHFitter.predict_expectation
  2. https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/fitters/coxph_fitter.py
@sebp
Copy link
Owner

sebp commented Mar 25, 2021

It is correct that the expected survival time can be estimated with the integral under the survival function. However, this only leads to a reasonable estimate if the last observed time point corresponds to an event such that the survival function is zero thereafter. If that's not the case, the expected survival time is undefined without making any additional assumptions.

Typical assumptions are (i) assuming the remaining instances experience an event directly after the last observed time point (S(t) becomes zero), (ii) assuming the remaining instances never experience an event (S(t) remains constant), (iii) assuming a parametric form, e.g. exponential curve, at the tail. Finally, estimating the restricted mean survival time, i.e., an expectation over a pre-defined interval rather than all of time, is from a practical perspective the most reasonable.

I don't know how lifelines handles this, but I think, when implementing this, it is important to make it explicit that estimating the mean survival time can be tricky and avoid giving an answer for which the underlying assumptions are not clear.

@konradsemsch
Copy link
Author

konradsemsch commented Mar 25, 2021

Hi @sebp and thanks for coming back on this! Yes, there's a note in the lifelines documentation referring to what you have mentioned:

image

In general, I would consider this a really desirable feature in one of the future releases. I find your package much better integrated with the sklearn ecosystem, but the lack of these prediction helpers prevents us from using it. An additional point in favour of implementing this is that when I check other package implementations, such as the quite famous survival package in R, it also provides an option to predict the expected survival time.

I'd be happy to contribute here with a PR 😉

@sebp
Copy link
Owner

sebp commented Mar 25, 2021 via email

@PragyanSubedi
Copy link

Any updates on this issue for implementation? @sebp @konradsemsch

@sebp
Copy link
Owner

sebp commented May 2, 2021

@PragyanSubedi I did not have time to look into it, I'm afraid.

@konradsemsch
Copy link
Author

@PragyanSubedi - I've started looking into that. I should have a first PR draft in a week.

@konradsemsch
Copy link
Author

@sebp - I added a basic first PR: #196. Please take a look at it and let me know if you agree. Once we have that, I'd move further with implementing tests and adding this helper to other models as well.

@bobbyx27
Copy link

Hi, @sebp : I wonder why not considering a predict_hazard_function in addition to predict_cumulative_hazard_function ? Or is there an obvious and simple way to get it, that I'm missing... Thank you very much for all the great work!

@sebp
Copy link
Owner

sebp commented Apr 26, 2022 via email

@tomikrogerus
Copy link

tomikrogerus commented May 3, 2022

Hi, @konradsemsch and @sebp are you still working with this predict_expectation function? Would be really interested about it.

@sebp
Copy link
Owner

sebp commented May 4, 2022

@tomikrogerus I'm currently not working on it. PR #196 is inactive too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants