Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement some predictive model transformations #141

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ricardoV94
Copy link
Member

Companionship to #111

Exploring some possibilities and also understanding the limitations/ design issues.

This is just a draft to guide design decisions for now.

forecast_timeseries_rewrite = in2out(forecast_timeseries_node_rewrite, ignore_newtrees=True)


def forecast_timeseries(model: Model, forecast_steps: int) -> Model:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe request a replace_dims={"old_dim": "new_dim"}. Right now any dims of deterministics/variables below the timeseries that corresponded to to the initial timeseries steps will be carried over (incorrectly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the GP predictions in mind with this example, but this was a bit simpler as all the graph information is available in the model. For more context see https://gist.github.com/ricardoV94/c16693680d55294eb73feed7c9997421

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, still have your gist up in a tab.. I'm thinking that a really nice way to conceptualize GPs, or AR models like this, would just be as functions.

with pm.Model() as model:
    X = pm.MutableData("X", X_)
    gp = pm.gp.Latent(cov_func=cov_func)
    f = gp(X)  # formerly gp.prior(...)
    ...
    pm.sample()


with model:
    pm.set_data(X_new)
    # or maybe pm.set_data({"X": X_new}, replace_dims={"X": new_coords})

    

Is all that's needed is to tuck your function here into pm.set_data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to create a new model and not mess with the original one. In these cases the predictive model is different than the prior model, which is not such an unreasonable thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, it'd be nice from a user perspective though if there was one way to do this regardless of whether its a linear model, AR, or GP.

My bad, obv you cant return a new model somehow within pm.set_data. What if the syntax switched a bit, new_model = pm.set_data(X_new) (or a new function, pm.update_data) and could handle the diff cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API I have in mind for GPs would be:

with predict_gp(gp1=x1, gp2=x2, ...) as pred_m:
  pm.sample_posterior_predictive()

That feels much more intuitive.

Procedurally set_data doesn't seem like good analogy for GPs. We don't want to override the initial x, we want to expand them (and apply a specific multivariate normal to that expansion). If it was called pm.expand_data it would be closer, but again I don't think it's that intuitive.

I think sample is actually a good counter example, that code is really hard to understand and maintain. Users have no idea how to do something less standard like resuming sampling which libraries like blackjax and numpyro do much better.

Also a more practical if silly reason is that we can't create a distinct model inside the context of another. The variables would be added to the parent. And you can't call set_data outside of a model.

But instead of me arguing against it, do you see a problem with this approach? Do you think it would be hard to use, or to find or both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the gp1=x, gp2=x2, ... part work? is that setting X for each gp object in the model?

The details of predicting with a GP could be abstracted away though. It'd be nice if a user could just think of GPs as random functions f(x) that take some x and return a distribution. So you sample initially with some x, then you can plug in new x and get the new GP at x. Which is true, the stuff about conditionals and non-parametric and all that is kind of an implementation detail, and there to make working with GPs more efficient. It's just as correct to fit the whole thing with NUTS, a GP over x (where data is observed) and x_new (where it isnt) up front. It's too slow though.

One potential problem (don't know if it necessarily is), is how this would interact with additive GPs, example here. Could you apply this to predict a component of an additive GP? (I think its def possible, predict_gp might need some args for this, just wanted to keep it in mind as a potential complication)

It also kind of looks like you're building higher level model components, so censoring, gps, timeseries, which all use this function chaining for uncensoring/prediction/forecasting whatever is natural. Even though you could do it with pm.set_data, you could additional build a predict_lm function that works the same more function chaining way too.

If you had a linear model as the mean of a GP, or equiv., $y \sim N(f + X\beta, \sigma^2)$, would you do

with predict_gp(...) as pred_m:
  pm.set_data({"X": x_new})
  pm.sample_posterior_predictive()

(or maybe, this one?)

with predict_gp(predict_lm(...)) as pred_m:
  pm.sample_posterior_predictive()

Anyway, please don't let this discussion be a drag! I feel there's a few threads here and theres a lot to think about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential problem (don't know if it necessarily is), is how this would interact with additive GPs, example here. Could you apply this to predict a component of an additive GP? (I think its def possible, predict_gp might need some args for this, just wanted to keep it in mind as a potential complication)

I have to read a bit about that to understand it.

If you had a linear model as the mean of a GP, or equiv., , would you do

Either would be possible! The GLM part usually doesn't need any graph transformation (that's why it already works). Although you could use it replace ConstantData predictors by a new one if you wanted to. This way a PyMC model can sample with constant predictors, which may allow more efficient graphs, than when we use shared variables where it has to assume those could change values/shape any time.

But I would probably go for the first (not implement a predict_lm transform). The predict_gp, creates a new model with the prediction component inserted right after where the original gp was (conditioned on the new_X). After that you can still set any old MutableData variables you had in the original model as you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, please don't let this discussion be a drag! I feel there's a few threads here and theres a lot to think about.

This was the point of me opening the PR, so not a drag at all. The idea is to experiment and see if something works well enough. Maybe I should add a gp_predict transform in this PR to see how it looks like?

Copy link
Contributor

@bwengals bwengals Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay definitely, I'm curious how it looks.

The predict_gp, creates a new model with the prediction component inserted right after where the original gp was (conditioned on the new_X). After that you can still set any old MutableData variables you had in the original model as you want.

Would mutating X affect the original GP though? Like, if you have

f = gp.prior("f", X)
...
pm.set_data({"X": x_new})

you might expect something to happen to f?

Also, I'd be curious to check the graph after .prior has been replaced by .conditional to see if there is an X as input somewhere still hanging around.

Copy link
Contributor

@lucianopaz lucianopaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to the party here. I'm catching up with all of your discussion but I haven't read all of the stuff you wrote about using this for GPs.

To be honest, I think that you started with two rather difficult and specific applications. I suggest that you do a couple of things differently:

  1. Make model_transform a directory (subpackage)
  2. Have model transformations split within according to some logic, instead of a monolithic script
  3. Add a transformation called observe and one called unobserve. The idead behind these two is to move a random variable between a freeRV and an observedRV.
  4. Add a transformation called impute that takes any random variable and replaces it with a TensorConstant of the appropriate size and dtype. This, along with observe will serve as the basis for do-calculus and would open the door of practically all of causal inference to us.

pymc_experimental/model_transform.py Outdated Show resolved Hide resolved
Comment on lines 101 to 104
# FIXME: This special logic shouldn't be needed for ObservedRVs
# but PyMC does not allow one to not resample observed.
# We hack around by conditioning on the value variable directly,
# even though that should not be part of the generative graph...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. This special logic IS needed for ObservedRVs because forecasting is different than posterior predictive.

Posterior predictive does something like this:

$$ y_{i} ~ P(Y|y_{1}^{obs}, ..., y_{n}^{obs}) $$

draw other potential values of the $y_{i}^{obs}$ observations from the probability distribution of the observed variable conditioned on the previous set of observations. Essentially, this is agnostic of temporal order and is intended for resampling of the entire set of observations.

Forecasting does this:

$$ y_{t+1} ~ P(Y_{t+1}|y_{1}^{obs}, ..., y_{n}^{obs}) $$

So it samples from a different conditional distribution than the posterior predictive. That's why we do need to manually add the past y_{1}^{obs}, ..., y_{n}^{obs} values explicitly.

Copy link
Member Author

@ricardoV94 ricardoV94 May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I think of it observed variables are just FreeRVs that had constant values during sampling. If I can condition on FreeRVs traced values I should also be able to condition on ObservedRV traced values (in the constant/observed data group) without having to hack across the value variable representation on the forward graph.

Similarly, if I had a Deterministic based on an ObservedRV (as happens with imputation), why can't I now make predictions based on those traced values without forcing sampling from the likelihood?

The difference between forecasting vs posterior predictive (+ forecasting) would be controllable by including or excluding the ObservedRV in the var_names kwarg in posterior predictive, which would behave exactly the same as with FreeRVs

@ricardoV94
Copy link
Member Author

Add a transformation called impute that takes any random variable and replaces it with a TensorConstant of the appropriate size and dtype. This, along with observe will serve as the basis for do-calculus and would open the door of practically all of causal inference to us.

Any preference for a TensorConstant instead of SharedVariable? The latter can be modified without needing to reconstruct the graph

@ricardoV94 ricardoV94 marked this pull request as ready for review May 10, 2023 15:46
@ricardoV94 ricardoV94 marked this pull request as draft May 10, 2023 15:47
@ricardoV94 ricardoV94 changed the title Fgraph model applications Implement some predictive model transformations Jul 6, 2023
@ricardoV94 ricardoV94 marked this pull request as ready for review July 6, 2023 20:37
@jessegrabowski
Copy link
Member

Following up here with the conversation I had with @ricardoV94 on discourse so it doesn't get lost --

Another nice transformation would be one that takes in an MvN distribution and marginalizes over observed data, returning a new conditional distribution over the set complement to the observations. It would look something like this:

from pytensor.tensor.random.basic import multivariate_normal, MvNormalRV
from pytensor.tensor.nlinalg import matrix_dot
import pytensor.tensor as pt
import pymc as pm

def marginalize_over_observations(dist, observations):
    if not isinstance(dist.owner.op, MvNormalRV):
        raise NotImplementedError

    # Is this always true?
    *_, mu, cov = dist.owner.inputs
    observations = pt.as_tensor_variable(observations)
    
    is_nan = pt.isnan(observations)
    missing_idx = pt.nonzero(is_nan, return_matrix=False)[0]
    obs_idx = pt.nonzero(pt.neq(is_nan, np.array(True)), return_matrix=False)[0]
    
    n_missing = missing_idx.shape[0]
    n_obs = obs_idx.shape[0]
    n = n_missing + n_obs
    
    # Z is a map from R_n to R_observed
    # K is a map from R_n to R_missing
    Z = pt.set_subtensor(pt.zeros((n_obs, n))[pt.arange(n_obs), obs_idx], 1)
    K = pt.set_subtensor(pt.zeros((n_missing, n))[pt.arange(n_missing), missing_idx], 1)
    
    resid = observations[obs_idx] - Z @ mu

    # TODO: Benchmark this vs indexing on a large problem, gains from BLAS?
    cov_uu = matrix_dot(K, cov, K.T)
    cov_uo = matrix_dot(K, cov, Z.T)
    cov_oo = matrix_dot(Z, cov, Z.T)
        
    # This could be sped up if we got chol instead of cov, because we could to solve_triangular twice to invert
    # Typical MvN parameterization is in terms of chol, so there's a lot of waste going back and forth
    # over and over?
    cov_oo_inv = pt.linalg.solve(cov_oo, pt.eye(n_obs), assume_a="pos", check_finite=False)
    beta = cov_uo @ cov_oo_inv
    
    conditional_mu = Z @ mu + beta @ resid
    conditional_cov = cov_uu - matrix_dot(beta, cov_oo, beta.T)
    
    return multivariate_normal(mean=conditional_mu, 
                               cov=conditional_cov, 
                               dtype=dist.dtype)

Simple usage:

observation = np.array([np.nan, 1, np.nan])
L = np.random.normal(size=(3, 3))
cov = L @ L.T
d = pm.MvNormal.dist(mu=np.zeros(3), cov=cov)
d2 = marginalize_over_observations(d, observation)
d2.eval().shape
>>> Out: (2,)

This could then be applied to a model with 1) A list of MvN Rvs to re-write, and 2) a list of observations to condition them on. I will have a look at how the other functions in this PR work and see if I can have a try, but I actually don't know how to PR into a PR :>

@ricardoV94
Copy link
Member Author

@jessegrabowski you can push directly to my branch if you have the git permissions. Otherwise you can fork my branch and open a PR with that as a base. If that doesn't work just open a PR directly on main because I will have the permissions to work on it anyway.

About the Cholesky thing we added some rewrites recently in PyTensor that work as long as the right "tags" are on the variables (e.g. "lower_triangular".

You can check it in: pymc-devs/pymc@ee1657b

pymc-devs/pytensor#303

Would those cover the case here? If not, can we add the rewrite in PyTensor?

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jul 8, 2023

About the observations could we just request the new ones and combine with the old observed values from the model/trace

@jessegrabowski
Copy link
Member

About the Cholesky thing we added some rewrites recently in PyTensor that work as long as the right "tags" are on the variables (e.g. "lower_triangular".
You can check it in: pymc-devs/pymc@ee1657b
pymc-devs/pytensor#303
Would those cover the case here? If not, can we add the rewrite in PyTensor?

The rewrite handles my quip about going back and forth, since it looks like now the canonical MvN will be parameterized by the cholesky factorized covariance matrix? I was surprised to see that if you give pytensor.tensor.random.basic.multivariate_normal a cholesky factorized cov, it computes cov = L @ L.T, I assumed it would instead be the other way around (if you give cov it computes L = cholesky(cov)). Anyway my point in the comment was that if we always know we can get a cholesky factorized covariance matrix out of dist.owner.inputs, then we can do a more efficient algorithm for matrix inversion. Or check that tag and do a switch?

One last thing I just thought of: this "conditionalization" has implications for missing data imputation in MvN likelihoods -- the missing values should be filled with these conditional distributions. I know you just did a PR that brought back multivariate imputation, but I didn't carefully look to see what goes into the missing values now

@ricardoV94
Copy link
Member Author

The default parametrization of the MvNormal is still the covariance. What we did was to add rewrites to remove useless operations when you tag that some variables where already lower triangular (which the outputs of LKJ are tagged with).

You should still write the code expecting a full covariance. The question is if the rewrites that we have right now would go from the full covariance graph to the optimized one as long as the tags are provided?

If not, we could open an issue on Pytensor to do it. Maybe @dehorsley would be interested in also implementing those?

@ricardoV94
Copy link
Member Author

About the imputation, yes in that case you shouldn't need anything else but it can be pretty inneficient to sample the variables with NUTS so it's nice if we can get them afterwards using the conditioning algebra.

CC @bwengals as this is pretty much the same thing with GPs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants