#  Generating new quantities of interest given a existing model, data, and sample.


The [generated quantities block](https://mc-stan.org/docs/reference-manual/program-block-generated-quantities.html)
computes quantities of interest based on the data,
transformed data, parameters, and transformed parameters.
It can be used to:

-  generate simulated data for model testing by forward sampling
-  generate predictions for new data
-  calculate posterior event probabilities, including multiple
   comparisons, sign tests, etc.
-  calculating posterior expectations
-  transform parameters for reporting
-  apply full Bayesian decision theory
-  calculate log likelihoods, deviances, etc. for model comparison

The :ref:`class_cmdstanmodel` class ``generate_quantities`` method is useful once you
have successfully fit a model to your data and have a valid
sample from the posterior.
To generate new quantities of interest, you can modify just
the generated quantities block of the original model, 
adding all necessary statements to compute new quantities of interest.
After compiling the new model, you call its ``generate_quantities``
method, passing in the existing sample and input data as arguments.
CmdStanPy invokes CmdStan specifying ``method=generate_quantities.``
The sampler uses the per-draw parameter estimates from the sample to
compute the just the generated quantities block of the new model.

The ``generate_quantities`` method returns a ``CmdStanGQ`` object
which provides properties to retrieve information about the sample:


- ``chains``
- ``column_names``
- ``generated_quantities``
- ``generated_quantities_pd``
- ``sample_plus_quantities``
- ``save_csvfiles()``

The ``sample_plus_quantities`` combines the existing sample
and new quantities of interest into a pandas DataFrame object
which can be used for downstream analysis and visualization.
In this way you add more columns of information to an existing sample.

## Example:  add posterior predictive checks to `bernoulli.stan`


In this example we use the CmdStan example model [bernoulli.stan](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.stan)
and data file [bernoulli.data.json](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.data.json) as our existing model and data.

We instantiate the model `bernoulli`,
as in the "Hello World" section
of the CmdStanPy [tutorial](../../cmdstanpy_tutorial.ipynb) notebook.

In [None]:
import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
bernoulli_path = os.path.join(bernoulli_dir, 'bernoulli.stan')
bernoulli_data = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate bernoulli model, compile Stan program
bernoulli_model = CmdStanModel(stan_file=bernoulli_path)
print(bernoulli_model.code())

The input data consists of `N` - the number of bernoulli trials and `y` - the list of observed outcomes.
Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.

In [None]:
# examine bernoulli data
import ujson
import statistics
with open(bernoulli_data,'r') as fp:
    data_dict = ujson.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))

As in the "Hello World" tutorial, we produce a sample from the posterior of the model conditioned on the data:

In [None]:
# fit the model to the data
bern_fit = bernoulli_model.sample(data=bernoulli_data)

The fitted model produces an estimate of `theta` - the chance of success

In [None]:
bern_fit.summary()

To run a prior predictive check, we add a `generated quantities` block to the model, in which we generate a new data vector `y_rep` using the current estimate of theta.  The resulting model is in file [bernoulli_ppc.stan](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli_ppc.stan)

In [None]:
bernoulli_ppc_model = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(bernoulli_ppc_model.code())

We run the `generate_quantities` method on `bernoulli_ppc` using existing sample `bern_fit` as input.  The `generate_quantities` method takes the values of `theta` in the `bern_fit` sample as the set of draws from the posterior used to generate the corresponsing `y_rep` quantities of interest.

The arguments to the `generate_quantities` method are:
 + `data`  - the data used to fit the model
 + `mcmc_sample` - either a `CmdStanMCMC` object or a list of stan-csv files


In [None]:
new_quantities = bernoulli_ppc_model.generate_quantities(data=bernoulli_data, mcmc_sample=bern_fit)

The `generate_quantities` method returns a `CmdStanGQ` object which contains the values for all variables in the generated quantitites block of the program ``bernoulli_ppc.stan``.  Unlike the output from the ``sample`` method, it doesn't contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.

In this example, each draw consists of the N-length array of replicate of the `bernoulli` model's input variable  `y`, which is an N-length array of Bernoulli outcomes.

In [None]:
print(new_quantities.generated_quantities.shape, new_quantities.column_names)
for i in range(3):
    print (new_quantities.generated_quantities[i,:])

The method `sample_plus_quantities` returns a pandas DataFrame which combines the input drawset with the generated quantities.

In [None]:
sample_plus = new_quantities.sample_plus_quantities
print(sample_plus.shape)
print(sample_plus.columns)

For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters `theta` as well as `y_rep` values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the `generate_quantities` method is preferred.