#  Generating new quantities of interest.


The [generated quantities block](https://mc-stan.org/docs/reference-manual/program-block-generated-quantities.html)
computes quantities of interest based on the data,
transformed data, parameters, and transformed parameters.
It can be used to:

-  generate simulated data for model testing by forward sampling
-  generate predictions for new data
-  calculate posterior event probabilities, including multiple
   comparisons, sign tests, etc.
-  calculating posterior expectations
-  transform parameters for reporting
-  apply full Bayesian decision theory
-  calculate log likelihoods, deviances, etc. for model comparison

## Example:  add posterior predictive checks to `bernoulli.stan`


In this example we use the CmdStan example model [bernoulli.stan](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.stan)
and data file [bernoulli.data.json](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.data.json) as our existing model and data.

We instantiate the model `bernoulli`,
as in the "Hello World" section
of the CmdStanPy [tutorial](https://github.com/stan-dev/cmdstanpy/blob/develop/cmdstanpy_tutorial.ipynb) notebook.

In [1]:
import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
print(model.code())

INFO:cmdstanpy:found newer exe file, not recompiling


data {
  int<lower=0> N;
  array[N] int<lower=0,upper=1> y; // or int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}



The input data consists of `N` - the number of bernoulli trials and `y` - the list of observed outcomes.
Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.

In [2]:
# examine bernoulli data
import ujson
import statistics
with open(data_file,'r') as fp:
    data_dict = ujson.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))

{'N': 10, 'y': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]}
mean of y: 0.2


As in the "Hello World" tutorial, we produce a sample from the posterior of the model conditioned on the data:

In [3]:
# fit the model to the data
fit = model.sample(data=data_file)

INFO:cmdstanpy:CmdStan start processing


ERROR:cmdstanpy:Error in progress bar initialization:
	IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
Disabling progress bars for this session


Exception ignored in: <function tqdm.__del__ at 0x7fe19c904550>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.9/x64/lib/python3.9/site-packages/tqdm/std.py", line 1147, in __del__
    self.close()
  File "/opt/hostedtoolcache/Python/3.9.9/x64/lib/python3.9/site-packages/tqdm/notebook.py", line 286, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm' object has no attribute 'disp'


                                                                                

                                                                                

                                                                                

                                                                                

INFO:cmdstanpy:CmdStan done processing.





The fitted model produces an estimate of `theta` - the chance of success

In [4]:
fit.summary()

Unnamed: 0_level_0,Mean,MCSE,StdDev,5%,50%,95%,N_Eff,N_Eff/s,R_hat
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
lp__,-7.3,0.02,0.73,-8.8,-7.0,-6.8,1400.0,25000.0,1.0
theta,0.25,0.0031,0.12,0.083,0.24,0.48,1500.0,26000.0,1.0


To run a prior predictive check, we add a `generated quantities` block to the model, in which we generate a new data vector `y_rep` using the current estimate of theta.  The resulting model is in file [bernoulli_ppc.stan](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli_ppc.stan)

In [5]:
model_ppc = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(model_ppc.code())

INFO:cmdstanpy:compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/examples/bernoulli_ppc.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/examples/bernoulli_ppc


INFO:cmdstanpy:compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/examples/bernoulli_ppc


data { 
  int<lower=0> N; 
  int<lower=0,upper=1> y[N];
} 
parameters {
  real<lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1,1);
  y ~ bernoulli(theta);
}
generated quantities {
  int y_rep[N];
  for (n in 1:N)
    y_rep[n] = bernoulli_rng(theta);
}



We run the `generate_quantities` method on `bernoulli_ppc` using existing sample `fit` as input.  The `generate_quantities` method takes the values of `theta` in the `fit` sample as the set of draws from the posterior used to generate the corresponsing `y_rep` quantities of interest.

The arguments to the `generate_quantities` method are:
 + `data`  - the data used to fit the model
 + `mcmc_sample` - either a `CmdStanMCMC` object or a list of stan-csv files


In [6]:
new_quantities = model_ppc.generate_quantities(data=data_file, mcmc_sample=fit)

INFO:cmdstanpy:Chain [1] start processing


INFO:cmdstanpy:Chain [1] done processing


INFO:cmdstanpy:Chain [2] start processing


INFO:cmdstanpy:Chain [2] done processing


INFO:cmdstanpy:Chain [3] start processing


INFO:cmdstanpy:Chain [3] done processing


INFO:cmdstanpy:Chain [4] start processing


INFO:cmdstanpy:Chain [4] done processing


The `generate_quantities` method returns a `CmdStanGQ` object which contains the values for all variables in the generated quantitites block of the program ``bernoulli_ppc.stan``.  Unlike the output from the ``sample`` method, it doesn't contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.

In this example, each draw consists of the N-length array of replicate of the `bernoulli` model's input variable  `y`, which is an N-length array of Bernoulli outcomes.

In [7]:
print(new_quantities.draws().shape, new_quantities.column_names)
for i in range(3):
    print (new_quantities.draws()[i,:])

(1000, 4, 10) ('y_rep[1]', 'y_rep[2]', 'y_rep[3]', 'y_rep[4]', 'y_rep[5]', 'y_rep[6]', 'y_rep[7]', 'y_rep[8]', 'y_rep[9]', 'y_rep[10]')
[[1. 1. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


We can also use ``draws_pd(inc_sample=True)`` to get a pandas DataFrame which combines the input drawset with the generated quantities.

In [8]:
sample_plus = new_quantities.draws_pd(inc_sample=True)
print(type(sample_plus),sample_plus.shape)
names = list(sample_plus.columns.values[7:18])
sample_plus.iloc[0:3, :]

<class 'pandas.core.frame.DataFrame'> (4000, 18)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,-6.8697,0.868387,0.905522,2.0,3.0,0.0,7.94676,0.314757,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-6.78844,0.97961,0.905522,1.0,3.0,0.0,6.99758,0.215631,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,-6.74808,1.0,0.905522,2.0,3.0,0.0,6.78658,0.248604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters `theta` as well as `y_rep` values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the `generate_quantities` method is preferred.