# MCMC Sampling

The  `CmdStanModel` class method  `sample` invokes Stan's adaptive HMC-NUTS
sampler which uses the Hamiltonian Monte Carlo (HMC) algorithm
and its adaptive variant the no-U-turn sampler (NUTS) to produce a set of
draws from the posterior distribution of the model parameters conditioned on the data.
It returns a `CmdStanMCMC` object
which provides properties to retrieve information about the sample, as well as methods
to run CmdStan's summary and diagnostics tools.

In order to evaluate the fit of the model to the data, it is necessary to run
several Monte Carlo chains and compare the set of draws returned by each.
By default, the `sample` command runs 4 sampler chains, i.e.,
CmdStanPy invokes CmdStan 4 times.
CmdStanPy uses Python's `subprocess` and `multiprocessing` libraries
to run these chains in separate processes.
This processing can be done in parallel, up to the number of
processor cores available.

## Fitting a model to data

In this example we use the CmdStan example model
[bernoulli.stan](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.stan)
and data file
[bernoulli.data.json](https://github.com/stan-dev/cmdstanpy/blob/master/test/data/bernoulli.data.json>).

We instantiate a `CmdStanModel` from the Stan program file

In [2]:
import os
from cmdstanpy.model import CmdStanModel
from cmdstanpy.utils import cmdstan_path
    
bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)

INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli


By default, the model is compiled during instantiation.  The compiled executable is created in the same directory as the program file.  If the directory already contains an executable file with a newer timestamp, the model is not recompiled.

We run the sampler on the data using all default settings:  4 chains, each of which runs 1000 warmup and sampling iterations.

In [3]:
# run CmdStan's sample method, returns object `CmdStanMCMC`
fit = model.sample(data=data_file)

INFO:cmdstanpy:sampling: ['/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli', 'id=1', 'random', 'seed=8532', 'data', 'file=/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli.data.json', 'output', 'file=/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-1-u5imbo2z.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 4
INFO:cmdstanpy:sampling completed


The `sample` method returns a `CmdStanMCMC` object, which contains:
- metadata
- draws
- HMC tuning parameters `metric`, `step_size`

In [4]:
print('sampler diagnostic variables:\n{}'.format(fit.metadata.method_vars_cols.keys()))
print('stan model variables:\n{}'.format(fit.metadata.stan_vars_cols.keys()))

sampler diagnostic variables:
dict_keys(['lp__', 'accept_stat__', 'stepsize__', 'treedepth__', 'n_leapfrog__', 'divergent__', 'energy__'])
stan model variables:
dict_keys(['theta'])


In [5]:
fit.summary()

Unnamed: 0_level_0,Mean,MCSE,StdDev,5%,50%,95%,N_Eff,N_Eff/s,R_hat
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
lp__,-7.3,0.021,0.8,-9.0,-7.0,-6.8,1500.0,27000.0,1.0
theta,0.25,0.0031,0.12,0.071,0.23,0.47,1500.0,27000.0,1.0


The sampling data from the fit can be accessed either as a `numpy` array or a pandas `DataFrame`:

In [6]:
print(fit.draws().shape)
fit.draws_pd().head()

(1000, 4, 8)


Unnamed: 0,lp__,accept_stat__,stepsize__,treedepth__,n_leapfrog__,divergent__,energy__,theta
0,-7.13333,0.980314,0.931626,1.0,1.0,0.0,7.17119,0.152238
1,-7.13333,0.798976,0.931626,1.0,3.0,0.0,8.95325,0.152238
2,-7.70447,0.907049,0.931626,1.0,1.0,0.0,7.70487,0.108057
3,-7.19878,1.0,0.931626,2.0,3.0,0.0,7.62639,0.145411
4,-6.83651,0.949836,0.931626,1.0,3.0,0.0,7.47493,0.304867


Additionally, if `xarray` is installed, this data can be accessed another way:

In [7]:
fit.draws_xr()

The ``fit`` object records the command, the return code,
and the paths to the sampler output csv and console files.
The string representation of this object displays the CmdStan commands and
the location of the output files.

Output filenames are composed of the model name, a timestamp
in the form YYYYMMDDhhmm and the chain id, plus the corresponding
filetype suffix, either '.csv' for the CmdStan output or '.txt' for
the console messages, e.g. `bernoulli-201912081451-1.csv`. Output files
written to the temporary directory contain an additional 8-character
random string, e.g. `bernoulli-201912081451-1-5nm6as7u.csv`.

In [8]:
fit

CmdStanMCMC: model=bernoulli chains=4['method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
 csv_files:
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-1-u5imbo2z.csv
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-2-5ci36cqs.csv
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-3-igmruz5j.csv
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-4-1_jnrmm6.csv
 output_files:
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-1-u5imbo2z-stdout.txt
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-2-5ci36cqs-stdout.txt
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-3-igmruz5j-stdout.txt
	/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926162709-4-1_jnrmm6-stdout.txt

The sampler output files are written to a temporary directory which
is deleted upon session exit unless the ``output_dir`` argument is specified.
The ``save_csvfiles`` function moves the CmdStan CSV output files
to a specified directory without having to re-run the sampler.
The console output files are not saved. These files are treated as ephemeral; if the sample is valid, all relevant information is recorded in the CSV files.

### Sampler Progress

Your model make take a long time to fit.  The `sample` method provides two arguments:
    
    - visual progress bar:  `show_progress=True`
    - stream CmdStan ouput to the console - `show_console=True`
    
To illustrate how progress bars work, we will run the bernoulli model. Since the progress bars are only visible while the sampler is running and the bernoulli model takes no time at all to fit, we run this model for 200K iterations, in order to see the progress bars in action.

In [12]:
fit = model.sample(data=data_file, iter_warmup=100000, iter_sampling=100000, show_progress=True)


INFO:cmdstanpy:sampling: ['/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli', 'id=1', 'random', 'seed=17676', 'data', 'file=/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli.data.json', 'output', 'file=/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926163244-1-s71bymaa.csv', 'method=sample', 'num_samples=100000', 'num_warmup=100000', 'algorithm=hmc', 'adapt', 'engaged=1']


HBox(children=(HTML(value='chain 2'), FloatProgress(value=0.0, max=2002.0), HTML(value='')))

HBox(children=(HTML(value='chain 3'), FloatProgress(value=0.0, max=2002.0), HTML(value='')))

HBox(children=(HTML(value='chain 4'), FloatProgress(value=0.0, max=2002.0), HTML(value='')))

HBox(children=(HTML(value='chain 1'), FloatProgress(value=0.0, max=2002.0), HTML(value='')))

INFO:cmdstanpy:sampling completed





The Stan language `print` statement can be use to monitor the Stan program state.
In order to see this information as the sampler is running, use the `show_console=True` argument.
This will stream all CmdStan messages written to both stdout and stderr to the terminal while the sampler is running.


In [18]:
fit = model.sample(data=data_file, chains=2, parallel_chains=1, show_console=True)



INFO:cmdstanpy:sampling: ['/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli', 'id=1', 'random', 'seed=36218', 'data', 'file=/Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli.data.json', 'output', 'file=/var/folders/db/4jnggnf549s42z50bd61jskm0000gq/T/tmp_i0ae0g7/bernoulli-20210926164335-1-g60x_8fb.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:sampling completed


chain 1: method = sample (Default)
chain 1: sample
chain 1: num_samples = 1000 (Default)
chain 1: num_warmup = 1000 (Default)
chain 1: save_warmup = 0 (Default)
chain 1: thin = 1 (Default)
chain 1: adapt
chain 1: engaged = 1 (Default)
chain 1: gamma = 0.050000000000000003 (Default)
chain 1: delta = 0.80000000000000004 (Default)
chain 1: kappa = 0.75 (Default)
chain 1: t0 = 10 (Default)
chain 1: init_buffer = 75 (Default)
chain 1: term_buffer = 50 (Default)
chain 1: window = 25 (Default)
chain 1: algorithm = hmc (Default)
chain 1: hmc
chain 1: engine = nuts (Default)
chain 1: nuts
chain 1: max_depth = 10 (Default)
chain 1: metric = diag_e (Default)
chain 1: metric_file =  (Default)
chain 1: stepsize = 1 (Default)
chain 1: stepsize_jitter = 0 (Default)
chain 1: id = 1
chain 1: data
chain 1: file = /Users/mitzi/.cmdstan/cmdstan-2.27.0/examples/bernoulli/bernoulli.data.json
chain 1: init = 2 (Default)
chain 1: random
chain 1: seed = 36218
chain 1: output
chain 1: file = /var/folders/db/4jn

## Running a data-generating model using `fixed_param=True`

In this example we use the CmdStan example model
[data_filegen.stan](https://github.com/stan-dev/cmdstanpy/blob/master/docs/notebooks/data_filegen.stan)
to generate a simulated dataset given fixed data values.

In [None]:
model_datagen = CmdStanModel(stan_file='bernoulli_datagen.stan')
datagen_data = {'N':300, 'theta':0.3}
fit_sim = model_datagen.sample(data=datagen_data, fixed_param=True)
fit_sim.summary()

Compute, plot histogram of total successes for `N` Bernoulli trials with chance of success `theta`:

In [None]:
drawset_pd = fit_sim.draws_pd()
drawset_pd.columns

# restrict to columns over new outcomes of N Bernoulli trials
y_sims = drawset_pd.drop(columns=['lp__', 'accept_stat__'])

# plot total number of successes per draw
y_sums = y_sims.sum(axis=1)
y_sums.astype('int32').plot.hist(range(0,datagen_data['N']+1))