# CmdStanPy Tutorial


### Workflow Outline

Given a dataset and a model specification written as a Stan program, the CmdStanPy workflow is:

#### Assemble input data as either:
  + A Python `dict` object consisting of key-value pairs where the key corresponds
 to Stan data variables and the value is of the correct type and shape.
  + An existing data file on disk in either JSON or Rdump format.

#### Compile the model
  + A `Model` object stores the filepath of the Stan program.
  + Method `compile` translates the Stan program to C++ then calls the C++ compiler.

#### Fit the model using the data,  sample from the posterior
  + The `Model` class method `sample` invokes Stan's NUTS-HMC sampler to condition model on input data and returns a `StanFit` object which contains a set of draws from the posterior plus metadata.
  + Runs any number of chains - default is 4 chains.
  + The output of each chain is stored on disk as a Stan csv file.

#### Summarize and check the fit
   + The `StanFit` class method `summary` invokes CmdStan's `stansummary` utility. Returns a Pandas DataFrame with estimates of posterior means, standard deviations, Monte-Carlo standard error, effective sample size, and convergence diagnostic statistic for all parameters in the model.
   + The `StanFit` class method `diagnose` invokes CmdStan's `diagnose` utility which checks for the following problems:
    + transitions that hit the maximum treedepth
    + divergent transitions
    + low E-BFMI values (sampler transitions HMC potential energy)
    + low effective sample sizes
    + high R-hat values

####  Assemble the sample in-memory
  + The resulting sample is accessed via the `StanFit` object:
    + `sample`  - all draws from all chains, stored as a 3-D numpy.ndarray.
    + `chains` - number of chains run by sampler
    + `draws` - draws per chain
    + `column_names` - names of the parameters, transformed parameters, and generated quantities variables returned in each draw
    + `csv_files` - list of Stan csv output files which comprise the sample
  + The method `get_drawset` flattens the 3-D sample array into a 2-D pandas.DataFrame for downstream analysis.


### Installation

* Install Python package from PyPI or directly from GitHub:

  + `pip install --upgrade cmdstanpy`
  + `pip install -e git+https://github.com/stan-dev/cmdstanpy`


* CmdStanPy uses CmdStan directly to compile and run Stan programs, therefore CmdStan must be installed locally.

  + if you have a working installation of CmdStan, set environment variable `CMDSTAN` to the full path to the top-level CmdStan directory.
  
  + if you don't already have CmdStan installed, run Python script `install_cmdstan` which downloads and compiles the latest release from https://github.com/stan-dev/cmdstan/releases.  By default this installs the latest version of CmdStan in the location `~/.cmdstanpy`.  Flags -d and -v are used to specify the directory and version, respectively.


### Example 1:  Compile and run CmdStan example model `bernoulli.stan`, data `bernoulli.data.json`

Every CmdStan release has an `examples/bernoulli` directory which contains the Stan model and test data files.  In this example we compile the model and run the sampler on the model and data.

##### Import relevant classes and methods

In [None]:
import os
import os.path
from cmdstanpy import Model, StanFit, cmdstan_path

##### Compile model, specify data or data file

The CmdStanPy directory `examples/bernoulli` contains the model and data files.

In [None]:
bernoulli_path = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
bernoulli_model = Model(stan_file=bernoulli_path)
bernoulli_model.compile()
print(bernoulli_model)

Input data is either a Python `Dict` with entries corresponding to input data values, or it can be a file in JSON or Rdump format.

In [None]:
bern_json = os.path.join(bernoulli_path, 'bernoulli.data.json')

If a `Dict` is specified, CmdStanPy writes it to a temp file in JSON format.

In [None]:
bern_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }

##### Run the HMC-NUTS sampler on the model and data

The `sample` function runs the NUTS-HMC sampler and returns a `StanFit` object.

In [None]:
bern_fit = bernoulli_model.sample(data=bern_data)

By default, the sample function runs 4 sampler chains.  The `chains` argument specifies the number of chains to run.  The `cores` argument specifies the number of processes to run in parallel.

In [None]:
bern_fit = bernoulli_model.sample(chains=5, cores=3, data=bern_data)

##### Summarize or save the results

The `summary` function returns output of CmdStan bin/stansummary as pandas.DataFrame:

In [None]:
bern_fit.summary()

The `diagnose` function prints diagnostics to console:

In [None]:
bern_fit.diagnose()

The `get_drawset` function returns a pandas.DataFrame, one draw per row.

In [None]:
bern_drawset = bern_fit.get_drawset()

By default, `get_drawset` returns a DataFrame which contains all columns from the sampler's csv output file, i.e., it contains both the sampler state and the values for all parameter, transformed parameter, and generated quantities variables.

In [None]:
bern_drawset.shape, bern_drawset.columns

The `get_drawset` function argument `params` takes a list of parameter or column names:

In [None]:
thetas = bern_fit.get_drawset(params=['theta'])
thetas.shape


In [None]:
thetas[0:3]

In [None]:
bern_drawset.theta.plot.density()

### Access to sampler output via `RunSet ` methods and attributes

#### sample

The `sample` property is a 3-D numpy ndarray which contains all draws across all chains.  This array is created only as needed; therefore the first time that this property is accessed CmdStanPy will read in the contents of the sampler's csv output files.  Because the csv output files also contain stepsize and metric information, the `stepsize` and `metric` arrays will also be created.

The ndarray is stored column major format so that values for each parameter are stored contiguously in memory, likewise all draws from a chain are contiguous.  Thus the dimensions of the ndarray are arranged as follows:  (draws, chains, columns):

In [None]:
bern_fit.sample
bern_fit.sample.shape

Python's index slicing operations can be used to access the information by chain.
For example, to select all draws and all output columns from the first chain,
we specify the chain index (2nd index dimension).  As arrays indexing starts at 0,
the index '0' corresponds to the first chain in the ``RunSet``.

The following expression selects the first 3 draws from chain 1 for the parameter `theta`:

In [None]:
bern_fit.column_names[7], bern_fit.sample[0:3,0,7]

#### stepsize

The `stepsize` property is a 1-D numpy ndarray which contains the stepsize used by the sampler for each chain.  This array is created at the same time as the `sample` and `metric` arrays are created.

At the end of adaptation, the stepsize for the 4 chains in this example is:

In [None]:
bern_fit.stepsize

#### metric_type, metric

The `metric` property is an numpy ndarray which contains the metric used by the sampler for each chain.  This array is created at the same time as the `sample` and `stepsize` arrays are created.

At the end of adaptation, the metric for the 4 chains in this example is:

In [None]:
bern_fit.metric_type,  bern_fit.metric