In [1]:
# import packages
import os
import subprocess
import numpy as np


# Rank test 
This notebook describes the pipeline for evaluationg of the Metropolis-Hastings algorithm in the paper "Stochastic phylogenetic models of shape". 

When carrying out rank test we do the following: 
1) Sample kernel parameters from the prior distribution 
2) For each parameter combination we simulate a data set and add observation noise in order to simulate from the complete model
3) For each data set we infer the posterior distribution using our MCMC algorithm. 
4) We evaluate convergence of the MCMC chains using Gelman-Rubin diagnostics. 
5) For each dimension of the posterior we compute the rank statistics and plot the distribution of all rank statistics.

In [2]:
# define ranktest settings
directory = '_ranktest'
num_datasets = 2

if not os.path.isdir(directory): 
    os.mkdir(directory)

# define priors on parameters and variance of the Gaussian noise
prior_alpha = (0.0005, 0.03) # min, max
prior_sigma = (0.7, 1.3) # min, max

# define simulation and MCMC settings 
dt = 0.05
gamma = 0.001 
treepath = 'data/chazot_subtree.nw'
rootshape = 'data/hercules_forewing_n=20.csv'
sti = 1

# MCMC settings (variables refer to paper)
tau_alpha = 0.005
tau_sigma = 0.05
n_iter = 100
lambd = 0.8
burnin = 10 

##### 1. Sample kernel parameters from prior

In [3]:
subprocess.run(['python', 'sample_pars.py', 
                f'-n {num_datasets}', 
                '-palpha', f'{prior_alpha[0]}',f'{prior_alpha[1]}',
                '-psigma', f'{prior_sigma[0]}',f'{prior_sigma[1]}',
                '-o', f'{directory+"/"}'
                ])

CompletedProcess(args=['python', 'sample_pars.py', '-n 2', '-palpha', '0.0005', '0.03', '-psigma', '0.7', '1.3', '-o', '_ranktest/'], returncode=0)

##### 2. Simulate a data set for each parameter

We can either simulate data sets by running the following bash script in the terminal window (remember to substitute the variables in "{}" with the true values of the variables).

In [4]:
# alternatively we can simulate from this notebook. 
pars = np.genfromtxt(f'{directory}/alpha:{prior_alpha[0]}-{prior_alpha[1]}_sigma:{prior_sigma[0]}-{prior_sigma[1]}.csv')
pars.shape

(2, 2)

In [5]:
for i in range(pars.shape[0]):
  subprocess.run(['python', 'simulate_data.py', 
                '-dt', f'{dt}',
                '-a', f'{pars[i,0]}',
                '-s', f'{pars[i,1]}',
                '-ov', f'{gamma}',
                '-root', f'{rootshape}',
                '-o', f'{directory}/simdata', 
                '-simtree', f'{treepath}', 
                '-sti', f'{sti}', # whether or not do the stratonovich ito corr
                '-rb', '1'  
                  ])

!! no data seed given
simulation seed: 96716799255318284


here() starts at /Users/lkn315/Documents/stoch_phyl_mod_shape


> args <- commandArgs(trailingOnly = TRUE)
> 
> # do what we want 
> tree = read.tree(paste(args[1], '.nw', sep=''))
> vcv_ = vcv(tree)
> write.table(vcv_, file=paste(args[1],'_vcv.csv', sep=''), row.names=F, col.names=F)
> 
!! no data seed given
simulation seed: 10581440337633583


here() starts at /Users/lkn315/Documents/stoch_phyl_mod_shape


> args <- commandArgs(trailingOnly = TRUE)
> 
> # do what we want 
> tree = read.tree(paste(args[1], '.nw', sep=''))
> vcv_ = vcv(tree)
> write.table(vcv_, file=paste(args[1],'_vcv.csv', sep=''), row.names=F, col.names=F)
> 


##### 3. Run MCMC to get posterior
We show in the notebook how to run MCMC but in practice we do not run MCMC from a notebook and this is just to give an example of how to run the mcmc.py script.  

A few comments regarding the mcmc.py script: 

Some of the variables in the code are not named in accordance with the paper (see below):
- gtheta = sigma 
- kalpha = alpha 
- obs_var = gamma 
- mirrored Gaussian = reflected Gaussian

In the paper we update parameters by evaluating $g_s(x_s; \theta)$ in the code we refer to this as "logrhorilde" and not g_s. As we do not infer the root when doing rank test, super root and root is used interchangeably. 


In [6]:
with open (f'{directory}.sh', 'w') as rsh:
    rsh.write(f'''#!/bin/bash
read seed
              
screen -md -S ranktest python mcmc.py -N {n_iter} -l {lambd} -dt {dt} -datapath {directory}/simdata -tau_sigma {tau_sigma} -tau_alpha {tau_alpha} -palpha {prior_alpha[0]} {round(prior_alpha[1]-prior_alpha[0],2)} -psigma {prior_sigma[0]} {round(prior_sigma[1]-prior_sigma[0],2)} -ov {gamma} -super_root {rootshape} -o {directory}/runs -ds $seed
screen -md -S ranktest python mcmc.py -N {n_iter} -l {lambd} -dt {dt} -datapath {directory}/simdata -tau_sigma {tau_sigma} -tau_alpha {tau_alpha} -palpha {prior_alpha[0]} {round(prior_alpha[1]-prior_alpha[0],2)} -psigma {prior_sigma[0]} {round(prior_sigma[1]-prior_sigma[0],2)} -ov {gamma} -super_root {rootshape} -o {directory}/runs -ds $seed
screen -md -S ranktest python mcmc.py -N {n_iter} -l {lambd} -dt {dt} -datapath {directory}/simdata -tau_sigma {tau_sigma} -tau_alpha {tau_alpha} -palpha {prior_alpha[0]} {round(prior_alpha[1]-prior_alpha[0],2)} -psigma {prior_sigma[0]} {round(prior_sigma[1]-prior_sigma[0],2)} -ov {gamma} -super_root {rootshape} -o {directory}/runs -ds $seed

'''
    )

We submit the script generated above by running the line below in the commandline. 

In [None]:
# ls {directory}/simdata | parallel --memsuspend 5G -j 50% "echo {} | bash {directory}.sh"

##### 4. Evaluate convergence 
We evaluate convergence and plot posterior distribution.

In [None]:
# ! ls _ranktest/runs | while read seed; do python diagnostics_1-ranktest.py -folder_runs _ranktest/runs/$seed -folder_simdata _ranktest/simdata/$seed -MCMC_iter 100 -burnin 10 -nnodes 9 -phylogeny data/chazot_subtree.nw; done 

Look at distribution of bias and overall convergence diagnostics by running the notebook 4.1-ranktest/summary-plots.ipynb .

##### 5. Compute ranks and do rank plots


In [8]:
# run script for computing ranks 
! python calc_ri.py -N {n_iter} -burnin {burnin} -folder_runs {directory}/runs -folder_simdata {directory}/simdata -nnodes 9 -nxd 40

3
_ranktest/runs/20139000411276579/
_ranktest/runs/49053031123053395/
_ranktest/runs/9230576093721682/


In [10]:
# run script for plotting ranks 
! python plot_ranks.py -folder_runs {directory}/runs -tree data/chazot_subtree.nw

niepelti
theseus
hercules
amphitryon
telemachus
[3, 4, 5, 7, 8]
[0, 1, 2, 6]
