Conformational Sampling of the P53 Peptide in Explicit Solvent

by Adam Pratt

Contact: Adam.J.Pratt@pitt.edu?Subject=MolecularScaleTutorial

The video of the presentation on this subject can be found on youtube.

Requirements: WESTPA 1.0 beta and GROMACS 4.6.5

Results from this tutorial can be compared to those of extensive brute force simulations reported in Xiong et al., JPC A (2011) here.

Table of Contents Overview Required Files & Initial Setup Preparing the GROMACS files md.mdp Picking an appropriate τ value Things to edit Preparing the WESTPA files init.sh What is BASIS_STATES? runseg.sh Changes to use the local scratch Trajectory Initialization & mdrun Imaging & Progress Coordinate Return west.cfg system.py system.System.initialize system.pcoord_loader_color_tracker Run it!

Overview

This tutorial serves as an example of how to do a more advanced WESTPA simulation of a molecular-scale system; specifically, the conformational sampling of a peptide in explicit solvent using the GROMACS dynamics engine. As such, much of the tutorial is focused on what is different from the Na+/Cl- tutorial, and what complications are involved in a more complex system. To follow along successfully, the user should have already run the introductory tutorial in order to be familiar with the basics of WESTPA and weighted ensemble.

Required Files & Initial Setup

To run a WESTPA simulation of the p53 peptide, you will need a number of files for both WESTPA and the dynamics engine (GROMACS). To obtain all of these files, clone the following repository:

p53-tutorial

by using the following set of commands:

cd ~/
git clone https://github.com/ajoshpratt/p53-tutorial

As the tutorial has been designed for the WESTPA user's workshop, the env.sh has been set up to run on the Frank cluster at the Center for Simulation and Modeling (SAM) at the University of Pittsburgh. Users wishing to get a headstart can then initialize the simulation as follows:

cd p53-tutorial
./init.sh

and then submit to the cluster:

qsub runwe-frank.sh

Preparing the GROMACS files

Input File	Description
p53.top	topology
conf.itp	topology include file
ions.itp	ion topology. Necessary due to differences in GROMACS.
p53.tpr	binary topology
p53.ndx	GROMACS index file
md.mdp	configuration for initial segments

Unlike the original Na+/Cl- tutorial, this simulation will be using explicit solvent, and as such velocities are not generated for new trajectories; therefore, there is only one .mdp file necessary. There is a new file, however, called the index file; GROMACS is intelligent enough to auto-generate groups for many simple systems (Na+/Cl- being one of them) when using programs such as g_dist, but for proteins or other complex systems (particularly those with a binding partner) where the user wishes to do complicated calculations, it is generally desirable to create an index file.

Some extra files for the topology have been included to avoid some minor changes in how the force field parameters are being handled.

md.mdp

;   15 ps NPT production with Velocity Rescaling thermostat and Berendsen barostat
;   Template Written by Karl Debiec on 13-03-14, modified by Adam Pratt on 01-17-15
;##################################################### INTEGRATOR ######################################################
integrator              = md                                
ld_seed                 = -1                                ; Random seed
dt                      = 0.002                             ; Timestep (ps)
nsteps                  = 7500                              ; Simulation duration (timesteps)
nstcomm                 = 375                               ; Center of mass motion removal interval (timesteps)
comm-grps               = system                            ; Remove center of mass motion of system
;###################################################### ENSEMBLE #######################################################
Pcoupl                  = berendsen                         ; Berendsen barostat
tcoupl                  = v-rescale
ref_p                   = 1                                 ; System Pressure (Bar)
tau_p                   = 0.5                               ; Barostat time constant (ps)
compressibility         = 4.5e-5                            ; Compressibility (Bar -1)
refcoord_scaling        = com                               ; Scale center of mass of reference coordinates with box
ref_t                   = 303 303                           ; System temperature (K)
tau_t                   = 0.1 0.1                           ; Thermostat time constant (ps)
tc_grps                 = Protein Non-Protein               ; Apply thermostat to system
;gen_vel                 = no
continuation            = yes
;################################################ BONDED INTERACTIONS ##################################################
constraints             = hbonds                            ; Disable constraints
constraint_algorithm    = LINCS                             ; Constrain bonds using LINCS
lincs_iter              = 1                                 ; Number of LINCS iterations
;############################################### NONBONDED INTERACTIONS ################################################
coulombtype             = PME                               ; Switch/PME long-range electrostatics
fourierspacing          = 0.1
pme_order               = 6
ewald_rtol              = 0.000001
pbc                     = xyz                               ; Periodic boundary condition
rcoulomb                = 1.0                               ; Short-range electrostatic cutoff (nm)
;rcoulomb_switch         = 0.9                               ; Short-range electrostatic switch cutoff (nm)
vdwtype                 = Switch                            ; Switch van der Waals interactions
rvdw                    = 0.9                               ; Van der Waals cutoff (nm)
rvdw_switch             = 0.8                               ; Van der Waals switch cutoff (nm)
DispCorr                = EnerPres                          ; Long-range dispersion correction to energy and pressure
ns_type                 = grid                              ; Update neighbor list using grid
nstlist                 = 10                                ; Neighbor list update interval (timesteps)
rlist                   = 1.00                              ; Neighbor list cut-off (nm)
;####################################################### OUTPUT ########################################################
nstlog                  = 750                               ; Energy log output interval (timesteps)
nstenergy               = 750                               ; Energy output interval (timesteps)
nstcalcenergy           = 75
nstxout                 = 7500                              ; Full-resolution trajectory output interval (timesteps)
nstvout                 = 7500                              ; Full-resolution velocity output interval (timesteps)
nstfout                 = 7500                              ; Full-resolution force output interval (timesteps)
nstxtcout               = 750                               ; Reduced-resolution trajectory output interval (timesteps)
xtc-precision           = 10000                             ; Reduced-resolution trajectory output precision

While in the original paper, the Nosé–Hoover thermostat was used with a Parrinello-Rahman barostat, these options are not suitable for a weighted ensemble run as a stochastic element is required for the splits to diverge. Therefore, in this tutorial, the stochastic velocity rescaling thermostat (time constant of 0.1 ps) with a weak Berendsen barostat will be used, as these options have been tested internally and have been deemed to be suitable.

The progress coordinate length will be determined by the options chosen here.

Picking an appropriate τ value

The rule of thumb behind picking an appropriate τ value is to choose a timescale that will capture the system's slowest motion that is relevant to your observable of interest, whether that is the diffusion of individual ligands, vibration of side chains, etc. Choosing an inappropriate τ value will not make your simulation wrong, however; it will simply hurt efficiency.

Things to edit

You must ensure that everything is compatible with your chosen τ value (15 ps, here), such as your output frequency options for the xtc file, forces, energy, etc. They should be, at most, the same value as nsteps, or nsteps should divide into it cleanly. The following options are the ones to adjust:

nsteps
nstcomm
nstlog
nstenergy
nstcalcenergy
nstxout
nstvout
nstfout
nstxtcout

Preparing the WESTPA files

For this section, it may be helpful to compare the files to that in the original GROMACS Na+/Cl- tutorial, as there are significant changes to many of them. If you are familiar with vim, running

vimdiff file1 file2

will allow you a line by line comparison with a splitscreen view, allowing you to see what lines have been added, removed, and changed.

init.sh

1 #!/bin/bash
2 source env.sh
3 ps aux | grep w_run | grep -v grep
4 pkill -9 -f w_run
5
6 SFX=.d$$
7 mv      traj_segs{,$SFX}
8 mv      seg_logs{,$SFX}
9 mv      istates{,$SFX}

10 rm -Rf traj_segs$SFX seg_logs$SFX istates$SFX & disown %1 11 rm -f system.h5 west.h5 seg_logs.tar 12 mkdir seg_logs traj_segs istates 13 14 BSTATE_ARGS="--bstate-file bstates/BASIS_STATES" 15 16 $WEST_ROOT/bin/w_init $BSTATE_ARGS --segs-per-state 1 17 --work-manager=serial "$@"

As this is an equilibrium simulation, no target state/bin is specified. In addition, as there is no recycling and the simulation is started from a delta distribution, there is no reason to create more than one initial state per basis state. The rest of the script simply cleans up prior runs in a sane manner, and kills any running simulations owned by the current user.

What is BASIS_STATES?

One of the files in the bstates directory is BASIS_STATES; this contains information about the basis states (in this case, there is only one), and it will be fed in to w_init with the --bstates-from-file flag during the initialization step of WESTPA to generate initial states. It is formatted as follows:

internal_name relative_probability reference

Where reference can be used in a manner determined entirely by the west.cfg file. In the Na+/Cl- tutorial, the bstates.txt file served the same purpose, and pointed simply to nacl.gro within the same directory. Here, the reference points to a directory within bstates, and west.cfg and get_pcoord.sh have been modified appropriately to use the file name within the directory. Our BASIS_STATES looks like the following:

p53_bound_conf 1 p53

Indicating that within bstates, there is a reference called p53 that will be internally referred to as p53_bound_conf, and it will be pulled with a probability of 1. A basis states file is useful for when there are multiple conformations you may wish to start from, or there is a known probability distribution inside of a particular state. You need only assign an appropriate probability to each reference, one on each line. If the probabilities do not sum to 1, WESTPA will renormalize them. For another example, see the following (obviously not an appropriate file for this tutorial).

runseg.sh

There are significant changes to this file; not only are we returning auxiliary progress coordinates and dealing with imaging issues, we are also working with a more expensive system that may be more stressful on the communications infrastructure of your supercomputing cluster. While these changes aren't strictly necessary to get the system up and running, the changes are noted, nonetheless.

We'll run through the file in chunks.

Changes to use the local scratch

if [ -n &quot;$SEG_DEBUG&quot; ] ; then
    set -x
    env | sort
fi

cd $WEST_SIM_ROOT

# Set up the run
mkdir -pv $WEST_CURRENT_SEG_DATA_REF
cd $WEST_CURRENT_SEG_DATA_REF

if [[ "$USE_LOCAL_SCRATCH" == "1" ]] ; then
    # make scratch directory
    WORKDIR=$SCRATCHROOT/$WEST_CURRENT_SEG_DATA_REF
    $SWROOT/bin/mkdir -pv $WORKDIR || exit 1
    cd $WORKDIR || exit 1
    STAGEIN=&quot;$SWROOT/bin/cp -avL&quot;
else
    STAGEIN=&quot;$SWROOT/bin/ln -sv&quot;
fi


function cleanup() {
    # Clean up.  Copy back what we want, and remove the rest.
    # Also, remove our copied in parent references.  We don't need to keep that.
    $SWROOT/bin/rm -f none.xtc whole.xtc $REF parent.*
    if [[ "$USE_LOCAL_SCRATCH" == "1" ]] ; then
        $SWROOT/bin/cp *.{cpt,xtc,trr,edr,tpr,gro,log,xvg} $WEST_CURRENT_SEG_DATA_REF || exit 1
        cd $WEST_CURRENT_SEG_DATA_REF
        $SWROOT/bin/rm -Rf $WORKDIR
    else
        # Here, we're not using local scratch.  Remove some specific things, in that case.
        $SWROOT/bin/rm -f *.itp *.mdp *.ndx *.top
    fi
}

# Regardless of the reason we exit, run the function cleanup.
trap cleanup EXIT

1. Enter the simulation directory.
2. Create the directory that will hold the trajectory segment data.
3. Change into that directory.

In env.sh, we have defined whether or not to use the local scratch space on the individual compute node, and have created a variable which holds the appropriate location, if such a thing exists on this supercomputer.

If this is set to True, 
4. Create the trajectory segment directory on the scratch space on the local node.
5. Change into that directory.

We then define a variable called $STAGEIN (set to cp if we're using local scratch space on the node, or ln if we're not), which will be used to stage what files we need in to our working directory. If using the local scratch space, copying in the parent trajectory data and topology information generally results in significantly less network communication than writing dynamics information, as it is calculated, over the network to the shared file system.

In addition, the cleanup code is modified; if the simulation is not run on the node, we remove the files we don't want, whereas if we are working on the node, we only copy back what we want to the $WEST_SIM_ROOT location, and delete everything once we're finished. From this point on, runseg.sh works the same, regardless of whether we are using the local scratch or not.

Trajectory Initialization & mdrun

    SEG_INITPOINT_NEWTRAJ)
        # Initiation of a new trajectory
        # In truth, there's very little difference between a new trajectory
        # and an old one, except we handle our istates a little differently
        # than a previous segment, and use the .edr file.  
        # For an explicit solvent simulation,
        # all trajectories are considered continuations.
        # We are also copying in the basis state as the imaged ref.
        # $WEST_PARENT_DATA_REF contains the reference to the
        #   appropriate basis or initial state
        $STAGEIN $WEST_PARENT_DATA_REF.edr ./parent.edr
        $STAGEIN $WEST_PARENT_DATA_REF.gro ./parent.gro
        $STAGEIN $WEST_PARENT_DATA_REF.trr ./parent.trr
        $STAGEIN $WEST_PARENT_DATA_REF.gro ./parent_imaged.gro
        $STAGEIN $GMX_CFG/* .
        $GROMPP -f $MDP -c parent.gro -e parent.edr -p $TOP \
          -t parent.trr -o seg.tpr -po md_out.mdp
    ;;

    *)
        # This should never fire.
        echo &quot;unknown init point type $WEST_CURRENT_SEG_INITPOINT_TYPE&quot;
        exit 2
    ;;
esac

# Propagate segment
# It's easiest to set our OpenMP thread count manually here.
export OMP_NUM_THREADS=1
$MDRUN -s   seg.tpr -o seg.trr -c  seg.gro -e seg.edr \
       -cpo seg.cpt -g seg.log -x  seg.xtc -nt 1

Like the implicit solvent simulations, new trajectories and continuations must be handled differently; this is despite the fact that there is little difference between the two, from a technical perspective (a new trajectory can simply be thought of as a continuation after the equilibration step in a new ensemble). There are a few significant differences here, however:

1. No new velocities are generated for a new trajectory. Velocities already exist due to the previous solvent equilibration step and have been stored.

2. New trajectories use the energy file from the basis state, whereas continuations use the checkpoint file. The checkpoint file is the preferred method of continuing simulations in newer versions of GROMACS, and saves full resolution state information from both the thermostat and the barostat, in addition to coordinates, velocities, etc.

3. A file named 'parent_imaged.gro' is called later on in runseg.sh, and so regardless of what type of trajectory this is, it must exist; continuations merely $STAGEIN that file, whereas new trajectories don't have parents and must decide on how to handle that scenario. As the basis state they are being pulled from is, in this case, guaranteed to be within the box (more on this below), it is okay to use it as the reference state.

Once that's finished, mdrun is called (after setting a variable to force it run with 1 thread; this is easier on Frank than fiddling with the thread options on mdrun).

Imaging & Progress Coordinate Return

    # Image the system correctly.
    COMMAND=&quot;0 \n&quot;
    echo -e $COMMAND \
      | $TRJCONV    -f seg.xtc     -s parent_imaged.gro  -n $NDX -o none.xtc        -pbc none || exit 1
    echo -e $COMMAND \
      | $TRJCONV    -f none.xtc    -s parent_imaged.gro  -n $NDX -o whole.xtc       -pbc whole || exit 1
    echo -e $COMMAND \
      | $TRJCONV    -f whole.xtc   -s parent_imaged.gro  -n $NDX -o nojump.xtc      -pbc nojump || exit 1
    echo -e $COMMAND \
      | $TRJCONV    -f nojump.xtc  -s seg.tpr            -n $NDX -o imaged_ref.gro  -b -1 || exit 1

    # Update the command, then calculate the first dimension of the progress coordinate: end to end distance.
    COMMAND=&quot;18 \n 19 \n&quot;
    echo -e $COMMAND \
      | $G_DIST -f seg.xtc -s seg.tpr -o dist.xvg -xvg none -n $NDX || exit 1
    cat dist.xvg | awk '{print $2*10;}' &gt; $WEST_END_TO_END_DIST_RETURN

    # Update the command again, then run g_rms to calculate to second the dimension: the heavy atom rmsd of the protein aligned on itself.
    COMMAND=&quot;2 \n 2 \n&quot;
    echo -e $COMMAND \
      | $G_RMS -s $REF -f nojump.xtc -n $NDX -xvg none || exit 1
    cat rmsd.xvg | awk '{print $2*10;}' &gt; $WEST_PCOORD_RETURN

fi

# Output coordinates.  While we can return coordinates, this is expensive (data size) for a system of this size
# and so by default, it is off for this system.  However, by modifying the variable COMMAND, the group
# which has its coordinates returned can be modified and reduce the cost, so it is sensible to leave it in.

if [ ${WEST_COORD_RETURN} ]; then
    COMMAND=&quot;0 \n&quot;
    if [ ${TRJCONV} ]; then
        # For GROMACS 4, use trjconv
        echo -e $COMMAND | $TRJCONV -f seg.trr -s seg.tpr -o seg.pdb
    fi
    cat seg.pdb | grep 'ATOM' \
      | awk '{print $6, $7, $8}' &gt; $WEST_COORD_RETURN
fi

# Output log
if [ ${WEST_LOG_RETURN} ]; then
    cat seg.log \
      | awk '/Started mdrun/ {p=1}; p; /A V E R A G E S/ {p=0}' \
      &gt; $WEST_LOG_RETURN
fi

There are 4 distinct things that happen within this block of code:

1. Image the system.
2. Save the final, correctly imaged frame.
3. Calculate the end to end center of mass distance of the caps, and return it (in Angstroms) as an auxiliary coordinate.
4. Calculate the heavy atom RMSD of the P53 peptide after aligning on itself (heavy atom), and return it (in Angstroms) as the progress coordinate 
   (which is loaded by a custom data loader; see the section on system.py).

Coordinates and log file parsing have been disabled for this system, but the functionality exists and should work, in general, for a large system (with adjustment of atom groups, nfields, etc).

While exact imaging requirements tend to vary between systems, the above is a good baseline and should work for many systems. The most important aspect of this imaging procedure is that, when imaging, trjconv is referencing the final correctly imaged frame of the parent trajectory (which is a correctly imaged version of frame 0 of the current trajectory) so it knows where the protein/waters/solutes should be at time 0, and whether the protein should be placed back within the box.

Without this correctly imaged parent reference structure, GROMACS images everything relative to frame 0 of the input trajectory file. If the imaging commands are correct, and the protein left the box during the previous segment, the parent pcoord will be correct; however, this iteration will start and stay outside of the box, and then RMSD calculations will be artificially high, resulting in large, frustrating discontinuities both in the free energy profile and in any visualized trajectory. Passing in the final, correctly imaged frame as a reference ensures that if the protein did leave the box, GROMACS will put the protein back in the box at the start of the current segment and no visualization artifacts will result.

g_dist handles periodic boundary conditions correctly, regardless of imaging; it does not matter whether nojump.xtc or seg.xtc is passed in.

The imaging steps are as follows:

1. Remove periodic boundary conditions.  This effectively removes the box, allowing molecules to diffuse out of the box.
2. Make whole any molecules that are split across a boundary.  Instead of existing across barriers, they will now 'jump'.
3. Remove jumps, referencing the parent image.  With the parent image in hand, GROMACS knows to simply allow a protein
  or solvent molecule to diffuse across a boundary, rather than having it jump to the other side of the box.

Once that is all complete, the appropriate metric is calculated (using g_dist and g_rms), converted into Angstroms, piped into its respective variable, $WEST_VAR_RETURN (which points to a temp file on the server node, generated by WESTPA), and is loaded by whatever loader is specified in west.cfg.

west.cfg

# The master WEST configuration file for a simulation.
# vi: set filetype=yaml :
---
west: 
  system:
    driver:      system.System
    module_path: $WEST_SIM_ROOT
  propagation:
    max_total_iterations: 100
    max_run_wallclock:    6:00:00
    propagator:           executable
    gen_istates:          false
  data:
    west_data_file: west.h5
    datasets:
      - name:        pcoord
        scaleoffset: 4
      - name:        coord
        dtype:       float32
        scaleoffset: 3
      - name:        log
        dtype:       float32
        scaleoffset: 4
    data_refs:
      segment:       $WEST_SIM_ROOT/traj_segs/{segment.n_iter:06d}/{segment.seg_id:06d}
      basis_state:   $WEST_SIM_ROOT/bstates/{basis_state.auxref}/eq3
      initial_state: $WEST_SIM_ROOT/istates/{initial_state.iter_created}/{initial_state.state_id}.gro
  plugins:
  executable:
    environ:
      PROPAGATION_DEBUG: 1
    datasets:
      - name:    pcoord
        enabled: true
        loader:  system.pcoord_loader_color_tracker
      - name:    coord
        loader:  system.coord_loader
        enabled: false
      - name:    log
        loader:  system.log_loader
        enabled: false
      - name:    end_to_end_dist
        enabled: true
    propagator:
      executable: $WEST_SIM_ROOT/westpa_scripts/runseg.sh
      stdout:     $WEST_SIM_ROOT/seg_logs/{segment.n_iter:06d}-{segment.seg_id:06d}.log
      stderr:     stdout
      stdin:      null
      cwd:        null
      environ:
        SEG_DEBUG: 1
    get_pcoord:
      executable: $WEST_SIM_ROOT/westpa_scripts/get_pcoord.sh
      stdout:     /dev/null
      stderr:     stdout
    gen_istate:
      executable: $WEST_SIM_ROOT/westpa_scripts/gen_istate.sh
      stdout:     /dev/null
      stderr:     stdout
    post_iteration:
      enabled:    true
      executable: $WEST_SIM_ROOT/westpa_scripts/post_iter.sh
      stderr:     stdout
    pre_iteration:
      enabled:    false
      executable: $WEST_SIM_ROOT/westpa_scripts/pre_iter.sh
      stderr:     stdout

Very little has changed here; however, it's worth pointing out that the pcoord is being treated like any other dataset, and a custom loader is being used for it (system.pcoord_loader_color_tracker). This is so that the progress coordinate can be analyzed, and on-the-fly state information generated and saved as a second dimension. When data is returned to $WEST_PCOORD_RETURN in runseg.sh, it now bypasses the normal, built-in pcoord loader and uses this instead. In addition, every possible basis state is in a directory, defined in the initialization step, and is named 'eq3'.

gen_istates is set to False. In this instance, we have already prepared a suitable file and run through energy minimisation and solvent equilibration, and so the bstates are simply copied in as istates. While you can probably set up your WESTPA scripts to automatically do this for you, it is ill advised for explicit solvent, as it can be a complicated procedure which requires human input and a lot of time.

system.py

import logging
log = logging.getLogger(__name__)
log.debug('loading module %r' % __name__)

class System(WESTSystem):
    &quot;&quot;&quot;
    System for P53 folding and unfolding.
    &quot;&quot;&quot;

    def initialize(self):
        &quot;&quot;&quot;
        Initializes system
        &quot;&quot;&quot;
        self.pcoord_ndim  = 2
        self.pcoord_len   = 11
        self.pcoord_dtype = numpy.float32
        # As the RMSD coordinate is taken relative to the coil, aligned on the coil,
        # it will remain sensitive to coil changes.  It's best to assume the maximum is
        # not dissimilar to the maximum for the distance; something around 57 A, as
        # that would take into account the peptide flipping completely around.
        # However, we must bin much finer.
        self.rmsd_binbounds         = [0.0+0.4*i for i in xrange(0,19)] + \
                                      [8.0+0.8*i for i in xrange(0,19)] + \
                                      [24.0+11.0*i for i in xrange(0,3)] + [float('inf')]

        # It's best not to place these at the integer boundaries, due to 
        # oddities with the way numpy/h5py stores the values inside the west.h5 file.
        # Given that we are starting in the coil conformation, the 'unknown state'
        # (that is, 1.5 to float, or 2) will never be used; our bins will never be more
        # than 66% filled.

        self.color_binbounds = [-0.5,0.5,1.5,float('inf')]

        # A simple rectilinear binmapper, with the third dimension as color, to ensure good sampling.
        self.bin_mapper   = RectilinearBinMapper([self.rmsd_binbounds, self.color_binbounds])

        self.bin_target_counts      = numpy.empty((self.bin_mapper.nbins,),
                                        numpy.int)
        self.bin_target_counts[...] = 4

def pcoord_loader_color_tracker(fieldname, coord_file, segment, single_point=False):
    &quot;&quot;&quot;
    This function loads a 1-dimensional progress coordinate, performs some logic to track color,
    then returns the 2 dimensional progress coordinate to the system to be processed.
    In this tutorial, there are 2 dimensions specified in this file; runseg.sh returns one of them.
    The third is calculated here.
    Note that we are defining our states only based on one progress coordinate dimension, in this example.

    **Arguments:**
        :*fieldname*:      Key at which to store dataset
        :*coord_filename*: Temporary file from which to load coordinates
        :*segment*:        WEST segment
        :*single_point*:   Data to be stored for a single frame
                           (only false half the time)
    &quot;&quot;&quot;

    # These are the raw coordinates.
    coord_raw = numpy.loadtxt(coord_file, dtype=numpy.float32) 
    # These are the states; they are left inclusive, and right exclusive, which is consistent with the normal
    # binning procedure.
    # It's difficult to ascertain what is truly 'folded' and 'unfolded' for these without a prior
    # free energy profile; thankfully, we just need some rough estimates.  In the worst case scenario,
    # we devolve to the original Huber and Kim sampling scheme.
    color_bins = [(0.0,2.0),(15.0,float('inf'))]
    unknown_state = 2
    system = westpa.rc.get_system_driver()

    if single_point == True:
        npts = 1
    else:
        npts = system.pcoord_len

    coords = numpy.empty((npts), numpy.float32)
    colors = numpy.empty((npts), numpy.float32)
    #coords = numpy.empty((npts,system.pcoord_ndim), numpy.float32)
    #colors = numpy.empty((npts), numpy.float32)
    if single_point == True:
        colors[:] = unknown_state
        for istate,state_tuple in enumerate(color_bins):
            # Note that here, we are using the first dimension and first dimension alone.
            # The shape of the returned coord_raw is slightly different if single_point evalues
            # to true.
            # We evalulate whether or not we're in a state; if not, we leave it as in the
            # unknown state.
            # Swap this line to enable an N-dimensional pcoord, using the 1st dimension
            # as the state definition.
            #if coord_raw[0] &gt;= state_tuple[0] and coord_raw[0] &lt; state_tuple[1]:
            if coord_raw &gt;= state_tuple[0] and coord_raw &lt; state_tuple[1]:
                colors[:] = istate
        coords[:] = coord_raw[...]
    else:
        # If we're not the first point, we set the state to be what it was in the beginning
        # of the iteration.  We only want to update the state when we update a bin for purposes
        # of state tracking.
        # Swap lines to enable multiple pcoord dimensions, then change dimensions.
        #colors[:] = segment.pcoord[0][2]
        colors[:] = segment.pcoord[0][1]
        coords[:] = coord_raw[...]

    for istate,state_tuple in enumerate(color_bins):
        #if coords[-1,0] &gt;= state_tuple[0] and coords[-1,0] &lt; state_tuple[1]:
        if coords[-1] &gt;= state_tuple[0] and coords[-1] &lt; state_tuple[1]:
            colors[-1] = istate

    # We require different stacking behavior to return things in the proper order
    # depending on how many points we have.  I could probably clean this up.
    if single_point == True:
        # Again, swap lines.
        #segment.pcoord = numpy.hstack((coords[0,0],coords[0,1],colors[:]))
        segment.pcoord = numpy.hstack((coords[:],colors[:]))
    else:
        # This could easily be modified to return N dimensions.
        #segment.pcoord = numpy.swapaxes(numpy.vstack((coords[:,0],coords[:,1],colors[:])), 0, 1)
        segment.pcoord = numpy.swapaxes(numpy.vstack((coords[:],colors[:])), 0, 1)

For the sake of readability, the unused log and coordinate loader are not shown here.

There are two major functions, here: initialize, contained within the class System, and pcoord_loader_color_tracker.

system.System.initialize

As in the other examples/tutorials, the number of dimensions, progress coordinate points, bin boundaries, etc, are set here. What is notably different is that here, there are two dimensions: RMSD, and 'color', which is a simple on the fly state tracking method that 'paints' walkers with a tag depending on what state they were last in. The idea is to ensure that trajectories which have visited the final state which are coming back (that is, 'reverse' trajectories) are not merged with the trajectories that are still approaching the final state (the 'forward' trajectories, which are typically heavier). While there are various possible ways to implement such a scheme, it is done here as a custom progress coordinate loader (pcoord_loader_color_tracker).

The state tags, in pcoord_loader_color_tracker, are integers, but the bins themselves, self.color_binbounds, are halfway between the integers. This is because numpy does not cleanly store the tags as integers, as the progress coordinate has been defined as type float32 with a scale offset of 4, not a type int, in west.cfg; a tag of 1 is sometimes stored as 1.0003, and sometimes as 0.99993, etc, which can make tracking confusing (this does not necessarily result in incorrect state tracking, however, as each value in the array is offset by the same small amount).

system.pcoord_loader_color_tracker

This is the function that is called to load $WEST_PCOORD_RETURN in both runseg.sh and get_pcoord.sh. There are four input variables: fieldname, coord_file, segment, and single_point. These are standard for any data loader inside of WESTPA. In fact, there's very little to differentiate the progress coordinate dataset from any other dataset inside of WESTPA, internally. 'coord_file' is the file pointed to by $WEST_PCOORD_RETURN, segment is an object, and single_point evaluates to True if this function is being called from get_pcoord.sh (or from gen_istates.sh, if that script is being run, which it is not). While 'fieldname' isn't used here, as it is known that this function is being called only for the progress coordiate, the function could be modified to write to any dataset with ease.

The function works as follows, and is mostly the same regardless of whether single_point is True or False (the differences mostly lie in the fact that the data is shaped differently, and so must be handled differently. So, too, must the output):

1.  Load the data from runseg.sh with numpy.loadtxt.
2.  Define the folded and unfolded bin values (that is, their pcoord value ranges).
3A. If single_point is True, check to see whether the first time point is within one of
    the states.  If not, bin it as the 'unknown state', which is merely state N+1, where
    N is the number of states.  Copy this through for every time point.
3B. If single_point is False, and this is a continuation, take whatever state the segment
    was in at time 0, and copy it through for every time point.
4.  Evaluate whether the state has changed at the final time point.  If so, update.
5.  Stack the data in the appropriate manner, and write to the progress coordinate.

The system will continue to propagate segments with an 'unknown state' tag; this tutorial avoids that problem by starting from a delta distribution within the folded state, and so no segment will ever not be tagged. For systems in which this isn't possible, it is possible to adjust the target count on the 'unknown' bins, as the system evolves, to encourage them to depopulate.

Run it!

At this point, you should be ready to run. As indicated in the instructions above,

cd p53-tutorial
./init.sh
qsub runwe-frank.sh

And wait for everything to happen. Good luck!

If you run into any issues while using WESTPA, please post your question to our mailing list here. If you are a developer, please post to this group. We will do our best to respond in a timely manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly