Skip to content

man:w_crawl

Darian Yang edited this page Apr 21, 2021 · 15 revisions

Table of Contents

Overview (STUB)

Crawl a weighted ensemble dataset, executing a function for each iteration. This can be used for postprocessing of trajectories, cleanup of datasets, or anything else that can be expressed as "do X for iteration N, then do something with the result". Tasks are parallelized by iteration, and no guarantees are made about evaluation order.

Usage

$WEST_ROOT/bin/w_crawl [-h] [-r RCFILE] [--quiet | --verbose | --debug] [--version]
                       [--max-queue-length MAX_QUEUE_LENGTH] [-W WEST_H5FILE] [--first-iter N_ITER]
                       [--last-iter N_ITER] [-c CRAWLER_INSTANCE]
                       [--serial | --parallel | --work-manager WORK_MANAGER] [--n-workers N_WORKERS]
                       [--zmq-mode MODE] [--zmq-comm-mode COMM_MODE] [--zmq-write-host-info INFO_FILE]
                       [--zmq-read-host-info INFO_FILE] [--zmq-upstream-rr-endpoint ENDPOINT]
                       [--zmq-upstream-ann-endpoint ENDPOINT] [--zmq-downstream-rr-endpoint ENDPOINT]
                       [--zmq-downstream-ann-endpoint ENDPOINT] [--zmq-master-heartbeat MASTER_HEARTBEAT]
                       [--zmq-worker-heartbeat WORKER_HEARTBEAT] [--zmq-timeout-factor FACTOR]
                       [--zmq-startup-timeout STARTUP_TIMEOUT] [--zmq-shutdown-timeout SHUTDOWN_TIMEOUT]
                       task_callable
For WESTPA 2.0 (beta), replace $WEST_ROOT/bin/w_crawl with w_crawl.

Command-Line Options

See the general command-line tool reference for more information on the general options.

Input/Output Options

General Guidelines

Examples

  • One way to process is to first build an iteration processor:
from __future__ import print_function, division; __metaclass__ = type
import numpy
import os
import sys
from westpa import h5io
from w_crawl import WESTPACrawler
import MDAnalysis

class IterationProcessor(object):
    '''
    This class performs analysis on each iteration.  It should contain a method
    ``process_iteration``, which may be called as 
    ``process_iteration(self, n_iter, iter_group)``, where ``n_iter`` refers to
    the weighted ensemble iteration index, and ``iter_group`` is the HDF5 group 
    for the given iteration. The method should return an array or other values, 
    which the ``process_iter_result`` method of the ``Crawler`` class recieves 
    as the argument ``result``. 
    '''
    # Store the location of the PDB file to be used as the topology 
    topology_filename = '/path/to/pdb/file/topology.pdb'
    # Define the pattern used for finding each segment's traj file
    traj_pattern = 'traj_segs/{n_iter:06d}/{seg_id:06d}/testout.xtc'
    # In this example, there are three frames saved for each segment.
    nframes = 3

    def __init__(self):
        '''
        Initialize the IterationProcessor class
        '''
#        self.workdir = os.environ('LOCAL')

    def process_iteration(self, n_iter, iter_group):
        '''
        The main analysis function that w_crawl calls for each iteration.
        This should be changed based on your analysis. This method could
        contain all the code for your analysis, or it could call an outside
        function. 

        ----------
        Parameters
        ----------
        n_iter: (int) The index of the weighted ensemble iteration for which
          analysis should be performed.
        iter_group: (H5py group) The hdf5 group corresponding to iteration
          n_iter, from the the main WESTPA data file (typically west.h5)

        -------
        Returns
        -------
        result: (numpy.ndarray) In general this could be an object, which is
          later processed by Crawler.process_iter_result. Here, it is an array
          of the center of mass of the protein. The array has shape 
          (n_segments, n_timepoints, 3), where dimension 0 indexes the segment, 
          dimension 1 indexes the frame number, and dimension 2 indexes the 
          x/y/z coordinate of the center of mass.
        '''
        # Find the number of segments in the iteration at hand
        nsegs = iter_group['seg_index'].shape[0]

        # The dimensionality of the data you wish to store
        data_dims = 160
        
        # Create an array to hold your data
        iteration_data_array = numpy.zeros((nsegs, self.nframes, data_dims,3))
        print(iteration_data_array.shape)

        # Iterate over each segment
        for iseg in xrange(nsegs):
            
            # Generate a path to the traj file
            trajpath = os.path.join(os.environ['WEST_SIM_ROOT'], self.traj_pattern.format(n_iter=n_iter,seg_id=iseg) )

            f = MDAnalysis.Universe(self.topology_filename, trajpath)
            coord_data = []
            for frame in f.trajectory:
                coord_data.append(f.atoms.positions)

            # Iterate over each timestep and calculate the quantities of interest.
            for num, val in enumerate(coord_data):
                print(num,val)
                # Run some calculation here and add the result for the current 
                # timestep to the array for the whole iteration
                # Here, I calcualte the center of geometry since it is easy to
                # calculate; you can make this analysis as complex as you need
                iteration_data_array[iseg, num - 1 ] = val
        return iteration_data_array
  • Then set up a Crawler class
class Crawler(WESTPACrawler):
    '''
    In this example, w_crawl works as follows:

    We supply the ``Crawler`` class, which handles writing data. The 
    Crawler specifies 3 methods: initialize, finalize, and process_iter_result.

    ``initialize`` is called only once--when w_crawl starts up. The job of 
    initialize is to create the output file (and HDF5 file).

    Like ``initialize``, ``finalize`` is also called only once--when w_crawl
    finishes calculations for all iterations. The job of ``finalize`` is to
    gracefully close the output file, preventing data corruption.

    The method ``process_iter_result`` is called once per weighted ensemble
    iteration. It takes the weighted ensemble iteration (n_iter) and the result
    of the calculations for an iteration (result) as arguments, and stores the
    data in the output file.

    The actual calculations are performed by the IterationProcessor class 
    defined above. In particular, the IterationProcessor.process_iteration 
    method performs the calculations; the return value of this method is passed
    to Crawler.process_iter_result.
    '''

    def initialize(self, iter_start, iter_stop):
        '''
        Create an HDF5 file for saving the data.  Change the file path to
        a location that is available to you. 
        '''
        self.output_file = h5io.WESTPAH5File('/path/to/new/h5file/filename.h5', 'w')
        h5io.stamp_iter_range(self.output_file, iter_start, iter_stop)

    def finalize(self):
        self.output_file.close()

    def process_iter_result(self, n_iter, result):
        '''
        Save the result of the calculation in the output file.

        ----------
        Parameters
        ----------
        n_iter: (int) The index of the weighted ensemble iteration to which
          the data in ``result`` corresponds.
        result: (numpy.ndarray) In general this could be an arbitrary object
          returned by IterationProcessor.process_iteration; here it is a numpy
          array of the center of geometry.
        '''
        # Initialize/create the group for the specific iteration
        iter_group = self.output_file.require_iter_group(n_iter)

        iteration_data_array = result
        
        # Save datasets
        dataset = iter_group.create_dataset('coords', 
                                            data=iteration_data_array, 
                                            scaleoffset=6, 
                                            compression=4,
                                            chunks=h5io.calc_chunksize(
                                                    iteration_data_array.shape,
                                                    iteration_data_array.dtype
                                                                       )
                                            )
  • Then set up the entry point
# Entry point for w_crawl
iteration_processor = IterationProcessor()
def calculate(n_iter, iter_group):
    '''Picklable shim for iteration_processor.process_iteration()'''
    global iteration_processor 
    return iteration_processor.process_iteration(n_iter, iter_group)

crawler = Crawler()
  • You can run these functions via:
w_crawl \
  wcrawl_functions.calculate \
  -c wcrawl_functions.crawler \
  --debug \
  --work-manager=processes \
  --n-workers=4 \
  &> ${ANALYSIS_DIR}/w_crawl.log 

After running w_crawl:

  • You will have created an output h5 file containing the newly calculated auxiliary data for each segment of each iteration that was specified.
  • To make a new main west.h5 file with the additional (1-D) w_crawl data copied over per iteration:
import h5py
from shutil import copyfile

crawl_aux = "Name of the Aux Data Specified in wcrawl_functions.py calculation"

# make new file to avoid corrupting main h5 file
source_h5 = "west.h5"
dest_h5 = "west_new.h5"
copyfile(source_h5, dest_h5)

# goal is to copy over each iteration aux data from crawl to west
west_h5 = h5py.File(dest_h5, mode="r+")
crawl_h5 = h5py.File(f"{crawl_aux}.h5", mode="r")
# copies over all iterations available on the crawl h5 dataset
for iter in range(1, crawl_h5.attrs["iter_stop"]):
    crawl_h5.copy(f"iterations/iter_{iter:08d}/{crawl_aux}",
                  west_h5[f"iterations/iter_{iter:08d}/auxdata"]
                  )
  • After this step, you will have the main west.h5 file with your WESTPA run information and data, updated with the additional w_crawl data that was calculated.
  • All done!
Clone this wiki locally