Skip to content

Running steps of the pipeline in parallel

Toby Dylan Hocking edited this page Dec 5, 2019 · 4 revisions

The PeakSegPipeline consists of seven steps. Below we give the command lines for the test-pipeline-demo.R script which uses the project directory, ~/PeakSegPipeline-test/demo.

Step 0: convert labels, create problem directories, and shell scripts

Convert labels from ~/PeakSegPipeline-test/demo/labels/*.txt files to ~/PeakSegPipeline-test/demo/samples/*/*/labels.bed files using the following R command:

PeakSegPipeline::convert_labels("~/PeakSegPipeline-test/demo")

Step 1: target interval computation for each labeled sample and problem

Begin model training by computing ~/PeakSegPipeline-test/demo/samples/*/*/problems/*/target.tsv files. The target is the largest interval of log(penalty) values for which PeakSegFPOP returns peak models that have the minimum number of incorrect labels. The R code to do this for one sample and problem is, for example:

PeakSegPipeline::problem.target("~/PeakSegPipeline-test/demo/samples/kidney/MS002201/problems/chr10:18024675-38818835")

Step 2: train a model that predicts a penalty for each sample and genomic segmentation problem

The target.tsv files computed in Step 2 are used in this step to train a machine learning model that can predict optimal penalty values, even for un-labeled samples and genome subsets. To train a model, use the R code:

PeakSegPipeline::problem.train("~/PeakSegPipeline-test/demo")

A model is trained using ~/PeakSegPipeline-test/demo/samples/*/*/problems/*/target.tsv files, and saved to ~/PeakSegPipeline-test/demo/model.RData.

Step 3: peak predictions for each sample independently, joint problem creation via peak clustering

The next step is to compute peak predictions independently for each sample and genomic segmentation problem, and then cluster the peaks into joint segmentation problems. This step is parallelized on genomic segmentation problems. Each job computes peak predictions for every sample in one genomic segmentation problem (PeakSegPipeline::problem.predict.allSamples), then clusters the predicted peaks to obtain joint segmentation problems (PeakSegPipeline::create_problems_joint), then computes joint target intervals (PeakSegPipeline::problem.joint.targets). To run one of these jobs on a single genomic segmentation problem, use an R command as below:

PeakSegPipeline::problem.pred.cluster.targets("~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835")

The outputs of this step are:

  • Joint segmentation problems files, e.g. ~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835/jointProblems.bed
  • Joint target interval files, e.g. ~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835/jointProblems/chr10:35182819-35261002/target.tsv

Step 4: train a model that predicts a penalty for each joint segmentation problem

In this step we use the joint target interval files computed in Step 4 to train a joint model, which will be used to predict joint penalty for each joint segmentation problem. The R command is:

PeakSegPipeline::problem.joint.train("~/PeakSegPipeline-test/demo")

This step computes:

  • a joint model, ~/PeakSegPipeline-test/demo/joint.model.RData
  • a list of joint segmentation problems for each job in the next step, ~/PeakSegPipeline-test/demo/jobs/*/jobProblems.bed

Step 5: joint peak predictions

To make joint peak predictions for one job, which consists of several joint segmentation problems, use the R code below:

PeakSegPipeline::problem.joint.predict.job("~/PeakSegPipeline-test/demo/jobs/1")

The outputs of this step are the ~/PeakSegPipeline-test/demo/jobs/*/jobPeaks.RData files, which contain joint peak predictions.

Step 6: gather and summarize results

To gather all the peak predictions in the summary web page ~/PeakSegPipeline-test/demo/index.html, run the R code:

PeakSegPipeline::plot_all("~/PeakSegPipeline-test/demo")

This last step includes creation of ~/PeakSegPipeline-test/demo/hub.txt which can be used as a track hub on the UCSC genome browser, with ~/PeakSegPipeline-test/demo/samples/*/*/coverage.bigWig and ~/PeakSegPipeline-test/demo/samples/*/*/joint_peaks.bigWig files that will be shown together on the track hub in a multiWig container (for each sample, a colored coverage profile with superimposed peak calls as horizontal black line segments).