-
Notifications
You must be signed in to change notification settings - Fork 3
Running steps of the pipeline in parallel
The PeakSegPipeline consists of seven steps. Below we give the command
lines for the test-pipeline-demo.R script which uses the project
directory, ~/PeakSegPipeline-test/demo
.
Convert labels from ~/PeakSegPipeline-test/demo/labels/*.txt
files
to ~/PeakSegPipeline-test/demo/samples/*/*/labels.bed
files using
the following R command:
PeakSegPipeline::convert_labels("~/PeakSegPipeline-test/demo")
Begin model training by computing
~/PeakSegPipeline-test/demo/samples/*/*/problems/*/target.tsv
files. The target is the largest interval of log(penalty) values for which
PeakSegFPOP returns peak models that have the minimum number of
incorrect labels. The R code to do this for one sample and problem is, for
example:
PeakSegPipeline::problem.target("~/PeakSegPipeline-test/demo/samples/kidney/MS002201/problems/chr10:18024675-38818835")
The target.tsv
files computed in Step 2 are used in this step to
train a machine learning model that can predict optimal penalty
values, even for un-labeled samples and genome subsets. To train a
model, use the R code:
PeakSegPipeline::problem.train("~/PeakSegPipeline-test/demo")
A model is trained using
~/PeakSegPipeline-test/demo/samples/*/*/problems/*/target.tsv
files,
and saved to ~/PeakSegPipeline-test/demo/model.RData
.
The next step is to compute peak predictions independently for each sample and genomic segmentation problem, and then cluster the peaks into joint segmentation problems. This step is parallelized on genomic segmentation problems. Each job computes peak predictions for every sample in one genomic segmentation problem (PeakSegPipeline::problem.predict.allSamples), then clusters the predicted peaks to obtain joint segmentation problems (PeakSegPipeline::create_problems_joint), then computes joint target intervals (PeakSegPipeline::problem.joint.targets). To run one of these jobs on a single genomic segmentation problem, use an R command as below:
PeakSegPipeline::problem.pred.cluster.targets("~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835")
The outputs of this step are:
- Joint segmentation problems files, e.g. ~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835/jointProblems.bed
- Joint target interval files, e.g. ~/PeakSegPipeline-test/demo/problems/chr10:18024675-38818835/jointProblems/chr10:35182819-35261002/target.tsv
In this step we use the joint target interval files computed in Step 4 to train a joint model, which will be used to predict joint penalty for each joint segmentation problem. The R command is:
PeakSegPipeline::problem.joint.train("~/PeakSegPipeline-test/demo")
This step computes:
- a joint model,
~/PeakSegPipeline-test/demo/joint.model.RData
- a list of joint segmentation problems for each job in the next step,
~/PeakSegPipeline-test/demo/jobs/*/jobProblems.bed
To make joint peak predictions for one job, which consists of several joint segmentation problems, use the R code below:
PeakSegPipeline::problem.joint.predict.job("~/PeakSegPipeline-test/demo/jobs/1")
The outputs of this step are the ~/PeakSegPipeline-test/demo/jobs/*/jobPeaks.RData files, which contain joint peak predictions.
To gather all the peak predictions in the summary web page
~/PeakSegPipeline-test/demo/index.html
, run the R code:
PeakSegPipeline::plot_all("~/PeakSegPipeline-test/demo")
This last step includes creation of
~/PeakSegPipeline-test/demo/hub.txt
which can be used as a track hub
on the UCSC genome browser, with
~/PeakSegPipeline-test/demo/samples/*/*/coverage.bigWig
and
~/PeakSegPipeline-test/demo/samples/*/*/joint_peaks.bigWig
files
that will be shown together on the track hub in a multiWig container
(for each sample, a colored coverage profile with superimposed peak
calls as horizontal black line segments).