# GNU parallel

There are several ways to batch parallel processes in the shell.
The bash for loop is common and effective when the number of processors on your computer exceeds the number of tasks to be run in parallel.
But as your number of tasks grows, for loops provide no easy way to utilize all available processors without overwhelming the system.
Either you run all processes simultaneously or you run one at a time.

The GNU `parallel` program provides a better alternative in these types of scenarios.
With this command, you specify not only the inputs to be processed but also the number of tasks you want to run at once.
For example, if you have 1024 samples that need to be processed and a computer with 16 processors, you would invoke the parallel command to run 16 jobs at once until all 1024 samples are processed.

This notebook will demonstrate the syntax for invoking the parallel command.
First, let's consider some sample data files.

In [1]:
ls data/creatures/

Bvul.indvX.repl1.fasta  Dori.indvX.repl1.fasta  Epeg.indvX.repl1.fasta
Bvul.indvX.repl2.fasta  Dori.indvX.repl2.fasta  Epeg.indvX.repl2.fasta
Bvul.indvX.repl3.fasta  Dori.indvX.repl3.fasta  Epeg.indvX.repl3.fasta
Bvul.indvY.repl1.fasta  Dori.indvY.repl1.fasta  Epeg.indvY.repl1.fasta
Bvul.indvY.repl2.fasta  Dori.indvY.repl2.fasta  Epeg.indvY.repl2.fasta
Bvul.indvY.repl3.fasta  Dori.indvY.repl3.fasta  Epeg.indvY.repl3.fasta
Bvul.indvZ.repl1.fasta  Dori.indvZ.repl1.fasta  Epeg.indvZ.repl1.fasta
Bvul.indvZ.repl2.fasta  Dori.indvZ.repl2.fasta  Epeg.indvZ.repl2.fasta
Bvul.indvZ.repl3.fasta  Dori.indvZ.repl3.fasta  Epeg.indvZ.repl3.fasta
Docc.indvX.repl1.fasta  Emon.indvX.repl1.fasta  Hdiu.indvX.repl1.fasta
Docc.indvX.repl2.fasta  Emon.indvX.repl2.fasta  Hdiu.indvX.repl2.fasta
Docc.indvX.repl3.fasta  Emon.indvX.repl3.fasta  Hdiu.indvX.repl3.fasta
Docc.indvY.repl1.fasta  Emon.indvY.repl1.fasta  Hdiu.indvY.repl1.fasta
Docc.indvY.repl2.fasta  Emon.indvY.repl2.fasta  Hdiu.indvY.repl2.fasta
Docc.i

And also a `summarize.sh` script, invoked like so.

In [2]:
./summarize.sh data/creatures/Bvul.indvX.repl1.fasta

Fasta file 'data/creatures/Bvul.indvX.repl1.fasta' contains 1 record(s) across 473 lines


Now imagine we want to run this script on all these Fasta files.

- We use `-j 4` to indicate that we want to run 4 tasks at a time
- The `{}` symbol is a placeholder for one of the inputs
- The `:::` symbol separates the command to be executed from the input declaration
- We use globbing to specify all `.fasta` files in `data/creatures/` as input

In [3]:
parallel -j 4 ./summarize.sh {} ::: data/creatures/*.fasta

Fasta file 'data/creatures/Bvul.indvX.repl1.fasta' contains 1 record(s) across 473 lines
Fasta file 'data/creatures/Bvul.indvX.repl2.fasta' contains 1 record(s) across 121 lines
Fasta file 'data/creatures/Bvul.indvX.repl3.fasta' contains 1 record(s) across 500 lines
Fasta file 'data/creatures/Bvul.indvY.repl1.fasta' contains 1 record(s) across 428 lines
Fasta file 'data/creatures/Bvul.indvY.repl3.fasta' contains 1 record(s) across 63 lines
Fasta file 'data/creatures/Bvul.indvY.repl2.fasta' contains 1 record(s) across 775 lines
Fasta file 'data/creatures/Bvul.indvZ.repl1.fasta' contains 1 record(s) across 833 lines
Fasta file 'data/creatures/Bvul.indvZ.repl2.fasta' contains 1 record(s) across 203 lines
Fasta file 'data/creatures/Bvul.indvZ.repl3.fasta' contains 1 record(s) across 311 lines
Fasta file 'data/creatures/Docc.indvX.repl1.fasta' contains 1 record(s) across 31 lines
Fasta file 'data/creatures/Docc.indvY.repl1.fasta' contains 1 record(s) across 806 lines
Fasta file 'data/creatu

The parallel command also supports `{.}` as a placeholder, which (if the input is a filename) removes the final file extension.
This is very convenient for keeping a consistent prefix for all output files and avoiding the all-too-common accumulation of more and more extensions: `prefix.fasta` --> `prefix.fasta.cleaned.fasta` --> `prefix.fasta.cleaned.fasta.filtered` --> `prefix.fasta.cleaned.fasta.filtered.mash` --> `prefix.fasta.cleaned.fasta.filtered.mash.besthits.txt`

In [4]:
parallel -j 4 ./summarize.sh {} '>' {.}.summary.txt ::: data/creatures/D*.fasta
ls data/creatures/*.summary.txt

data/creatures/Docc.indvX.repl1.summary.txt
data/creatures/Docc.indvX.repl2.summary.txt
data/creatures/Docc.indvX.repl3.summary.txt
data/creatures/Docc.indvY.repl1.summary.txt
data/creatures/Docc.indvY.repl2.summary.txt
data/creatures/Docc.indvY.repl3.summary.txt
data/creatures/Docc.indvZ.repl1.summary.txt
data/creatures/Docc.indvZ.repl2.summary.txt
data/creatures/Docc.indvZ.repl3.summary.txt
data/creatures/Dori.indvX.repl1.summary.txt
data/creatures/Dori.indvX.repl2.summary.txt
data/creatures/Dori.indvX.repl3.summary.txt
data/creatures/Dori.indvY.repl1.summary.txt
data/creatures/Dori.indvY.repl2.summary.txt
data/creatures/Dori.indvY.repl3.summary.txt
data/creatures/Dori.indvZ.repl1.summary.txt
data/creatures/Dori.indvZ.repl2.summary.txt
data/creatures/Dori.indvZ.repl3.summary.txt


If we want to specify our inputs using a file (one input per line), we use `::::` instead of `:::`.

In [5]:
parallel -j 4 ./summarize.sh {} :::: creatures.txt

Fasta file 'data/creatures/Bvul.indvY.repl1.fasta' contains 1 record(s) across 428 lines
Fasta file 'data/creatures/Bvul.indvX.repl1.fasta' contains 1 record(s) across 473 lines
Fasta file 'data/creatures/Bvul.indvX.repl2.fasta' contains 1 record(s) across 121 lines
Fasta file 'data/creatures/Bvul.indvX.repl3.fasta' contains 1 record(s) across 500 lines
Fasta file 'data/creatures/Bvul.indvZ.repl1.fasta' contains 1 record(s) across 833 lines
Fasta file 'data/creatures/Bvul.indvY.repl2.fasta' contains 1 record(s) across 775 lines
Fasta file 'data/creatures/Bvul.indvY.repl3.fasta' contains 1 record(s) across 63 lines
Fasta file 'data/creatures/Docc.indvX.repl1.fasta' contains 1 record(s) across 31 lines
Fasta file 'data/creatures/Bvul.indvZ.repl2.fasta' contains 1 record(s) across 203 lines
Fasta file 'data/creatures/Docc.indvX.repl2.fasta' contains 1 record(s) across 238 lines
Fasta file 'data/creatures/Bvul.indvZ.repl3.fasta' contains 1 record(s) across 311 lines
Fasta file 'data/creatu

We can use the `:::` multiple times to specify multiple input sources.
Every combination of inputs (one from each source) will be executed.
For example, if we wanted to print out every tri-nucleotide we could do the following.

In [6]:
parallel echo {1}{2}{3} ::: A C G T ::: A C G T ::: A C G T

AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT


This can give us finer control over matching when needed.
Back to our sample data files, we can process a subset of the `.fasta` files using the following command.

In [7]:
parallel -j 4 ./summarize.sh data/creatures/{1}.indv{2}.repl{3}.fasta ::: Bvul Dori Hdiu ::: X Y ::: 1 2 3

Fasta file 'data/creatures/Bvul.indvX.repl1.fasta' contains 1 record(s) across 473 lines
Fasta file 'data/creatures/Bvul.indvX.repl2.fasta' contains 1 record(s) across 121 lines
Fasta file 'data/creatures/Bvul.indvY.repl1.fasta' contains 1 record(s) across 428 lines
Fasta file 'data/creatures/Bvul.indvX.repl3.fasta' contains 1 record(s) across 500 lines
Fasta file 'data/creatures/Bvul.indvY.repl2.fasta' contains 1 record(s) across 775 lines
Fasta file 'data/creatures/Dori.indvX.repl1.fasta' contains 1 record(s) across 828 lines
Fasta file 'data/creatures/Bvul.indvY.repl3.fasta' contains 1 record(s) across 63 lines
Fasta file 'data/creatures/Dori.indvX.repl2.fasta' contains 1 record(s) across 185 lines
Fasta file 'data/creatures/Dori.indvX.repl3.fasta' contains 1 record(s) across 15 lines
Fasta file 'data/creatures/Dori.indvY.repl2.fasta' contains 1 record(s) across 11 lines
Fasta file 'data/creatures/Dori.indvY.repl1.fasta' contains 1 record(s) across 536 lines
Fasta file 'data/creatur