# Bash for loops

## Intro

Like most programming languages, Bash has a flow control mechanism called a `for` loop for repeating a command or set of commands on multiple values or files.
Instead of executing the same statement over and over again...

In [1]:
echo 1
echo 2
echo 3
echo 4
echo 5

1
2
3
4
5


...you can write a `for` loop so that you only have to invoke the statement once.

In [2]:
for i in 1 2 3 4 5; do echo $i; done

1
2
3
4
5


The cell above shows how I typically write for loops on the command line—compressed on a single line.
But if you're writing the loop in a script, you can improve readability by speading the loop across multiple lines, as shown here.

In [3]:
for i in 1 2 3 4 5; do
    echo $i
done

1
2
3
4
5


The first line of the loop specifies the inputs.
The indented body of the loop declares the operation(s) to be performed on the inputs.
The final line `done` indicates the end of the loop.

> *Note 1: the indentation of the loop body is optional, but improves readability*

> *Note 2: it's common for the body of the loop to include multiple lines/operations on the input*

The for loop above is a trivial example in a couple of ways.
First, using a for loop for 5 inputs doesn't save you *that* much time or typing.
Second, the "operation" we're performing on this input (printing the value to the terminal) is about as trivial as it gets.
However, the for loop is a huge time saver when you need to operate on dozens, hundreds, or even thousands of inputs, or when the operation you're performing takes minutes or hours instead of milliseconds.

## Nested for loops

Sometimes you need to loop over two variables.
For example, you have have reads for 5 species and 3 samples per species.
So if you want to loop over each read set, you can either type out each of the 15 species/sample combinations (yuck!) or you can use a nested loop.

In [4]:
for i in 1 2 3 4 5; do for j in a b c; do echo sample${i}${j}.fastq; done; done

sample1a.fastq
sample1b.fastq
sample1c.fastq
sample2a.fastq
sample2b.fastq
sample2c.fastq
sample3a.fastq
sample3b.fastq
sample3c.fastq
sample4a.fastq
sample4b.fastq
sample4c.fastq
sample5a.fastq
sample5b.fastq
sample5c.fastq


Here's the same nested loop expanded for readability.

In [5]:
for i in 1 2 3 4 5; do
    for j in a b c; do
        echo sample${i}${j}.fastq
    done
done

sample1a.fastq
sample1b.fastq
sample1c.fastq
sample2a.fastq
sample2b.fastq
sample2c.fastq
sample3a.fastq
sample3b.fastq
sample3c.fastq
sample4a.fastq
sample4b.fastq
sample4c.fastq
sample5a.fastq
sample5b.fastq
sample5c.fastq


## Exercise

Imagine we have reads for 5 species (Bant, Bcer, Ftul, Yent, Ypes), and 3 samples from each species.

In [6]:
ls data/loopy/

Bant-samp1.fastq  Bcer-samp2.fastq  Ftul-samp3.fastq  Ypes-samp1.fastq
Bant-samp2.fastq  Bcer-samp3.fastq  Yent-samp1.fastq  Ypes-samp2.fastq
Bant-samp3.fastq  Ftul-samp1.fastq  Yent-samp2.fastq  Ypes-samp3.fastq
Bcer-samp1.fastq  Ftul-samp2.fastq  Yent-samp3.fastq


Write a nested for loop that will count the number of lines in each file.

In [7]:
# your code goes here

## Parallel execution with the for loop

Normally, a for loop performs operations on the input *sequentially*.
That is, it executes operation(s) in the body of the loop on the first input, then it moves to the second input, and so on until all inputs have been processed.
It is possible to have a for loop execute operations on all inputs simultaneously.

The examples above are too trivial to make an effective demonstration, so here we'll use something that requires a bit more runtime—our trusty dusty `analyze.sh` script!

Let's process 5 of our Fastq files.
If we use a sequential loop here, it will take 30-60 seconds to process 5 inputs.

In [8]:
for species in Bant Bcer Ftul Yent Ypes; do ./analyze.sh data/loopy/${species}-sample1.fastq; done

Analysis of 'data/loopy/Bant-sample1.fastq'...done! (completed in 7 seconds)
Analysis of 'data/loopy/Bcer-sample1.fastq'...done! (completed in 7 seconds)
Analysis of 'data/loopy/Ftul-sample1.fastq'...done! (completed in 9 seconds)
Analysis of 'data/loopy/Yent-sample1.fastq'...done! (completed in 11 seconds)
Analysis of 'data/loopy/Ypes-sample1.fastq'...done! (completed in 10 seconds)


We can run all 5 jobs in parallel by replacing the second `;` symbol with a `&` symbol, indicating that the command should be run in the background.
This allows the for loop to move on to the second input before the first input is done being processed, and so on until processing of all inputs has been initiated.
Since all inputs are being processed simultaneously, the run time required is closer to 5-10 seconds.

In [9]:
for species in Bant Bcer Ftul Yent Ypes; do ./analyze.sh data/loopy/${species}-sample1.fastq & done && wait

[1] 11088
[2] 11089
[3] 11090
[4] 11091
[5] 11092
Analysis of 'data/loopy/Ftul-sample1.fastq'...done! (completed in 7 seconds)
Analysis of 'data/loopy/Yent-sample1.fastq'...done! (completed in 8 seconds)
Analysis of 'data/loopy/Bant-sample1.fastq'...done! (completed in 10 seconds)
Analysis of 'data/loopy/Bcer-sample1.fastq'...done! (completed in 10 seconds)
[1]   Done                    ./analyze.sh data/loopy/${species}-sample1.fastq
[3]   Done                    ./analyze.sh data/loopy/${species}-sample1.fastq
[4]-  Done                    ./analyze.sh data/loopy/${species}-sample1.fastq
Analysis of 'data/loopy/Ypes-sample1.fastq'...done! (completed in 11 seconds)
[2]-  Done                    ./analyze.sh data/loopy/${species}-sample1.fastq
[5]+  Done                    ./analyze.sh data/loopy/${species}-sample1.fastq


> **IMPORTANT NOTE**: This approach should **NEVER** be used when the number of inputs exceeds the number of processors on your computer.
> If you have 100 inputs and 8 processors, executing a for loop in parallel will cause Bad Things™ to happen.
> In cases where the number of inputs exceeds the number of available processors, the GNU `parallel` is the recommended solution.
> See `parallel.ipynb` for more details.