# Running commands on multiple samples

Now, fair warning, you're going to wish we'd told you this earlier on. However, then you wouldn't have had the fun of running and updating each of the previous commands, growling at typos and generally wishing that you'd gone for that cup of coffee before starting this tutorial.

Here we go....we can use a **loop** to run the same commands for multiple samples.

There's a great introduction to bash scripting and loops as part of our [Unix tutorial](../Unix/bash_scripts/bash_scripts.ipynb). But let's take a look at how we could have generated genome alignments for all of our samples using a single loop.

**First let's go to our `data` directory.**

In [None]:
cd data

Whenever you write a loop, it's always a good idea to build it up slowly to check that it's doing what you think.

In [None]:
for r in *.fastq
do
  echo $r
done  

This loop looks for all (*) files which end with ".fastq". The for loop then executes a sequence of commands for each file name that it finds. In the first iteration its "MT1_1.fastq", then "MT1_2.fastq" and so on... In each iteration, we assigned each filename that it found to a variable called "r". 

`for r in *.fastq`

Then, to check we got what we expected, we printed what the variable "r" represented back to the terminal. Because we want to use the variable ("r") we created we need to use dollar ($) symbol.

`echo $r`

Now, if we left things as they are, we would be running the commands twice for each sample. This is because we have two FASTQ files for each sample i.e. "_1.fastq" and "_2.fastq".  Let's change our loop so that we only get the "_1.fastq" files.

In [None]:
for r1 in *_1.fastq
do
  echo $r1
done  

Great! Now, the only problem here is that we're going to want to use both the "_1.fastq" and the "_2.fastq" files in our mapping. We can get around this by removing the "_1.fastq" from the filename to give us our sample name.

`sample=${r1/_1.fastq/}`

This will replace the "_1.fastq" at the end of the filename we stored as "r1" with nothing. 

We've added a little descriptive message so that when we run our loop we know which iteration it's on and what it's doing. Let's try adding our HISAT2 mapping command.

_Note: we assume that the HISAT2 index has already been generated as that's a command you'll only need to run once._

In [None]:
for r1 in *_1.fastq
do
  sample=${r1/_1.fastq/}
  echo "Processing sample: "$sample
  
  echo "Mapping sample: "$sample
  hisat2 --max-intronlen 10000 -x PccAS_v3_hisat2.idx \
  -1 $sample"_1.fastq" -2 $sample"_2.fastq" -S $sample".sam"
done

Notice that because we're using a filename which starts with our variable, but ends with a set phrase, we need to write the phrase in double quotes.

`$sample"_1.fastq"`

Now let's add in our `samtools` commands.

In [None]:
for r1 in *_1.fastq
do
  sample=${r1/_1.fastq/}
  echo "Processing sample: "$sample
  
  echo "Mapping sample: "$sample
  hisat2 --max-intronlen 10000 -x PccAS_v3_hisat2.idx \
  -1 $sample"_1.fastq" -2 $sample"_2.fastq" -S $sample".sam"
  
  echo "Converting SAM to BAM: "$sample
  samtools view -b -o $sample".bam" $sample".sam"
  
  echo "Sorting BAM: "$sample
  samtools sort -o $sample"_sorted.bam" $sample".bam"
  
  echo "Indexing BAM: "$sample
  samtools index $sample"_sorted.bam"
done

Finally, we don't really want to keep intermediate SAM and unsorted BAM files if we don't have to. They just take up precious space. So, let's make our samtools command a one-liner, passing the stdout from one command to another.

In [None]:
for r1 in *_1.fastq
do
  sample=${r1/_1.fastq/}
  echo "Processing sample: "$sample
  hisat2 --max-intronlen 10000 -x PccAS_v3_hisat2.idx \
  -1 $sample"_1.fastq" -2 $sample"_2.fastq" \
  | samtools view -b - \
  | samtools sort -o $sample"_sorted.bam" - \
  && samtools index $sample"_sorted.bam" 
done

You could also have used this approach for transcript quantification with Kallisto, assuming you had already generated the Kallisto index.

In [None]:
for r1 in *_1.fastq
do
  echo $r1
  sample=${r1/_1.fastq/}
  echo "Quantifying transcripts for sample: "$sample
  kallisto quant -i PccAS_v3_kallisto -o $sample -b 100 \
  $sample'_1.fastq' $sample'_2.fastq'
done