### Running Bash in Jupyter notebook
To run Bash commands in a Jupyter notebook, you need to install the Bash kernel for Jupyter. For that, run the following commands in the terminal:
```
pip install bash_kernel
python -m bash_kernel.install
```

In [None]:
# Test a bash command
ls -lh

### Get sequencing data

### Option 1: use fastq-dump to download from GEO database

In [None]:
# Check fastq-dump options
fastq-dump --help

To download the data from SRR3649298 run
```
fastq-dump --accession SRR3649298 --split-files --outdir fastq --gzip
```

In [None]:
# Download reads from SRR3649298
fastq-dump --accession SRR3649298 --split-files --outdir fastq --gzip

### Option 2: download data from ENA (European Nucleotide Archive)
The same data (SRR3649298) is available here, in the fastq format:
https://www.ebi.ac.uk/ena/data/view/SRR3649298  
The raw fastq files are available from these links:  
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_1.fastq.gz  
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_2.fastq.gz  

To download these files, run:
```
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_1.fastq.gz -P fastq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_2.fastq.gz -P fastq
```

In [None]:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_1.fastq.gz -P fastq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_2.fastq.gz -P fastq

In [None]:
# Check the size of the downloaded files
ls -lh fastq/

In [None]:
# Check the format of the fastq files
gunzip -c fastq/SRR3649298_1.fastq.gz | head -n 12 

In [None]:
# Extract 100,000 reads (400,000 lines).
gunzip -c fastq/SRR3649298_1.fastq.gz | head -n 400000 > fastq/test_R1.fastq
gunzip -c fastq/SRR3649298_2.fastq.gz | head -n 400000 > fastq/test_R2.fastq

In [None]:
# Check the new files
ls -lh fastq/

### Read alignment (Bowtie2)

In [None]:
# Check the options of Bowtie2
bowtie2 --help

Many Bowtie2 index can be downloaded directly from Illumina
https://support.illumina.com/sequencing/sequencing_software/igenome.html

The sacCer3 archive is available at
ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Saccharomyces_cerevisiae/UCSC/sacCer3/Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gz

In [None]:
# Download the Illumina archive with the sacCer3 genome
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Saccharomyces_cerevisiae/UCSC/sacCer3/Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gz

# Unpack the archive
tar -xzf Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gz

In [None]:
# Check the downloaded files
ls -lh

In [None]:
# View the directory tree for the downloaded folder
tree -d Saccharomyces_cerevisiae

In [None]:
# Copy the sacCer3 index to a new folder, e.g. bt2_index_sacCer3
mkdir bt2_index_sacCer3
cp Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/* bt2_index_sacCer3/
ls -lh bt2_index_sacCer3

Alternatively, one can use `bowtie2-build` to build the index from the fasta file.

In [None]:
# Check the options
bowtie2-build --help

In [None]:
# Create a folder for the aligned data (sam/bam files)
mkdir -p bam

# Align reads with bowtie2
bowtie2 -x bt2_index_sacCer3/genome -1 fastq/test_R1.fastq -2 fastq/test_R2.fastq -S bam/test.sam 2> bam/test.bt2.log

In [None]:
# Check the size of the aligned reads (sam file) 
ls -lh bam/test.sam

In [None]:
# View the alignment report generated by bowtie2
cat bam/test.bt2.log

There are some reads that could not be aligned as a proper concordant pair, or did not align at all

In [None]:
# Check bowtie2 options for suppressing bad alignments
bowtie2 --help | grep "suppress"

In [None]:
# Align reads with bowtie2, discarding unpaired reads, and reads that are aligned discordantly, and unaligned reads 
bowtie2 --no-mixed --no-discordant --no-unal \
        -x bt2_index_sacCer3/genome -1 fastq/test_R1.fastq -2 fastq/test_R2.fastq -S bam/test.sam 2> bam/test.bt2.log

In [None]:
cat bam/test.bt2.log

In [None]:
# Check the first lines of the sam file
head -30 bam/test.sam

### Faster way of using Bowtie2

In [None]:
# Get the number of CPU cores
CORES=$(getconf _NPROCESSORS_ONLN)
echo $CORES

In [None]:
# Run bowtie2 on multiple cores to improve speed
bowtie2 --no-mixed --no-discordant --no-unal \
        -p $((CORES-1)) \
        -x bt2_index_sacCer3/genome -1 fastq/test_R1.fastq -2 fastq/test_R2.fastq -S bam/test.sam 2> bam/test.bt2.log

OK, that was much faster. Now let's store the alignments in a more compact way (bam file instead of sam file). We can convert the sam to bam format using `samtools view`.

In [None]:
# Check the arguments of samtools view
samtools view

In [None]:
# Convert the sam file to bam format
samtools view -b --threads $((CORES-1)) bam/test.sam > bam/test.bam

In [None]:
# Let's compare the sizes of sam and bam files
ls -lh bam/

The bam file is about 5 times smaller than the corresponding sam file. Bam is a binary file format so trying to read it as a regular text won't work.
```
head bam/test.bam
```
will output some nonsense. The proper way of viewing a bam file is using `samtools view`.

In [None]:
# Proper way of listing the alignments from a bam file
samtools view bam/test.bam | head

In [None]:
# Count the total number of alignments
samtools view bam/test.bam | wc -l

There are 183,030 reads, corresponding to 91,515 DNA fragments (paired-end reads).

In [None]:
# Let's look again at the bowtie2 log:
cat bam/test.bt2.log

These 91,515 properly aligned pairs are the sum of: 76,533 + 14,982

### Explore the bam file in more detail

In [None]:
# Explore the columns of the bam file
# Column 1: alignment name
samtools view bam/test.bam | head | cut -f 1

In [None]:
# Column 2: Sum of multiple bitwise flags
# See https://samtools.github.io/hts-specs/SAMv1.pdf for an explanation of different FLAGs
samtools view bam/test.bam | head | cut -f 2

One can easily check the meaning of each FLAG using the following website:
https://broadinstitute.github.io/picard/explain-flags.html

In [None]:
# Column 3: chromosome name
samtools view bam/test.bam | head | cut -f 3

In [None]:
# Column 4: 1-based leftmost mapping position
samtools view bam/test.bam | head | cut -f 4

In [None]:
# Column 5: mapping quality
# −10 log10 Prob(mapping position is wrong)
samtools view bam/test.bam | head | cut -f 5

In [None]:
# Column 6: CIGAR string
# Column 7: ‘=’ if the paired read was aligned on the same chromosome
# Column 8: position of the paired read
# Column 9: length of the whole DNA fragment whose ends were sequenced (+/- for left/right end)
samtools view bam/test.bam | head | cut -f 9

In [None]:
# Column 10: DNA sequence
samtools view bam/test.bam | head | cut -f 10

In [None]:
# Column 11: ASCII representation of base quality plus 33
samtools view bam/test.bam | head | cut -f 11

In [None]:
# By dafault the sam/bam files are sorted by the first column (read name)
# Show the following 3 columns: 1 - read name, 3 - chromosome, 4 - position)
samtools view bam/test.bam | head | cut -f 1,3,4

In [None]:
# Many tools require the bam file to be sorted such that the alignments occur in “genome order”. 
# That is, ordered positionally based upon their alignment coordinates on each chromosome.

In [None]:
# We can re-sort the BAM file using 
samtools sort

In [None]:
samtools sort -o bam/test.sorted.bam --threads $((CORES-1)) bam/test.bam

In [None]:
# Check the sorted bam file (columns 1 - read name, 3 - chromosome, 4 - position)
samtools view bam/test.sorted.bam | head | cut -f 1,3,4

In [None]:
# Indexing a position sorted bam file allows one to quickly extract alignments overlapping particular genomic regions.
# This indexing is done using
samtools index

In [None]:
samtools index -@ $((CORES-1)) bam/test.sorted.bam

In [None]:
# Check the sizes (the index file .bai is very small)
ls -lh bam/

### Pipes
All the above operations could be run in a more compact way, using pipes (see https://www.geeksforgeeks.org/piping-in-unix-or-linux/ for more info on the pipe operator).

In [None]:
# Using pipes to perform all these operations in one step
bowtie2 --no-mixed --no-discordant --no-unal \
        -p $((CORES-1)) \
        -x bt2_index_sacCer3/genome -1 fastq/test_R1.fastq -2 fastq/test_R2.fastq 2> bam/test2.bt2.log \
        | samtools view -b -@ $((CORES-1)) - \
        | samtools sort -o bam/test2.sorted.bam -@ $((CORES-1)) -
samtools index -@ $((CORES-1)) bam/test2.sorted.bam

In [None]:
ls -lh bam/

We'll switch now to another Jupyter notebook and analyze the aligned data in R.

Further reading resources: <br>
https://samtools.github.io/hts-specs/SAMv1.pdf <br>
http://quinlanlab.org/tutorials/samtools/samtools.html <br>
http://biobits.org/samtools_primer.html <br>

