# Intro to basic command line tools (samtools, bowtie2, etc.)
### Running Bash commands inside a Jupyter notebook
To run Bash commands in a Jupyter notebook, you need to install the Bash kernel for Jupyter. For that, run the following commands in the terminal:
```
pip install bash_kernel
python -m bash_kernel.install
```
If the Bash kernel is installed and selected as the kernel for the current Jupyter notebook, we can run Bash commands inside the notebook. For example, we can use the `tree` command to show the contents of the current folder. Please make sure that you have downloaded all of the following files, and the tree of your current folder looks like this: 
```
.
├── Part_1_MNase-seq_intro.pptx
├── Part_2_Bash_commands.ipynb
├── Part_3_R_commands.ipynb
├── data_backup
│   ├── SRR3649298.bt2.log
│   ├── SRR3649298.sorted.bam
│   ├── SRR3649298.sorted.bam.bai
│   ├── SRR3649298_1.fastq.gz
│   ├── SRR3649298_2.fastq.gz
│   └── TxDbFromUCSC.sqlite
├── info.txt
└── sacCer3_genome
    └── genome.fa
```

In [1]:
tree

.
├── Part_1_MNase-seq_intro.pptx
├── Part_2_Bash_commands.ipynb
├── Part_3_R_commands.ipynb
├── data_backup
│   ├── SRR3649298.bt2.log
│   ├── SRR3649298.sorted.bam
│   ├── SRR3649298.sorted.bam.bai
│   ├── SRR3649298_1.fastq.gz
│   ├── SRR3649298_2.fastq.gz
│   └── TxDbFromUCSC.sqlite
├── info.txt
└── sacCer3_genome
    └── genome.fa

2 directories, 11 files


### Get sequencing data

Option 1: GEO database / Sequence Read Archive (SRA)  
https://www.ncbi.nlm.nih.gov/geo/  
https://www.ncbi.nlm.nih.gov/sra  
Use `fastq-dump` to download sequencing data from GEO database or Sequence Read Archive (SRA). To download the MNase-seq data with the accession number SRR3649298, run
```
fastq-dump --accession SRR3649298 --split-files --outdir fastq_files --gzip
```

Option 2: European Nucleotide Archive (ENA)  
https://www.ebi.ac.uk/ena  
Use the direct links to the fastq files provided on their website. The same data (SRR3649298) is available on ENA in different formats:
https://www.ebi.ac.uk/ena/data/view/SRR3649298  

The raw fastq files are available from these links:  
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_1.fastq.gz  
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_2.fastq.gz  

To download these files, run:
```
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_1.fastq.gz -P fastq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR364/008/SRR3649298/SRR3649298_2.fastq.gz -P fastq
```

These files can be very big (100 MB - 1 TB in size). To prevent big downloads, I have copied the necessary `fastq` files into the folder `data_backup`. Let's assume we have downloaded these files using the above commands (I will just copy the files from the local folder).

In [2]:
mkdir -p fastq_files
cp data_backup/*.fastq.gz fastq_files/

In [3]:
# Check the size of the downloaded files
ls -lh fastq_files/

total 1387616
-rw-r--r--@ 1 cherejirv  1360859114   334M Nov 22 16:07 SRR3649298_1.fastq.gz
-rw-r--r--@ 1 cherejirv  1360859114   343M Nov 22 16:07 SRR3649298_2.fastq.gz


Notice that these are zipped files, and they occupy ~700 MB of space.

In [4]:
# Check the format of the fastq files. Print the first 12 lines of the first file
gunzip -c fastq_files/SRR3649298_1.fastq.gz | head -n 12

@SRR3649298.1 HWI-D00638:56:C6FTLANXX:2:1101:17378:1996/1
NTAAACTGTGCTCGGCCAGTGGAGACGTACAAATTCCACTACTATTCGTA
+
#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR3649298.2 HWI-D00638:56:C6FTLANXX:2:1101:18353:1996/1
NGTTCCCTGTGTTCTAAAAATTGGAAAAACATGGCTATTAAAACCTGTGG
+
#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR3649298.3 HWI-D00638:56:C6FTLANXX:2:1101:1209:2084/1
NTCTCAAATCATTTAATAACTACAAAGAAACAAAAAAAAATCAATGTTGA
+
#<<BBFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFF


Let's work first only with a small subset of 100,000 reads. Since each short read occupies 4 lines in the fastq files, we need to copy the first 400,000 lines from each `fastq` file.

In [5]:
# Extract 100,000 reads (400,000 lines) from the original fastq files
gunzip -c fastq_files/SRR3649298_1.fastq.gz | head -n 400000 > fastq_files/test_R1.fastq
gunzip -c fastq_files/SRR3649298_2.fastq.gz | head -n 400000 > fastq_files/test_R2.fastq

In [6]:
# Check the new files
ls -lh fastq_files/

total 1452608
-rw-r--r--@ 1 cherejirv  1360859114   334M Nov 22 16:07 SRR3649298_1.fastq.gz
-rw-r--r--@ 1 cherejirv  1360859114   343M Nov 22 16:07 SRR3649298_2.fastq.gz
-rw-r--r--  1 cherejirv  1360859114    16M Nov 22 16:07 test_R1.fastq
-rw-r--r--  1 cherejirv  1360859114    16M Nov 22 16:07 test_R2.fastq


### Alignment of paired-end reads (Bowtie2)
The reads from the fastq files were obtained by digesting yeast DNA using microccocal nuclease (MNase). Let's detect the genomic loci where these DNA fragments originated from. For this we will align the reads to the yeast genome, using the Bowtie2 aligner.

In [7]:
# Check the options of Bowtie2
bowtie2 --help

Bowtie 2 version 2.3.4.3 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: 
  bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i>} [-S <sam>]

  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
             NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <i>        Files with interleaved paired-end FASTQ reads
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times. 

First, we'll need to create a Bowtie2 index corresponding to the yeast genome. For this we'll use `bowtie2-build`.

In [8]:
# Check the options
bowtie2-build --help

Bowtie 2 version 2.3.4.3 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
    reference_in            comma-separated list of files with ref sequences
    bt2_index_base          write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
Options:
    -f                      reference files are Fasta (default)
    -c                      reference sequences given on cmd line (as
                            <reference_in>)
    --large-index           force generated index to be 'large', even if ref
                            has fewer than 4 billion nucleotides
    --debug                 use the debug binary; slower, assertions enabled
    --sanitized             use sanitized binary; slower, uses ASan and/or UBSan
    --verbose               log the issued command
    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
   

In [9]:
# Create a folder for the Bowtie2 index
mkdir -p bt2_index

# Create the Bowtie2 index for sacCer3 genome
bowtie2-build sacCer3_genome/genome.fa bt2_index/sacCer3

Settings:
  Output files: "bt2_index/sacCer3.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  sacCer3_genome/genome.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 3039276
Using parameters --bmax 2279457 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 2279457 --dcv 1024
Construct

In [10]:
# Let's check the index files that we created
ls -lh bt2_index/

total 51040
-rw-r--r--  1 cherejirv  1360859114   7.9M Nov 22 16:07 sacCer3.1.bt2
-rw-r--r--  1 cherejirv  1360859114   2.9M Nov 22 16:07 sacCer3.2.bt2
-rw-r--r--  1 cherejirv  1360859114   161B Nov 22 16:07 sacCer3.3.bt2
-rw-r--r--  1 cherejirv  1360859114   2.9M Nov 22 16:07 sacCer3.4.bt2
-rw-r--r--  1 cherejirv  1360859114   7.9M Nov 22 16:07 sacCer3.rev.1.bt2
-rw-r--r--  1 cherejirv  1360859114   2.9M Nov 22 16:07 sacCer3.rev.2.bt2


Many Bowtie2 indexes can be downloaded directly from Illumina
https://support.illumina.com/sequencing/sequencing_software/igenome.html

For example, the sacCer3 archive is available at
ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Saccharomyces_cerevisiae/UCSC/sacCer3/Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gz

Now we can go ahead and align the short reads to the yeast genome.

In [11]:
# Create a folder for the aligned data (sam/bam files)
mkdir -p bam_files

# Align reads with bowtie2, discarding unpaired reads, 
# reads that are aligned discordantly, and unaligned reads
bowtie2 --no-mixed --no-discordant --no-unal \
        -x bt2_index/sacCer3 \
        -1 fastq_files/test_R1.fastq -2 fastq_files/test_R2.fastq \
        -S bam_files/test.sam 2> bam_files/test.bt2.log

In [12]:
# Check the size of the aligned reads (sam file) 
ls -lh bam_files/

total 80368
-rw-r--r--  1 cherejirv  1360859114   246B Nov 22 16:07 test.bt2.log
-rw-r--r--  1 cherejirv  1360859114    39M Nov 22 16:07 test.sam


In [13]:
# View the alignment report generated by bowtie2
cat bam_files/test.bt2.log

100000 reads; of these:
  100000 (100.00%) were paired; of these:
    8485 (8.48%) aligned concordantly 0 times
    76533 (76.53%) aligned concordantly exactly 1 time
    14982 (14.98%) aligned concordantly >1 times
91.52% overall alignment rate


In [14]:
# Let's check the first lines of the sam file containing the alignments
head -n 30 bam_files/test.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:chrI	LN:230218
@SQ	SN:chrII	LN:813184
@SQ	SN:chrIII	LN:316620
@SQ	SN:chrIV	LN:1531933
@SQ	SN:chrIX	LN:439888
@SQ	SN:chrM	LN:85779
@SQ	SN:chrV	LN:576874
@SQ	SN:chrVI	LN:270161
@SQ	SN:chrVII	LN:1090940
@SQ	SN:chrVIII	LN:562643
@SQ	SN:chrX	LN:745751
@SQ	SN:chrXI	LN:666816
@SQ	SN:chrXII	LN:1078177
@SQ	SN:chrXIII	LN:924431
@SQ	SN:chrXIV	LN:784333
@SQ	SN:chrXV	LN:1091291
@SQ	SN:chrXVI	LN:948066
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.4.3	CL:"/usr/local/bin/../Cellar/bowtie2/2.3.4.3/bin/bowtie2-align-s --wrapper basic-0 --no-mixed --no-discordant -x bt2_index/sacCer3 --passthrough -1 fastq_files/test_R1.fastq -2 fastq_files/test_R2.fastq"
SRR3649298.1	99	chrXII	166119	42	50M	=	166163	94	NTAAACTGTGCTCGGCCAGTGGAGACGTACAAATTCCACTACTATTCGTA	#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	AS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0T49	YS:i:0	YT:Z:CP
SRR3649298.1	147	chrXII	166163	42	50M	=	166119	-94	TTCGTACTTTGTCTAGTTCAAGGGTCTCCGTCGAGTGAGTCCTACGTTGA	FFFFFFFFFFFFFFFF

### Faster way of running Bowtie2 (use multiple CPUs/threads)

In [15]:
# Get the number of CPU cores
CORES=$(getconf _NPROCESSORS_ONLN)
echo $CORES

12


In [16]:
# Run bowtie2 on multiple cores to improve speed
bowtie2 --no-mixed --no-discordant --no-unal \
        -p $((CORES-1)) \
        -x bt2_index/sacCer3 \
        -1 fastq_files/test_R1.fastq -2 fastq_files/test_R2.fastq \
        -S bam_files/test.sam 2> bam_files/test.bt2.log

OK, that was much faster. Now let's store the alignments in a more compact way (using a `bam` file instead of a `sam` file). We can convert a `sam` file to the `bam` format using `samtools view`.

In [17]:
# Check the arguments of samtools view
samtools view


Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with a

In [18]:
# Convert the sam file to bam format
samtools view -b -h --threads $((CORES-1)) bam_files/test.sam > bam_files/test.bam

In [19]:
# Let's compare the sizes of the sam and bam files
ls -lh bam_files/

total 94328
-rw-r--r--  1 cherejirv  1360859114   7.0M Nov 22 16:07 test.bam
-rw-r--r--  1 cherejirv  1360859114   246B Nov 22 16:07 test.bt2.log
-rw-r--r--  1 cherejirv  1360859114    39M Nov 22 16:07 test.sam


The `bam` file is about 5 times smaller than the corresponding `sam` file. `Bam` is a binary file format so trying to read it as a regular text won't work.
```
head bam/test.bam
```
will output some nonsense. The proper way of viewing a `bam` file is using `samtools view`.

In [20]:
# Proper way of listing the alignments from a bam file 
# (use -h to include the header information)
samtools view bam_files/test.bam | head

SRR3649298.1	99	chrXII	166119	42	50M	=	166163	94	NTAAACTGTGCTCGGCCAGTGGAGACGTACAAATTCCACTACTATTCGTA	#<BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	AS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:0T49	YS:i:0	YT:Z:CP
SRR3649298.1	147	chrXII	166163	42	50M	=	166119	-94	TTCGTACTTTGTCTAGTTCAAGGGTCTCCGTCGAGTGAGTCCTACGTTGA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:50	YS:i:-1	YT:Z:CP
SRR3649298.2	83	chrIV	324159	42	50M	=	324054	-155	CCACAGGTTTTAATAGCCATGTTTTTCCAATTTTTAGAACACAGGGAACN	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBB<#	AS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:49A0	YS:i:0	YT:Z:CP
SRR3649298.2	163	chrIV	324054	42	50M	=	324159	155	TGCACTGTCAATAAATCATTGGAACTGAATGCTTTGTTGCATGTAGGGCA	BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:50	YS:i:-1	YT:Z:CP
SRR3649298.3	83	chrVIII	113027	42	50M	=	112907	-170	TCAACATTGATTTTTTTTTGTTTCTTTGTAGTTATTAAATGATTTGAGAN	FFFFFFFFFFBFFFFFFFFFF

In [21]:
# Count the total number of alignments
samtools view bam_files/test.bam | wc -l

  183030


There are 183,030 reads, corresponding to 91,515 DNA fragments (paired-end reads).

In [22]:
# Let's look again at the bowtie2 log:
cat bam_files/test.bt2.log

100000 reads; of these:
  100000 (100.00%) were paired; of these:
    8485 (8.48%) aligned concordantly 0 times
    76533 (76.53%) aligned concordantly exactly 1 time
    14982 (14.98%) aligned concordantly >1 times
91.52% overall alignment rate


These 91,515 properly aligned pairs are the sum of: 76,533 + 14,982

### Explore the bam file in more detail

In [23]:
# Show the first 5 columns of the bam file
samtools view bam_files/test.bam | head | cut -f 1-5

SRR3649298.1	99	chrXII	166119	42
SRR3649298.1	147	chrXII	166163	42
SRR3649298.2	83	chrIV	324159	42
SRR3649298.2	163	chrIV	324054	42
SRR3649298.3	83	chrVIII	113027	42
SRR3649298.3	163	chrVIII	112907	42
SRR3649298.4	99	chrIV	279819	42
SRR3649298.4	147	chrIV	279928	42
SRR3649298.5	99	chrIX	408131	42
SRR3649298.5	147	chrIX	408231	42


The first 5 columns represent the following quantities:  
Column 1: alignment name.  
Column 2: sum of multiple bitwise flags. See https://samtools.github.io/hts-specs/SAMv1.pdf for an explanation of different FLAGs. One can easily check the meaning of each FLAG using the following website:
https://broadinstitute.github.io/picard/explain-flags.html  
Column 3: chromosome name.  
Column 4: mapping position of the left end of the read.  
Column 5: mapping quality, i.e. $−10 \log_{10}\times$ Probability that the mapping position is wrong.

In [24]:
# Show columns 6 to 9
samtools view bam_files/test.bam | head | cut -f 6-9

50M	=	166163	94
50M	=	166119	-94
50M	=	324054	-155
50M	=	324159	155
50M	=	112907	-170
50M	=	113027	170
50M	=	279928	159
50M	=	279819	-159
50M	=	408231	150
50M	=	408131	-150


These columns represent the following:  
Column 6: CIGAR string  
Column 7: ‘=’ if the paired read was aligned on the same chromosome  
Column 8: position of the paired read  
Column 9: length of the whole DNA fragment whose ends were sequenced (+/- for left/right end)

In [25]:
# Show column 10: DNA sequence
samtools view bam_files/test.bam | head | cut -f 10

NTAAACTGTGCTCGGCCAGTGGAGACGTACAAATTCCACTACTATTCGTA
TTCGTACTTTGTCTAGTTCAAGGGTCTCCGTCGAGTGAGTCCTACGTTGA
CCACAGGTTTTAATAGCCATGTTTTTCCAATTTTTAGAACACAGGGAACN
TGCACTGTCAATAAATCATTGGAACTGAATGCTTTGTTGCATGTAGGGCA
TCAACATTGATTTTTTTTTGTTTCTTTGTAGTTATTAAATGATTTGAGAN
TGGCTTGAGGAGTGGTCGTACTGTTGGTCCACCTCACTAACGCAATCATT
TCGTTGTGAAATCCATTAAAATTAACACAGGCTCTCAAAAAGGAGGCTTG
TGAAGGCCCTCTTTGTTCAAACTGTCTAGCAAGGTCTGGAGAATTATACA
NTGTGTAATTGAAGATCCAGGATGTTTTCCTTTTCAGGGAGATGAGAAGG
AAAAATTGTGTCCAATGATTAGCATAGAGAGGTAGAGTATCAGAGAAACA


By default the sam/bam files are sorted by the first column (read name).

In [26]:
# Show the following 3 columns: 1 - read name, 3 - chromosome, 4 - position)
samtools view bam_files/test.bam | head | cut -f 1,3,4

SRR3649298.1	chrXII	166119
SRR3649298.1	chrXII	166163
SRR3649298.2	chrIV	324159
SRR3649298.2	chrIV	324054
SRR3649298.3	chrVIII	113027
SRR3649298.3	chrVIII	112907
SRR3649298.4	chrIV	279819
SRR3649298.4	chrIV	279928
SRR3649298.5	chrIX	408131
SRR3649298.5	chrIX	408231


Many tools require the bam file to be sorted such that the alignments occur in “genome order”. That is, ordered positionally based upon their alignment coordinates on each chromosome.

In [27]:
# We can re-sort the BAM file using 
samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]


In [28]:
samtools sort -o bam_files/test.sorted.bam --threads $((CORES-1)) bam_files/test.bam

[bam_sort_core] merging from 0 files and 11 in-memory blocks...


In [29]:
# Check the sorted bam file (columns 1 - read name, 3 - chromosome, 4 - position)
samtools view bam_files/test.sorted.bam | head | cut -f 1,3,4

SRR3649298.54916	chrI	231
SRR3649298.54916	chrI	293
SRR3649298.36570	chrI	301
SRR3649298.26155	chrI	309
SRR3649298.36570	chrI	409
SRR3649298.26155	chrI	410
SRR3649298.18919	chrI	415
SRR3649298.18919	chrI	433
SRR3649298.60307	chrI	854
SRR3649298.60307	chrI	918


In [30]:
# Indexing a position sorted bam file allows one to quickly extract alignments overlapping 
# particular genomic regions. The indexing is done using
samtools index

Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]
  -@ INT   Sets the number of threads [none]


: 1

In [31]:
samtools index -@ $((CORES-1)) bam_files/test.sorted.bam

In [32]:
# Check the sizes (the index file .bai is very small)
ls -lh bam_files/

total 106672
-rw-r--r--  1 cherejirv  1360859114   7.0M Nov 22 16:07 test.bam
-rw-r--r--  1 cherejirv  1360859114   246B Nov 22 16:07 test.bt2.log
-rw-r--r--  1 cherejirv  1360859114    39M Nov 22 16:07 test.sam
-rw-r--r--  1 cherejirv  1360859114   6.0M Nov 22 16:07 test.sorted.bam
-rw-r--r--  1 cherejirv  1360859114   9.5K Nov 22 17:09 test.sorted.bam.bai


### Pipes
All the above operations could be run in a more compact way, using pipes (see https://www.geeksforgeeks.org/piping-in-unix-or-linux/ for more info on the pipe operator). Let's align now all the reads (not only 100,000 reads).

In [33]:
# Using pipes to perform all these operations in one step
bowtie2 --no-mixed --no-discordant --no-unal \
        -p $((CORES-1)) \
        -x bt2_index/sacCer3 \
        -1 fastq_files/SRR3649298_1.fastq.gz \
        -2 fastq_files/SRR3649298_2.fastq.gz \
        2> bam_files/SRR3649298.bt2.log \
        | samtools view -b -@ $((CORES-1)) - \
        | samtools sort -o bam_files/SRR3649298.sorted.bam -@ $((CORES-1)) -
samtools index -@ $((CORES-1)) bam_files/SRR3649298.sorted.bam

[bam_sort_core] merging from 0 files and 11 in-memory blocks...


In [34]:
ls -lh bam_files/

total 963936
-rw-r--r--  1 cherejirv  1360859114   256B Nov 22 17:15 SRR3649298.bt2.log
-rw-r--r--  1 cherejirv  1360859114   414M Nov 22 17:15 SRR3649298.sorted.bam
-rw-r--r--  1 cherejirv  1360859114    37K Nov 22 17:15 SRR3649298.sorted.bam.bai
-rw-r--r--  1 cherejirv  1360859114   7.0M Nov 22 16:07 test.bam
-rw-r--r--  1 cherejirv  1360859114   246B Nov 22 16:07 test.bt2.log
-rw-r--r--  1 cherejirv  1360859114    39M Nov 22 16:07 test.sam
-rw-r--r--  1 cherejirv  1360859114   6.0M Nov 22 16:07 test.sorted.bam
-rw-r--r--  1 cherejirv  1360859114   9.5K Nov 22 17:09 test.sorted.bam.bai


In [35]:
cat bam_files/SRR3649298.bt2.log

11358904 reads; of these:
  11358904 (100.00%) were paired; of these:
    990760 (8.72%) aligned concordantly 0 times
    8670333 (76.33%) aligned concordantly exactly 1 time
    1697811 (14.95%) aligned concordantly >1 times
91.28% overall alignment rate


### Analyze the aligned data  
We'll switch now to another Jupyter notebook and analyze the aligned data in R.

### Further reading resources:  
https://samtools.github.io/hts-specs/SAMv1.pdf  
http://quinlanlab.org/tutorials/samtools/samtools.html  
http://biobits.org/samtools_primer.html