## Analysis of Leishmania donovani single and paired end Illumina reads

## Convert SRA files to FASTQ


**File for this session** are saved at /BGA2017/Course_material/June13/  

The following command links those files to your directory for use with this worksheet.


In [None]:
# establish links to all files for the session
!ln -s /BGA2017/Course_material/June13/* .

*	Run **fastq-dump** on the .sra file.  
    **--gzip option** puts it in gzip form to save space.   
    **--split-files** makes two files for paired end reads with suffixes _1 and _2.  
    
    This produces fastq.gz files with the same name. 

In [None]:
#single read file
#time: 10+ minutes
#creates file SRR1254937.fastq.gzip
#!fastq-dump --gzip  SRR1254937.sra


In [None]:
#paired end reads file
#time: 10+ minutes
#creates files SRR1254938_1.fastq.gzip and SRR1254938_2.fastq.gzip
#!fastq-dump --gzip --split-files SRR1254938.sra


In [None]:
#check that new files exist
!ls -Lltr SRR*

* *For our demonstration*, we use **gunzip** to capture the first 10,000 fastq reads from each gzip file into separate (much smaller files).  

    * Note the –c option leaves the original file compressed.  Also, gunzip looks for a file with .gz suffix, so it doesn’t have to be specified in the file name.  
    
    * ** " | " ** is the pipe symbol which carries the output from one operation (gunzip) into the input of the next operation (head).  
    
    * **head -40000** puts the first 40,000 lines into the new file (each read is 4 lines). 

 **Ignore "Broken pipe" errors** if they occur, it is a system problem.  Use **ls** below to check that files were created.


In [None]:
#single read file, first 10,000 reads
!gunzip -c SRR1254937.fastq | head -40000 > SRR1254937.first10000.fastq

#paired end reads files, first 10,000 reads from each
!gunzip -c SRR1254938_1.fastq | head -40000 > SRR1254938_1.first10000.fastq

!gunzip -c SRR1254938_2.fastq | head -40000 > SRR1254938_2.first10000.fastq

!gunzip -c SRR1254938_1.fastq | head -4000000 > SRR1254938_1.first10000.fastq

!gunzip -c SRR1254938_2.fastq | head -4000000 > SRR1254938_2.first10000.fastq
#ignore Broken pipe errors. 

#check that new files exist
!ls -Lltr SRR*

## Pipeline for Read Mapping and Visualization


*  First **copy the genome fasta file** and use a name that is smaller to simplify commands.

In [None]:
#create a new genome file with a smaller name
!cp TriTrypDB-32_LdonovaniBPK282A1_Genome.fasta LdBPK282A1.fa

#check that new files exist
!ls -ltr *.fa

* **Create the BWA Index** for the Leishmania donovani BPK282A1 reference genome  

    For a small genome < 2 Gbases use the "**is**" argument.  
    For example, Leishmania donovani BPK282A1, which is only partially reconstructed and contains 36 contigs is 32,444,968 nucleotides long.  


In [None]:
#time: up to 30 seconds
!bwa index -a is LdBPK282A1.fa

#check that new files exist
!ls -ltr LdBPK282A1.*

* Use **BWA aln** to map the read data to the *Leishmania* genome  

    **Parameters** are as follows:  
    * **-n**	maximum number of differences allowed between the read and the mapped location, here, 4% of the read length (default is 0.04)  
    * **-o**	maximum number of gap starts allowed (default is 1)  
    * **-e**	maximum number of gap extensions allowed (default is -1 means single 	character gaps)  
    * **-t** 	number of threads (perhaps set to number of processor cores)  
    * **files**	
        * **LdBPK282A1.fa** – reference fasta file  
        * **SRR1254937.first10000.fastq** – read data file in fastq format (note the fastq input file could be compressed, for example data.fastq.gz)
        * **SRR1254937.sai** – output file (sai means **suffix array indices**).



In [None]:
#these steps are fast because the files only contain 10,000 reads each

#align single reads file
!bwa aln -n 0.04 -o 2 -e -1 -t 7 LdBPK282A1.fa SRR1254937.first10000.fastq > outSRR1254937.sai

#align paired end reads file
!bwa aln -n 0.04 -o 2 -e -1 -t 7 LdBPK282A1.fa SRR1254938_1.first10000.fastq > outSRR1254938_1.sai

#align paired end reads file
!bwa aln -n 0.04 -o 2 -e -1 -t 7 LdBPK282A1.fa SRR1254938_2.first10000.fastq > outSRR1254938_2.sai

#check that new files exist
!ls -ltr *.sai

* Produce the **SAM file output**
    * For **single reads** (not paired end), use **bwa samse**
    * For **paired end reads**, use the **bwa sampe**
    
  **Parameters** are as follows:  
     * **-n** maximum number of best alignments to report (default is 3)  
     * **files**
         * **LdBPK282A1.fa** – reference fasta file 
         * **outSRR1254937.sai** – file with the suffix array indices created by the 
	bwa aln command
         * **SRR1254937.first10000.fastq** – read data file in fastq format (Note this file could be compressed, for example data.fastq.gz)
         * **alignmentsSRR1254937.sam** – SAM output file (this name is arbitrary)

  for sampe, use two pairs of input .sai and .fastq files.


In [None]:
#Single reads
!bwa samse -n 3 LdBPK282A1.fa outSRR1254937.sai SRR1254937.first10000.fastq > alignmentsSRR1254937.sam

#Paired end reads
!bwa sampe -n 3 LdBPK282A1.fa outSRR1254938_1.sai outSRR1254938_2.sai SRR1254938_1.first10000.fastq SRR1254938_2.first10000.fastq > alignmentsSRR1254938.sam

#check that new files exist
!ls -ltr *.sam

## Samtools file format

**Compare the output for one line of the sam file to the format tables below.**

In [None]:
#prints line 40
!awk '(NR==1){print "Header line with chromsome name and length:\n"$0"\n";exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print "Data line with the following fields:\n"$0"\n";exit}' alignmentsSRR1254937.sam

#names individual fields
!awk '(NR==40){print " 1 Query name:         "$1" (read name)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 2 Flag:               "$2"           (indicates reverse complement)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 3 Reference name:     "$3"   (chromosome LD23)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 4 Pos:                "$4"          (mapped position)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 5 MapQ:               "$5"           (range is ???-???)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 6 Cigar:              "$6"          (indicates 36 matches)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 7 RNext:              "$7"            (no mate pair)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 8 PNext:              "$8"            (no mate pair)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print " 9 Tlen:               "$9"            (no mate pair)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print "10 Seq:                "$10" (read sequence)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print "11 Qual:               "$11" (read quality per position)"; exit}' alignmentsSRR1254937.sam
!awk '(NR==40){print "1 Optional Fields:     "$12,$13,$14,$15,$16,$17,$18,$19}' alignmentsSRR1254937.sam


### Samtools format

![title](samtoolsp1.png)
![title](samtoolsp2.png)


* **Convert SAM to BAM** using the samtools **view** operation. BAM is a compressed version of SAM.

    **Parameters** are as follows:
    * **-b** output is in .bam format
    * **S** input is in .sam format and the SAM file contains header lines which give the length of each chromosome as above, for example: @SQ     SN:Ld35_v01s1   LN:2113966

    * **files** 
        * **alignmentsSRR1254937.sam** – input single end SAM alignment file
        * **alignmentsSRR1254937.bam** – output single end BAM alignment file  
        
        * **alignmentsSRR1254938.sam** - input paired end SAM alignment file
        * **alignmentsSRR1254938.bam** - output paired end SAM alignment file



In [None]:
# convert single read sam file to bam file
!samtools view -bS alignmentsSRR1254937.sam > alignmentsSRR1254937.bam

# convert paired end read sam file to bam file
!samtools view -bS alignmentsSRR1254938.sam > alignmentsSRR1254938.bam

# check that new files were created
!ls -ltr *.bam

* **Sort the BAM file** using the samtools **sort** function

    This command sorts the reads by chromosome and position.

    **Note that .bam will be appended to the end of the output file name. Do not put .bam at the end of the name**

this is what happened next


In [None]:
#sort the single end read alignment file
!samtools sort alignmentsSRR1254937.bam -o alignmentsSRR1254937.sorted.bam

#sort the paired end read alignment file
!samtools sort alignmentsSRR1254938.bam -o alignmentsSRR1254938.sorted.bam

# check that new files were created
!ls -ltr *.bam

* **Index the BAM file** using the samtools **index** function

    This command makes an index for fast lookup of the reads by chromosome and position.  
    
    **It produces a .bai index file that is used to quickly locate the reads, for example, by the IGV viewer**.


In [None]:
!ls -ltr *.bam

#index the single end read sorted bam file
!samtools index alignmentsSRR1254937.sorted.bam

#index the paired end read sorted bam file
!samtools index alignmentsSRR1254938.sorted.bam

# check that new files were created
!ls -ltr *.bai

* **Check the number of reads mapped** using the samtools **idxstats** function

    **Output is chromosome name, length, number of reads mapped, and number of reads not mapped.**

In [None]:
#print headers for the ouput
!echo "Single end reads"
!echo "chromosome      length  mapped  not_mapped"

#check reads mapped in the single end read sorted bam file
!samtools idxstats alignmentsSRR1254937.sorted.bam

#print headers for the ouput
!echo "\nPaired end reads"
!echo "chromosome      length  mapped  not_mapped"

#check reads mapped in the paired end read sorted bam file
!samtools idxstats alignmentsSRR1254938.sorted.bam

* **Extract alignments** for a particular region from BAM/SAM using the samtools **view** function

    **Parameters** are as follows:
    * **-b** output is in .bam format
    * **file** - use sorted bam alignment file
    * **region** - here we use chromosome Ld31_v01s1 which has the most reads mapped
 
 
* **Index again**


* **Check number of reads mapped**

In [None]:
#extract alignments from the single end read sorted bam file
!samtools view -b alignmentsSRR1254937.sorted.bam Ld31_v01s1 > alignmentsSRR1254937.Ld31_v01s1.sorted.bam

#extract lignments from the paired end read sorted bam file
!samtools view -b alignmentsSRR1254938.sorted.bam Ld31_v01s1 > alignmentsSRR1254938.Ld31_v01s1.sorted.bam

#check that new files were created
!ls -ltr *.Ld31*

#index new files
#single end
!samtools index alignmentsSRR1254937.Ld31_v01s1.sorted.bam

#paired end
!samtools index alignmentsSRR1254938.Ld31_v01s1.sorted.bam

#check again the number of reads mapped 

#print headers for the ouput
!echo "Single end reads"
!echo "chromosome      length  mapped  not_mapped"

#single end
!samtools idxstats alignmentsSRR1254937.Ld31_v01s1.sorted.bam

#print headers for the ouput
!echo "\nPaired end reads"
!echo "chromosome      length  mapped  not_mapped"

#paired end
!samtools idxstats alignmentsSRR1254938.Ld31_v01s1.sorted.bam

* **View mapped reads in the Integrated Genomics Viewer (IGV)**

    * **Start IGV** by clicking on the webstart file or icon.   
    Use of IGV may require resetting security parameters for Java.  
    Shows mapped sequencing reads (and other data) along a reference genome. 
    
    * **Orientation:**
        1. **Load reference genome:** In menu select Genomes -> Load Genome from File
            * use LdBPK282A1.fa
        2. **Choose chromosome:** In second text box from left at top choose **Ld31_v01s1**
        3. **Load data:** in menu select File->Load from File
            * choose alignmentsSRR1254937.Ld31_v01s1.sorted.bam
        4. **Choose an interval**
            * click on chromosome image
            * use zoom slider bar on upper right
            * user Go text box in upper middle 
                * use Ld31_v01s1:833,479-862,052 
            * drag view left and right with mouse
        5. Sequence is visible only at higher zoom
        6. Mouse cursor on gray read bars shows read information
        7. Scroll to 843,620 to see a probable SNP in the reference: 
            * T substituted in place of G.
        8. For a paired alignment track, In menu choose 
            * View as Pairs
            * Group alignment by none
            * Sort alignment by start position
