## RNAseq notebook 3.2: SAM files and read counting
This notebook features examples on how to work with the sequence alignment map file, and how to derive gene counts from mapped reads.  
**Notes**
- full SAM files tend to be large so bash manipulation can take some time (typically minutes)
- not all SAM attributes will be found in all SAM files

**Find where a specific read aligned**  
Example: 1000th read

```bash
%%bash  
cd ../Downloads/hisat_out  
grep SRR5454079.1000 ./SRR5454079.sam | head -n 1
```

In [6]:
%%bash  
cd ~/downloads/hisat_out  
grep SRR5454079.1 ./SRR5454079.sam | head -n 1

@PG	ID:hisat2	PN:hisat2	VN:2.1.0	CL:"/home/yuwang/downloads/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -x ./downloads/HISAT_indices/grch38/genome -S ./downloads/hisat_out/SRR5454079.sam -U ./SRR5454079_1.fastq"


grep: write error: Broken pipe


**Print the first few aligned reads**
```bash
%%bash
cd ../Downloads/hisat_out
awk '/^SRR/' SRR5454079.sam | head
```

In [7]:
%%bash
cd ~//downloads/hisat_out
awk '/^SRR/' SRR5454079.sam | head

SRR5454079.1	4	*	0	0	*	*	0	0	NTCTTTCAGGTTTAGTTAGACGTCCTCCAAAAAGAGGCCANAANTCACC	#AAAFFJJJJJFAF-FAFAJJJ7JJFJJJJJJJJJJ<FJJ#JJ#JJJJJ	YT:Z:UU
SRR5454079.2	4	*	0	0	*	*	0	0	NTGCGCGTGCAGCCCCGGACATCTAAGGGCATCACAGACCNGTNATTGNT	#AAAFJJJJJJJJJJJJJJJJJFJJJJJFJJFJJJJJJJJ#JJ#JJJJ#J	YT:Z:UU
SRR5454079.3	4	*	0	0	*	*	0	0	NAAGATAATTGCTTTGGTCATCTGTAAGTCACTTTAGCCANTGNGTCTNC	#AAFFJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJ#JJ#JJJJ#J	YT:Z:UU
SRR5454079.4	4	*	0	0	*	*	0	0	NTGGATTGCCTGAGGTCAGGAATTCGAGGCCAGTCTGGCCNACNTGATN	#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#JJJJ#	YT:Z:UU
SRR5454079.5	4	*	0	0	*	*	0	0	NGGCAATGCAAACAGCAATCCTACATAATGTAGAATAATTNTTNTTCTNT	#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#JJJJ#J	YT:Z:UU
SRR5454079.6	4	*	0	0	*	*	0	0	NTCCGGATGCGTTGCTCATTTGTCATTTTCATAGGCAGCTNGANTCTTNC	#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#JJJJ#J	YT:Z:UU
SRR5454079.7	4	*	0	0	*	*	0	0	NAATAATATAAAACAGAAAGCTGAACACAACATGTGTGTGNGTNTGTG	#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ#JJ#JJJJ	YT:Z:UU
SRR5454079.8	4	*	0	0	*	*	0	0	NCTATA

**Print the first few reads that aligned to the mitochondrial genome**  
```bash
%%bash  
cd ../Downloads/hisat_out  
awk '{if ($3 == "MT") {print}}' SRR5454079.sam | head
```

**Exercises**   
1) How many reads mapped to chromosome 20?  
2) Find the 76th read in the SAM file. Where did it map in the human genome? Now use blastn to map the read. Do the results agree with each other?  
3) Inspect the reference genome details in the SAM header. Beyond chromosomes, what else is included in the reference?

**Check how many reads were uncounted due to multimapping (alignment not unique)**
```bash
%%bash
cd Unit2-RNAseq/data
tail SRR5454102_genecounts.txt
```

**Check how many Ensembl genes have zero expression**  
Spot and correct the mistake
```bash
%%bash
cd Unit2-RNAseq/data
awk '{if ($2 == 0) print }' SRR5454102_genecounts.txt  | wc -l
```

**HOMEWORK**  
1) Use awk to check the number of columns in the SAM file for all rows and print only the unique column counts. *HINT*: revisit Unit 1   
2) Count how many reads from SRR5454079 mapped to chromosome 20 with 2 soft-clipped bases at the start of the read. *HINT*: Consult the SAM documentation on CIGAR strings.  
3) Using the human transcriptome annotations by Ensembl, calculate counts per gene in the bam file for SRR5454079 and print the first 10 lines (use -s reverse)