# Mapping Post-Processing
Compress and Sort the SAM file to BAM.  Remove duplicates and look at statistics

<hr >

## Current Directory Structure

In [1]:
%%bash
cd ./analysis
ls -1F

assembly/
data/
fastqc-analysis/
mappings/
trimmed/


- data: Raw FASTQ files
- trimmed: Sickle trimmed FASTQ files
- fastqc-analysis: FASTQC analysis of raw and trimmed FASTQ files
- assembly: reference genome assembly from ancestral genome with bowtie and bwa indexed references
- mappings: bowtie and bwa aligned mappings

<hr >

## Post processing with Samtools 
- Install samtools with conda
- Using samtools version 1.9

### SAM file format:

| Column        | Field         | Description  |
| ------------- |:-------------:|:-----:|
| 1 |QNAME  |Query (pair) NAME  |
| 2 |FLAG   |Bitwise FLAG  |
| 3 |RNAME  |Reference sequence NAME  |
| 4 |POS    |1-based leftmost Position/coordinate of clipped sequence  |
| 5 |MAPQ   |Mapping Quality (Phred-scaled)  |
| 6 |CIAGR  |Extended CIGAR string  |
| 7 |MRNM   |Mate Reference sequence name (‘=’ if same as RNAME)|
| 8 |MPOS   |1-based Mate Position  |
| 9 |ISIZE  |Inferred insert SIZE  |
| 10|SEQ    |Query Sequence on the same strand as the reference  |
| 11|QUAL   |Query Quality (ASCII-33 gives the Phred base quality) |
| 12|OPT    |variable Optional fields in the format TAG: VTYPE: VALUE  |

<hr >

## Fix Mates and Compress
- Clean up read pairing information and flags with SAMtools. 
- Compress SAM to BAM for efficient storing
- Use samtools sort -n to sort by name, as Samtools fixmate expects name-sorted input files
    - -m: Add ms (mate score) tags. These are used by markdup to select the best reads to keep.
    - -O bam: specifies compressed bam output from fixmate

#### Look at header of BAM file (Sorted by Query Name):

In [3]:
%%bash
samtools view -h analysis/mappings/bwa/evolved-6.fixmate.bam | head

@HD	VN:1.6	SO:queryname
@SQ	SN:NODE_1_length_1394677_cov_15.3771	LN:1394677
@SQ	SN:NODE_2_length_1051867_cov_15.4779	LN:1051867
@SQ	SN:NODE_3_length_950567_cov_15.4139	LN:950567
@SQ	SN:NODE_4_length_925223_cov_15.3905	LN:925223
@SQ	SN:NODE_5_length_916389_cov_15.4457	LN:916389
@SQ	SN:NODE_6_length_772252_cov_15.4454	LN:772252
@SQ	SN:NODE_7_length_506590_cov_15.6969	LN:506590
@SQ	SN:NODE_8_length_473386_cov_15.0601	LN:473386
@SQ	SN:NODE_9_length_438517_cov_15.3909	LN:438517


## Sort the BAM file by coordinate order
- -O bam: specifies that the output will be bam-format
- -o: specifies the name of the output file

#### Look at header of BAM file (Sorted by Coordinate):

In [4]:
%%bash
samtools view -h analysis/mappings/bwa/evolved-6.sorted.bam | head

@HD	VN:1.6	SO:coordinate
@SQ	SN:NODE_1_length_1394677_cov_15.3771	LN:1394677
@SQ	SN:NODE_2_length_1051867_cov_15.4779	LN:1051867
@SQ	SN:NODE_3_length_950567_cov_15.4139	LN:950567
@SQ	SN:NODE_4_length_925223_cov_15.3905	LN:925223
@SQ	SN:NODE_5_length_916389_cov_15.4457	LN:916389
@SQ	SN:NODE_6_length_772252_cov_15.4454	LN:772252
@SQ	SN:NODE_7_length_506590_cov_15.6969	LN:506590
@SQ	SN:NODE_8_length_473386_cov_15.0601	LN:473386
@SQ	SN:NODE_9_length_438517_cov_15.3909	LN:438517


<hr >