[NOTE] This tutorial is reproduction of [the article](http://quinlanlab.org/tutorials/samtools/samtools.html) developed by the Quinlan Lab, via [Sandbox.bio](https://sandbox.bio/tutorials?id=samtools-intro).

### Download sample SAM data

Download a sample SAM file, unzip it, then save as `sample.sam`.

In [2]:
curl https://s3.amazonaws.com/samtools-tutorial/sample.sam.gz | gzip -d > sample.sam

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  371M  100  371M    0     0  17.0M      0  0:00:21  0:00:21 --:--:-- 13.5M


### Convert: SAM $\to$ BAM

SAM file is a text format while BAM is binary version that is computer-friendly. Many software supports BAM format only. So, we need to convert it.

Here is the first 5 lines of `sample.sam` in our hand.

In [6]:
head -n 5 sample.sam

@HD	VN:1.5	GO:none	SO:coordinate
@SQ	SN:1	LN:249250621
@SQ	SN:2	LN:243199373
@SQ	SN:3	LN:198022430
@SQ	SN:4	LN:191154276


Move your mouse cursor, and hover it over `-b` in the cell below. 

(Proceed when you've done it.) As you can see, this option `-b`, or `--bam`, sets BAM as the output file format. `samtools view` produces the data in standard output, and `>` redirects it to a new file `sample.ban`. 

In [1]:
samtools view -b sample.sam > sample.bam

`sample.bam` is a binary file so the `head` command gives us a kinf of junk information.

In [13]:
head -n 2 sample.bam

     � BC �W]�\W?�5if+UAAA�">X�̜��u��f7Yw6��&mj�xw���d��;Ν�dk
RP��B�i���I��b�j�>��[��/-R��{�s��(�>$������?_��j�yB+�^�\�^&5��q~9�#Q�=�܉�qw������.(T����ͩ�]JJ3�Ι��$J��)PX�v����sꤨ�9q�[�)�PƮ�P/EN��E4?Em��6T(�PB	��*p


You can use `file` command to check a file type, anyway.

In [14]:
file sample.bam

sample.bam: gzip compressed data, extra field


### Sorting and indexing a BAM file

`igv` is a standard tool to visualize a BAM file. But requires a BAM in 'proper' form.

https://software.broadinstitute.org/software/igv/BAM
> IGV requires ... BAM files be sorted by position and indexed, ....

So we'll 'sort' first, and then 'index' it using `samtools`.

In [1]:
samtools sort sample.bam > sample.sorted.bam

[bam_sort_core] merging from 2 files and 1 in-memory blocks...


Then we index the BAM file, which is to create an accompanying file with `.bam.bai` extension.

In [2]:
samtools index sample.sorted.bam

#### Extract some BAM file information with `samtools view`

While graphical software like IGV is typically the way to visualize data, `samtools view` can also do the job. For example, you can peek alignments from 1.4 megabase to 1.5 megabase in chromosome 20 by   

In [2]:
samtools view sample.sorted.bam 20:1.4e6-1.5e6 | head -n 5

HWI-ST354R:351:C0UPMACXX:5:1316:9518:23495	99	20	1424292	60	100M	=	1424436	244	GACTGCTGGGTATGCACAAGGGGGCAGGAGGGGCGATCCCCATGGGGCATGGCCACTGGCCATGGGAAACACAGGAGGGAGGCCAGGCAGCTGGCTGGGC	@@CFFFFFGHCFHJIJJJJJJJJIJJJJJJJJJIHFFDDEDCDDDDDDDDDBDDCDDDDDDDDDCDDDDDDDDDDDDDDDBDDDDDDDDDDDDDBDDB?#	MC:Z:100M	MD:Z:11G88	RG:Z:1719PC0017_51	NM:i:1	MQ:i:60	AS:i:95	XS:i:50
HWI-ST354R:351:C0UPMACXX:6:1315:14068:89888	99	20	1424343	60	100M	=	1424530	287	GCCACTGGCCATGGGAAACACAGGAGGGCGGCCAGGCAGCTGGCTGGGCGGTTATGTTAACCGCTGCAAGATGACAGCATTGAGCAGGTTGGCTTCCTTC	??;DB?;?:CF;DBEBAFBBDHHG;EHG):?@;@HA=AAD>7B=;;3((95&))::@3(:3:;<BBBA((++:@4:+2<@AC@>ACCB############	MC:Z:100M	MD:Z:28A39C31	RG:Z:1719PC0017_51	NM:i:2	MQ:i:60	AS:i:90	XS:i:41
HWI-ST354R:351:C0UPMACXX:5:2313:13369:27714	1187	20	1424373	60	100M	=	1424505	174	GCCAGGCAGCTGGCTGGGCGGTTATGTGAACCGCTGCACGATGACAGCATTGAGCAGGTTAGCTTCCTTCAGGGTCTGGCGCTCATCAGCCAGGTCTTTG	:;8DD?:+A;A:+22CG+3<@###############################################################################	MC:Z:58S

And you can get number of alignments in the region by

In [4]:
samtools view -c sample.sorted.bam 20:1.4e6-1.5e6 

338


In [1]:
samtools view

Error        :	Unknown option '--version'
Command line :	SnpEff  --version 

snpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
Usage: snpEff [eff] [options] genome_version [input_file]


	variants_file                   : Default is STDIN



Options:
	-chr <string>                   : Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). Only on TXT output.
	-classic                        : Use old style annotations instead of Sequence Ontology and Hgvs.
	-csvStats <file>                : Create CSV summary file.
	-download                       : Download reference genome if not available. Default: true
	-i <format>                     : Input format [ vcf, bed ]. Default: VCF.
	-fileList                       : Input actually contains a list of files to process.
	-o <format>                     : Ouput format [ vcf, gatk, bed, bedAnn ]. Default: VCF.
	-s , -stats, -htmlStats         : Create HTML summary file.  Default is 'snpEff_summary.html'
	-noS

: 255