# Project #1 "What causes antibiotic resistance?"

__Daria Tziba__, __Igor Kuznetsov__

Following steps can be used to reproduce results from the mini-paper.
Working envioronment: _Ubuntu 18.04, Intel i5_

## 1. Obtaining data

Download data of reference sequence of the E. coli strain not resistant to antibiotics.

`wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz`

`wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz`

Unzip data

`gunzip GCF_000005845.2_ASM584v2_genomic.fna.gz`

`gunzip GCF_000005845.2_ASM584v2_genomic.gff.gz`

Donwload raw Illumina sequencing reads fro shotgun sequencing of an E.col strain that is resistant to the antibiotic ampicilin (1 and 2 refer to forward and reverse)

`wget http://public.dobzhanskycenter.ru/mrayko/amp_res_1.fastq.zip`

`wget http://public.dobzhanskycenter.ru/mrayko/amp_res_2.fastq.zip`

Unzip data

`unzip amp_res_1.fastq.zip`

`unzip amp_res_2.fastq.zip`

## 2. Inspect raw sequencing data

Head of the forward .fastq file

`head -8 amp_res_1.fastq`

Counting words in the fasta files

`wc -l amp_res_1.fastq`

Output:

`wc -l amp_res_2.fastq`

Output:

Number of lines: $\frac{1823504}{4} = 455876$

### 3. Inspecting raw sequencing data with fastqc

3.1) Install __fastqc__ program - simmple statistics analysis tool

`sudo apt-get install fastqc`

Run fastqc on two fasta files 

`fastqc -o . amp_res_1.fastq amp_res_2.fastq`

Unusual anomaly for forward strand #1 (base sequence quality)
    
![title](init_per_base_seq_quality_1.png)

Meaning: we can see that per base sequence quality is lower than 20 quality score.

The graph below allows to look at the quality scores from each tile across all of the bases to see if there was a loss in quality associated with only one part of the flowcell. Failure shows thwhen a mean Phred score more than 5 less than the mean for that base across all tiles.

Unusual anomaly for forward strand #2 (tile sequence quality)

![title](init_per_tile_seq_quality_1.png)

Unusual anomaly for reverse strand

![title](init_per_base_seq_quality_2.png)

## 4. Filter the reads

Following steps are commited to improve the overall quality of the sequencing reads before proceeding with downstream analyssis. 

4.1) Install trimming program Trimmoma

`sudo apt-get install trimmomatic`

4.2) gzip Illumina sequences for using via Trimmoma

`gzip amp_res_1.fastq`

`gzip amp_res_2.fastq`

__Note__: we will use `-phred33` parameter as for Illumina 1.9

4.3) Run TrimmomaticPE

`mkdir trim20`

`TrimmomaticPE -phred33 amp_res_1.fastq.gz amp_res_2.fastq.gz trim20/output_1P.fq.gz trim20/output_1U.fq.gz trim20/output_2P.fq.gz trim20/output_2U.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20`

Output:

4.5) Count read in trimmed sequences

`cd trim20`

`gunzip output_1P.fq.gz`

`gunzip output_2P.fq.gz`

`wc -l output_1P.fq`

Output: $\frac{148810}{4} = 439769$

4.6) Run fastqc on trimmed=20 data

`fastqc -o . output_1P.fq output_2P.fq`

Result: in the forward strand only per_tile_seq quality remain with anomaly, in reverse strand anomaly vanished.

The only lseft anomaly:

![title](trim20_per_tile_seq_quality_1.png)

4.7) Try to increase trim parameter to 30

`mkdir trim30`

`TrimmomaticPE -phred33 amp_res_1.fastq.gz amp_res_2.fastq.gz  trim30/output_1P.fq.gz trim30/output_1U.fq.gz trim30/output_2P.fq.gz trim30/output_2U.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:30 TRAILING:30 SLIDINGWINDOW:10:30 MINLEN:30`

`gunzip output_1p.fq.gz`

`gunzip output_2p.fq.gz`

`wc -l output_1P.fq`

Number of reads: $\frac{1484276}{4}=371069$

__Conclusion__: trimming with limit 30 cut too many reads, so for the following steps we will use reads trimmed with size of 20.

### 5. Aligning sequences to reference

__5.1) Index the reference file via bwa tool__

`sudo apt-get install bwa`

`bwa index -a bwtsw GCF_000005845.2_ASM584v2_genomic.fna`

__5.2) Aligning reads__

Aligning reads
bwa mem GCF_000005845.2_ASM584v2_genomic.fna trim20/output_1P.fq trim20/output_2P.fq > alignment.sam


__5.3) Compress SAM file to BAM via samtools program__

`sudo apt-get samtools`

Compress SAM file to BAM format:

`samtools view -S -b alignment.sam > alignment.bam`

To get statistics:

`samtools flagstat alignment.bam`

Output:

__5.4) Sort bam file__

5.4.1) Sort bam file by sequence coordinate on reference

`samtools sort alignment.bam -o alignment_sorted.bam`

5.4.2) Index bam file for faster search

`samtools index alignment_sorted.bam`

__5.5) Visualize in IGV program __

    1) Download IGV Desktop Application
    2) Genomes -> Create .genome file -> Choose .fna file
    3) Genomes -> Load from file (choose .fna.fai file)
    4) File -> Load from file (choose .bam file)

## 6. Variant Calling

__6.1) Create .mpileup from reference genome and sorted .bam files__

`samtools mpileup -f GCF_000005845.2_ASM584v2_genomic.fna alignment_sorted.bam > my.mpileup`

__6.2) Call actual variates via VarScan program__

Download VarScan from webside: [VarScan](http://dkoboldt.github.io/varscan/)

`java -jar VarScan.v2.3.9.jar mpileup2snp my.mpileup --min-var-freq 0.50 --variants --output-vcf 1 > VarScan_results.vcf`

Output:

## 7. Variant effect prediction

To find out where mutations are 

    1) File -> Load from file (open .gff annotation)
    2) File -> Load from file (open .VarScan_results.vcf)
    
![title](igv_final.png)

Following mutated genes were detected:
* ftsI
* acrB
* rybA
* envZ
* rsgA