#### BI462 - Introduction to Bioinformatics
# Lab 3: Short Read Assembly using `velvet`
___

In this lab you will learn to practice the popular short-read assembler, `velvet`, a De-Bruijn Graph based assembler on the single-end short reads generated by Illumina sequencer. 

The data for assembled is from **Staphylococcus aureus (金黄色葡萄球菌) USA300** which has a genome of around **3MB **. The reads are Illumina single-end library.

## 1. Download and compile the assembler

```bash
mkdir -p ~/bioinformatics/velvet
cd ~/bioinformatics/velvet
wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.03.tgz
tar xzf velvet_1.2.03.tgz; cd velvet_1.2.03
make velveth velvetg
```
These commands will generate a few executable files:
- `velveth`: generates hash information.
- `velvetg`: conducts the de-Bruijn graph assembling into contigs.

## 2. Download the data

The reads can be obtained from the **Sequence Read Archive (SRA)**. For the following example use the run data **SRR022825** and **SRR022823** from the SRA Sample **SRS004748**.

```bash
cd ~/bioinformatics/velvet
mkdir data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022825/SRR022825.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022823/SRR022823.fastq.gz
```

Now you are ready to process the compressed FASTQ files with velvet.

## 3. `velveth`

`velveth` supports the following input file formats and read categories:

### 3.1 Input file formats
- **FASTA** (default) or **FASTA.gz**
- **FASTQ** or **FASTQ.gz**
- **SAM** or **BAM**
- **Eland**
- **Gerald**

### 3.2 Read categories
- **short**: short reads
- **shortPaired**: short paired-end reads
- **short2**: same as short, but for a separate insert-size library
- **shortPaired2**: short paired-end reads in two separate files
- **long**: long reads

Now run `velveth` for the reads in **SRR022825.fastq.gz** and **SRR022823.fastq.gz** using the following options:
- A De Bruijn graph with **k-mer of 25**
- An output directory called **kmer_25**

```bash
velveth kmer_25 25 -fastq.gz -short data/SRR022825.fastq.gz data/SRR022823.fastq.gz
```

This will take a short while and ends with some files in the output directory.

### <font color="red">$\S$ Exercise 1</font>

1. What did you find in the folder kmer_25?
2. Describe the content of the two velveth output files.
3. What does the **Log** file store for you?

## 4. `velvetg`

Now run `velvetg` on your output directory, with the commands:
```bash
time velvetg run_25
```

### <font color="red">$\S$ Exercise 2</font>

1. What extra files do you see in the folder kmer_25?
2. What do you suppose they might represent?
3. In the **Log** file in kmer_25, what is the **N50**?

### $\S$ What is N50?

Broadly speaking, N50 is the median (not average) of a sorted data set using the length of a set of sequences. Usually it is the length of the contig whose length, when added to the length of all longer contigs, makes a total greater that half the sum of the lengths of all contigs. 

### <font color="red">$\S$ Exercise 3</font>

We can use a java program to calculate the truncated N25, N50, and N75:
```bash
java -jar contrib/gnx.jar -min 100 -nx 25,50,75 kmer_25/contigs.fa
```

Does the value of N50 agree with the value stored in the Log file? If not, why do you think this might be?

## 5. Improving the assembly using some options

In order to improve our results, take a closer look at the standard options of velvetg by typing `velvetg` without any parameters.

### 5.1 `-cov_cutoff`

Clearly **-cov_cutoff** will allow you to exclude contigs with low-coverage kmer coverage, implying unacceptably poor quality. 

### 5.2 `-exp_cov`

The **-exp_cov** switch is used to give `velvetg` a coverage to expect. If the expected coverage of any contig is substantially in excess of the suggested expected value, maybe this would indicate a repeat. 

We can use a **contributed software**, written in Perl, to **visualize the distribution** and **estimate the expected coverage of $k$-mers**.
```bash
contrib/estimate-exp_cov/velvet-estimate-exp_cov.pl kmer_25/stats.txt
```

This will generate a **weighted ASCII histogram**, with which you can determine the expected coverage and figure out a possible coverage threshold.

### <font color='red'>$\S$ Exercise 4</font>

1. What is the possible expected coverage of $k$-mers?
2. What is the coverage cutoff to remove the likely noise,  in your opinion?  

### 5.3 Rerun `velvetg` using some options

```bash
# backup the previous assembled contigs
cp kmer_25/contigs.fa kmer_25/contigs.fa.0

# run with coverage cutoff of 6
time velvetg kmer_25 -cov_cutoff 6
cp kmer_25/contigs.fa kmer_25/contigs.fa.1

time velvetg run_25 -exp_cov 14
cp kmer_25/contigs.fa kmer_25/contigs.fa.2
time velvetg run_25 -cov_cutoff 6 -exp_cov 14
cp kmer_25/contigs.fa kmer_25/contigs.fa.3

```

### <font color="red">$\S$ Exercise 5</font>

1. What is the N50 with no parameter?
2. What is the N50 with `-cov_cutoff 6`?
3. What is the N50 with `-exp_cov 13`?
4. What is the N50 with `-cov_cutoff 6 -exp_cov 13`?
5. Did you notice a trend in the time `velvetg` took to run? If so, can you explain why that might be?

Now you can try using different values for both `-cov_cutoff` and `-exp_cov`. Similarly, you can explore the further use of the other options:
- `-max_coverage`: cut-off value for the upper range (like `cov_cutoff` for the lower range).
- `-max_branch_length`: length of branch to look for bubble.
- `-unused_reads`: write unused reads into file.
- `-amos_file`: write AMOS message file.
- `-read_trkg`: tracking read (more memory usage) - automatically on for certain operations.
- etc.

### <font color="red">$\S$ Exercise 6</font>

If you want to explore the behaviour of velvet even further, you could experiment with the following.
1. Reduce the sequence coverage by only choosing one input file for the assembly e.g. SRR022825.fastq.gz.
2. Increase the sequence coverage by downloading further files from the same sample SRS004748.

How did the N50 change?

## 6. Visualization of the contigs

In particular, look at the **-amos_file** parameter which instructs velvetg to create a version of the
assembly that can be viewed with a program called **tablet**, written in java.
```bash
velvetg kmer_25 -cov_cutoff 6 -exp_cov 13 -amos_file yes
```
This, will generate a file named `velvet_asm.afg`, which can be visualized using `tablet`.

You can download the reference genome sequence to assist the viewing.