## Before you begin:
Run the following line and ensure that it prints out something like
```
/home/ubuntu/<Your quest ID>/Tutorial-Series
```

In [2]:
!pwd

/home/ubuntu/Tutorial-Series


## Table of Contents
- <a href='#1.0'>Section 1 - Introduction</a>
- <a href='#2.0'>Section 2 - The Process of Genome Assembly</a>
    - <a href='#2.1'>Section 2.1 - Sequencing</a>
    - <a href='#2.2'>Section 2.2 - Assembly on the Computer</a>
    - <a href='#2.3'>Section 2.3 - Running Velvet</a>
- <a href='#3.0'>Section 3 - Visualizing our Assembly</a>

# Tutorial 2 - Genome Assembly
## <a id='1.0'>Section 1 - Introduction</a>

In Tutorial 1 we learned how to navigate around Jupyter Notebooks as well as explored some basic Command Line arguments. We also learned about the BLAST algorithm and how we can search for similarities between sequences. We finished the tutorial by BLASTing a viral protein against the genome for an E. coli strain named O157:H7. Surprisingly, we found that there were some similar proteins in the bacterial genome!

In this tutorial, we'll be taking some unknown E. coli reads that we found and we'll try to identify which strain they might've come from by walking through the process of genome assembly and searching with BLAST.

## <a id='2.0'>Section 2 - The Process of Genome Assembly</a>

Genome assembly is a large part of bioinformatics. It's how we take physical DNA sequences and store them in a computer readable format. Genome assembly usually refers to two specific processes, sequencing and computational assembly. Sequencing is the wet lab component of taking our physical DNA and storing it into a computer, while assembly further processes that data into a genome that could be further investigated or uploaded to NCBI.

<img src='img/tut02/assembly-overview.PNG'></img>

### <a id='2.1'>Section 2.1 - Sequencing</a>

Recent advances in bioinformatics are all thanks to the rapid developments in sequencing technology! These Next-Gen Sequencing technologies allow us to cheaply and rapidly read short segments of DNA or RNA. This in turn allows for researchers to cover more sections of a genome to generate higher quality genome reconstructions. Older sequencing technologies, mainly Sanger sequencing have only been able to read a single sequence that would be 600 - 1200 base pairs long. With Next-Gen Sequencing we're often able to get enough reads to cover an entire genome 50 times!

<img src='img/tut02/shotgun_sequencing.PNG'></img>

Typically when people talk about sequencing a genome they're referring to Whole Genome Shotgun Sequencing. This involves shearing our sample DNA into tiny chunks. These chunks, invidually, are simpler to sequence, allowing modern sequencers to read multiple chunks at the same time! After all of that, all of the read sequences are stored in `.fastq` files for further processing. In the field, whenever you hear someone talk about reads, they're most likely talking about the `.fastq` that was produced from the sequencer. Below is a picture of what an Illumina Sequencer looks like...

<img src='img/tut02/Sequencer.PNG'></img>

### <a id='2.2'>Section 2.2 - Assembly on the Computer</a>

So, now we have our `raw_reads1.fastq` file and we want to assemble it into a single sequence. This is where the power of Computer Science comes in! To do this, we're going to use a famous assembly program called Velvet.

There are a ton of intracacies involved in genome assembly, but we'll try to cover the major ideas. To piece together our reads we need to further slice our reads up, into fragments. The "k" in k-mers represents how many bases we have in our slices. For example, `ATGC` is a 5-mer, since it has five bases in it. By using k-mers, our computer can readily sort them and start building the sequences from there. The image below shows all of the 5-mers for two sets of reads.

<img src='img/tut02/kmers.png'></img>

Now, after collecting all of the k-mers we can overlap them until we can find the the longest sequences possible. We can stack the k-mers of our reads to build contigs and finally we can stack our contigs to build scaffolds.

<img src='img/tut02/GenomeAssembly.png'></img>

### <a id='2.3'>Section 2.3 - Running Velvet</a>

Now that we understand the process of genome assembly let's run our commands to assemble our `raw_reads1.fastq`....

In [1]:
!velveth data/tut02/assembly 31 -short -fastq raw_reads1.fastq

[0.000000] Reading FastQ file raw_reads1.fastq;
[7.284250] 1257935 sequences found
[7.284262] Done
[8.277546] Reading read set file data/tut02/assembly/Sequences;
[8.536103] 1257935 sequences found
[9.853106] Done
[9.853124] 1257935 sequences in total.
[9.853176] Writing into roadmap file data/tut02/assembly/Roadmaps...
[13.124403] Inputting sequences...
[13.124417] Inputting sequence 0 / 1257935
[51.210761] Inputting sequence 1000000 / 1257935
[61.260348]  === Sequences loaded in 48.135962 s
[61.292445] Done inputting sequences
[61.292464] Destroying splay table
[61.316316] Splay table destroyed


`velveth` sets the order for which k-mers will be read first. The other arguments specify that we're reading from a `.fastq` file type and that we want a k-mer size of 31 bases.

In [2]:
!velvetg data/tut02/assembly/ -scaffolding no -read_trkg yes -amos_file yes -exp_cov auto

[0.000001] Reading roadmap file data/tut02/assembly//Roadmaps
[2.364279] 1257935 roadmaps read
[2.364802] Creating insertion markers
[2.716147] Ordering insertion markers
[4.058660] Counting preNodes
[4.266896] 2074211 preNodes counted, creating them now
[9.323758] Sequence 1000000 / 1257935
[10.483138] Adjusting marker info...
[10.750601] Connecting preNodes
[14.122816] Connecting 1000000 / 1257935
[15.071025] Cleaning up memory
[15.074082] Done creating preGraph
[15.074087] Concatenation...
[15.763336] Renumbering preNodes
[15.763352] Initial preNode count 2074211
[15.826490] Destroyed 1432820 preNodes
[15.826504] Concatenation over!
[15.826507] Clipping short tips off preGraph
[15.913341] Concatenation...
[16.015534] Renumbering preNodes
[16.015548] Initial preNode count 641391
[16.064066] Destroyed 45511 preNodes
[16.064077] Concatenation over!
[16.064079] 24400 tips cut off
[16.064081] 595880 nodes left
[16.064137] Writing into pregraph file data/tut02/assembly//PreGraph...
[17.59

This is the actual command that assembles our reads into contigs. The flags are just some recommended settings.

Now that we have our contigs, let's run a BLAST search to find out where our reads came from.

In [4]:
!blastn -query data/tut02/assembly/contigs.fa -remote -db nt -evalue 0.05| head -n 50

BLASTN 2.6.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: Nucleotide collection (nt)
           50,405,971 sequences; 198,157,098,033 total letters



Query= NODE_1_length_6431_cov_50.828331

Length=6461

RID: 5R5J2HR0016
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

CP034953.1  Escherichia coli strain MT102 chromosome, complete ge...  11926   0.0  
CP034595.1  Escherichia coli strain WCHEC035053S1G0 chromosome, c...  11926   0.0  
LR134083.1  Escherichia coli strain NCTC12655 genome assembly, ch...  11926   0.0  
CP034426.1  Escherichia coli strain WPB121 chromosome                 11926   0.0  
CP034428.1  Escherichia coli strain WPB102 chromosome                 11926   0.0  
CP032667.1  Escherichia coli str. K-12 substr. MG1655 chrom

Here we can see that our contigs seem to match with different strains of E. coli. If we look very closely we can see that `K-12 substr. MG1655` appears twice. For our next tutorial, we're going to assume that our assembled contigs belong to this particular strain.

## <a id='3.0'>Section 3 - Visualizing our Assembly</a>

If you are following this session from home you'll need to install these two programs.....

- <a href="https://ics.hutton.ac.uk/tablet/">Tablet</a>
- <a href="https://rrwick.github.io/Bandage/">Bandage</a>

Both of These programs offer some different visualizations for the assembly we just made. Tablet let's us examine the k-mer overlap we developed from `raw_reads1.fastq` were overlapped to make our contigs, while Bandage let's us examine the order our raw reads were put together.

To use Tablet you need to download the `velvet_asm.afg` file from the `data/tut02/assembly` directory (Just navigate, right-click and select Download if you're in Jupyter) and open it in Tablet. If you do this you should end up with a visualization like this...

<img src='img/tut02/Tablet.PNG'></img>

The left section shows a list of all your contigs, the upper right section shows you the amount of overlap for a given section and the bottom right shows you a visualiztion with your k-mer overlaps.

To use Bandage you need to download the `LastGraph` file from the `data/tut02/assembly` directory and download it, just like before. Now we can open it up in Bandage. Once it's loaded up in Bandage, click on the button that says "Draw Graph" and you should end up with a visualization that looks like this....

<img src='img/tut02/Bandage.PNG'></img>