## Table of Contents
- <a href='#1.0'>Section 1 - Introduction</a>
    - <a href='#1.1'>Section 1.1 - Downloading our Data</a>
- <a href='#2.0'>Section 2 - Running BLAST</a>
    - <a href='#2.1'>Section 2.1 - Setting up the Database</a>
    - <a href='#2.2'>Section 2.2 - A Brief Look Into the BLAST Algorithm</a>
    - <a href='#2.3'>Section 2.3 - Running the BLAST Search</a>

# Tutorial 1 - Command Line BLAST
<img src='img/tut01/ncbi.png'></img>
## <a id='1.0'>Section 1 - Introduction</a>

If you were ask any bioinformatician about what they think the most important tool they use is, they would all likely say BLAST.

But what does BLAST stand for?

$B$asic
<br>
$L$ocal
<br>
$A$lignment
<br>
$S$earch
<br>
$T$ool

It is used to search for either Nucleotide (DNA/RNA) or Amino Acid sequences against the entire NCBI database. This is no trivial task since there are millions of sequences currently uploaded to the database. In this tutorial, we will go over the basic architecture behind the NCBI database while going over how the BLAST algorithm works.

## <a id='1.1'>Section 1.1 - Downloading our Data</a>

To save a bit of time, we're not going to run BLAST searches against the entire NCBI database. We're going to download both the genome and proteome for a strain of E. coli called O157H7. The links can be found here....

Genome:
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_O157_H7_uid57781/NC_002127.fna

Proteome:
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_O157_H7_uid57781/NC_002127.faa

When working on a project of your own, it's often faster to download a few of the important sequences you'll be using so that you can use the Commmand Line to run the BLAST algorithm. To do that, we will be using the Command Line function `wget` to download our data. To see all of the options we could use for wget, run the cell below...

In [None]:
!wget -h

Now in the cell below, use one of the `wget` options to download the E. coli genome and proteome into our `data` directory.

# <a id='2.0'>Section 2.0 - Using BLAST</a>

## <a id='2.1'>Section 2.1 - Setting up the Database</a>
Now that we have our sequences downloaded we need to setup our own version of the database using another Command Line argument called `makeblastdb`. To find out more about `makeblastdb` run the cell below....

In [None]:
!makeblastdb -h

Running this command will slice our large sequences into a series of small words that can be easily sorted. This will speed up the searches that we'll be performing.

Now, let's actually run `makeblastdb` on our data. Here's how it looks to run `makeblastdb` on our amino acid sequence...

In [None]:
!makeblastdb -in data/NC_002127.faa -dbtype 'prot' -parse_seqids

Do the same for our nucleotide sequence below...

## <a id='2.2'>Section 2.2 - A Brief Look Into the BLAST Algorithm</a>

<img src='img/tut01/word-based1.PNG'></img>
<img src='img/tut01/word-based2.PNG'></img>

## <a id='2.3'>Section 2.3 - Running the BLAST Search</a>

When it comes to BLAST there are few different flavours depending on what you're searching for.

`blastn` ~ nulceotide query searched against nucleotide database<br>
`blastp` ~ protein query searched against protein database<br>
`blastx` ~ translated nulceotide query searched protein database<br>
`tblastn` ~ protein query searched against translated nucleotide database<br>
`tblastx` ~ translated nulceotide query searched against another translated nucleotide database

to see the parameters run the cell below....

In [None]:
!blastn -h

Now on to the fun part! Let's search `virus_protein.fasta` against our downloaded E. coli genome.