<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 01 - Using BLAST Programmatically

## Table of Contents

1. [Introduction](#introduction)
2. [Python imports](#imports)
3. [Running and analysing a local BLASTX search](#blastx)
  1. [Run BLASTX](#runblastx)

<a id="introduction"></a>
## Introduction

This notebook presents examples of methods for using `BLAST` programmatically, with a local installation of `BLAST`.

It can be very convenient to use a web interface - such as the [NCBI interface](https://blast.ncbi.nlm.nih.gov/Blast.cgi), or a project-specific instance (like this one at the [PGSC](http://solanaceae.plantbiology.msu.edu/blast.shtml)) - for `BLAST` searches, but there can be limitations to this approach.

It may not be practical to submit a large number of simultaneous queries via a web form, either because the interface prevents this, or because it is tiresome to point-and-click over and over again. It may be that the interface does not make it easy to change custom options that you might want to modify to help refine your query. It could be the case that the web database does not contain sequences that you are interested in searching against (if, for example, some of the sequences are proprietary), or it might not be constrained to a relevant set of organisms, so the search might take much longer than it needs to for your purposes. If you need to repeat a query, it can be awkward to get the same settings every time.

<br><div class="alert-success">
<b>Using a programmatic approach to submitting `BLAST` queries provides a number of potential advantages</b>
</div></br>

* It is easy to set up repeatable searches for many sequences, or collections of sequences
* It is easy to read in the search results and conduct downstream analyses that add value to your search
* The same code can be readily adapted to different BLAST instances, databases, and servers

The code that we develop in this notebook will be adapted for use in the next notebook: `02-programming_web_blast`, but we focus on a local installation now to understand the principles.

<a id="imports"></a>
## Python imports

To interact with the local installation of `BLAST`, we will use the free `Biopython` programming tools. These provide an interface to interact with `BLAST`, run jobs, and to read in the output files.

To collate the `BLAST` search results as dataframes/tables for analysis, we will use the `pandas` package.

To graph the downstream results, we will use the `seaborn` visualisation package.

We import these tools, and some standard library packages for working with files (`os`) below.

In [1]:
# Standard library packages
import os

# Import Pandas
import pandas as pd

# Import Biopython tools for running local BLASTX
from Bio.Blast.Applications import NcbiblastxCommandline

<a id="blastx"></a>
## Running and analysing a local BLASTX search

As a first worked example, we will run a local `BLASTX` search, querying a nucleotide sequence against a local protein database, to identify potential homologues.

* The database comprises predicted gene products from five *Kitasatospora* genomes
* The query is a single nucleotide sequence of a predicted penicillin-binding protein from *Kitasatospora* sp. CB01950

We will use Python/`biopython` in the code blocks below to first perform the `BLASTX` search, then parse the results into a `pandas` dataframe, and finally we will plot some summary statistics using `seaborn`.

<a id="runblastx"></a>
### Run `BLASTX`

There are two steps to running a `BLAST` command line with `biopython`.

1. Create the command-line object
2. Run the command-line object

To create the command-line, we need to provide the same information as if we were running `BLAST` at the terminal: the location of the query sequence file, the location of the database, and any arguments that modify the type of `BLAST` search we are running.

Firstly, we define two variables that contain the paths to the input data, and the location we want to place our `BLAST` output file. Then we define variables that contain paths to: the input query sequence file; the database we're searching against; and the file containing `BLAST` output

In [2]:
# Define paths to input and output directories
datadir = os.path.join('data', 'kitasatospora')   # input
outdir = os.path.join('output', 'kitasatospora')  # output
os.makedirs(outdir, exist_ok=True)                # create output directory if it doesn't exist

# Define paths to input and output files
query = os.path.join(datadir, 'k_sp_CB01950_penicillin.fasta')           # query sequence(s)
db = os.path.join(datadir, 'kitasatospora_proteins.faa')                 # BLAST database
blastout = os.path.join(outdir, 'AMK19_00175_blastx_kitasatospora.tab')  # BLAST output

<div class="alert-danger">
When using a Jupyter notebook, if you ever forget how exactly to use a Python function or class, you can use Python's inbuilt `help()` system. We use this in the cell below to get information on how to construct a `BLASTX` command, using the `NcbiblastxCommandline` object we imported above:
</div>

In [3]:
# Get help with how to construct the command-line
help(NcbiblastxCommandline)

Help on class NcbiblastxCommandline in module Bio.Blast.Applications:

class NcbiblastxCommandline(_NcbiblastMain2SeqCommandline)
 |  Wrapper for the NCBI BLAST+ program blastx (nucleotide query, protein database).
 |  
 |  With the release of BLAST+ (BLAST rewritten in C++ instead of C), the NCBI
 |  replaced the old blastall tool with separate tools for each of the searches.
 |  This wrapper therefore replaces BlastallCommandline with option -p blastx.
 |  
 |  >>> from Bio.Blast.Applications import NcbiblastxCommandline
 |  >>> cline = NcbiblastxCommandline(query="m_cold.fasta", db="nr", evalue=0.001)
 |  >>> cline
 |  NcbiblastxCommandline(cmd='blastx', query='m_cold.fasta', db='nr', evalue=0.001)
 |  >>> print(cline)
 |  blastx -query m_cold.fasta -db nr -evalue 0.001
 |  
 |  You would typically run the command line with cline() or via the Python
 |  subprocess module, as described in the Biopython tutorial.
 |  
 |  Method resolution order:
 |      NcbiblastxCommandline
 |      

The information above tells how to pass the paths to the query sequence, database, and how to specify other values to control `BLASTX`, e.g.:

```
cline = NcbiblastxCommandline(query="m_cold.fasta", db="nr", evalue=0.001)
```

<br><div class="alert-success">
We use this information to create a command-line object in the variable `cmd_blastx` that we can use to run our `BLASTX` query:
</div></br>

In [4]:
# Create command-line for BLASTX
cmd_blastx = NcbiblastxCommandline(query=query, out=blastout, outfmt=6, db=db)

The `cmd_blastx` object now contains instructions that are equivalent to running `BLASTX` at the command-line. We can even get it to print out a command-line that we could copy-and-paste into the terminal, to run the search:

In [5]:
# Get a working command-line
print(cmd_blastx)

blastx -out output/kitasatospora/AMK19_00175_blastx_kitasatospora.tab -outfmt 6 -query data/kitasatospora/k_sp_CB01950_penicillin.fasta -db data/kitasatospora/kitasatospora_proteins.faa


We don't need to use the terminal at all, though. We can run the `BLASTX` search from Python, by *calling* the `cmd_blastx` object, with:

```
cmd_blastx()
```

<br><div class="alert-warning">
Although the code above is the <i>simplest</i> way to run the command, it can be worth doing something slightly more complex.<br><br />
Any Linux command can place information into two special <i>streams</i>: `STDOUT` and `STDERR` (pronounced 'standard-out' and 'standard-error'). As you might expect, `STDOUT` gets 'output', and errors are reported to `STDERR`. It is good practice to 'catch' these streams, and check them for reports from the program that's being run.
</class></div>

In [6]:
# Run BLASTX, and catch STDOUT/STDERR
stdout, stderr = cmd_blastx()

# Check STDOUT, STDERR
print("STDOUT: %s" % stdout)
print("STDERR: %s" % stderr)

STDOUT: 
STDERR: 


If everything has worked, there should be no information in either `STDOUT` or `STDERR`. You should, however now see a file named `AMK19_00175_blastx_kitasatospora.tab` in the `output/kitasatospora` directory. This file contains your `BLASTX` search results, and we shall import and inspect these in the next section.