<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 04 - Using NCBI `BLAST+` Service  with Biopython

## Table of Contents

1. [Introduction](#introduction)
2. [Python imports](#imports)
3. [Running and analysing a remote BLASTX search](#blastx)
  1. [Run `BLASTX`](#runblastx)
  2. [Load `BLASTX` results](#loadresults)
  3. [Exercise 01](#ex01)

## Introduction

This notebook presents examples of methods for using `BLAST` programmatically, with the webservice provided by NCBI. All calculations will be run on NCBI's servers, using NCBI's databases (not your local `BLAST` installation), but you will be controlling the search using Python code in this notebook.

The advantages to using a programmatic interface for remote BLAST queries are the same as those listed in [notebook 03](03-programming_for_blast):

* It is easy to set up repeatable searches for many sequences, or collections of sequences
* It is easy to read in the search results and conduct downstream analyses that add value to your search

Where it is be practical to submit a large number of simultaneous queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead. You have the opportunity to change custom options to help refine your query, compared to the website interface. If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.

<a id="imports"></a>
## Python imports

To interact with the NCBI's `BLAST` service, we will again use the free `Biopython` programming tools. These provide an interface to interact with NCBI's `BLAST` server, run jobs, and to retrieve the output files.

To collate the `BLAST` search results as dataframes/tables for analysis, we will use the `pandas` package.

To graph the downstream results, we will use the `seaborn` visualisation package.

We import these tools, and some standard library packages for working with files (`os`) below.

In [1]:
# Show plots as part of the notebook
%pylab inline

# Standard library packages
import os

# Import Pandas and Seaborn
import pandas as pd
import seaborn as sns

# Import Biopython tools for running remote BLAST searches
from Bio.Blast import NCBIWWW

# Import Biopython SeqIO module to handle reading sequence data
from Bio import SeqIO

Populating the interactive namespace from numpy and matplotlib


<a id="blastx"></a>
## Running and analysing a remote BLASTX search

As in [notebook 03](03-programming_for_blast.ipynb), our first worked example will be to run a `BLASTX` search, querying a nucleotide sequence against a protein database, to identify potential homologues. What is different about this search is that we will be conducting it at NCBI, and using a different database.

We will use `Biopython` in the code blocks below to first perform the same `BLASTX` search you carried out in the exercise from [notebook 01](01-blast_at_NCBI_website.ipynb) - query a lantibiotic biosynthesis protein from *Kitasatospora* against other *Kitasatospora* species, and write the results to file. Then we will parse the results into a `pandas` dataframe, and finally plot some summary statistics using `seaborn`.

<a id="runblastx"></a>
### Run `BLASTX`

Running a remote `BLAST` search with `Biopython` is, in some ways, simpler than running a local `BLAST` query. The key steps are:

1. Read the query sequence(s) from a source (possibly a local file, but maybe a remote database)
2. Run a remote job with the `NCBIWWW.qblast()` method, specifying your query sequence, database, and `BLAST` program
3. Parse the output you get back from NCBI

To run the remote job, we need the same kind of information as if we were running `BLAST` via the web interface - these arguments are compulsory:

* the `BLAST` program to use
* query sequence(s) to search with
* the database to search in

but we can provide some extra choices when we run the remote job, including restricting the remote search on the basis of taxonomy, just as we did in the exercise from notebook 01 (we'll do this later).

<div class="alert-success">
<b>Our first task is to obtain a query sequence for the search</b>
</div>

We use the same penicillin-binding protein from notebook 01, and read it from a local file with `Biopython's` `SeqIO()` module.

When data such as biological sequences are read in, their metadata - information on database IDs, and other features - follows them. `Biopython` does a nice job of showing us this information if we look at it with the `print()` function:

In [23]:
# Load sequence of penicillin-binding protein, and inspect the information
seq = SeqIO.read("data/kitasatospora/k_sp_CB01950_penicillin.fasta", "fasta")
print(seq)

ID: lcl|LISX01000001.1_cds_OKJ16671.1_31
Name: lcl|LISX01000001.1_cds_OKJ16671.1_31
Description: lcl|LISX01000001.1_cds_OKJ16671.1_31 [locus_tag=AMK19_00175] [protein=penicillin-binding protein] [protein_id=OKJ16671.1] [location=39730..41184]
Number of features: 0
Seq('GTGAACAAGCCGATCCGCCGGGTGTCGATCTTCTGCCTGGTCCTGATCCTGGCC...TAG', SingleLetterAlphabet())


<div class="alert-warning">
<b>However, the remote `BLAST` server requires us to present our sequence in `FASTA` format!</b>
</div>

One of the clever things about `Biopython`'s sequence objects - and a big advantage of using programmatic approaches - is that we can readily convert our sequence information into a number of different formats. To do this, we can use the sequence's `.format()` method to produce a `FASTA`-formatted string.

Doing this does not change the original sequence or its information in any way, but it creates a new presentation of that data, which we can use as our query:

In [24]:
# We need the sequence as a string, so use the .format() method
print(seq.format("fasta"))

>lcl|LISX01000001.1_cds_OKJ16671.1_31 [locus_tag=AMK19_00175] [protein=penicillin-binding protein] [protein_id=OKJ16671.1] [location=39730..41184]
GTGAACAAGCCGATCCGCCGGGTGTCGATCTTCTGCCTGGTCCTGATCCTGGCCCTGATG
CTCCGGGTGAACTGGGTGCAGGGCGTTCAGGCGTCGACGTGGGCCAACAACCCGCACAAC
GACCGCACCAAGTACGACAAGTACGCCTACCCGCGCGGCAACATCATCGTCGGCGGCCAG
GCCGTCACCAAGTCCGACTTCGTCAACGGGCTGCGCTACAAGTACAAGCGCTCCTGGGTG
GACGGGCCGATGTACGCGCCGGTCACCGGCTACTCCTCGCAGACGTACGACGCCAGCCAG
CTGGAGAAGCTGGAGGACGGCATCCTCTCCGGCACCGACTCGCGGCTGTTCTTCCGCAAC
ACCCTGGACATGCTGACCGGCAAGCCCAAGCAGGGCGGCGACGTGGTCACCACCATCGAC
CCCAAGGTGCAGAAGGCCGGCTTCGAGGGGCTCGGCAACAAGAAGGGCGCCGCGGTCGCC
ATCGACCCGAAGACCGGGGCGATCCTCGGGCTGGTCTCCACCCCGTCCTACGACCCGGGC
ACCTTCGCGGGCGGCACCAAGGACGACGAGAAGGCCTGGACGGCACTCGACAGCGACCCG
AACAAGCCGATGCTGAACCGGGCGCTGCGCGAGACCTACCCGCCCGGCTCGACCTTCAAG
CTGGTCACCGCGGCGACCGCGTTCGAGACCGGCAAGTACCAGAGCCCGTCGGACGTCACC
GACACCCCGGACCAGTACATCCTGCCCGGCACCAGCACCCCGCTGATCAACGCCAGCCCC
ACCGAGGACTGCGGGAACGCCACCGTGCAACACGCGATGGACCTGTCCTGCAACACGGTG

<div class="alert-success">
<b>We are now almost ready to build our `BLAST` query</b>
</div>

The last two things we need to do are to consider the database we're going to search against, and the format we want the data returned in.

* We are going to query against the `refseq_protein` database, as this should return a result fairly quickly.
* We are going to request `"Text"` format output, as this will be easy to read.

<div class="alert-warning">
<b>The search will return a special kind of object called a 'handle', which we still need to do some work with</b>
</div>

In [25]:
# Run a remote BLASTX search, with the penicillin-binding protein as a query,
# against the refseq_protein database, getting text output
result_handle = NCBIWWW.qblast("blastx", "refseq_protein", seq.format("fasta"),
                               format_type="Text")

Once the search has completed, we will `read()` the contents of the handle into a variable called `output`, and then inspect the contents using the `print()` function.

In [26]:
# Read the contents of the handle, and print to the cell
output = result_handle.read()
print(output)

<p><!--
QBlastInfoBegin
	Status=READY
QBlastInfoEnd
--><p>
<PRE>
BLASTX 2.6.1+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs", Nucleic
Acids Res. 25:3389-3402.


RID: B0CPKFSE015


Database: NCBI Protein Reference Sequences
           77,318,157 sequences; 29,548,319,843 total letters
Query= lcl|LISX01000001.1_cds_OKJ16671.1_31 [locus_tag=AMK19_00175]
[protein=penicillin-binding protein] [protein_id=OKJ16671.1]
[location=39730..41184]

Length=1455


                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

WP_073807071.1  penicillin-binding protein [Kitasatospora sp. ...  996     0.0  
WP_035864460.1  penicillin-binding protein [Kitasatospora chee...  946     0.0  
WP_043915377.1  penicillin-binding protein [Ki

<div class="alert-warning">
<h3>QUESTIONS</h3>
<ol>
<li> How is this output different to that from command-line and web-interface `BLAST` results?
<li> How many hits are there?
<li> What is the "best hit" to the query? Why do you think it is the "best hit" (what in the results tells you this?)
<li> At what point do you think the matches start to become less reliable? Why do you think this? (*HINT:* inspect the alignments)
</ol>
</div>

<a id="save"></a>
### Save `BLASTX` results to file

The results we have are human-readable, and similar to the default output type from command-line/terminal `BLAST`. But, for now, they exist only in the variable called `output`. If we want to come back to these results, we will need to save them to a file somewhere.

This is a common operation in programmatic approaches to bioinformatics: once a result is obtained, we usually want to save it to a file.

The Python code for saving the contents of `output` to a file is given in the cell below:

In [27]:
# Save output to file
outfilename = "output/kitasatospora/remote_blastx_query_01.txt"
with open(outfilename, 'w') as outfh:
    outfh.write(output)

This code does three main things:

1. It creates a variable called `outfilename`, with the path to the file we want to write
2. It opens that file, ready for writing, as a *handle* called `outfh`
3. It writes the contents of `output` into the `outfh` *handle*

When this is done, the `BLAST` search results we got from NCBI are written to a file, as though we did the search locally. You can inspect the contents of that file at the terminal using a command like:

```bash
less output/kitasatospora/remote_blastx_query_01.txt
```

In [19]:
result_handle = NCBIWWW.qblast("blastx", "nr", seq.format("fasta"),
                               entrez_query="txid2063[ORGN]",
                               format_type="Text")

In [14]:
from Bio.Blast import NCBIXML
blast_result = NCBIXML.parse(result)

In [20]:
output = result_handle.read()
print(output)

<p><!--
QBlastInfoBegin
	Status=READY
QBlastInfoEnd
--><p>
<PRE>
BLASTX 2.6.1+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs", Nucleic
Acids Res. 25:3389-3402.


RID: B08XTG0G015


Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
from WGS projects
           114,782,014 sequences; 42,075,798,770 total letters
Query= lcl|LISX01000001.1_cds_OKJ16671.1_31 [locus_tag=AMK19_00175]
[protein=penicillin-binding protein] [protein_id=OKJ16671.1]
[location=39730..41184]

Length=1455


                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

WP_073807071.1  penicillin-binding protein [Kitasatospora sp. ...  996     0.0   
WP_035864460.1  penicillin-binding protein 