# 8. BLAST

Dealing with BLAST can be split up into two steps, both of which can be done from within Biopython.
Firstly, running BLAST for your query sequence(s), and getting some output. Secondly, parsing the BLAST
output in Python for further analysis.

Your first introduction to running BLAST was probably via the NCBI web-service. In fact, there are
lots of ways you can run BLAST, which can be categorised in several ways. The most important distinction
is running BLAST locally (on your own machine), and running BLAST remotely (on another machine,
typically the NCBI servers). The former will not be part of this Python course, however we will run BLAST from within a Python script in this chapter. 

## 8.1 Running BLAST over the internet

We use the function `qblast()` in the `Bio.Blast.NCBIWWW` module to call the online version of BLAST. This
has three non-optional arguments:
- The first argument is the blast program to use for the search, as a lower case string. The options and descriptions of the programs are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi. Currently qblast only works with blastn, blastp, blastx, tblast and tblastx.
- The second argument specifies the databases to search against. Again, the options for this are available on the NCBI Guide to BLAST ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf.
- The third argument is a string containing your query sequence. This can either be the **sequence** itself, the **sequence in fasta** format, or an **identifier like a GI number**.

The `qblast` function also takes a number of other option arguments which are basically analogous to the
different parameters you can set on the BLAST web page. We'll just highlight a few of them here:

- The argument `url_base` sets the base URL for running BLAST over the internet. By default it connects to the NCBI, but one can use this to connect to an instance of NCBI BLAST running in the cloud. 
- The `qblast` function can return the BLAST results in various formats, which you can choose with the optional format_type keyword: "HTML", "Text", "ASN.1", or "XML". The default is "XML", as that is the format expected by the parser, described in section 7.3 below.
- The argument `expect` sets the expectation or *e-value* threshold.

In [None]:
# Imports
from Bio.Blast import NCBIWWW
help(NCBIWWW.qblast)

Note that the default settings on the NCBI BLAST website are not quite the same as the defaults
on QBLAST. If you get different results, you'll need to check the parameters (e.g., the expectation value
threshold and the gap values).

For example, if you have a nucleotide sequence you want to search against the nucleotide database (nt)
using BLASTN, and you know the GI number of your query sequence, you can use:

In [None]:
# Takes a long time to run
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
result_handle

Alternatively, if we have our query sequence already in a FASTA formatted file, we just need to open the
file and read in this record as a string, and use that as the query argument:

In [None]:
with open("data/NC_005816.fna") as fh:
    fasta_string = fh.read()
    result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
result_handle

We could also have read in the FASTA file as a SeqRecord and then supplied just the sequence itself:

In [None]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
result_handle

Supplying just the sequence means that BLAST will assign an identifier for your sequence automatically.
You might prefer to use the `SeqRecord object`'s *format method* to make a FASTA string (which will include
the existing identifier):

In [None]:
record = SeqIO.read("data/NC_005816.fna", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta"))
result_handle

This approach makes more sense if you have your sequence(s) in a non-FASTA file format which you can
extract using `Bio.SeqIO`.
Whatever arguments you give the `qblast()` function, you should get back your results in a handle
object (by default in XML format). The next step would be to parse the XML output into Python objects
representing the search results (Section 7.3), but you might want to save a local copy of the output file first.

This is especially useful when debugging code that extracts info from the BLAST results (because
re-running the online search is slow and wastes the NCBI computer time).

**You can use `result_handle.read()` to read the BLAST output only
once - calling `result_handle.read()` again returns an empty string.**

In [None]:
with open("data/my_blast.xml", "w") as out_handle:
    out_handle.write(result_handle.read())

After doing this, the results are in the file `my_blast.xml` and the original handle has had all its data
extracted (so we closed it). However, the parse function of the BLAST parser (described in 7.3) takes a
file-handle-like object, so we can just open the saved file for input:

In [None]:
with open("data/my_blast.xml", 'r') as fh:
    my_blast = fh.read()
    
#print(my_blast)

Now that we've got the BLAST results back into a handle again, we are ready to do something with
them, so this leads us right into the parsing section.

## 8.2 Parsing BLAST output
BLAST can generate output in various formats, such as XML, HTML and plain text. It is however highly recommended to work with the XML format. Not only is the XML output more stable than the plain text and HTML output, it is also much easier to parse automatically, making Biopython a whole lot more stable.

You can get BLAST output in XML format in various ways. For the parser, it doesn't matter how the output was generated, as long as it is in the XML format. You then need to get a handle to the results. 

Let's fetch some data

In [None]:
# Uncomment following lines if you haven't run them above
#from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")

In [None]:
result_handle

Once we have a handle to the BLAST output (in XML format), we can either parse a single result or multiple results. 

Just like `Bio.SeqIO` and `Bio.AlignIO`, we have a pair of input functions, read
and parse, where read is for when you have exactly one object, and parse is an iterator for when you can
have lots of objects. The result of this reading/parsing process is/are a BLAST
record objects.

In [None]:
from Bio.Blast import NCBIXML

In [None]:
# For a single BLAST result
#blast_record = NCBIXML.read(result_handle)

In [None]:
# For multiple BLAST results in one XML file
blast_records = NCBIXML.parse(result_handle)
blast_records

In order to iterate over the objects that are part of the `blast_records`, we can use the following:

## 8.3 Exercise
Write a script that blasts the top 5 overrepresented sequences in a fastq-file. Save the following information in a pandas dataframe: title, e-value and score. 


Here is a table that is part of the output of a FastQC process. The raw data can be obtained from the zipped folder that is always created as part of the process. This part represents the overrepresented sequences in a fastq file. The file that contains the data is stored under `data/overrepresented_sequences.txt`. 


```
#Sequence	Count	Percentage	Possible Source
GCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGG	634749	0.9399698125201895	No Hit
GCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGGCC	437871	0.6484224816077345	No Hit
GGGGACAGTCCGCCCCGCCCCCCACCGGGCCCCGAGAGAGGCGACGGAGG	319343	0.47289996493044484	No Hit
GGCTTCCTCGGCCCCGGGATTCGGCGAAAGCTGCGGCCGGAGGGCTGTAA	310651	0.4600283926862577	No Hit
GGGCCTTCCCGGCCGTCCCGGAGCCGGTCGCGGCGCACCGCCACGGTGGA	260086	0.3851490725611636	No Hit
ACGAATGGTTTAGCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGG	247602	0.3666621066273818	No Hit
CGGCTTCGTCGGGAGACGCGTGACCGACGGTCCCCCCGGGACCCGACGGC	170383	0.25231213687083787   No Hit
...
```

In [None]:
# Imports
import pandas as pd
from Bio.Seq import Seq
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio.Alphabet import IUPAC

Here is an example output: 

| Title |                                             Score | E-value |              |  
|------:|--------------------------------------------------:|--------:|-------------:|
|     0 | PREDICTED: Arvicanthis niloticus 28S ribosomal... |    90.0 | 7.825980e-13 |  
|     1 | PREDICTED: Arvicanthis niloticus 28S ribosomal... |    90.0 | 7.825980e-13 |   
|     2 | Streptomyces sp. RPA4-2 chromosome, complete g... |    47.0 | 6.864950e-01 |  
|     3 | PREDICTED: Canis lupus dingo 28S ribosomal RNA... |    73.0 | 6.016610e-08 | 
| 4     | Chain 2, 28S rRNA                                 | 86.0    | 9.534000e-12 | 

## Next session
Next session contains all the exercises. Practice makes perfect! Go to the exercises [here](09_Exercises_Day2.ipynb)