# BLAST Docker Jupyter Notebook

This notebook is created from NCBI's [BLAST Docker documentation](https://github.com/ncbi/docker/tree/master/blast) using a customized [BLAST]((https://www.ncbi.nlm.nih.gov/books/NBK279690/) and [E-direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/) Docker image. 

At time of testing, BLAST version is 2.7.1+ and Entrez-direct version is 11.0.  The latest version of the tools were installed using anaconda.

## How to use this notebook?

Jupyter Notebook is a powerful way to share free text and code.  If you are not familiar with Jupyter Notebook, take a look at the [documentation.](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/)     

The tools are already installed using Docker in the environment that generated this notebook.  
To get started, first click anywhere inside the code box (in grey), then click the "Run" button above (or by pressing shift + enter).


In [48]:
!blastn -version
!efetch -version

blastn: 2.7.1+
 Package: blast 2.7.1, build Sep 20 2018 02:20:26
11.0


# What is NCBI BLAST?
The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool [(BLAST)](https://www.ncbi.nlm.nih.gov/pubmed/2231712) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

For a full description of the features and capabilities of BLAST+, please refer to the [BLAST Command Line Applications User Manual](https://www.ncbi.nlm.nih.gov/books/NBK279690/).

In this tutorial, BLAST consists of two steps -

* Data provisioning - obtaining and staging query and index database sequences
* Running BLAST - compare query sequence(s) against indexed database

### Provision Data
To create directories to save data, please run the following command block.

In [None]:
%cd BLAST

In [56]:
!mkdir blastdb queries fasta results
!ls -al

total 32
drwsrwsr-x 1 jovyan users 4096 May  8 13:29 .
drwsrwsr-x 1 jovyan users 4096 May  8 13:04 ..
drwxr-sr-x 2 jovyan users 4096 May  8 13:29 blastdb
drwxr-sr-x 2 jovyan users 4096 May  8 13:29 fasta
drwxr-sr-x 2 jovyan users 4096 May  8 13:04 .ipynb_checkpoints
drwxr-sr-x 2 jovyan users 4096 May  8 13:29 queries
drwxr-sr-x 2 jovyan users 4096 May  8 13:29 results


To populate these directories with sample data used in these examples, please
run the commands below:

In [58]:
# Retrieve sample query sequence
!efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa
# Retrieve sample database sequences
!efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa

### Create/Index BLAST databases

To create the blast database, please run the command below:

In [59]:
!makeblastdb -in fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out blastdb/nurse-shark-proteins



Building a new DB, current time: 05/08/2019 13:29:18
New DB name:   /home/jovyan/work/blastdb/nurse-shark-proteins
New DB title:  fasta/nurse-shark-proteins.fsa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7 sequences in 0.00171518 seconds.


### Run BLAST

To run a BLAST search, one can issue the following command:


In [60]:
!blastp -query queries/P01349.fsa \
    -db blastdb/nurse-shark-proteins -out results/blastp.out

The results will be stored in the `results` directory.

In [61]:
!cat results/blastp.out

BLASTP 2.7.1+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: fasta/nurse-shark-proteins.fsa
           7 sequences; 922 total letters



Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName:
Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain

Length=44
                                                                      Score     E
Sequences producing significant

### Additional Resources
If you change the code in this notebook, you can download a copy of it by clicking "File">"Download as">"Notebook."  The modified notebook will not be saved after you stop this Jupyter session.