# Elastic-BLAST RDRP in Jupyter notebook


### Requirements
Please, see the [requirements.txt](https://github.com/boratyng/elastic-blast-notebook/blob/main/requirements.txt) file for required python packages.

In [None]:
import os
from uuid import uuid4
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Set up AWS credentials
You need to provide credentials for your AWS user account so that Elastic-BLAST can use cloud resources. Generating and providing user credentials is described here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html. There are two steps to this process:
1. Create a key pair via AWS console: https://console.aws.amazon.com/iam/
1. Paste AWS access key and AWS secret access key in the code below (remember to use quotes as these are python strings)

Note that these keys authenticate your AWS account and anyone who has them has access to your account. We recommend creating new keys for working with this notebook and inactivating them when you are done.

In [None]:
os.environ['AWS_ACCESS_KEY_ID'] =
os.environ['AWS_SECRET_ACCESS_KEY'] =
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

## Create results bucket (if one does not exist)
Elastic-BLAST saves results in a cloud bucket. If you already have a cloud bucket in AWS, you can just provide its name.

### Name the results bucket
Select a name for your results bucket or provide your bucket name. Please, remember that bucket names must be  globally unique. You can either edit _YOURNAME_ variable or change value of _RESULTS_BUCKET_ variable.

In [None]:
YOURNAME = str(uuid4())[:8]
RESULTS_BUCKET = f'elasticblast-{YOURNAME}'
print(f'Your results bucket: s3://{RESULTS_BUCKET}')

### Create results bucket
Skip if the bucket already exists.

In [None]:
!aws s3 mb s3://{RESULTS_BUCKET}

## Optional: Enable Elastic-BLAST Auto-shutdown feature
This feature enables ElasticBLAST to monitor its status and shutdown cloud resources in the event of failures or successful search completion. It needs to be done only once per AWS user. If this feature is not enables you will need to run `elastic-blast delete` to delete cloud resources. Please, see https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/janitor.html for more information. 

In [None]:
!sed -i~ -e '/export PATH/d' $(which aws-create-elastic-blast-janitor-role.sh)
!aws-create-elastic-blast-janitor-role.sh

## Elastic-BLAST config
Below is the contents of Elastic-BLAST configuration file, borrowed from [Elastic-BLAST AWS Quickstart]( https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/quickstart-aws.html), and code that writes it to a file named _BDQA.ini_.

In [None]:
conf_file = 'BDQA.ini'
conf = f"""[cloud-provider]
aws-region = us-east-1

[cluster]
num-nodes = 5
labels = owner={YOURNAME}

[blast]
program = blastp
db = s3://elasticblast-test/db/wolf18/RNAvirome.S2
queries = s3://elasticblast-test/queries/BDQA01.1.fsa_aa
results = s3://{RESULTS_BUCKET}
options = -task blastp-fast -evalue 0.01 -outfmt "7"
"""

with open(conf_file, 'w') as f:
    print(conf, file=f)

## Submit Elastic-BLAST search
Run the cell below to submit Elastic-BLAST search. It will take a few minutes.

In [None]:
!elastic-blast submit --cfg {conf_file}

## Check search status
The cell below checks search status. Elastic-BLAST splits query sequences into parts. _elastic-blast status_ command shows how many of these parts are pending, running, completed, or completed. When the whole search is done you will see only the message: "Your Elastic-BLAST search succeeded ..." or "Your Elastic-BLAST search failed ..."

In [None]:
!elastic-blast status --cfg {conf_file}

## Wait until the search is done
Run the cell below to wait until the search is done.

In [None]:
!elastic-blast status --cfg {conf_file} --wait

## Download results
When the search is done, download results.

In [None]:
!aws s3 cp s3://{RESULTS_BUCKET}/ . --exclude "*" --include "*.out.gz" --recursive

## Uncompress and merge results
Elastic-BLAST produces compressed results files for each batch of queries. We are going to uncompress them and merge them into one file.

In [None]:
!gzip -d batch_*.gz
!cat batch_*.out | grep -v ^# >results.tab
!head results.tab

## Analyze results

We are extracting column names from a comment line of output format 7. 

In [None]:
with open('batch_000-blastp-RNAvirome.S2.out') as f:
    for line in f:
        if 'Fields:' not in line:
            continue
        columns = [col.strip() for col in line[9:].rstrip().split(',')]
        break
columns

Load search results into a pandas dataframe and show a snippet of results in tabular format.

In [None]:
hits = pd.read_csv('results.tab', sep='\t', names=columns)
hits

Let's look at distribution of super kingdoms of database sequences matched by query sequences.

In [None]:
hits[['query acc.ver', 'subject super kingdoms']].drop_duplicates()['subject super kingdoms'].value_counts()

And the distribution of species.

In [None]:
counts = hits[['query acc.ver', 'subject sci name']].drop_duplicates()['subject sci name'].value_counts()
counts

Below is a histogram for top 30 species. 

In [None]:
top_counts = counts[:30]
plt.figure(figsize=(14, 9))
ax = sns.barplot(y=top_counts.index, x=top_counts)
ax.set_xticks(range(max(top_counts + 1)))
ax.set_xlabel('Number of query matches')
ax.set_ylabel(top_counts.name)
plt.grid();

## Clean up cloud resources
### Delete Elastic-BLAST queue and compute environment in AWS
If you did not enable Elastic-BLAST auto-shutdown feature, the AWS Batch queue and compute environment have to be deleted manually.

In [None]:
!elastic-blast delete --cfg {conf_file}

### Delete cloud bucket
If you do not need BLAST search results stored in the cloud, delete the cloud bucket so that you are not charged for it.

In [None]:
!aws s3 rb s3://{RESULTS_BUCKET} --force

## Optional: Delete elastic-blast-janitor role
Deleting this role will disable Elastic-BLAST auto-shutdown feature. You are not paying for this role. It can be reused in future Elastic-BLAST searches.

In [None]:
!sed -i~ -e '/export PATH/d' $(which aws-delete-elastic-blast-janitor-role.sh)
!aws-delete-elastic-blast-janitor-role.sh