# Example 01 - Obtaining genome data with `Biopython` and `Entrez` <img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">

## Introduction

In this exercise, you will use [`Biopython`](http://biopython.org/) to download pathogen genome data from [`NCBI`](http://www.ncbi.nlm.nih.gov/) programmatically. 

It is possible to obtain the same data by point-and-click from a browser, or at the terminal using a program like `wget`, but being able to script data downloads has some advantages, such as:

* automation
* reproducibility
* self-documentation
* future adaptability
* linked searches

**Note: large data sets**: if you wish to download large datasets, then using `wget`, `ftp` or other methods is better than programmatic acess *via* `Entrez`.

This Jupyter notebook provides some examples of scripting genome downloads from `NCBI` singly, and in larger quantities. This method of obtaining genome data uses the `Entrez` interface that NCBI provides for automated querying of its data.

### Online documentation

* `Biopython` tutorial for `Entrez`: [http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109)
* `Biopython` technical documentation for `Bio.Entrez`: [http://biopython.org/DIST/docs/api/Bio.Entrez-module.html](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)
* `Entrez` introductory documentation at `NCBI`: [http://www.ncbi.nlm.nih.gov/books/NBK25497/](http://www.ncbi.nlm.nih.gov/books/NBK25497/)

## `Entrez`

`Entrez` is the name `NCBI` give to the tools they provide as a computational interface to the data they hold across their genomic and other databases (e.g. `PubMed`). Most programs that interact with `NCBI` (GenBank/RefSeq) data will be using this set of tools.

### Caveats

It is possible to over-use `Entrez`. If this happens, you or your IP address may be blacklisted. In order to avoid this, you should keep to the following guidelines:

* Make no more than three URL requests per second
* Make large queries outwith the hours of 0900-1700 EST (1400-2200 GMT)
* Provide your email address as an identifier when querying

Programming libraries, such as `Biopython`'s `Entrez` module, will usually help you stay within those guidelines by limiting the frequency of queries, and insisting that you provide an email address.

## `Biopython` and `Bio.Entrez`

[`Biopython`](http://biopython.org/) is a widely-used library, providing bioinformatics tools for the popular [Python](https://www.python.org/) programming language. Similar libraries exist for other programming languages.

`Bio.Entrez` is a module of `Biopython` that is concerned directly with making queries against the `NCBI` databases.

## Running cells in this notebook

This is an interactive notebook, which means you are able to run the code that is written in each of the cells.

To run the code in a cell, you should:

1. Place your mouse cursor in the cell, and click (this gives the cell *focus*) to make it active
2. Hold down the `Shift` key, and press the `Return` key.

If this is successful, you should see the input marker to the left of the cell change from

```
In [ ]:
```

to (for example)

```
In [1]:
```

and you may see output appear below the cell.

## 1. Connecting to `NCBI`

In order to use the `Bio.Entrez` module, you need to *import* it. This is how modules become available for use in Python.

It is good practice at this point to specify your email, so that `NCBI` can contact you in case of problems (or if you are likely to become blacklisted).

It is also good practice to specify a 'tool' that is the script making the call.

In [None]:
# This line imports the Bio.Entrez module, and makes it available
# as 'Entrez'.
from Bio import Entrez

# This line sets the variable 'Entrez.email' to the specified
# email address. You should substitute your own address for the
# example address provided below. Please do not provide a
# fake name.
Entrez.email = "Fakey.McFakename@example.com"

# This line sets the name of the tool that is making the queries
Entrez.tool = "01-genome_data.ipynb"

## 2. Using `Bio.Entrez` to list available databases

When you send a query or request to `NCBI` using `Bio.Entrez`, the remote service will send back data in [XML](https://en.wikipedia.org/wiki/XML) format. This is a file format designed to be easy for computers to read, but is very verbose and difficult to read for humans.

The `Bio.Entrez` module can `read()` this data so that you can extract useful information.

In the example below, you will ask `NCBI` for a list of the databases you can search by using the `Entrez.einfo()` function. This will return a *handle* containing the XML response from `NCBI`. This will be *read* into a record that you can inspect and manipulate, by the `Entrez.read()` function.

In [None]:
# The line below uses the Entrez.einfo() function to
# ask NCBI what databases are available. The result is
# 'stored' in a variable called 'handle'
handle = Entrez.einfo()

# In the line below, the response from NCBI is read
# into a record, that organises NCBI's response into
# something you can work with.
record = Entrez.read(handle)

The variable `record` contains a list of the available databases at `NCBI`, which you can see by executing the cell below:

In [None]:
print(record["DbList"])

You may recognise some of the database names, such as `pubmed`, `nuccore`, `assembly`, `sra`, and `taxonomy`.

`Entrez` allows you to query these databases in much the same way that you just obtained the list of databases.

## 3. Using `Bio.Entrez` to find genome assemblies at `NCBI`

In the cells below, you will use `Bio.Entrez` to identify assemblies for the bacterial plant pathogen *Ralstonia solanacearum*.

We are interested in genome data, so will query against the [`assembly`](http://www.ncbi.nlm.nih.gov/assembly) database at `NCBI`. This contains all genome assemblies, whether complete or draft.

We are interested in *Ralstonia solanacearum*, so will search against the `assembly` database with this text as a query. The function that allows us to do this is `Entrez.esearch()`. By default, searches are limited to 20 results (as on the `NCBI` webpage), but we can change this.

In [None]:
# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)

# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)

The returned information can be viewed by running the cell below.

It may look confusing at first, but it simply describes the database identifiers that uniquely identify the assemblies present in the `assembly` database that correspond to the query we made, and a few other pieces of information that we do not need, right now.

In [None]:
# This line prints the downloaded information from NCBI, so
# we can read it.
print(record)

For now, we are interested in the list of database identifiers, in `record['IdList']`. We will use these to get information from the `assembly` database.

We will look at a single record first, and then consider how to get all the *Ralstonia* genomes at the same time.

## 4. Downloading a single genome from `NCBI`

In this section, you will use one of the database identifiers returned from your search at `NCBI` to identify and download a single GenBank genome record corresponding to *Ralstonia solanacearum*.

To do this, we will select a single accession from the list in `record["IdList"]` in the cell below

In [None]:
# The line below takes the first value in the list of 
# database accessions record["IdList"], and places it in
# the variable 'accession'
accession = record["IdList"][0]

# Show the contents of the variable 'accession'
print(accession)

There is a complicating factor here. Assemblies may not be complete, and could comprise several contigs, or a chromosome and several extrachromosomal elements. These are stored independently in a different database, `nuccore`, and each have individual accessions. We need to *link* the `assembly` accession to each of the `nucleotide` accessions.

This is a common requirement when querying `NCBI` databases, and is achieved using the `Entrez.elink()` function.

We need to specify the database for which we have the accession (or `UID`), and which database we want to query for related records (in this case, `nucleotide`).

In [None]:
# The line below requests the identifiers (UIDs) for all
# records in the `nucleotide` database that correspond to the
# assembly UID that is stored in the variable 'accession'
handle = Entrez.elink(dbfrom="assembly", db="nucleotide",
                     from_uid=accession)

# We place the downloaded information in the variable 'links'
links = Entrez.read(handle)

The `links` variable may contain links to more than one version of the genome (`NCBI` keep third-party managed genome data in GenBank/INSDC records, and `NCBI`-'owned' data in RefSeq records). 

The function below extracts only the INSDC information from the `Elink()` query. It is ***not*** important that you understand the code.

In [None]:
# The code below provides a function that extracts nucleotide
# database accessions for INSDC data from the result of an
# Entrez.elink() query.
def extract_insdc(links):
    """Returns the link UIDs for RefSeq entries, from the
    passed Elink search results"""
    linkset = [ls for ls in links[0]['LinkSetDb'] if
              ls['LinkName'] == 'assembly_nuccore_insdc']
    if 0 == len(linkset):
        raise ValueError("Elink() output has no assembly_nuccore_insdc data")
    uids = [i['Id'] for i in linkset[0]['Link']]
    return uids

You will use the `extract_insdc()` function to get the accession IDs for the sequences in this *Ralstonia solanacearum* genome, in the cell below.

In [None]:
# The line below uses the extract_insdc() function to get INSDC/GenBank
# accession UIDs for the components of the genome/assembly referred to
# in the 'links' variable. These will be stored in the variable
# 'nuc_uids'
nuc_uids = extract_insdc(links)

# Show the contents of 'nuc_uids'
print(nuc_uids)

Now we have accession UIDs for the nucleotide sequences of the assembly, you will use `Entrez.efetch` as before to *fetch*  each sequence record from NCBI.

We need to tell `NCBI` which database we want to use (in this case, `nucleotide`), and the identifiers for the records (the values in `nuc_uids`). To get all the data at the same time, we can join the accession ids into a single string, with commas to separate the individual UIDs.

We will also tell `NCBI` two further pieces of information:

1. The format we want the data returned in. We will ask for full GenBank format (`fasta`).
2. How we want the data returned. We will ask for plain text (`text`).

In [None]:
# The line below retrieves (fetches) the GenBank FASTA record for
# the each database entry specified in `nuc_uids`, in plain text
# format.
handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",
                      id=",".join(nuc_uids))

### Reading sequence data with `Biopython`

To convert the sequences to a usable form, then write them to a file, we can use `Biopython`'s [`SeqIO`](http://biopython.org/wiki/SeqIO) module, which is the standard biological sequence reading/witing interface in Python.

In the cell below, you will *parse* the `NCBI` output into usable [`SeqRecord`](http://biopython.org/wiki/SeqRecord) objects.

In [None]:
# This line imports the SeqIO module, so we can read and write
# sequences
from Bio import SeqIO

# The line below converts the NCBI output into Biopython's 
# SeqRecord representation of sequence data, and stores the
# results in the variable 'seqdata'
seqdata = list(SeqIO.parse(handle, 'fasta'))

In the cell below, you can see that each sequence in the *Ralstonia solanacearum* assembly has been downloaded

In [None]:
# Show the contents of 'seqdata'
for s in seqdata:
    print(s)

### Writing sequence data with `Biopython`

The `SeqIO` module can also be used to write sequence data out to a file on your local hard drive. You will do this in the cell below, using the `SeqIO.write()` function.

In [None]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.fasta", in FASTA format.
# The function returns the number of sequences that were written to file
SeqIO.write(seqdata, "data/ralstonia.fasta", "fasta")