# Downloading genome data from NCBI with `Biopython` and `Entrez`

## Introduction

In this worksheet, you will use [`Biopython`](http://biopython.org/) to download pathogen genome data from [`NCBI`](http://www.ncbi.nlm.nih.gov/) programmatically with Python. 

It is possible to obtain the same data by point-and-click from a browser, at the terminal using a program like `wget`, or by other means, but scripting data downloads in this way has advantages, such as:

* **automation** - only one script is required to download many sequences
* **reproducibility** - the same data will be downloaded each time, and copy-paste errors will be avoided
* **self-documentation** - the script itself describes exactly how the data was obtained
* **future adaptability (and reuse)** - only minor changes to the script may be required for the next analysis or project

<div class="alert alert-warning">
<b>Note: large data sets</b>: if you wish to download large datasets, then using <b>wget</b>, <b>ftp</b> or other methods can be better than programmatic access <i>via</i> <b>Entrez</b>. The <b>Entrez</b> interface may give errors during through large downloads, and is not designed for large data transfers.
</div>

This Jupyter notebook provides some examples of scripting genome downloads from `NCBI` singly, and in groups. This method of obtaining genome data uses the [`Entrez`](http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Entrez/) interface that NCBI provides for automated querying of its data.

### Related online documentation

* `Biopython` tutorial for `Entrez`: [http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109)
* `Biopython` technical documentation for `Bio.Entrez`: [http://biopython.org/DIST/docs/api/Bio.Entrez-module.html](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)
* `Entrez` introductory documentation at `NCBI`: [http://www.ncbi.nlm.nih.gov/books/NBK25497/](http://www.ncbi.nlm.nih.gov/books/NBK25497/)
* `Entrez` help: [http://www.ncbi.nlm.nih.gov/books/NBK3837/](http://www.ncbi.nlm.nih.gov/books/NBK3837/)
* `Entrez` Quick Start Guide: [http://www.ncbi.nlm.nih.gov/books/NBK25500/](http://www.ncbi.nlm.nih.gov/books/NBK25500/)

### Requirements
To complete this worksheet, you will need:
* Active internet connection
* Installed the <b>Biopython</b> libraries

In [5]:
# Install BioPython on local computer
#!pip3 install biopython

### Setup on Google Colab

In [2]:
try:
    import google.colab
    # Running on Google Colab, so install Biopython first
    !pip install biopython
except ImportError:
    pass

In [3]:
import os
import sys

from urllib.request import urlretrieve

import Bio
from Bio import SeqIO, SearchIO, Entrez
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Blast import NCBIWWW
from Bio.Data import CodonTable

print("Python version:", sys.version_info)
print("Biopython version:", Bio.__version__)

Python version: sys.version_info(major=3, minor=11, micro=5, releaselevel='final', serial=0)
Biopython version: 1.81


## `Entrez`

[`Entrez`](http://www.ncbi.nlm.nih.gov/Class/MLACourse/Original8Hour/Entrez/) is the name `NCBI` give to the tools they provide as a computational interface to the data they hold across their genomic and other databases (e.g. `PubMed`). Many scripts and programs that interact with `NCBI` to download data (e.g. from GenBank or RefSeq) will be using this set of tools.

<div class="alert alert-warning">
**Caveats**

There are usage caps for this service, and it is possible to over-use <b>Entrez</b>. If this happens, you or your IP address may be blacklisted. In order to avoid this, you should keep to the following guidelines:

* Make no more than three URL requests per second
* Make large queries outwith the hours of 0900-1700 EST (1400-2200 GMT)
* Provide your email address as an identifier when querying

Programming libraries, such as <b>Biopython</b>'s <b>Bio.Entrez</b> module, will usually help you stay within those guidelines by limiting the frequency of queries, and insisting that you provide an email address.
</div>

## `Biopython` and `Bio.Entrez`

[`Biopython`](http://biopython.org/) is a widely-used library, providing bioinformatics tools for the popular [Python](https://www.python.org/) programming language. Similar libraries exist for other programming languages.

`Bio.Entrez` is a module of `Biopython` that provides tools to make queries against the `NCBI` databases using the `Entrez` interface.

## Connecting to `NCBI`

In order to use the `Bio.Entrez` module, you need to *import* it. This is how modules become available for use in Python.

<div class="alert alert-info" role="alert">
It is good practice at this point to specify your email, so that <b>NCBI</b> can contact you in case of problems (or if you are likely to become blacklisted through excessive use). It is also good practice to specify a '<b>tool</b>' that is the script making the call.
</div>

In [6]:
# This line imports the Bio.Entrez module, and makes it available
# as 'Entrez'.
from Bio import Entrez

# The line below imports the Bio.SeqIO module, which allows reading
# and writing of common bioinformatics sequence formats.
from Bio import SeqIO

# Create a new directory (if needed) for output/downloads
import os
outdir = "ncbi_downloads"
os.makedirs(outdir, exist_ok=True)

# This line sets the variable 'Entrez.email' to the specified
# email address. You should substitute your own address for the
# example address provided below. Please do not provide a
# fake name.
Entrez.email = "talipzengin@mu.edu.tr"

# This line sets the name of the tool that is making the queries
Entrez.tool = "week_03_biopython_NCBI_entrez_downloads.ipynb"

## Using `Bio.Entrez` to list available databases

When you send a query or request to `NCBI` using `Bio.Entrez`, the remote service will send back data in [XML](https://en.wikipedia.org/wiki/XML) format. This is a file format designed to be easy for computers to read, but is very verbose and difficult to read for humans.

The `Bio.Entrez` module can `read()` this data so that you can extract useful information.

In the example below, you will ask `NCBI` for a list of the databases you can search by using the `Entrez.einfo()` function. This will return a *handle* containing the XML response from `NCBI`. This will be *read* into a record that you can inspect and manipulate, by the `Entrez.read()` function.

In [7]:
# The line below uses the Entrez.einfo() function to
# ask NCBI what databases are available. The result is
# 'stored' in a variable called 'handle'
handle = Entrez.einfo()

# In the line below, the response from NCBI is read
# into a record, that organises NCBI's response into
# something you can work with.
record = Entrez.read(handle)

The variable `record` contains a list of the available databases at `NCBI`, which you can see by executing the cell below:

In [8]:
print(record["DbList"])

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']


For each of these databases, we can use EInfo again to obtain more information:

In [18]:
from Bio import Entrez
handle = Entrez.einfo(db="nucleotide")
record = Entrez.read(handle)
print(record)
record["DbInfo"]["Description"]

{'DbInfo': {'DbName': 'nuccore', 'MenuName': 'Nucleotide', 'Description': 'Core Nucleotide db', 'DbBuild': 'Build231015-2350m.1', 'Count': '609637622', 'LastUpdate': '2023/10/17 15:18', 'FieldList': [{'Name': 'ALL', 'FullName': 'All Fields', 'Description': 'All terms from all searchable fields', 'TermCount': '12965065280', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'UID', 'FullName': 'UID', 'Description': 'Unique number assigned to each sequence', 'TermCount': '0', 'IsDate': 'N', 'IsNumerical': 'Y', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'Y'}, {'Name': 'FILT', 'FullName': 'Filter', 'Description': 'Limits the records', 'TermCount': '424', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'WORD', 'FullName': 'Text Word', 'Description': 'Free text associated with record', 'TermCount': '2696390059', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'I

'Core Nucleotide db'

`Entrez` allows you to query these databases using `Entrez.esearch()` in much the same way that you just obtained the list of databases with `Entrez.einfo()`.

## Using `Bio.Entrez` to find genome assemblies at `NCBI`

In the cells below, you will use `Bio.Entrez` to identify assemblies for the bacterial plant pathogen *Ralstonia solanacearum*. As our interest is genome data, we will query against the [`assembly`](http://www.ncbi.nlm.nih.gov/assembly) database at `NCBI`. This database contains entries for all genome assemblies, whether complete or draft.

We are interested in *Ralstonia solanacearum*, so will search against the `assembly` database with the text `"Ralstonia solanacearum"` as a query. The function that allows us to do this is `Entrez.esearch()`. By default, searches are limited to 20 results (as on the `NCBI` webpage), but we can change this.

In [9]:
# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)

# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)

The returned information can be viewed by running the cell below.

The output may look confusing at first, but it simply describes the database identifiers that uniquely identify the assemblies present in the `assembly` database that correspond to the query we made, and a few other pieces of information (number of returned entries, total number of entries that could have been returned, how the query was processed) that we do not need, right now.

In [11]:
# This line prints the downloaded information from NCBI, so we can read it.
print(record)

{'Count': '497', 'RetMax': '100', 'RetStart': '0', 'IdList': ['17412981', '16182181', '16182171', '16182161', '15940591', '15940541', '15940491', '15794331', '15794271', '15794261', '15794241', '15794231', '15794221', '15794201', '15794191', '15794181', '15794171', '15794161', '15794151', '15794131', '15794101', '15623791', '15623761', '15623751', '15623741', '15623731', '15623721', '15623711', '15623701', '15623691', '15623681', '15623671', '15623661', '15623651', '15623641', '14327951', '14327941', '14231511', '14231501', '13916141', '13915671', '13559881', '13118301', '12942611', '12916701', '12916691', '12916681', '12916671', '12916661', '12916651', '12916631', '12916621', '12916611', '12916601', '12916581', '12574841', '12574831', '12574821', '12574811', '12574801', '12574791', '12574781', '12574771', '12574761', '12574751', '12574741', '12574731', '12574711', '12574701', '12574691', '12574681', '12574671', '12574661', '12574651', '12574611', '12574601', '12574591', '12574581', '1

For now, we are interested in the list of database identifiers, in `record['IdList']`. We will use these to get information from the `assembly` database.

We will look at a single record first, and then consider how to get all the *Ralstonia* genomes at the same time.

`ESummary` retrieves document summaries from a list of primary IDs (see the ESummary help page for more information). In Biopython, ESummary is available as `Bio.Entrez.esummary()`. Using the search result above, we can for example find out more about the record with accession: 17412981

In [27]:
from Bio import Entrez
Entrez.email = "talipzengin@mu.edu.tr"     # Always tell NCBI who you are
handle = Entrez.esummary(db="nucleotide", term="Ralstonia solanacearum", id="17412981")
record = Entrez.read(handle)
info = record[0]
print(info)
print("\nExtracted Info:")
print("ID\nid: {}\nTitle: {}".format(record[0]["Id"], info["Title"]))

{'Item': [], 'Id': '17412981', 'Caption': 'BJ019199', 'Title': "BJ019199 MF01SSA cDNA Oryzias latipes cDNA clone MF01SSA076A03 3', mRNA sequence", 'Extra': 'gi|17412981|gnl|dbEST|10499255|dbj|BJ019199.1|[17412981]', 'Gi': IntegerElement(17412981, attributes={}), 'CreateDate': '2001/12/07', 'UpdateDate': '2010/12/15', 'Flags': IntegerElement(0, attributes={}), 'TaxId': IntegerElement(8090, attributes={}), 'Length': IntegerElement(319, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'BJ019199.1'}

Extracted Info:
ID
id: 17412981
Title: BJ019199 MF01SSA cDNA Oryzias latipes cDNA clone MF01SSA076A03 3', mRNA sequence


## Fetching sequence records from `NCBI`

Now we have accession UIDs for the nucleotide sequences of the assembly, you will use `Entrez.efetch` as before to *fetch*  each sequence record from NCBI.

We need to tell `NCBI` which database we want to use (in this case, `nucleotide`), and the identifiers for the records (the values in `nuc_uids`). To get all the data at the same time, we can join the accession ids into a single string, with commas to separate the individual UIDs.

We will also tell `NCBI` two further pieces of information:

1. The format we want the data returned in. We will ask for GenBank format (`gbwithparts`) to obtain the genome sequence and feature annotations.
2. How we want the data returned. We will ask for plain text (`text`).

In [35]:
# The lines below retrieve (fetch) the GenBank record
from Bio import Entrez
Entrez.email = "talipzengin@mu.edu.tr"     # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="17412981", rettype="gb", retmode="text")
print(handle.read())

LOCUS       BJ019199                 319 bp    mRNA    linear   EST 15-DEC-2010
DEFINITION  BJ019199 MF01SSA cDNA Oryzias latipes cDNA clone MF01SSA076A03 3',
            mRNA sequence.
ACCESSION   BJ019199
VERSION     BJ019199.1
DBLINK      BioSample: SAMN00169934
KEYWORDS    EST.
SOURCE      Oryzias latipes (Japanese medaka)
  ORGANISM  Oryzias latipes
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Actinopterygii; Neopterygii; Teleostei; Neoteleostei;
            Acanthomorphata; Ovalentaria; Atherinomorphae; Beloniformes;
            Adrianichthyidae; Oryziinae; Oryzias.
REFERENCE   1  (bases 1 to 319)
  AUTHORS   Kohara,Y., Shin-i,T., Kimura,T., Narita,T., Jindo,T. and Takeda,H.
  TITLE     Medaka EST Project in Takeda's lab
  JOURNAL   Unpublished (2001)
COMMENT     Contact: Tadasu Shin-i
            Genome Biology Lab., Center For Genetic Resource Information
            National Institute of Genetics
            1111 Yata, Mishima, Shiz

The arguments rettype="gb" and retmode="text" let us download this record in the GenBank format.

Note that until Easter 2009, the `Entrez EFetch API` let you use “genbank” as the return type, however the NCBI now insist on using the official return types of “gb” or “gbwithparts” (or “gp” for proteins) as described on online. Also not that until Feb 2012, the Entrez `EFetch API` would default to returning plain text files, but now defaults to `XML`.

Alternatively, you could for example use `rettype="fasta"` to get the Fasta-format; see the `EFetch` Sequences Help page for other options. Remember – the available formats depend on which database you are downloading from - see the main EFetch Help page.

If you fetch the record in one of the formats accepted by `Bio.SeqIO` (see Chapter [chapter:Bio.SeqIO]), you could directly parse it into a SeqRecord:

In [36]:
from Bio import Entrez, SeqIO
handle = Entrez.efetch(db="nucleotide", id="17412981", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(record)

ID: BJ019199.1
Name: BJ019199
Description: BJ019199 MF01SSA cDNA Oryzias latipes cDNA clone MF01SSA076A03 3', mRNA sequence
Database cross-references: BioSample:SAMN00169934
Number of features: 1
/molecule_type=mRNA
/topology=linear
/data_file_division=EST
/date=15-DEC-2010
/accessions=['BJ019199']
/sequence_version=1
/keywords=['EST']
/source=Oryzias latipes (Japanese medaka)
/organism=Oryzias latipes
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Neoteleostei', 'Acanthomorphata', 'Ovalentaria', 'Atherinomorphae', 'Beloniformes', 'Adrianichthyidae', 'Oryziinae', 'Oryzias']
/references=[Reference(title="Medaka EST Project in Takeda's lab", ...)]
/comment=Contact: Tadasu Shin-i
Genome Biology Lab., Center For Genetic Resource Information
National Institute of Genetics
1111 Yata, Mishima, Shizuoka 411-8540, Japan
Tel: 81-55-981-6856
Fax: 81-55-981-6855
Email: tshini@genes.nig.ac.jp, URL: http://dolph

Note that a more typical use would be to save the sequence data to a local file, and then parse it with `Bio.SeqIO`. This can save you having to re-download the same file repeatedly while working on your script, and places less load on the NCBI’s servers. For example:

In [37]:
import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "talipzengin@mu.edu.tr"     # Always tell NCBI who you are
filename = "gi_17412981.gbk"
if not os.path.isfile(filename):
    # Downloading and writing to file
    with Entrez.efetch(db="nucleotide",id="17412981",rettype="gb", retmode="text") as net_handle:
        with open(filename, "w") as out_handle:
            out_handle.write(net_handle.read())
        print("Saved")

print("Parsing...")
record = SeqIO.read(filename, "genbank")
print(record)

Saved
Parsing...
ID: BJ019199.1
Name: BJ019199
Description: BJ019199 MF01SSA cDNA Oryzias latipes cDNA clone MF01SSA076A03 3', mRNA sequence
Database cross-references: BioSample:SAMN00169934
Number of features: 1
/molecule_type=mRNA
/topology=linear
/data_file_division=EST
/date=15-DEC-2010
/accessions=['BJ019199']
/sequence_version=1
/keywords=['EST']
/source=Oryzias latipes (Japanese medaka)
/organism=Oryzias latipes
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Actinopterygii', 'Neopterygii', 'Teleostei', 'Neoteleostei', 'Acanthomorphata', 'Ovalentaria', 'Atherinomorphae', 'Beloniformes', 'Adrianichthyidae', 'Oryziinae', 'Oryzias']
/references=[Reference(title="Medaka EST Project in Takeda's lab", ...)]
/comment=Contact: Tadasu Shin-i
Genome Biology Lab., Center For Genetic Resource Information
National Institute of Genetics
1111 Yata, Mishima, Shizuoka 411-8540, Japan
Tel: 81-55-981-6856
Fax: 81-55-981-6855
Email: tshini@genes.nig.ac.jp, 

### Writing sequence data with `Biopython`

The `SeqIO` module can be used to write sequence data out to a file on your local hard drive. You will do this in the cells below, using the `SeqIO.write()` function.

<div class="alert alert-info" role="alert">
The <b>SeqRecord</b>s you downloaded contain sequence and feature annotation data, and can be written in any of several file formats. Some of these formats preserve annotation information, and some do not.
</div>

Firstly, in the cell below, you will write GenBank format files that preserve both sequence and annotation data. For the `SeqIO.write()` function, we need to specify the list of `SeqRecord`s (`records`), the output filename to which they will be written, and the format we wish to write (in this case `"genbank"`).

In [38]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.gbk", in GenBank format.
# The function returns the number of sequences that were written to file
SeqIO.write(record, os.path.join(outdir, "gi_17412981.gbk"), "genbank")

1

If you inspect the newly-created `ralstonia.gbk` file, you should see that it contains complete GenBank records, describing this genome.

GenBank files are detailed and large, and sometimes we only want to consider the genome sequence itself, not its annotation. The FASTA sequence can be written out on its own by specifyinf the `"fasta"` format to `SeqIO.write()` instead. This time, we write the output to `data/ralstonia.fasta`.

In [39]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.fasta", in FASTA format.
SeqIO.write(record, os.path.join(outdir, "ralstonia.fasta"), "fasta")

1

## Downloading a single genome from `NCBI`

In this section, you will use one of the database identifiers returned from your search at `NCBI` to identify and download the GenBank records corresponding to a single assembly of *Ralstonia solanacearum*.

To do this, we will select a single accession from the list in `record["IdList"]`, using the code in the cell below. 

<div class="alert alert-danger" role="alert">
Although this is a single assembly, with a single accession ID, we shall see that we need to download more than one sequence to cover the complete genome.
</div>

In [50]:
# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)

# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)

# The line below takes the first value in the list of 
# database accessions record["IdList"], and places it in
# the variable 'accession'
accession = record["IdList"][0]
print("Accession:", accession)

print('The record is from the {}.'.format(record["TranslationSet"]))
print('The IdList:')
for link in record["IdList"]:
    print(link)

Accession: 17412981
The record is from the [{'From': 'Ralstonia solanacearum', 'To': '"Ralstonia solanacearum"[Organism]'}].
The IdList:
17412981
16182181
16182171
16182161
15940591
15940541
15940491
15794331
15794271
15794261
15794241
15794231
15794221
15794201
15794191
15794181
15794171
15794161
15794151
15794131
15794101
15623791
15623761
15623751
15623741
15623731
15623721
15623711
15623701
15623691
15623681
15623671
15623661
15623651
15623641
14327951
14327941
14231511
14231501
13916141
13915671
13559881
13118301
12942611
12916701
12916691
12916681
12916671
12916661
12916651
12916631
12916621
12916611
12916601
12916581
12574841
12574831
12574821
12574811
12574801
12574791
12574781
12574771
12574761
12574751
12574741
12574731
12574711
12574701
12574691
12574681
12574671
12574661
12574651
12574611
12574601
12574591
12574581
12574571
12574511
12574491
12574451
12574411
12574401
12574371
12574331
12574321
12574301
12574291
12574281
12574221
12574211
12574201
12574191
12574161
12574141

### Linking across databases

<div class="alert alert-info" role="alert">
There is a complicating factor: assemblies may not be a single complete sequence, and could comprise several contigs, or a chromosome and several extrachromosomal elements, all annotated independently. These are stored independently in a different database, called <b>nucleotide</b>, and each has an individual accession.  

We need to <i>link</i> the <b>assembly</b> accession to each of the <b>nucleotide</b> accessions.

This is a common requirement when querying <b>NCBI</b> databases, and is achieved using the <b>Entrez.elink()</b> function.

We need to specify the database for which we have the accession (or `UID`), and which database we want to query for related records (in this case, `nucleotide`).  
</div>

In [72]:
# The line below requests the identifiers (UIDs) for all
# records in the `nucleotide` database that correspond to the
# assembly UID that is stored in the variable 'accession'
handle = Entrez.elink(dbfrom="assembly", db="nucleotide", from_uid=accession)

# We place the downloaded information in the variable 'links'
links = Entrez.read(handle)
print(links)

[{'ERROR': [], 'LinkSetDbHistory': [], 'LinkSetDb': [{'Link': [{'Id': '2533054996'}, {'Id': '2533054995'}, {'Id': '2533054994'}, {'Id': '2533054993'}, {'Id': '2533054992'}, {'Id': '2533054991'}, {'Id': '2533054990'}, {'Id': '2533054989'}, {'Id': '2533054988'}, {'Id': '2533054987'}, {'Id': '2533054986'}, {'Id': '2533054985'}, {'Id': '2533054984'}, {'Id': '2533054983'}, {'Id': '2533054982'}, {'Id': '2533054981'}, {'Id': '2533054980'}, {'Id': '2533054979'}, {'Id': '2533054978'}, {'Id': '2533054977'}, {'Id': '2533054976'}, {'Id': '2533054975'}, {'Id': '2533054974'}, {'Id': '2533054973'}, {'Id': '2533054972'}, {'Id': '2533054971'}, {'Id': '2533054970'}, {'Id': '2533054969'}, {'Id': '2533054968'}, {'Id': '2533054967'}, {'Id': '2533054966'}, {'Id': '2533054965'}, {'Id': '2533054964'}, {'Id': '2533054963'}, {'Id': '2533054962'}, {'Id': '2533054961'}, {'Id': '2533054960'}, {'Id': '2533054959'}, {'Id': '2533054958'}, {'Id': '2533054957'}, {'Id': '2533054956'}, {'Id': '2533054955'}, {'Id': '25330

The `links` variable may contain links to more than one version of the genome (`NCBI` keep third-party managed genome data in GenBank/INSDC records, and `NCBI`-'owned' data in RefSeq records). 

The function below extracts only the INSDC information from the `Elink()` query. It is ***not*** important that you understand the code.