### 1. Import required packages and inform NCBI of your email address (change below)

In [1]:
from Bio import Entrez, Medline, SeqIO

In [2]:
Entrez.email = "tquangbk@gmail.com" 

In [3]:
#This gives you the list of available databases
handle = Entrez.einfo()
rec = Entrez.read(handle)
print(rec)

{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}


## 2. Find Chloroquine resistance transporter (CRT)
We will now try to find the chloroquine resistance transporter (CRT) gene in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:

In [4]:
handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')
rec_list = Entrez.read(handle)
if rec_list['RetMax'] < rec_list['Count']:
    handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]',
                            retmax=rec_list['Count'])
    rec_list = Entrez.read(handle)

## 3. Retrieve all of these records
Now, let's try to retrieve all of these records. The following query will download all matching nucleotide sequences from GenBank, which is 481, at the time of writing this book. You probably won't want to do this all the time:

In [7]:
id_list = rec_list['IdList']
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb', retmax=rec_list['Count'])

### 4. Read and parse the result
Note that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server. This saves time, bandwidth, and server usage if you plan to iterate many times over. The disadvantage is that it will allocate memory for all records. This will not work for very large datasets; you might not want to do this conversion genome-wide as in the next chapter,

In [8]:
recs = list(SeqIO.parse(hdl, 'gb'))

### 5. Working with one specific dataset
We will now just concentrate on a single record. This will only work if you used the exact same preceding query:

In [9]:
for rec in recs:
    if rec.name == 'KM288867':
        break
print(rec.name)
print(rec.description)

KM288867
Plasmodium falciparum clone PF3D7_0709000 chloroquine resistance transporter (CRT) gene, complete cds


### 6. Extract some sequence features
Now, let's extract some sequence features, which contain information such as gene products and exon positions on the sequence:

If the feature.type is gene, we will print its name, which will be in the qualifiers dictionary. We will also print all the locations of exons. Exons, as with all features, have locations in this sequence: a start, an end, and the strand from where they are read. While all the start and end positions for our exons are ExactPosition, note that Biopython supports many other types of positions. One type of position is BeforePosition, which specifies that a location point is before a certain sequence position. Another type of position is BetweenPosition, which gives the interval for a certain location start/end. There are quite a few more position types; these are just some examples.

In [10]:
for feature in rec.features:
    if feature.type == 'gene':
        print(feature.qualifiers['gene'])
    elif feature.type == 'exon':
        loc = feature.location
        print('Exon', loc.start, loc.end, loc.strand)
    else:
        print('not processed:\n%s' % feature)

not processed:
type: source
location: [0:10000](+)
qualifiers:
    Key: clone, Value: ['PF3D7_0709000']
    Key: db_xref, Value: ['taxon:5833']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Plasmodium falciparum']

['CRT']
not processed:
type: mRNA
location: join{[2751:3543](+), [3720:3989](+), [4168:4341](+), [4513:4646](+), [4799:4871](+), [4994:5070](+), [5166:5249](+), [5376:5427](+), [5564:5621](+), [5769:5862](+), [6055:6100](+), [6247:6302](+), [6471:7598](+)}
qualifiers:
    Key: gene, Value: ['CRT']
    Key: product, Value: ['chloroquine resistance transporter']

not processed:
type: 5'UTR
location: [2751:3452](+)
qualifiers:
    Key: gene, Value: ['CRT']

not processed:
type: primer_bind
location: [2935:2958](+)
qualifiers:

not processed:
type: primer_bind
location: [3094:3121](+)
qualifiers:

not processed:
type: CDS
location: join{[3452:3543](+), [3720:3989](+), [4168:4341](+), [4513:4646](+), [4799:4871](+), [4994:5070](+), [5166:5249](+), [5376:54

### 7. Look at the annotations on the record
We will now look at the annotations on the record, which are mostly metadata that is not related to the sequence position:

In [11]:
for name, value in rec.annotations.items():
    print('%s=%s' % (name, value))

molecule_type=DNA
topology=linear
data_file_division=INV
date=12-NOV-2014
accessions=['KM288867']
sequence_version=1
keywords=['']
source=Plasmodium falciparum (malaria parasite P. falciparum)
organism=Plasmodium falciparum
taxonomy=['Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida', 'Haemosporida', 'Plasmodiidae', 'Plasmodium', 'Plasmodium (Laverania)']
references=[Reference(title='Versatile control of Plasmodium falciparum gene expression with an inducible protein-RNA interaction', ...), Reference(title='Direct Submission', ...)]


### 8. Access the fundamental piece of information, the sequence

In [12]:
print(len(rec.seq))

10000


### 9. Access to more datasets
There are many more databases at NCBI. You will probably want to check the Sequence ReadArchive (SRA) database (previously known as Short Read Archive) if you are working with NGS data. The SNP database contains information on single-nucleotide polymorphisms (SNPs), whereas the protein database has protein sequences, and so on. A full list of databases in Entrez is linked in the See also section of this recipe.

Another database that you probably already know about with regard to NCBI is PubMed, which includes a list of scientific and medical citations, abstracts, and even full texts. You can also access it via Biopython. Furthermore, GenBank records often contain links to PubMed. For example, we can perform this on our previous record, as shown here:

In [13]:
refs = rec.annotations['references']
for ref in refs:
    if ref.pubmed_id != '':
        print(ref.pubmed_id)
        handle = Entrez.efetch(db="pubmed", id=[ref.pubmed_id],
                                rettype="medline", retmode="text")
        records = Medline.parse(handle)
        for med_rec in records:
            for k, v in med_rec.items():
                print('%s: %s' % (k, v))

25370483
PMID: 25370483
OWN: NLM
STAT: MEDLINE
DCOM: 20160112
LR: 20181113
IS: 2041-1723 (Electronic) 2041-1723 (Linking)
VI: 5
DP: 2014 Nov 5
TI: Versatile control of Plasmodium falciparum gene expression with an inducible protein-RNA interaction.
PG: 5329
LID: 10.1038/ncomms6329 [doi]
AB: The available tools for conditional gene expression in Plasmodium falciparum are limited. Here, to enable reliable control of target gene expression, we build a system to efficiently modulate translation. We overcame several problems associated with other approaches for regulating gene expression in P. falciparum. Specifically, our system functions predictably across several native and engineered promoter contexts, and affords control over reporter and native parasite proteins irrespective of their subcellular compartmentalization. Induction and repression of gene expression are rapid, homogeneous and stable over prolonged periods. To demonstrate practical application of our system, we used it to re