# 7. Sequence input/output

Content:
- 7.1 Sequence input
- 7.2 Parsing sequences from the internet
- 7.3 Accessing NCBI's Entrez database
    

This aims to provide a simple interface for working with assorted sequence file formats in a uniform way. Basically, you're working with a SeqRecord object which contains the Seq object as we've seen in the previous chapter, along with annotations like an identifier and description. 

In [1]:
from Bio import SeqIO
help(SeqIO)

Help on package Bio.SeqIO in Bio:

NAME
    Bio.SeqIO - Sequence input/output as SeqRecord objects.

DESCRIPTION
    Bio.SeqIO is also documented at SeqIO_ and by a whole chapter in our tutorial:
    
      - `HTML Tutorial`_
      - `PDF Tutorial`_
    
    .. _SeqIO: http://biopython.org/wiki/SeqIO
    .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
    
    Input
    -----
    The main function is Bio.SeqIO.parse(...) which takes an input file handle
    (or in recent versions of Biopython alternatively a filename as a string),
    and format string.  This returns an iterator giving SeqRecord objects:
    
    >>> from Bio import SeqIO
    >>> for record in SeqIO.parse("Fasta/f002", "fasta"):
    ...     print("%s %i" % (record.id, len(record)))
    gi|1348912|gb|G26680|G26680 633
    gi|1348917|gb|G26685|G26685 413
    gi|1592936|gb|G29385|G29385 471
    
    Note that the parse(

## 7.1 Sequence input
The first thing we'll want to do is reading in a sequence object. The function that we'll need for that is `Bio.SeqIO.parse()` and it expects two arguments:
1. An explicit **path to a file, a filehandle or a link** to data that can be downloaded from the internet 
2. A lower case string specifying the **sequence format**. Examples are: clustal, fasta, embl, fastq, genbank or gb, pdb-atom, swiss, uniprot-xml,... You must specify the file format because [*explicit is better than implicit*](https://www.python.org/dev/peps/pep-0020/). 

The output of `Bio.SeqIO.parse()` is a `SeqRecord` iterator. 

On a side note, sequence alignment files formats can be treated with `Bio.SeqIO`, however the newer `Bio.AlignIO` is designed to work with such alignment files directly. 

In [1]:
from Bio import SeqIO

# Here we're using the explicit path to a fasta file (in the data folder)
for seq_record in SeqIO.parse("data/ls_orchid.fasta", "fasta"):
    print(seq_record.id)
#    print(repr(seq_record.seq))
#    print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
gi|2765657|emb|Z78532.1|CCZ78532
gi|2765656|emb|Z78531.1|CFZ78531
gi|2765655|emb|Z78530.1|CMZ78530
gi|2765654|emb|Z78529.1|CLZ78529
gi|2765652|emb|Z78527.1|CYZ78527
gi|2765651|emb|Z78526.1|CGZ78526
gi|2765650|emb|Z78525.1|CAZ78525
gi|2765649|emb|Z78524.1|CFZ78524
gi|2765648|emb|Z78523.1|CHZ78523
gi|2765647|emb|Z78522.1|CMZ78522
gi|2765646|emb|Z78521.1|CCZ78521
gi|2765645|emb|Z78520.1|CSZ78520
gi|2765644|emb|Z78519.1|CPZ78519
gi|2765643|emb|Z78518.1|CRZ78518
gi|2765642|emb|Z78517.1|CFZ78517
gi|2765641|emb|Z78516.1|CPZ78516
gi|2765640|emb|Z78515.1|MXZ78515
gi|2765639|emb|Z78514.1|PSZ78514
gi|2765638|emb|Z78513.1|PBZ78513
gi|2765637|emb|Z78512.1|PWZ78512
gi|2765636|emb|Z78511.1|PEZ78511
gi|2765635|emb|Z78510.1|PCZ78510
gi|2765634|emb|Z78509.1|PPZ78509
gi|2765633|emb|Z78508.1|PLZ78508
gi|2765632|emb|Z78507.1|PLZ78507
gi|2765631|emb|Z78506.1|PLZ78506
gi|2765630|emb|Z78505.1|PSZ78505
gi|2765629|emb|Z78504.1|PKZ78504
gi|2765628|emb|Z78503.1|PCZ78503
gi|2765627

Alternatively, by using a handle. The `with` statement makes sure that the file is properly closed after reading it. That should all happen automatically if you just use the filename instead. 

In [4]:
with open("data/ls_orchid.fasta", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)

gi|2765658|emb|Z78533.1|CIZ78533
gi|2765657|emb|Z78532.1|CCZ78532
gi|2765656|emb|Z78531.1|CFZ78531
gi|2765655|emb|Z78530.1|CMZ78530
gi|2765654|emb|Z78529.1|CLZ78529
gi|2765652|emb|Z78527.1|CYZ78527
gi|2765651|emb|Z78526.1|CGZ78526
gi|2765650|emb|Z78525.1|CAZ78525
gi|2765649|emb|Z78524.1|CFZ78524
gi|2765648|emb|Z78523.1|CHZ78523
gi|2765647|emb|Z78522.1|CMZ78522
gi|2765646|emb|Z78521.1|CCZ78521
gi|2765645|emb|Z78520.1|CSZ78520
gi|2765644|emb|Z78519.1|CPZ78519
gi|2765643|emb|Z78518.1|CRZ78518
gi|2765642|emb|Z78517.1|CFZ78517
gi|2765641|emb|Z78516.1|CPZ78516
gi|2765640|emb|Z78515.1|MXZ78515
gi|2765639|emb|Z78514.1|PSZ78514
gi|2765638|emb|Z78513.1|PBZ78513
gi|2765637|emb|Z78512.1|PWZ78512
gi|2765636|emb|Z78511.1|PEZ78511
gi|2765635|emb|Z78510.1|PCZ78510
gi|2765634|emb|Z78509.1|PPZ78509
gi|2765633|emb|Z78508.1|PLZ78508
gi|2765632|emb|Z78507.1|PLZ78507
gi|2765631|emb|Z78506.1|PLZ78506
gi|2765630|emb|Z78505.1|PSZ78505
gi|2765629|emb|Z78504.1|PKZ78504
gi|2765628|emb|Z78503.1|PCZ78503
gi|2765627

It should be clear now that the `Bio.SeqIO.parse()` returns an *iterator* which gives SeqRecord objects. In the following cell blocks we'll showcase a couple of variants. Let's start with a more elegant, one-liner approach to retrieve **only the IDs**. 

In [5]:
identifiers = [seq_record.id for seq_record in SeqIO.parse("data/ls_orchid.gbk","genbank")]

If you're interested in the **first record** of this file:

In [7]:
# One special case to consider is when your sequence files have multiple records, but you only want the first one. 
first_record = next(SeqIO.parse("data/ls_orchid.gbk", "genbank"))

If there is only one record in the file, you might as well use the `Bio.SeqIO.read()` function. It takes the same two arguments and returns a SeqRecord object with one record. 

Iterators are great for when you only need the records one by one, in the order found in the file. For some tasks you may need to have **random access to the records in any order**. In this case, we're interested in using the `list`-function. 

In [9]:
records = list(SeqIO.parse("data/ls_orchid.gbk","genbank"))
print("Found {}".format(len(records)))

Found 94


Let's explore this list of records:

In [13]:
# Retrieve the third record:
print(records[2])

ID: Z78531.1
Name: Z78531
Description: C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Number of features: 5
/molecule_type=DNA
/topology=linear
/data_file_division=PLN
/date=30-NOV-2006
/accessions=['Z78531']
/sequence_version=1
/gi=2765656
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/source=Cypripedium fasciculatum
/organism=Cypripedium fasciculatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium']
/references=[Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', IUPACAmbiguousDNA())


In [14]:
# Explore the last record by first saving it in a separate variable:
print("The last record is:")
last_record = records[-1]
print(last_record.id)
print(repr(last_record.seq))
print(len(last_record))

The last record is:
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592


In [15]:
# Exploring the annotations 
print(last_record.annotations)

{'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '30-NOV-2006', 'accessions': ['Z78439'], 'sequence_version': 1, 'gi': '2765564', 'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'source': 'Paphiopedilum barbatum', 'organism': 'Paphiopedilum barbatum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum'], 'references': [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]}


Note that the output of the annotations is returned in dictionary format. 

In [18]:
print(last_record.annotations.keys())

dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'gi', 'keywords', 'source', 'organism', 'taxonomy', 'references'])


In [19]:
print(last_record.annotations.values())

dict_values(['DNA', 'linear', 'PLN', '30-NOV-2006', ['Z78439'], 1, '2765564', ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'Paphiopedilum barbatum', 'Paphiopedilum barbatum', ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum'], [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]])


In [21]:
print(last_record.annotations["source"])

Paphiopedilum barbatum


In [22]:
print(last_record.annotations["organism"])

Paphiopedilum barbatum


--- 
### 7.1.1 Exercise
Return a list that contains the organism of each record in the `data/ls_orchid.gbk`-file. 

Tip: you should make an empty list, iterate over all the records, access the organism and append it to the  list. 

In [2]:
from Bio import SeqIO

In [7]:
# Method 1 by using the annotations (cleaner)
all_species= []
for seq_record in SeqIO.parse("data/ls_orchid.gbk","genbank"):
    all_species.append(seq_record.annotations["organism"])
    
print(set(all_species))

{'Paphiopedilum glaucophyllum', 'Phragmipedium kaiteurum', 'Phragmipedium warszewiczianum', 'Paphiopedilum primulinum', 'Paphiopedilum adductum', 'Paphiopedilum barbatum', 'Paphiopedilum malipoense', 'Cypripedium californicum', 'Phragmipedium boissierianum', 'Paphiopedilum urbanianum', 'Cypripedium passerinum', 'Paphiopedilum haynaldianum', 'Paphiopedilum sanderianum', 'Mexipedium xerophyticum', 'Paphiopedilum villosum', 'Cypripedium segawai', 'Paphiopedilum niveum', 'Paphiopedilum philippinense', 'Paphiopedilum purpuratum', 'Paphiopedilum acmodontum', 'Paphiopedilum appletonianum', 'Paphiopedilum supardii', 'Paphiopedilum mastersianum', 'Paphiopedilum ciliolare', 'Cypripedium macranthon', 'Cypripedium calceolus', 'Cypripedium himalaicum', 'Paphiopedilum bullenianum', 'Cypripedium parviflorum var. pubescens', 'Paphiopedilum bellatulum', 'Phragmipedium schlimii', 'Paphiopedilum victoria', 'Paphiopedilum bougainvilleanum', 'Phragmipedium wallisii', 'Cypripedium fasciculatum', 'Paphiopedi

In [28]:
# Method 2 by using the description (can be a bit tricky if the name of the organism is not on the second location)
all_species = []
for seq_record in SeqIO.parse("data/ls_orchid.fasta","fasta"):
    all_species.append(seq_record.description.split()[1])
print(all_species)

['C.irapeanum', 'C.californicum', 'C.fasciculatum', 'C.margaritaceum', 'C.lichiangense', 'C.yatabeanum', 'C.guttatum', 'C.acaule', 'C.formosanum', 'C.himalaicum', 'C.macranthum', 'C.calceolus', 'C.segawai', 'C.pubescens', 'C.reginae', 'C.flavum', 'C.passerinum', 'M.xerophyticum', 'P.schlimii', 'P.besseae', 'P.wallisii', 'P.exstaminodium', 'P.caricinum', 'P.pearcei', 'P.longifolium', 'P.lindenii', 'P.lindleyanum', 'P.sargentianum', 'P.kaiteurum', 'P.czerwiakowianum', 'P.boissierianum', 'P.caudatum', 'P.warszewiczianum', 'P.micranthum', 'P.malipoense', 'P.delenatii', 'P.armeniacum', 'P.emersonii', 'P.niveum', 'P.godefroyae', 'P.bellatulum', 'P.concolor', 'P.fairrieanum', 'P.druryi', 'P.tigrinum', 'P.hirsutissimum', 'P.barbigerum', 'P.henryanum', 'P.charlesworthii', 'P.villosum', 'P.exul', 'P.insigne', 'P.gratrixianum', 'P.primulinum', 'P.victoria', 'P.victoria', 'P.glaucophyllum', 'P.supardii', 'P.kolopakingii', 'P.sanderianum', 'P.lowii', 'P.dianthum', 'P.parishii', 'P.haynaldianum', 'P

Besides reading sequences (sequence input), it's also possible to write records out to a file (sequence output). The function that you'll need for this is `Bio.SeqIO.write()`. We will not cover it here, but more information is available on [Biopython's wiki page](https://biopython.org/wiki/SeqIO). Idem dito for file format conversions with `Bio.SeqIO.convert()`. This function is used for transforming e.g. a GenBank file to a Fasta file. These combinations of sequence inputs and outputs or file converters allow you to manipulate or filter data in some way. 

## 7.2 Parsing sequences from the internet
In the previous sections, we looked at parsing sequence data from a file (using a filename or handle). As discussed in the introduction of this chapter, it's also possible to download and parse sequences from the internet.
Note that just because you can download sequence data and parse it into a SeqRecord object in one go doesn't mean this is a good idea. In general, you should probably download sequences once and save them to a file for reuse.

Let's try to retrieve a sequence from the NCBI website, more specifically the entry with the following ID: [6273291](https://www.ncbi.nlm.nih.gov/nuccore/6273291). We will discuss this thoroughly in the next chapter, however for now let's have a quick look into how this is done. First, we'll have to define who we are. Then, we're telling the e-fetching function that we want to access the nucleotide database, we're looking for a fasta sequence, the file format and the id in the form of a handle, which we will subsequently read with the `Bio.SeqIO.read()` function. 

In [50]:
from Bio import Entrez
from Bio import SeqIO

Entrez.email = "hello@its.me"
with Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291") as handle:
    seq_record = SeqIO.read(handle, "fasta")

print(seq_record)

ID: AF191665.1
Name: AF191665.1
Description: AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for chloroplast product, partial intron sequence
Number of features: 0
Seq('TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAA...AGA', SingleLetterAlphabet())


Idem dito for the genbank file of this entry:

In [51]:
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="6273291"
) as handle:
    seq_record = SeqIO.read(handle, "gb")
print(f"{seq_record.id} with {len(seq_record.features)} features")

AF191665.1 with 3 features


In [52]:
seq_record.annotations

{'molecule_type': 'DNA',
 'topology': 'linear',
 'data_file_division': 'PLN',
 'date': '07-NOV-1999',
 'accessions': ['AF191665'],
 'sequence_version': 1,
 'keywords': [''],
 'source': 'chloroplast Grusonia marenae',
 'organism': 'Grusonia marenae',
 'taxonomy': ['Eukaryota',
  'Viridiplantae',
  'Streptophyta',
  'Embryophyta',
  'Tracheophyta',
  'Spermatophyta',
  'Magnoliopsida',
  'eudicotyledons',
  'Gunneridae',
  'Pentapetalae',
  'Caryophyllales',
  'Cactineae',
  'Cactaceae',
  'Opuntioideae',
  'Grusonia'],
 'references': [Reference(title='Phylogeny of the subfamily Opuntioideae (Cactaceae)', ...),
  Reference(title='Direct Submission', ...)]}

## 7.3 Accessing NCBI's Entrez database
Entrez (https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system that provides
users access to NCBI's databases such as PubMed, GenBank, GEO, and many others. You can access
Entrez from a web browser to manually enter queries, or you can use Biopython's Bio.Entrez module for
programmatic access to Entrez. Read the guidelines [here](https://www.ncbi.nlm.nih.gov/books/NBK25497/). The latter allows you for example to search PubMed or download GenBank records from within a Python script.

The output returned by the Entrez Programming Utilities is typically in XML format. To parse such
output, you have several options, however most appropriate is often to use Bio.Entrez's parser to parse the XML output into a Python object. 

First we'll have to import Entrez

In [9]:
from Bio import Entrez

Next, we have to enter our e-mail address as you're limited to the amount of requests and the NCBI server might block you to prevent any spamming behaviour:

In [56]:
Entrez.email = "hello@itsme.eu"

Now we can use `Bio.Entrez.esearch()` to search any of the databases of NCBI. Along with the esearch we define the parameters:
- db: for the database, i.e. nucleotide for a genbank file
- term: the search term in the style of NCBI searches
- idtype: if idtype is set to ‘acc’, ESearch will return accession.version identifiers rather than GI numbers.

In [10]:
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", idtype="acc")
record = Entrez.read(handle)
print(record["Count"])
print(record["IdList"])


Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


542
['MT683624.1', 'MK935187.1', 'MH659838.1', 'MN016934.1', 'NC_045279.1', 'NC_045278.1', 'NC_045400.1', 'MN602053.1', 'MN535015.1', 'MN535014.1', 'KX886268.1', 'KX886267.1', 'KX886266.1', 'KX886265.1', 'KX886264.1', 'KX886263.1', 'KX886262.1', 'KX886261.1', 'KX886260.1', 'KX886259.1']


Each of these IDs is aGenBank identifier (accession number). We can download this GenBank record(s) by using EFetch. EFetch is what you use when you want to retrieve a full record from Entrez. This covers several possible databases. For most of their databases, the NCBI support several different file formats. Requesting a specific file format from Entrez using `Bio.Entrez.efetch()` requires specifying the `rettype` and/or `retmode` (default = xml) optional arguments. The different combinations are described for each database type on the pages linked to on [NCBI efetch webpage](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch).

In [64]:
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
print(handle.read())

LOCUS       EU490707                1302 bp    DNA     linear   PLN 26-JUL-2016
DEFINITION  Selenipedium aequinoctiale maturase K (matK) gene, partial cds;
            chloroplast.
ACCESSION   EU490707
VERSION     EU490707.1
KEYWORDS    .
SOURCE      chloroplast Selenipedium aequinoctiale
  ORGANISM  Selenipedium aequinoctiale
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Asparagales; Orchidaceae;
            Cypripedioideae; Selenipedium.
REFERENCE   1  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A., Endara,L.,
            Williams,N.H. and Moore,M.
  TITLE     Phylogenetic utility of ycf1 in orchids: a plastid gene more
            variable than matK
  JOURNAL   Plant Syst. Evol. 277 (1-2), 75-84 (2009)
REFERENCE   2  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,
            Endara,C.L., Williams,N.H. and Moore,M.J.
  TIT

The arguments `rettype="gb"` and `retmode="text"` let us download this record in the GenBank format. Alternatively, you could for example use `rettype="fasta"` to get the Fasta-format. 

Note that a more typical use would be to save the sequence data to a local file, and then parse it with
Bio.SeqIO. This can save you having to re-download the same file repeatedly while working on your script,
and places less load on the NCBI's servers. For example:

In [65]:
import os
from Bio import SeqIO
from Bio import Entrez

filename = "data/EU490707.gbk"
if not os.path.isfile(filename):
    # Downloading...
    net_handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")

print("Parsing...")
record = SeqIO.read(filename, "genbank")
print(record)

Saved
Parsing...
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast
Number of features: 3
/molecule_type=DNA
/topology=linear
/data_file_division=PLN
/date=26-JUL-2016
/accessions=['EU490707']
/sequence_version=1
/keywords=['']
/source=chloroplast Selenipedium aequinoctiale
/organism=Selenipedium aequinoctiale
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliopsida', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Selenipedium']
/references=[Reference(title='Phylogenetic utility of ycf1 in orchids: a plastid gene more variable than matK', ...), Reference(title='Direct Submission', ...)]
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA())


Golden rice was created by transforming rice with two beta-carotene biosynthesis genes:

- psy (phytoene synthase) from daffodil ('Narcissus pseudonarcissus')
- crtI (phytoene desaturase) from the soil bacterium Erwinia uredovora

In [None]:
Narcissus pseudonarcissus[Orgn] AND psy[Gene]

## 7.x Next session
Click here to go to the [next session](08_Biopython_BLAST.ipynb). 