# 7. Sequence input/output

Content:
- 7.1 Sequence input
- 7.2 Parsing sequences from the internet
- 7.3 Accessing NCBI's Entrez database
    

The `SeqIO` module provides a simple interface to input and output sequence file formats. Basically, you're working with a `SeqRecord` object which contains the `Seq` object as we've seen in the previous chapter, along with annotations like an identifier and description. 

In [None]:
# Import SeqIO
from Bio import SeqIO

## 7.1 Sequence input
The main function that is used for interacting with `Seq` objects is `Bio.SeqIO.read()`. It expects two arguments:
1. An explicit **path to a file, a filehandle or a link** to data that can be downloaded from the internet 
2. A lower case string specifying the **sequence format**. Examples are: clustal, fasta, embl, fastq, genbank or gb, pdb-atom, swiss, uniprot-xml,... Best to specify the file format because [*explicit is better than implicit*](https://www.python.org/dev/peps/pep-0020/). 

The output of `Bio.SeqIO.parse()` is a `SeqRecord` iterator. 


In [None]:
# Here we're using the explicit path to a fasta file (in the data folder)
for seq_record in SeqIO.parse("data/ls_orchid.fasta", "fasta"):
    print(seq_record.id)
#    print(repr(seq_record.seq))
#    print(len(seq_record))

Alternatively, the same is also possible by using a handle. The `with` statement makes sure that the file is properly closed after reading it. 

In [None]:
# Similar to reading files in Python with a filehandle
with open("data/ls_orchid.fasta", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)

It should be clear now that the `Bio.SeqIO.parse()` returns an *iterator* which gives `SeqRecord` objects. In the following cell blocks we'll showcase a couple of variants. Let's start with a more elegant, one-liner approach to retrieve **only the IDs**. 

In [None]:
identifiers = [seq_record.id for seq_record in SeqIO.parse("data/ls_orchid.gbk","genbank")]

If you're interested in the **first record** of this file:

In [None]:
# One special case to consider is when your sequence files have multiple records, but you only want the first one. 
first_record = next(SeqIO.parse("data/ls_orchid.gbk", "genbank"))

If there is only one record in the file, you might as well use the `Bio.SeqIO.read()` function. It takes the same two arguments and returns a SeqRecord object with one record. 

Iterators are great for when you only need the records one by one, in the order found in the file. For some tasks you may need to have **random access to the records in any order**. In this case, we're interested in using the `list`-function. 

In [None]:
records = list(SeqIO.parse("data/ls_orchid.gbk","genbank"))
print(f"Found {len(records)} records")

Let's explore this list of records:

In [None]:
# Retrieve the third record:
print(records[2])

In [None]:
# Explore the last record by first saving it in a separate variable:
last_record = records[-1]
print(last_record.id)
print(repr(last_record.seq))
print(len(last_record))

--- 
### 7.1.1 Exercise
Make a list that contains the organism of each record in the `data/ls_orchid.gbk`-file. 

Tip: you should make an empty list, iterate over all the records, access the organism and append it to the  list. 

---

Besides reading sequences (sequence input), it's also possible to write records out to a file (sequence output). The function that you'll need for this is `Bio.SeqIO.write()`. We will not cover it here, but more information is available on [Biopython's wiki page](https://biopython.org/wiki/SeqIO). Idem dito for file format conversions with `Bio.SeqIO.convert()`. This function is used for transforming e.g. a GenBank file to a Fasta file. These combinations of sequence inputs and outputs or file converters allow you to manipulate or filter data in some way. 


On a side note, sequence alignment files formats can be treated with `Bio.SeqIO`, however the newer `Bio.AlignIO` is designed to work with such alignment files directly. 

## 7.2 Parsing sequences from the internet
In the previous sections, we looked at parsing sequence data from a file (using a filename or handle). As discussed in the introduction of this chapter, it's also possible to **download and parse sequences from the internet**.
Note that just because you can download sequence data and parse it into a `SeqRecord` object in one go doesn't mean this is a good idea. In general, you should probably **download sequences once and save them to a file for later usage**.

Entrez (https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system that provides
users access to NCBI's databases such as PubMed, GenBank, GEO, and many others. You can access
Entrez from a web browser to manually enter queries, or you can use Biopython's Bio.Entrez module for
programmatic access to Entrez. Read the guidelines [here](https://www.ncbi.nlm.nih.gov/books/NBK25497/). The latter allows you for example to search PubMed or download GenBank records from within a Python script.

Let's try to retrieve a sequence from the NCBI website, more specifically the entry with the following ID: [6273291](https://www.ncbi.nlm.nih.gov/nuccore/6273291). 
- First, we'll have to define who we are. This is a standard process when fetching data from a database.
- Then, we're telling the `efetch()` function that we want to access the nucleotide database, we're looking for a fasta sequence, the file format and the id in the form of a handle, which we will subsequently read with the `Bio.SeqIO.read()` function. 

In [None]:
# Imports
from Bio import Entrez
from Bio import SeqIO

# Identify yourself for NCBI
Entrez.email = "hello@its.me"

# Create a handle to some specific data (a sequence in fasta format)
with Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291") as handle:
    # Read in the fasta file
    seq_record = SeqIO.read(handle, "fasta")

print(seq_record)

Idem dito for the genbank file of this entry:

In [None]:
# Downloading genbank file of the same ID
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="6273291") as handle:
    # Read in the genbank file
    seq_record = SeqIO.read(handle, "gb")
    
# Print out the results
print(f"{seq_record.id} with {len(seq_record.features)} features")

In [None]:
# Check some annotations of the genbank (SeqRecord) data
seq_record.annotations

## 7.3 Accessing NCBI's Entrez database

Although we've been explicitly requesting to get outputs in text-format in the examples above, the output returned by `Entrez` is typically in XML format. To parse such output, you have several options, however most appropriate is often to use `Bio.Entrez`'s parser to parse the XML output into a Python object. 

In [None]:
# Imports
from Bio import Entrez

Next, we have to enter our e-mail address as you're limited to the amount of requests and the NCBI server might block you to prevent any spamming behaviour:

In [None]:
# Identify yourself
Entrez.email = "hello@itsme.eu"

Now we can use `Bio.Entrez.esearch()` to search any of the databases of NCBI. Along with the esearch we define the parameters:
- db: for the database, i.e. nucleotide for a genbank file
- term: the search term in the style of NCBI searches
- idtype: if idtype is set to ‘acc’, ESearch will return accession.version identifiers rather than GI (GenInfo Identifier) numbers.

In [None]:
# Search through NCBI nucleotide database for a gene from an organism. 
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", idtype="acc")
record = Entrez.read(handle)
print(record["Count"])
print(record["IdList"])


Each of these IDs is aGenBank identifier (accession number). We can download this GenBank record(s) by using `efetch()`. `efetch()` is what you use when you want to retrieve a full record from Entrez. This covers several possible databases. For most of their databases, the NCBI support several different file formats. Requesting a specific file format from Entrez using `Bio.Entrez.efetch()` requires specifying the `rettype` and/or `retmode` (default = xml) optional arguments. The different combinations are described for each database type on the pages linked to on [NCBI efetch webpage](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch).

In [None]:
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
print(handle.read())

The arguments `rettype="gb"` and `retmode="text"` let us download this record in the GenBank format. Alternatively, you could for example use `rettype="fasta"` to get the Fasta-format. 

**Note** that a more typical use would be to save the sequence data to a local file, and then parse it with
Bio.SeqIO. This can save you having to re-download the same file repeatedly while working on your script,
and places less load on the NCBI's servers. For example:

In [None]:
# Imports
import os
from Bio import SeqIO
from Bio import Entrez

# Name of the new file with the genbank record
filename = "data/EU490707.gbk"
# Check whether file exists
if not os.path.isfile(filename):
    # Downloading entry from NCBI nt database in text output  and storing as filehandle
    with Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text") as net_handle:
        # Make filehandle to where the file will be written to
        with open(filename, "w") as out_handle:
            # Write out the information to the file
            out_handle.write(net_handle.read())
    
# Read the record back in
record = SeqIO.read(filename, "genbank")
print(record)

## 7.4 Next session
Click here to go to the [next session](08_Biopython_BLAST.ipynb). 