# Reading and writing sequences with BioPython
(Víctor Sojo | vsojo@amnh.org)

In the previous lesson we had our first experience interacting with NCBI through BioPython. We downloaded `4` records that seem to contain the DNA sequence for the gene encoding the 12S rRNA of the red panda.

In this notebook we will continue our **BioPython** training by:
1. Reading back in the records that we downloaded in the previous lesson.
1. Extracting the `features` that contain the sequences of the 12S rRNA gene.
1. Assembling the sequences of the four files into a common FASTA file, which we will use later in the alignment lesson.

**References & recommended reading:**
+ The [_BioPython tutorial_](http://biopython.org/DIST/docs/tutorial/Tutorial.html).
+ Tiago Antao's [_Bioinformatics with Python Cookbook_](https://www.packtpub.com/product/bioinformatics-with-python-cookbook-second-edition/9781789344691).

## Contents
&emsp;[Importing required BioPython modules](#Importing-required-BioPython-modules)<br/>
&emsp;[Re-explore the file with the list of GenBank files](#Re-explore-the-file-with-the-list-of-GenBank-files)<br/>
&emsp;[Reading single-sequence files using BioPython's SeqIO.read\(\) ](#Reading-single-sequence-files-using-BioPython's-SeqIO.read\(\)-)<br/>
&emsp;[Extracting features and their sequences in a BioPython record](#Extracting-features-and-their-sequences-in-a-BioPython-record)<br/>
&emsp;[Creating new records and writing them to a file](#Creating-new-records-and-writing-them-to-a-file)<br/>
&emsp;[Reading multiple independent files the proper way](#Reading-multiple-independent-files-the-proper-way)<br/>
&emsp;[Reading a file with multiple sequences using SeqIO.parse\(\)](#Reading-a-file-with-multiple-sequences-using-SeqIO.parse\(\))<br/>
&emsp;&emsp;[Should I use SeqIO.parse\(\) or SeqIO.read\(\) to load my sequence file?](#Should-I-use-SeqIO.parse\(\)-or-SeqIO.read\(\)-to-load-my-sequence-file?)<br/>

Again, let's make sure that we're using the `bioinfo` environment that we created in the `Py201` notebook:

(If you're on Windows, remember that every line with `! some code` should be changed to `!wsl some code` and you should have an active [WSL installation](https://docs.microsoft.com/en-us/windows/wsl/install-win10))

## Importing required BioPython modules
Here we will need:

Module      | Use
:-----------|:-----------------------------------------
**Bio.SeqIO**   | To handle parsing, reading and writing sequences
**Bio.SeqRecord.SeqRecord** | To create new sequence records

## Re-explore the file with the list of GenBank files
In the previous lesson, we used **Entrez** to download data from NCBI. We found `4` records, and stored each of them into a GenBank file. Cleverly, we also stored the names of the files that we created into a file of their own, so that we could automate our analyses. This file simply contains the location of names of the files. We will use that file here so that we can read the sequence records back in.

First, let's just take a look at the file:

That looks good. We can use this list file to open each of the GenBank files.

There are 4 files, which makes sense since we had 4 gene IDs.

Once again, let's take a look at the first few lines of one of these files:

---
Go ahead and take a look at the full file in a text editor such as Notepad++ (Win) or Sublime/BBEdit (Mac); or you can just use the Jupyter browser itself.

## Reading single-sequence files using BioPython's `SeqIO.read()` 
Take a look at the second line of the fragment that we just printed above. You'll see that this particular file has the _complete genome_ of the mitochondrial chromosome of the red panda. We don't want the whole chromosome – just the 12S rRNA gene. BioPython makes it easy to extract it.

Here is our plan of action for the next block of code:
1. We will read in the file with the list of GenBank file names.
1. We will use these file names to load the GenBank files one by one.
1. For each of these files, we will turn their contents into a BioPython record using `Bio.SeqIO.read()`.
1. We will then store each of these records into a list that we will call `seqrecs`, so that we can access them later.

About that last step, we can do this here because it's only 4 records, but in typical bioinformatics work you should avoid lists of sequence records as much as possible (further details below).

We now have our four sequence records in the `seqrecs` list.

⚠️ **Important** ⚠️<br/>
It is typically a very bad idea to store records into a `list`, as we did here. This is because, with lists, all records will be loaded together onto memory, as opposed to one at a time. With thousands of records and gigabytes of data, this would not be viable. Instead, we should process the records one by one in their entirety if possible, and we do so [towards the end of this notebook](#Reading-multiple-independent-files-the-right-way). 

However, for the time being, we will keep going using the list. This will allow us to explore the steps separately from each other, for educational purposes. Since it's only 4 records, we will be fine. Then, at the end or this notebook, we will integrate it all and look at a **production-level alternative**.

## Extracting `features` and their sequences in a BioPython record
As you can see from the `description` of each of the records that we printed above, a couple of those records contain sequences for the entire mitochondrial genome. We don't want the whole mitochondrial chromosome – just the 12S rRNA gene – so we need to extract that information from the GenBank record.

I strongly recommend that you open the file `AM711897.1.gb` in a text editor and take a look at the whole text. In brief, the file contains only one sequence (at the end), but multiple "features", one for each significant portion of DNA in the chromosome (genes, tRNAs, rRNAs and so on). In our case, we're interested in the `12S rRNA` gene, so we can look for that.

Before we do that with BioPython, let's use the advantages of Jupyter's access to bash and take a quick look with `grep`:

You'll see on the left that there are features of several types (`gene`, `rRNA`, `tRNA`). We are looking for a `gene` encoding the `12S rRNA`.

You'll notice also that there's a `gene` encoding a `tRNA` for valine at positions `1035..1102`. Before that one, you'll see our `gene` of interest, which is located at `70..1034`. We can extract it programmatically using BioPython, through the `features` property for each of the 4 records. This will let us filter for only those features classified as a `gene` with `12S rRNA` in the name.

As a first step to get familirised with the syntax, let's extract and print our desired features:

That was great. But here we just printed a bit of the information out. What we'd really like to do is save each of those sequences to a common FASTA file, so that we can compare them and check if they're the same (at least one of them won't be, since it's much shorter than the rest). Let's do that next.

## Creating new records and writing them to a file
Let's modify our last bit of code slightly so that, instead of just printing out that we found our desired 12S rRNA gene, we actually create a new sequence record with just that relevant bit and put it out to a common FASTA file.

(Note: We are calling this file `unaligned` because the sequences have no positional reference to one another at this point. We will align them in the next lesson)

Good! Let's take a look at the full file. I recommend that you open it outside (e.g. through the Jupyter file navigator), but we can also use the shell command `cat` to see its entire contents:

⚠️This file was short, so it was ok to print it all to screen with `cat`. That's often not the case in bioinformatics and you should instead use `grep` or `head`.

---
## Reading multiple independent files the proper way
Above we loaded all the records into a list. This would have been a terrible idea had we been working with thousands of records (not uncommon at all in bioinformatics).

Since in this case we didn't need to process the records together (they are all independent from each other), we could have just done everything in the very first `for` loop. That way, we would release the memory at the end of each iteration, loading only one record into memory at any given time. 

In this section we explore how that would look.

#### First, we define the names of the input and output files

#### Then we run the entire process above, but in a single `for` loop
(i.e., we clear the output file, we open the file with the list of input files, we then open each input file one at a time, extract any `12S rRNA` records, and export those to the multi-fasta output file).

⚠️ Note that in the second `with` we use `g` as the name of the file handle, as opposed the the more traditional `f`. This is because we're still inside the upper `with`, in which we had already used `f` as the file handle.

---
## Reading a file with multiple sequences using `SeqIO.parse()`
We don't need to, but just so we know how to do it, let's read the file back in, using BioPython.

⚠️ Above, when we read the GenBank files that had a _single sequence_, we used **`SeqIO.read()`**. Here, since we're reading a file with _multiple sequences_, we instead use **`SeqIO.parse()`**.

So, `SeqIO.parse()` gives us an iterator, in which each item corresponds to one of the independent sequences in the file, in this case our 4 sequences.

You'll see that BioPython nicely summarises the sequence and prints some useful information. This is the information that we gave when we created the FASTA file above.

Note: If you try to rerun the last `for` loop, you'll get nothing, because iterators run only once, as you surely know by now 😉. You'd have to go one further cell up to regenerate the iterator.

### Should I use `SeqIO.parse()` or `SeqIO.read()` to load my sequence file?
+ Use **`SeqIO.read()`** to load files that contain **only one sequence** (e.g., a single `>` in the case of a FASTA file). This returns a single record.
+ Use **`SeqIO.parse()`** to load files with **multiple sequences** (multiple `>` headers in the case of a FASTA file). This returns an iterator of records.

Note that a GenBank file may have multiple `feature`s, but it has only one _sequence_, so you should typically read it with `read()`, as we did above.

Note also that `.parse()` returns an `iterator`, in which each item is a record that corresponds to each of the sequences in the original file.<br/>
Conversely, `.read()` returns a BioPython sequence record directly, so you should only use it if you have a single sequence in the file, or you only care for the first one.

In the next lesson, we will work with sequence alignments. As a starter, we will reload the FASTA file that we made here to contain the 4 red panda 12S rRNA sequences. We will then align it, and take a look at the alignment.