# Today's session

In today's session, we will get to use some of the genomics tools that were introduced in the theoretical lectures. In particular, we will see how to work with (mostly nucleotide) sequences, align them, and so on. To do this, we will be using `Biopyton`, a comprehensive Python module that allows you to do these and many other things. For a complete tour of `Biopython`, take a look at the [complete tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html).

In the final exercise for this class you will be working on the following case study:

>A few days ago, a patient was admitted to the hospital where you work. The patient was coughing, and had difficulty breathing, as well as a fever---she is suspected to have COVID-19. Your colleagues Oscar and Maria, from the Omics platform at the hospital, have been able to isolate a viral genome from the patient. Here, you are expected to verify that the genome corresponds to the SARS-CoV-2 virus, and to investigate which variant of the virus has infected the patient by studying the mutations in the spike protein.

# What is Biopython?

The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

![Biopython](Media/biopython_logo.png)


The main Biopython releases have lots of functionalities, including:

* The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
    - Blast output – both from standalone and WWW Blast
    - Clustalw
    - FASTA
    - GenBank
    - PubMed and Medline
    - ExPASy files, like Enzyme and Prosite
    - SCOP, including ‘dom’ and ‘lin’ files
    - UniGene
    - SwissProt 
* Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
* Code to deal with popular on-line bioinformatics destinations such as:
    - NCBI – Blast, Entrez and PubMed services
    - ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches 
* Interfaces to common bioinformatics programs such as:
    - Standalone Blast from NCBI
    - Clustalw alignment program
    - EMBOSS command line tools 
* A standard sequence class that deals with sequences, ids on sequences, and sequence features.
* Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
* Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
* Code making it easy to split up parallelizable tasks into separate processes.
* GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
* Extensive documentation and help with using the modules, including the tutorial, on-line wiki documentation, the web site, and the mailing list.
* Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects.

# Sequences in `Biopython`

We will start with a quick introduction to the `Biopython` class for dealing with sequences. Most of the time, when we think of sequences what we have in my mind is a string of letters like 'AGTACACTGGT'. In `Biopython` you can create a `Seq` instance from such a sequence as follows:

In [None]:
from Bio.Seq import Seq

my_seq = Seq("AGTACACTGGT")
display(my_seq)

## Transcription and  translation

So, in the example above we are dealing with a DNA sequence. Since this is a DNA sequence, we can do things with this sequence that we could not do with a sequence of aminoacids. For example, assuming that the `coding_dna` sequence corresponds to the 5'-to-3′ strand of the DNA, we can obtain the corresponding 5'-to-3' complementary strand:

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
display(coding_dna.reverse_complement())

Or, we can **transcribe** this DNA sequence into an RNA sequence:

In [None]:
messenger_rna = coding_dna.transcribe()
display(messenger_rna)

Translation is a bit tricker, because codons are translated differently in different organisms and even different cellular compartments. Here is what happens if we translate and RNA sequence without any parameters:

In [None]:
messenger_rna.translate()

You should notice in the above protein sequence that, in addition to the end stop character `*`, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (genetic codes).

The translation tables available in Biopython are based on those from [the NCBI](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). By default, translation will use the standard genetic code (NCBI table id 1). But suppose we are dealing with a mitochondrial sequence---we need to tell the translation function to use the relevant genetic code instead:

In [None]:
messenger_rna.translate(table="Vertebrate Mitochondrial")

## Otherwise, sequences are similar to strings

Other than the situations described above, most of the time `Seq` instances can be dealt with just like strings. For example, we can extract the elements of a sequence using indices, or slice a sequence as we would slice a string: 

In [None]:
# Seq instances work, mostly, as strings
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
display(my_seq[4])
display(my_seq[4:12])
print('Length of the sequence:      ', len(my_seq))
print('Number of As in the sequence:', my_seq.count('A'))

# Reading and writing annotated sequence files

Typically, we do not want to create `Seq` instances by entering manually the sequence of letters (bases or aminoacids). Rather, we usually want to read a bunch of annotated sequences from a file or from some remote database. By annotated sequence here we mean a sequence to which additional information has been added (for example, about the organism, the positions of coding sequences within the sequence, etc.). Here we will learn how to read (and write) annotated sequences from (and to) files. First, we need to import the input/output functions in `Biopython`:  

In [None]:
from Bio import SeqIO

## Reading GeneBank

Now, we are going to read the file `Files/ls_orchid.gb`, which contains several annotated sequences in GeneBank format. But before we read it using `Biopython`...

***
### Exercise

Open the file `Files/ls_orchid.gb` using a text editor such as Notepad or Word. Familiarize yourself with the format of the file. Copy the first complete annotated record in the file and paste it in the cell below. (When you are done, do not press `Ctrl+Enter`; if you do, it is OK, but it will look ugly.)

*Write your answer here.*

***

Now, let's do something similar but using `Biopython` and the `SeqIO.parse()` method:

In [None]:
# Read the GeneBank file and create a list with the records
records = list(SeqIO.parse('Files/ls_orchid.gb', 'genbank'))

print('The file comprises', len(records), 'records')

Let us now select the first record and print it:

In [None]:
print(records[0])

The `SeqRecord` class that holds annotated sequences has the following attributes:

**.seq**
– The sequence itself, typically a Seq object.

**.id**
– The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.

**.name**
– A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be something else.

**.description**
– A human-readable description or expressive name for the sequence – a string.

**.letter_annotations**
– Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores or secondary structure information.

**.annotations**
– A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence.

**.features**
– A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence).

**.dbxrefs** - A list of database cross-references as strings.

For example:

In [None]:
# Print some of the attributes of the first annotated record in the file
print('SEQUENCE:', records[0].seq, '\n')
print('SEQUENCE LENGTH:', len(records[0].seq), '\n')
print('FEATURES:\n')
for feature in records[0].features:
    print(feature)

From the listed features, we see that this record contains a gene (from a species named *Cypripedium irapeanum*) that codes for ribosomal RNA. The gene itself is encoded in bases 380 to 550, and before and after that we have some spacer DNA. 

***
### Exercise

Using `Biopython`, select the record **with the longest sequence** in `Files/ls_orchid.gb`, and print its features. Then, explain in words what each feature corresponds to.

In [None]:
# Write your code here

*Describe each feature succinctly*

***

## Reading FASTA

Another common file format for sequences is FASTA. As we have seen above, GeneBank annotated records are quite complete. Now compare with the annotation in typical FASTA files:

In [None]:
# Read the GeneBank file and create a list with the records
frecords = list(SeqIO.parse('Files/ls_orchid.fasta', 'fasta'))

print('The file comprises', len(frecords), 'records\n')
print(frecords[0])

The resulting `SeqRecord` is the same as before, but the annotation is much poorer than in a typical GeneBank file.

## Writing annotated sequence files

Finally, to write a bunch of annotated sequences to a file we can use the `write()` method. For example, to write the first 5 records of the GeneBank file that we read above to a file named `Files/five_records.fasta` in FASTA format, we need to do:

In [None]:
SeqIO.write(records[:5], 'Files/five_records.fasta', 'fasta')

***

### Exercise

Using `Biopython`, open the `Files/five_records.fasta` file that you just created and verify that it contains 5 records.

In [None]:
# Write your code here

### Exercise

Using `Biopython`, write the first 10 records in `records` to a file named `Files/ten_records.gb` in GeneBank format.

In [None]:
# Write your code here

***

# Getting annotated sequences from online databases

We have seen how to read local GeneBank or FASTA files. Therefore, at this point we should be able to go online, manually obtain whatever files we need, and then read them using `Biopython`. However, if we need to get many files this is impractical. Here, we will see how to get annotated sequence files **from online databases directly**, that is, without having to manually download them.

`Biopython` provides interfaces for several online databases. Here we will focus on obtaining annotaded records for the [Entrez database](https://www.ncbi.nlm.nih.gov/search/). So, first, we import the Entrez functionalities in `Biopython`.

In [None]:
from Bio import Entrez

The `.efetch()` method in `Entrez` allows us to "fetch" entries from the database. Let's first see how to use it in practice:

In [None]:
# First, we need to identify ourselves to Entrez
Entrez.email = "name.surname@urv.cat" # MODIFY with your actual name and surname!

# Fetch one record by its ID (Z78502.1) and store it in a list of SeqRecord instances, which we call entrez_recs
with Entrez.efetch(db="nucleotide", id='Z78502.1', rettype="gb", retmode="text") as my_handle:
    entrez_recs = list(SeqIO.parse(my_handle, "genbank"))

# Show some stuff    
print(len(entrez_recs), 'record(s) read\n')
print(entrez_recs[0])

As illustrated in this example, the `efetch()` method needs several arguments. First, we need to specify which database, within Entrez, are we interested in. In this case, we have specified that we want to work with the `nucleotide` database. Second, we need to pass the ID of the entry we are interested in, in this example Z78502.1. Finally, we need to specify the format of the annotated record ('gb' for GeneBank in this case, but it could be FASTA or others), and the format of the output (text).

Importantly, the `.efetch()` method does *not* return a list of `SeqRecord` instances directly. Rather, **`.efetch()` returns a "handle"**, which can be understood as **a file that is open for reading**. We stored this handle in the variable `my_handle` and **then parsed with `SeqIO.parse()`**, just as we read annotated sequences from files earlier.

We can also get several annotated sequences in a single search as follows:

In [None]:
# Fetch multiple records by their ID and store them in the entrez_recs list of SeqRecord instances.
# Note that the multiples IDs are passed as a single string of comma-separated IDs, rather than a list of IDs.
with Entrez.efetch(db="nucleotide", id='AF191665.1, AF191664.1, AF191663.1', rettype="gb", retmode="text") as my_handle:
    entrez_recs = list(SeqIO.parse(my_handle, "genbank"))
print(len(entrez_recs), 'record(s) read\n')
print(entrez_recs[0])

OK, this is a good point to introduce a useful Python (not just Biopython!) trick to learn more about a given module, function, class or whatever using the `help()` function. If we want to obtain more information about `.efetch()` (and, similarly, for whatever other method) you can run the following:

In [None]:
help(Entrez.efetch)

# Sequence alingnments and BLAST

So far, we have learned how to manage sequences and annotated sequences, including reading and writing files and obtaining information from remote databases. Last, we turn to the problem of sequence alignment. We start by considering local and global pairwise alignments (alignments of two sequences), and then turn to sequence search in databases using BLAST.

## Pairwise alignments

The `Bio.Align.PairwiseAligner` implements the Needleman-Wunsch, Smith-Waterman, Gotoh (three-state), and Waterman-Smith-Beyer global and local pairwise alignment algorithms. Let's see a simple example

In [None]:
# Import Align
from Bio import Align

# Define the sequences we intend to align
seq1 = Seq("GAACTAAATTCGTA")
seq2 = Seq("GATAAATAGTA")

# Create the pairwise aligner, which you will use to make the alignment
aligner = Align.PairwiseAligner()

# Align the two sequences
alignments = aligner.align(seq1, seq2)

# Show results
for alignment in alignments:
    print('SCORE:', alignment.score)
    print(alignment)

The aligner object (an instance of the `PairwiseAligner` class) is the "machine" that will allow you to do the alignment. When you create a `PairwiseAligner` it takes some default values for things like the algorithm it uses, the scores for match, mismatch, or gap, or whether the alignment is local/global. Below we will learn how to change these default values.

After the `PairwiseAligner` has been created, it is just a matter or getting the actual alignments using the `.align()` method. Note that the output of `.align()` is not a single alignment, but a bunch of alignments with high score.

***

### Exercise

Use the `help()` function (as we did for `Entrez.efetch()` above) to learn which are the attributes of the `PairwiseAlignment` class that contain: 1) the algorithm being used by the aligner we just created, 2) the match score, 3) the mismatch score, 4) the gap score, and 5) whether the alignment is local/global. After finding out, write code to print these attributes.

In [None]:
# Use help here

In [None]:
# Write your code here to print the algorithm, the scores, and whether the alignment is local/global. 

*** 

Any of the attributes of the `PairwiseAligner` can be changed once an instance has been created and before actually aligning the sequences. For example, we can change the scores and, of course, we will get (slightly) different alignments: 

In [None]:
# Create the pairwise aligner
aligner2 = Align.PairwiseAligner()

# Change substitution scores
aligner2.match_score = 1.0
aligner2.mismatch_score = -2.0
aligner2.gap_score = -2.5

# Align the two sequences and show results
alignments2 = aligner2.align(seq1, seq2)
for alignment in alignments2:
    print('SCORE:', alignment.score)
    print(alignment)

## BLAST

Pairwise aligments are useful when we are interested in a few sequences, and we want to compare them to each other. However, more frequently we find ourselves with one (or a few) sequence(s) for which we want to find close matches; that is, for which we want to identify similar sequences wherever they are (for example, in other organisms). In those situations, we typically use BLAST or a similar tool. With BLAST, we can compare a target sequence with many other sequences in a reference database and, hopefully, find similar sequences that give us clues about the target sequence.

`Biopython` does not have its own implementation of BLAST, but it has an interface that allows you to interact with the BLAST web site automatically. However, we will not introduce this interface here. Instead, let's familiarize ourselves with the BLAST web page. 

***

### Exercise

Copy the sequence of the first record in `ls_orchid.fasta`. Then, go to the [BLAST home page](https://blast.ncbi.nlm.nih.gov/Blast.cgi), select the Nucleotide BLAST tool, enter the nucleotide sequence in the corresponding search box, and hit `BLAST`. After a short time, you will get the hits for your sequence, that is, a collection of aligned sequences that are similar to the target sequence you entered.

Download the BLAST results in XML format by clicking `Download All` > `XML`. Save the resulting `.xml` file as `Files/my_blast.xml`.

Take some time to explore and understand the BLAST results web page in more detail (especially the `Graphic Summary` and the `Distance tree of results`).

***

### Reading BLAST results

Now that we have saved the results of our BLAST search in this `XML` file, we will open and explore it with `Biopython`. For that, we need the `NCBIXML` module:

In [None]:
from Bio.Blast import NCBIXML

Similar to what we did for files containing annotated sequence records, we will need create a 'handle' to the file, and then we will read the hits with `NCBIXML.parse(handle)`:

In [None]:
with open("Files/my_blast.xml") as handle:
    blast_records = list(NCBIXML.parse(handle))
print(len(blast_records), 'BLAST record read from the file (1 query => 1 record)')
print(len(blast_records[0].alignments), 'alignments for the BLAST record')

BLAST records are instances of the `Bio.Blast.Record.Blast` class. Unlike sequences and annotated sequences, these are very complex objects, whose structure is summarized in the following diagram:

![UML diagram of the BlastRecord](Media/BlastRecord.png)

To summarize, **each record corresponds to one query** (in our case, we only queried one sequence, so we only got one record) and **contains several alignments**, and **each alignment may cotain several high-scoring pairs, or HSPs** (that is, regions within an alignment where there is a good match between the query sequence and the sequence identified as similar in the database). Next, we extract some useful information for each of the alignments in the `Files/my_blast.xml` file that we read above: 

In [None]:
for alignment in blast_records[0].alignments:
    for hsp in alignment.hsps:  # HSPs = high-scoring pairs
        if hsp.expect < 1e-5:
            print("\n****Alignment****")
            print("sequence    :", alignment.title)
            print("length      :", alignment.length)
            print("e value     :", hsp.expect)
            print("identities  :", hsp.identities)
            print("% identities:", hsp.identities / alignment.length)
            print(hsp.bits)
            print(hsp.query[0:75] + "...")
            print(hsp.match[0:75] + "...")
            print(hsp.sbjct[0:75] + "...")

# Final exercise (5/10 points): The misterious virus


A few days ago, a patient was admitted to the hospital where you work. The patient was coughing, and had difficulty breathing, as well as a fever---she is suspected to have COVID-19. Your colleagues Oscar and Maria, from the Omics platform at the hospital, have been able to isolate a viral genome from the patient. Here, **you are expected to verify that the genome corresponds to the SARS-CoV-2 virus, and to investigate which variant of the virus has infected the patient**. 

1. **Open and read genome of the virus** using Biopython. The genome is available as a FASTA file from `Files/patient.fasta`. 
2. Go to the BLAST [web page](https://blast.ncbi.nlm.nih.gov/Blast.cgi) and **BLAST the sequence manually**. BLAST against the standard nucleotide collections using the best algorithm for the alignment of highly similar sequences. After blasting, obtain, show (as a screenshot), and briefly discuss the **distance tree** of the results. Can you confirm the COVID-19 diagnosis for the patient?
3. **Save the BLAST results in a file** named `Files/blast_results.xml`. 

Now, you will proceed to establish which variant of the virus this is. You will accomplish this by studying the spike protein, which the virus uses to attach to cells and slip inside. We use the spike protein because the UK and the South African variants of the virus have specific mutations in this protein (with respect to the reference genome) that are well described and easy to identify. I encourage you to read [this fantastic article about the UK variant](https://www.nytimes.com/interactive/2021/health/coronavirus-mutations-B117-variant.html), as well as the [Wikipedia entry for the South Africant variant](https://en.wikipedia.org/wiki/501.V2_variant), to better understand your results and to give a more informed answer to the last question. For more information on the effect of mutations of the spike protein in the effect of COVID-19 vaccines, I also recommend [this article](https://www.quantamagazine.org/how-to-understand-covid-19-variants-and-their-effects-on-vaccines-20210225/).

4. Open the `Files/blast_results.xml` file using Biopython and **display the alignment with the highest percentage identity to the virus of our patient.** What percentage identity does it have with the query? What does this mean? **Store the accession code** of this genome in a variable called `seqid`.
5. Download from Entrez, using Biopython, the whole annotated genome of `seqid`.
6. List all the `features` of the genome you just downloaded. Identify and print the feature that contains the virus' **spike protein**, also called **surface glycoprotein** or simply **S protein**.
7. From the feature in 6, extract the translated aminoacid sequence for the spike protein.
8. Open and read the FASTA file containing the aminoacid sequence of the SARS-CoV-2 spike proteins for the reference, UK and South African variants. The file is available from `Files/s_protein.fasta`.
9. Do a **global pairwise alignment** between the aminoacid sequence of the spike protein of your patient (exercise 7) and the spike protein of each of the variants (exercise 8). For the alignment, use the **Needleman-Wunsch algorithm in Biopython**. (Hint: for a more straightforward interpretation of the results, set the `target_internal_open_gap_score` in your `aligner` to -1, so that opening gaps in the alignment gets penalized).
10. Which variant of SARS-CoV-2 does your patient have? Discuss your answer based on the mutations you observe in your patient's spike protein (and using the articles referenced above).

In [None]:
# Question 1 code here

![Question 2. Save the screenshot of the distance tree of the alignments as Files/blast_tree.png and, after Ctrl+Enter in this cell, the image should show up here!](Files/blast_tree.png)

*Answer to question 2 here*

In [None]:
# Question 4 code here

*Answer to question 4 here*

In [None]:
# Question 5 code here

In [None]:
# Question 6 code here

In [None]:
# Question 7 code here

In [None]:
# Question 8 code here

In [None]:
# Question 9 code here

*Answer to question 10 here*