# Exercises

---
### 5.2.1 Exercise
Calculate the GC-content in the following sequence:
```
GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT
```
- Use the `.count()` method as shown above
- Search through Bio.SeqUtils for a function that might help you

In [None]:
ex_seq = Seq("GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT", IUPAC.unambiguous_dna)


--- 
### 5.3.1 Exercise
Can you concatenate the following sequences using a for-loop?
- Seq("ACGT", generic_dna)
- Seq("GCTA", generic_dna)
- Seq("TACG", generic_dna)

## 5.7 Identifying open reading frames

Identifying genes is possible by looking for open reading frames (ORFs). For eukaryotic genes we know that there is a complex interaction between promotors, start codons, exons and introns. Nonetheless, for prokaryotic and virus genes this approach would still be useful. 

Depending on the organism you also need to use the according codon table. In this example we're using a bacterial plasmid fasta file for which we need to use codon [table 11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11). By using the following block of code, we will store the sequence in the variable `record`, define the tranlate tables and define that a possible protein needs to be of a minimum length of 100 AA's. 


In [None]:
from Bio import SeqIO
record = SeqIO.read("data/NC 005816.fna", "fasta")
table = 11
min_pro_len = 100

The output might look something like this: 
```
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF, - length 315, strand 1, frame 0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE, - length 285, strand 1, frame 1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA, - length 355, strand 1, frame 2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD, - length 114, strand -1, frame 0
```

You could easily edit the above loop based code to build up a list of the candidate proteins, or convert this to a list comprehension. 

---
### 6.1.1 Exercise
Find the title of all the articles related to the genbank entry 'NC_005816'. Import this file using the following block of code.  

Extra: Create a list of URL-links that brings you directly to the article. For this you can use the Pubmed ID in combination with `https://pubmed.ncbi.nlm.nih.gov/`. 


Hint: look at the section of *references* of [this link](https://biopython.readthedocs.io/en/latest/chapter_seq_annot.html)

In [None]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.gb","gb")

--- 
### 7.1.1 Exercise
Return a list that contains the organism of each record in the `data/ls_orchid.gbk`-file. 

Tip: you should make an empty list, iterate over all the records, access the organism and append it to the  list. 

In [None]:
from Bio import SeqIO

## 8.3 Exercise
Write a script that blasts the top 5 overrepresented sequences in a fastq-file. Save the following information in a pandas dataframe: title, e-value and score. 


Here is a table that is part of the output of a FastQC process. The raw data can be obtained from the zipped folder that is always created as part of the process. This part represents the overrepresented sequences in a fastq file. The file that contains the data is stored under `data/overrepresented_sequences.txt`. 


```
#Sequence	Count	Percentage	Possible Source
GCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGG	634749	0.9399698125201895	No Hit
GCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGGCC	437871	0.6484224816077345	No Hit
GGGGACAGTCCGCCCCGCCCCCCACCGGGCCCCGAGAGAGGCGACGGAGG	319343	0.47289996493044484	No Hit
GGCTTCCTCGGCCCCGGGATTCGGCGAAAGCTGCGGCCGGAGGGCTGTAA	310651	0.4600283926862577	No Hit
GGGCCTTCCCGGCCGTCCCGGAGCCGGTCGCGGCGCACCGCCACGGTGGA	260086	0.3851490725611636	No Hit
ACGAATGGTTTAGCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGG	247602	0.3666621066273818	No Hit
CGGCTTCGTCGGGAGACGCGTGACCGACGGTCCCCCCGGGACCCGACGGC	170383	0.25231213687083787   No Hit
...
```

In [None]:
# Imports
import pandas as pd
from Bio.Seq import Seq
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio.Alphabet import IUPAC

Here is an example output: 

| Title |                                             Score | E-value |              |  
|------:|--------------------------------------------------:|--------:|-------------:|
|     0 | PREDICTED: Arvicanthis niloticus 28S ribosomal... |    90.0 | 7.825980e-13 |  
|     1 | PREDICTED: Arvicanthis niloticus 28S ribosomal... |    90.0 | 7.825980e-13 |   
|     2 | Streptomyces sp. RPA4-2 chromosome, complete g... |    47.0 | 6.864950e-01 |  
|     3 | PREDICTED: Canis lupus dingo 28S ribosomal RNA... |    73.0 | 6.016610e-08 | 
| 4     | Chain 2, 28S rRNA                                 | 86.0    | 9.534000e-12 | 

# 9. Two more exercises
The following two exercises are a bit longer and require a combination of the materials that we learned today (9.1) or dive into the world of proteins (9.2). The choice is yours as to which one might be more relevant.

## 9.1 Diagnosing Sickle Cell Anemia
[This link](https://krother.gitbooks.io/biopython-tutorial/content/sicklecell.html) will bring you to a great example exercise from Kristian Rother that combines all of the things that we learned today. 

Your goal is to develop an experimental test that reveals whether a patient suffers from the hereditary disease sickle cell anemia. The test for diagnosis should use a restriction enzyme on a patients’ DNA sample. For the test to work, you need to know exactly what genetic difference to test against. In this tutorial, you will use Biopython to find out.

The idea is to compare DNA and protein sequences of sickle cell and healthy globin, and to try out different restriction enzymes on them.

This tutorial consists of four parts:

1. Use the module Bio.Entrez to retrieve DNA and protein sequences from NCBI databases.
2. Use the module Bio.SeqIO to read, write, and filter information in sequence files.
3. Use the modules Bio.Seq and Bio.SeqRecord to extract exons, transcribe and translate them to protein sequences.
4. Use the module re to identify restriction sites. Regular expressions are not part of the course.

## 9.2 Protein plots
Make two 3D plots of protein structures using the matplotlib pyplot library. For this you can use a Biopython module to retrieve the protein's PDB data and another one to parse it. 
1. The first one of the [human oxyhaemoglobin](https://www.rcsb.org/structure/1hho) chain A.
2. The second one with the superposition of chain B on top of chain A. 

Here is a quick way of plotting a 3D structure of the protein as well:

In [None]:
pip install py3Dmol
import py3Dmol
view1 = py3Dmol.view(query='pdb:1HHO')
view1.setStyle({'cartoon':{'color':'spectrum'}})
view1

## Extra exercises
- Simple quality filtering for FASTQ files
- Trimming off primer sequences
- Trimming off adaptor sequences
- Histogram of sequence lengths



## References
- Example exercise from https://krother.gitbooks.io/biopython-tutorial/content/
- Example exercise from the BioPython Tutorial - Chapter 20: *Cookbook - Cool things to do with it*. 