# Exercises

---
### 5.2.1 Exercise
Calculate the GC-content in the following sequence:
```
GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT
```
- Use the `.count()` method as shown above
- Search through Bio.SeqUtils for a function that might help you


--- 
### 5.3.1 Exercise
Can you concatenate the following sequences using a for-loop?
- Seq("ACGT")
- Seq("GCTA")
- Seq("TACG")

## 5.7 Exercise

Identifying genes is possible by looking for open reading frames (ORFs). For eukaryotic genes we know that there is a complex interaction between promotors, start codons, exons and introns. Nonetheless, for prokaryotic and virus genes this approach would still be useful. 

Depending on the organism you also need to use the according codon table. In this exercise we're using a bacterial plasmid fasta file for which we need to use codon [table 11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11). Write a function that accepts a DNA sequence and stores translated sequences in a pandas DataFrame, define the tranlate tables and define that a possible protein needs to be of a minimum length of 100 AA's. 

Input arguments of the function:
- `record`: DNA sequence (`Seq` object)
- `strand`: sense or antisense (+1/-1)
- `frame`: frameshift mutation (0/1/2)
- `table`: translation table (e.g. 11)
- `min_len`: minimum length of protein sequences to be included (e.g. 100)


In [None]:
The output might look something like this: 

|   |                                          Sequence | Length | Strand | Frame |
|--:|--------------------------------------------------:|-------:|-------:|------:|
| 0 | WGKLQVIGLSMWMVLFSQRFDDWLNEQEDALQEKVLADLKKLQVYG... |    125 |     -1 |     1 |
| 1 | RGIFMSDTMVVNGSGGVPAFLFSGSTLSSYRPNFEANSITIALPHY... |    361 |     -1 |     1 |
| 2 | WDVKTVTGVLHHPFHLTFSLCPEGATQSGREAHLLAELPQRRMEPV... |    111 |     -1 |     1 |

In [None]:
def extract_ORF(record, strand, frame, table, min_len):
    """extract_ORF accepts a sequence record object as argument together with a strand orientation 
    and frameshift and will give you as an output all of the possible ORFs from that sequence record object
    that are longer than a predefined minimal length of AAs using a specific codon table"""

    # Create empty dataframe that will store all the information
    
    # Change DNA sequence according to strand orientation 

    # Change DNA sequence according to frameshift mutation

    # Iterate over each possible translation 
    
    # If the possible translation is longer than min_len, add it to the DataFrame
    
    
    return # DataFrame

Test code:

In [None]:
import pandas as pd
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", "fasta")
table = 11
min_len = 100
extract_ORF(record=record, strand=-1, frame=1, table=11, min_len=100)

---
### 6.1.1 Exercise
Find the title of all the articles related to the genbank entry `NC_005816.gbk`. 

Extra: Create a list of URL-links that brings you directly to the article. For this you can use the Pubmed ID in combination with `https://pubmed.ncbi.nlm.nih.gov/`. 


Hint: look at the section of *references* of [this link](https://biopython.readthedocs.io/en/latest/chapter_seq_annot.html)

--- 
### 7.1.1 Exercise
Return a list that contains the organism of each record in the `data/ls_orchid.gbk`-file. 

Tip: you should make an empty list, iterate over all the records, access the organism and append it to the  list. 

## 8.3 Exercise
Write a script that blasts the top 5 overrepresented sequences in a fastq-file. Save the following information in a pandas dataframe: title, e-value and score. 


Here is a table that is part of the output of a FastQC process. The raw data can be obtained from the zipped folder that is always created as part of the process. This part represents the overrepresented sequences in a fastq file. The file that contains the data is stored under `data/overrepresented_sequences.txt`. 


```
#Sequence	Count	Percentage	Possible Source
GCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGG	634749	0.9399698125201895	No Hit
GCCAGGTTCCACACGAACGTGCGTTCAACGTGACGGGCGAGAGGGCGGCC	437871	0.6484224816077345	No Hit
GGGGACAGTCCGCCCCGCCCCCCACCGGGCCCCGAGAGAGGCGACGGAGG	319343	0.47289996493044484	No Hit
GGCTTCCTCGGCCCCGGGATTCGGCGAAAGCTGCGGCCGGAGGGCTGTAA	310651	0.4600283926862577	No Hit
GGGCCTTCCCGGCCGTCCCGGAGCCGGTCGCGGCGCACCGCCACGGTGGA	260086	0.3851490725611636	No Hit
ACGAATGGTTTAGCGCCAGGTTCCACACGAACGTGCGTTCAACGTGACGG	247602	0.3666621066273818	No Hit
CGGCTTCGTCGGGAGACGCGTGACCGACGGTCCCCCCGGGACCCGACGGC	170383	0.25231213687083787   No Hit
...
```

In [1]:
# Imports
import pandas as pd
from Bio.Seq import Seq
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio.Alphabet import IUPAC

Here is an example output: 

| / | Title |                                             Score | E-value |                
|------:|--------------------------------------------------:|--------:|-------------:|
    |     0 | Staphylococcus aureus...                      |    100.0 | 1.510770e-15 |  
|     1 | ... |    ... | ... |   


# 9. Two more exercises
The following two exercises are a bit longer and require a combination of the materials that we learned today (9.1) or dive into the world of proteins (9.2). The choice is yours as to which one might be more relevant.

## 9.1 Diagnosing Sickle Cell Anemia
[This link](https://krother.gitbooks.io/biopython-tutorial/content/sicklecell.html) will bring you to a great example exercise from Kristian Rother that combines all of the things that we learned today. 

Your goal is to develop an experimental test that reveals whether a patient suffers from the hereditary disease sickle cell anemia. The test for diagnosis should use a restriction enzyme on a patients’ DNA sample. For the test to work, you need to know exactly what genetic difference to test against. In this tutorial, you will use Biopython to find out.

The idea is to compare DNA and protein sequences of sickle cell and healthy globin, and to try out different restriction enzymes on them.

This tutorial consists of four parts:

1. Use the module Bio.Entrez to retrieve DNA and protein sequences from NCBI databases.
2. Use the module Bio.SeqIO to read, write, and filter information in sequence files.
3. Use the modules Bio.Seq and Bio.SeqRecord to extract exons, transcribe and translate them to protein sequences.
4. Use the module re to identify restriction sites. Regular expressions are not part of the course.

## 9.2 Protein plots
Make two 3D plots of protein structures using the matplotlib pyplot library. For this you can use a Biopython module to retrieve the protein's PDB data and another one to parse it. 
1. The first one of the [human oxyhaemoglobin](https://www.rcsb.org/structure/1hho) chain A.
2. The second one with the superposition of chain B on top of chain A. 

Here is a quick way of plotting a 3D structure of the protein as well:

In [None]:
pip install py3Dmol
import py3Dmol
view1 = py3Dmol.view(query='pdb:1HHO')
view1.setStyle({'cartoon':{'color':'spectrum'}})
view1

## Extra exercises
- Simple quality filtering for FASTQ files
- Trimming off primer sequences
- Trimming off adaptor sequences
- Histogram of sequence lengths



## Another exercise:
- Retrieve a FASTA file named `data/sample.fa` using BioPython and answer the following questions:
  - How many sequences are in the file?
  - What are the IDs and the lengths of the longest and the shortest sequences?
  - Select sequences longer than 500bp. What is the average length of these sequences?
  - Calculate and print the percentage of GC in each of the sequences.
  - Write the newly created sequences into a FASTA file named `long_sequences.fa` 

In [None]:
from Bio import SeqIO

# read the FASTA file named data/sample.fa
seq_records = list(SeqIO.parse('../data/sample.fa', 'fasta'))

# find the number of sequences present in the file
num_seq = len(seq_records)
print('Total number of sequences:', num_seq)

In [None]:
# find IDs and lengths of the longest and the shortest sequences

# Create a Pandas dataframe for storing the Seq objects, their IDs and their sequences
import pandas

seq_ids = []
seq_seqs = []
seq_objs = []

for seq in seq_records:
    seq_ids.append(seq.id)
    seq_seqs.append(str(seq.seq))
    seq_objs.append(seq)

seq_df = pandas.DataFrame({"id": seq_ids, "seq": seq_seqs, 'seqobj': seq_objs})

# Calculate the length of each sequence
seq_df['len'] = seq_df['seq'].apply(len)

# Find shortest and longest sequence ids
shortest = seq_df.sort_values("len", ascending=True).iloc[0]
longest = seq_df.sort_values("len", ascending=False).iloc[0]
print('Longest sequence is', longest['id'], 'with length', longest['len'], 'bp')
print('Shortest sequence is', shortest['id'], 'with length', shortest['len'], 'bp')

In [None]:
print(seq_df.head())

In [None]:
# Calculate the average length of sequences longer than 500bp
# Calculate and print the percentage of GC contents

from Bio.SeqUtils import GC

# Calculate GC content 
seq_df['gc'] = seq_df['seq'].apply(GC)

# Filter sequences longer the 500bp
long_seq_df = seq_df[seq_df['len'] > 500]

print('Average length for sequences longer than 500bp is {}'.format(long_seq_df['len'].mean()))
print(long_seq_df[['id', 'gc']])

In [None]:
# Write sequences stored in dataframe as Seq objects in the long_seq_df in a file with 'GenBank' format
SeqIO.write(long_seq_df['seqobj'], 'long_sequences.fa', 'fasta')

## References
- Example exercise from https://krother.gitbooks.io/biopython-tutorial/content/
- Example exercise from the BioPython Tutorial - Chapter 20: *Cookbook - Cool things to do with it*. 