# Handout 05
#### Sara Díaz del Ser

In [62]:
import numpy as np
import matplotlib.pyplot as plt
from termcolor import colored

In [63]:
# Data
ecoli_proteome = './data/ecoli-proteome.faa'
ecoli_orfs_sample = './data/ecoli-orfs-sample.ffn'
ecoli_orfs = './data/ecoli-orfs.ffn'
ecoli_genome = './data/ecoli-genome.fna'
ecoli_genome_sample = './data/ecoli-genome-sample.fna'
ecoli_genes_std= './data/ecoli-genes-standard.ffn'
ecoli_genes = './data/ecoli-genes.ffn'

### Ex. 1 _(6 pts)_ Reading and writing sequences in FASTA format

In previous exercises you were already tasked with reading (at least a single) sequence(s)
from FASTA files. Having a collection of functions available for reading and writing sequence
data is quite handy. There are a wide variety of formats available. However, the FASTA
format is both popular and simple. Thus, it is well suited for writing a set of functions for
dealing with sequences in this format. It does not matter whether the sequence(s) repesented
consist of DNA, RNA or amino acids; however, there might be some conventions on file name
extensions when dealing with one or the other type of sequence. More importantly, a file
may contain a single sequence or more than one sequence and functions for reading and
writing sequences in FASTA format should be able to deal with both cases:

#### (a) Reading multiple sequences from a file

A FASTA file can contain more than one sequence. Write a function ```fasta_list(filename)```
that reads all sequences from a FASTA file and returns a list of tuples, each tuple con-
taining the header as the first and the sequence as the second element.
Note, that a function written this way would normally first read the complete data
contained in the FASTA file and return all the data in a single data structure, i.e., a
list of tuples.


In [64]:
def fasta_list(filename:str) -> list:
	"""Reads all sequences from fasta file and returns a list of tuples containing of header and sequence"""
	with open(filename, 'r') as f:
		# Read all lines
		all_records = "".join(f.readlines()).split('>')

		# Split into headers and sequences
		fasta_list = [ (record.split('\n',1)[0], record.split('\n',1)[1].replace('\n','')) \
					   for record in all_records if record !='']

		print(colored(f'Found a total of {len(fasta_list)} sequences in {filename}', 'green'))
	return fasta_list

In [65]:
# Test function
a = fasta_list(ecoli_proteome)

[32mFound a total of 4141 sequences in ./data/ecoli-proteome.faa[0m


In [66]:
# Test function
b = fasta_list(ecoli_genes)

[32mFound a total of 4321 sequences in ./data/ecoli-genes.ffn[0m


####  (b) Using generators to read entries from FASTA files

FASTA files can be very large and calling the function fasta_list could end up using a
lot of memory. However, if you have a FASTA file (e.g. ```ecoli-genes.ffn```) containing a
number of sequences, often you just want to perform some operation for each sequence
separately (maybe something a simple as determine its length). In this case, it is not
really necessary to first store all the elements in memory, rather it is preferable to have
a mechanism that would yield one element (consisting of the header information and
the sequence) at a time without reading all the data first.


In [67]:
def fasta_generator(filename:str):
	"""Reads all sequences from fasta file and returns a list of tuples containing of header and sequence"""
	with open(filename, "r") as f:
		line = f.readline()
		while True:
			if line.startswith('>'):
				header = line.replace('\n','')
				# Read the rest of the lines as long as they're not headers
				seq = ''
				new_line = ''
				while not new_line.startswith('>'):
					seq = seq + str(new_line)
					try:
						new_line = next(f)
					except StopIteration:
						return
					line = new_line
				yield (header, seq.replace('\n',''))

In [68]:
# Try it out
g = fasta_generator(ecoli_genes)

In [69]:
print(next(g))
print(next(g))
print(next(g))

('>gi|556503834|ref|NC_000913.3|:190-255 Escherichia coli str. K-12 substr. MG1655, complete genome', 'ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA')
('>gi|556503834|ref|NC_000913.3|:337-2799 Escherichia coli str. K-12 substr. MG1655, complete genome', 'ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG

In [70]:
# Now attempt to run this code:
name,seq = max(fasta_generator(ecoli_genes),key= lambda x: len(x[1]))
max_length = len(seq)
print("The longest gene is",name,"and contains",max_length,"nucleobases.")


The longest gene is >gi|556503834|ref|NC_000913.3|:2044938-2052014 Escherichia coli str. K-12 substr. MG1655, complete genome and contains 7077 nucleobases.


#### (c) Testing
Test the functions written in (a) and (b) on ```ecoli-proteome.faa```: How many amino
acid sequences are contained in the file? Determine the header and length of the shortest
and longest amino acid sequence of the file.

In [71]:
# a) Read as list
records_list = fasta_list(ecoli_proteome)

# Get longest and shortest
name_longest,seq_longest = max(records_list,key= lambda x: len(x[1]))
name_shortest,seq_shortest = min(records_list,key= lambda x: len(x[1]))

print(f"Shortest sequence found was:\n\t{name_shortest}\n\t(Size: {len(seq_shortest)} aminoacids)")
print(f"Longest sequence found was:\n\t{name_longest}\n\t(Size: {len(seq_longest)} aminoacids)")

[32mFound a total of 4141 sequences in ./data/ecoli-proteome.faa[0m
Shortest sequence found was:
	gi|16129226|ref|NP_415781.1| trp operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
	(Size: 14 aminoacids)
Longest sequence found was:
	gi|145698281|ref|NP_416485.4| putative adhesin [Escherichia coli str. K-12 substr. MG1655]
	(Size: 2358 aminoacids)


In [72]:
# b) Read as generator
print(colored(f'Found a total of {sum(1 for _ in fasta_generator(ecoli_proteome))} sequences in {ecoli_proteome}', 'green'))

name_longest,seq_longest = max(fasta_generator(ecoli_proteome),key= lambda x: len(x[1]))
name_shortest,seq_shortest = min(fasta_generator(ecoli_proteome),key= lambda x: len(x[1]))

print(f"Shortest sequence found was:\n\t{name_shortest}\n\t(Size: {len(seq_shortest)} aminoacids)")
print(f"Longest sequence found was:\n\t{name_longest}\n\t(Size: {len(seq_longest)} aminoacids)")

[32mFound a total of 4140 sequences in ./data/ecoli-proteome.faa[0m
Shortest sequence found was:
	>gi|16129226|ref|NP_415781.1| trp operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
	(Size: 14 aminoacids)
Longest sequence found was:
	>gi|145698281|ref|NP_416485.4| putative adhesin [Escherichia coli str. K-12 substr. MG1655]
	(Size: 2358 aminoacids)


#### (d) Writing FASTA sequences
Now write a function  ```write_fasta(outfile,header,sequence)`` that writes the sequence
with header to the opened file outfile in FASTA format. Make sure the the written
header line starts with > and that the sequence is split into lines containing exactly 70
symbols (the final line may contain less than 70 symbols but should not be empty).

If written this way, the function can also be used to write multiple sequences to a single
file. E.g., the following piece of code writes short amino acid sequences of an albatross,
a lumberjack, and a dead parrot to the file nudgenudge.faa.
```
with open("nudgenudge.faa","w") as f:
		write_fasta(f,"albatross","WHATFLAVQRISIT")
		write_fasta(f,"lumberjack","ISLEEPALLNIGHTANDIWQRKALLDAY")
		write_fasta(f,"deadparrot","NQRWEGIANPLVE")
```

In [81]:
def write_fasta(outfile:str, header:str, sequence:str):
	"""Writes the given sequence and its header to the output file in FASTA format"""
	print(f'>{header}', file=outfile)
	for i in range(0, len(sequence), 70):
		if sequence[i:i+70] != '':
			print(sequence[i:i+70], file=outfile)

In [82]:
# Test it
with open("nudgenudge.faa","w") as f:
		write_fasta(f,"albatross","WHATFLAVQRISIT")
		write_fasta(f,"lumberjack","ISLEEPALLNIGHTANDIWQRKALLDAY")
		write_fasta(f,"deadparrot","NQRWEGIANPLVE")

### (e) Complementary DNA
Write a function cdna(seq) that takes a DNA sequence as argument and returns the
complementary DNA sequence. Note that nucleotide sequences are written in 5' to 3'
direction. Your output should also give the sequence in the 5' to 3' direction! Note: You
can test your function on the small DNA sequence given in ```ecoli-genome-sample.fna```.

In [97]:
def cdna(seq:str) -> str:
	"""Takes DNA sequence and returns complementary"""
	switch = { 'A' : 'T', 'T': 'A', 'C' : 'G', 'G': 'C'}
	# Use reverse to output it in 5' -> 3'
	return "".join(reversed([switch[nt] for nt in seq]))

In [99]:
# Test it
seq = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
cdna(seq)

'GCTGCTATCAGACACTCTTTTTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCT'

### (f) Creating a module: Putting things together.
Put the definitions of the previous functions in a single file named fastatools.py.
If you would run this script by itself you would observe no effect because all it does
is define the functions. However, these functions might be actually useful as part of
another script. Or, if you run import it in the notebook the functions can be used from
within the notebook. You can use the import statement, which you already know from
Python standard modules such as math:
```
from fastatools import single_fasta_sequence
f = open("ecoli-genome.fna")
species,genome = single_fasta_sequence(f)
f.close()
print("The genome of",species,"contains",len(genome),"nucleotides.")
```
or
```
import fastatools
f=open("truth.faa","w")
fastatools.write_fasta(f,"theking","ELVISISALIVE")
fastatools.write_fasta(f,"liverpoolfour","PAVLISDEAD")
f.close()
```
Python will find and import fastatools.py if it is in the current directory. You may
want to use the functions defined here in the following exercise by importing them.

In [102]:
from fastatools import single_fasta_sequence, complementary, write_to_fasta, all_fasta_sequences

with open(ecoli_genome, 'r') as f:
	species,genome = single_fasta_sequence(f)

print("The genome of",species,"contains",len(genome),"nucleotides.")

ImportError: cannot import name 'single_fasta_sequence' from 'fastatools' (/Users/sara/PycharmProjects/plab-assigments/diazdelser-plab-5/fastatools.py)