# Session 10 practical

Combine all that we have learned so far:
- write functions
- use type hints
- write comments
- catch , and/or raise, errors

Attempt exercise 1 and 2 first, then attempt the starred version afterwards (and the double-starred version only if you feel motivated!)

## Exercise 1: most common k-mers

Find and print the most common motifs consisting of 2, 3, and 4 amino acids (motifs of length k are known as k-mers), and their count in file `seq_long.txt`.
If multiple motifs are the most common, print them all. Example output:

`
[MT, 'TT']: 37
 MTT: 10
`

(the formatting is not important here, only the count and motifs!)

In [204]:
from typing import TextIO
import operator
seq_fname = "../data/seq_long.txt"

def common_motif (file: TextIO, kmer_length: int = 2) -> tuple:
    '''
    Finds and prints the most common motifs (k-mers) in a list of protein sequences. Does not assume each individual sequence can be joined (ie assumes each sequence is its own unique protein). 
    
    Takes two arguments:
    file: The filename containing the protein sequences to check
    kmer_length: The length of common motif to identify (integer)
    '''
    with open (file, 'r') as file:
        seq_list = file.read().splitlines()
    unique_keys = set()
    all_keys = []
    
    # pulls a list of k-mers from all sequences - one is a list of all possible k-mers (all_keys), and one is only the unique k-mers (unique_keys)
    for seq in seq_list:
        index_start = 0
        for i in seq:
            index_end = index_start + kmer_length
            key = seq[index_start:index_end]
            
            # list of all keys:
            all_keys.append(key)
            
            # list of all unique motifs:
            if len(key) == kmer_length:
                unique_keys.add(key)
            index_start += 1  
            
    # finds total count of each unique kmer in every kmer identified
    motif_dict = {}
    for key in unique_keys:
        motif_dict[key] = all_keys.count(key)
        
    # finds the maximum count and returns the key/value pair(s) for the most common motif(s)
    maximum = max(motif_dict, key=motif_dict.get), max(motif_dict.values())
    return maximum

common_motif(seq_fname, 20)

('ARLMIREAFAEDSGRFTCSA', 1)

## Exercise 1*: longest non-unique k-mer

What is(are) the length of the longest motif(s) that appear more than once? How many are there?

(This is hard. Feel free to ignore it!)

In [206]:
def longest_common_motif (file: TextIO, kmer_length: int = 2) -> tuple:
    '''
    Finds and prints the longest motif which is present more than once. 
    
    Takes one argument:
    file: The filename containing the protein sequences to check
    '''
    with open (file, 'r') as file:
        seq_list = file.read().splitlines()
    unique_keys = set()
    all_keys = []
    
    # pulls a list of k-mers from all sequences - one is a list of all possible k-mers (all_keys), and one is only the unique k-mers (unique_keys)
    for seq in seq_list:
        index_start = 0
        for i in seq:
            index_end = index_start + kmer_length
            key = seq[index_start:index_end]
            all_keys.append(key)
            if len(key) == kmer_length:
                unique_keys.add(key)
            index_start += 1  
            
    # finds maximum count of the current k-mer length
    motif_dict = {}
    for key in unique_keys:
        motif_dict[key] = all_keys.count(key)
    count = max(motif_dict.values())
    return count

for k in range(2, 50):
    if k == 2:
        prev_count = 0
    else:
        prev_count = count
    count = longest_common_motif(seq_fname, k)
    if count == 1:
        print ("Maximum length is", k-1, "with a count of", prev_count)
        break

Maximum length is 19 with a count of 2


## Exercise 1**: speed benchmarking

Can you beat my code? See if you can get a faster run time (I/O not included, e.g., I assume seq contains the full sequence)

In [208]:
#%timeit find_longest_nonunique_kmer(seq)
# 50.6 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit common_motif(seq_fname, 2)


32.9 ms ± 7.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Exercise 2: Normality

- Get the length of sequences in file `multi_seqs.txt`
- Perform a normality test: can you conclude the sequence lengths are normally distributed?

In [None]:
sequences_fname = "../data/multi_seqs.txt"

## Exercise 2*: Normality

Using scientific libraries, perform the following:
- obtain random variates from the following distributions: uniform, cauchy, laplace
- plot the resulting samples' distribution
- check whether the resulting samples are normally distributed

## Exercise 2**: Normality

As you obtain more samples, statistical tests become more powerful.

How many random variates from a laplace distribution are needed for the p-value to be below 0.5, 90% of the time?

And how would you plot it ?