# Assignment 1: Python Basics

Proteins are clustered into families based on sequence similarity. A protein family is a group of proteins that share a common evolutionary origin, reflected by their related functions and similarities in sequence or structure. Sequences of proteins in a family are aligned to identify the conserved regions and the variations in the family. Such an alignment is called a multiple sequence alignment (MSA).

In this assignment, you will write Python code to process the MSA of a protein family. The MSA is stored in a text file in the [Stockholm format](https://en.wikipedia.org/wiki/Stockholm_format). The Stockholm formatted file looks like the following:

```
# STOCKHOLM 1.0
#=GF ID   EXAMPLE
<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>
//
```

The first line shows the version of the Stockholm format. Each line that starts with `#` is a comment and can be ignored. It is followed by the aligned sequences of the proteins in the family, one sequence per line. Each line contains the sequence name (including start and end positions) and the aligned sequence separated by spaces. The alignment is ended by `//`.

First, let us download a sample MSA file that we will use for this assignment. The following code downloads the MSA of the [protein family PF00041](https://www.ebi.ac.uk/interpro/entry/pfam/PF00041/) from the Pfam database and saves it to the file `PF00041_seed.txt` in the folder `data`. Within the protein sequence, letters represent the amino acids (e.g., `A` for Alanine, `C` for Cysteine, etc.), and `-` and `.` are gaps. 


In [8]:
import urllib3
import gzip

pfam_id = "PF00041"
http = urllib3.PoolManager()
r = http.request(
    "GET",
    f"https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/{pfam_id}/?annotation=alignment:seed&download",
)
data = gzip.decompress(r.data)
data = data.decode()
with open(f"./data/{pfam_id}_seed.txt".format(pfam_id), "w") as file_handle:
    print(data, file=file_handle)

You can open the file `PF00041_seed.txt` in a text editor to see the content of the MSA file. In the following, you will write Python code to read this file and process the MSA .

## Part 1

1. Read the MSA file `PF00041_seed.txt` and store the sequences in a dictionary. The key of each item in the dictionary is the sequence name, and the value is the aligned sequence as a string. The sequence name should include the start and end positions of the sequence if provided. If the start and end positions are not provided, you can use the sequence name as it is. Keep the gaps in the aligned sequences.

2. Write a function to compute the number of protein sequences that are longer than 100 amino acids, excluding gaps. Use the dictionary created in the previous step as input to this function.

3. Write a function to get the names of the protein sequences that has the most and the least number of amino acids, excluding gaps. If there are multiple sequences with the same number of amino acids, you can report any of them. Use the dictionary created in the first step to implement this function.

In [12]:
## msa is the dictionary that will store the MSA
msa = {}
with open(f"./data/{pfam_id}_seed.txt", "r") as file_handle:
    for line in file_handle:
        ######################################################################
        ## write code to parse the MSA file and store it in the dictionary msa
        
        None
        ######################################################################

In [13]:
def compute_num_of_long_proteins(msa: dict) -> int:
    """
    This function computes the number of proteins in the MSA that are longer than 
    100 amino acids
    """
    ####################################################################
    ## write your code here
    None
    ####################################################################

In [None]:
def get_longest_protein(msa: dict) -> str:
    """
    This function computes the longest protein in the MSA
    """
    ####################################################################
    ## write your code here
    None
    ####################################################################

In [14]:
def get_shortest_protein(msa: dict) -> str:
    """
    This function computes the shortest protein in the MSA
    """
    ####################################################################
    ## write your code here
    None
    ####################################################################