# First steps with python - exam
--------------------------------

## Instructions

To complete the exam, please solve the 3 exercises in this jupyter notebook.

**Remember to:**  
* **Comment your code** to convey what you are trying to do and how you are trying to do it.
* Remember that its often useful to write you code **using intermediary functions**, that can help
  you write **cleaner and more readable code**.

**When you have completed the exam:**
* Rename the notebook to `firstName_lastName.exam.ipynb` before sending it to us. E.g. "Alice Smith" would
  renamed the notebook to `alice_smith.exam.ipynb`.
* Send your notebook to `wandrille.duchemin@unibas.ch`.
* If you do not get a confirmation that we received your submission within 3 days, please send us a reminder.


<br>

# Exercise 1

Generate random numbers between 1 and 6 until you get the number 6.
Simulate like you would be rolling a dice.
1. You should indicate the number of random numbers generated **before** you got the number 6
   (i.e., the number of dice rolls, excluding the last one).
2. You should print the sum over all the numbers generated, **before** you got the number 6.

> 🎯 **hint**: have a look at the `random` built-in module.

<br>

# Exercise 2

The file `name_size.txt` contains data about the name and height of a large number of people.

* Each line corresponds to one person and has two fields: a person's name and their height. 
* Note that a given first name can appear several times in the file, meaning that several people in the sample bear that name.

Run the code cell below, then perform the requests tasks.

In [None]:
d_names = {}

with open("name_size.txt", mode="r") as input_file:
    for l in input_file:
        name, size = l.strip().split(",")
        if not name in d_names:
            d_names[name] = []
        d_names[name].append(float(size))


1. **Explain in a couple of sentences** what the following piece of code is doing? (feel free to run the code, look at its output, and expriment a bit to understand how it works)

The following questions ask you to use the `d_names` object built in the previous question code.

2. **Find out the name** of the tallest person (i.e. the one associated to the maximum height) ?

3. **Find out which name**, among the ones starting with the letter `B`, **is the most popular** (i.e. the most frequent) ?

4. **Compute the average height** of people whose name contains more than 5 letters.


> 🦉 **Remember:** you can subdivide each question into smaller subtasks to acheive the requested result step by step.


<br>

# Exercise 3

We are given a python script by a colleague that is supposed to work with a certain kind of data file containing information on proteins that contain strange repeat elements. These repeat elements can be approximately 20-30 amino acid long, and each element can be repeated several times per protein. To add to the complexity of this imaginary phenomenon, the proteins typically contain 3-10 different repeat elements.

`peptide_data.tsv` is an example a data file that we know has the correct format to work with the script. The file is a simple tab-delimited text file, representing a table with the following 4 columns:

| Column name    | Description                                           |
| -------------- | ----------------------------------------------------- |
| protein_ID     | ID of the protein (unique for each protein).          |
| repeat_length  | Length of repeat element in nucleotides.              |
| num_repeat     | How many times this element is repeated.              |
| repeat_ID      | ID of the repeat element.                             |

<br>

However, the script (given below) has somehow collected many bugs (around 10) over time before it landed into our hands. Bugs are of different kinds, including but not limited to typos, missing lines and wrong types. Your task is to **fix the code and annotate your fixes**. Note that:

* The **_correct_** script should calculate the average size of repeated regions per protein.
* In the script's output file, the first letter of the protein IDs should be capitalized. 
* 🔥 **Important:** Please indicate any "fix" you do in the code with a comment that briefly describes 
  what you fixed and/or why it was needed.

As a reference, you are also given the file `output.tsv`, which contains the correct results if the script is fixed and works as intended. You can compare your ouput to this file to check if you have a working script.

In [None]:
def parse_line(line):
    """
    This function accepts a string containing a single line from an input file.
    The line must correspond to a line of a file with the following structure:
    
    1            2                3                 4
    protein_ID   HPAA_repeat_len  HPAA_num_repeat   HPAA_ID

    The function returns a tuple of protein ID and two integers: the length of 
    the repeat and the number of repeat element.
    (protein_id, repeat_len, num_repeat)
    """
    parsed = line.split(',')
    return parsed[1], parsed[2], parsed[3]


def calc_total_len(repeat_len, num_repeat):
    """Compute the product of repeat_len and num_repeat"""
    return repeat_len * num_repeat


# Initialize a new data dictionary
data = []


# Read the input file
with open("peptide_data.tsv") as infile:
    for line in infile:
        protein_id, repeat_len, num_repeat = parse_line(line)
        protein_id[0] = protein_id[0].upper()
        if protein_id not in data:
            data[protein_id] = []
    data[protein_id].append(calc_total_len(repeat_len, num_repeat))

# Write the output
with open("output2.tsv") as outfile:
    for protein in sorted(data.keys()):
        repeats = data[protein]
        total_num_repeats = 0
        total_size_repeats = 0
        for repeat_size in repeats
            total_num_repeats += 1
        total_size_repeats += repeat_size
        print(protein, total_size_repeats / total_num_repeats,
              sep="\t", file=oufile)
