## Week 1 Exercise - Calculating AT content
Exercise obtained and modified from https://pythonforbiologists.com


This exercise is going to involve a mixture of strings and numbers. Let's remind ourselves of the easiest way to calculate AT content:

$$AT content = \frac{A + T}{length}$$

There are three numbers we need to figure out: the number of A characters, the number of T characters, and the length of the sequence. We know that we can get the length of the sequence using the `len()` function, and we can count the number of A and T using the `count()` method. Here are a few lines of code that we think will calculate the numbers we need:

In [1]:
my_dna = "ACTGATCGATTACTTTTTTTTTTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')

At this point, it seems sensible to check that these lines work before we go any further. So rather than diving straight in and doing some calculations, let's print out these numbers so that we can eyeball them and see if they look approximately right. We'll have to remember to turn the numbers into strings using `str()` so that we can print them:

In [2]:
my_dna = "ACTGATCGATTACTTTTTTTTTTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')

print("length: " + str(length))
print("A count: " + str(a_count))
print("T count: " + str(t_count))

length: 54
A count: 13
T count: 26


That looks about right, but how do we know if it's exactly right? We could go through the sequence manually base by base, and verify that there are sixteen As and eighteen Ts, but that doesn't seem like a great use of our time: also, what would we do if the sequence were 51 kilobases rather than 51 bases? A better idea is to run the exact same code with a much shorter test sequence, to verify that it works before going ahead and running it on the larger sequence. 
Here's a version that uses a very short test sequence with one of each of the four bases:

In [3]:
test_dna = "ATTC"
length = len(test_dna)
a_count = test_dna.count('A')
t_count = test_dna.count('T')

print("length: " + str(length))
print("A count: " + str(a_count))
print("T count: " + str(t_count))

length: 4
A count: 1
T count: 2


Everything looks OK – we can probably go ahead and run the code on the long sequence. But wait; we know that the next step is going to involve doing some calculations using the numbers. If we switch back to the long sequence now, then we'll be in the same position as we were before – we'll end up with an answer for the AT content, but we won't know if it's the right one. 

A better plan is to stick with the short test sequence until we've written the whole program, and check that we get the right answer for the AT content (we can easily see by glancing at the test sequence that the AT content is 0.5). Here goes – we'll use the add and divide symbols from the exercise hint:

In [4]:
test_dna = "ATTC"
length = len(test_dna)
a_count = test_dna.count('A')
t_count = test_dna.count('T')

at_content = a_count + t_count / length
print("AT content is " + str(at_content))

AT content is 1.5


That doesn't look right. Looking back at the code we can see what has gone wrong – in the calculation, the division has taken precedence over the addition, so what we have actually calculated is:

$$A + \frac{T}{length}$$

To fix it, all we need to do is add some parentheses around the addition, so that the line becomes:

```
at_content = (a_count + t_count) / length
```

Now we get the correct output for the test sequence:

In [5]:
test_dna = "ATTC"
length = len(test_dna)
a_count = test_dna.count('A')
t_count = test_dna.count('T')

at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.75


and we can go ahead and run the program using the longer sequence, confident that the code is working and that the calculations are correct. Here's the final version:

In [6]:
# at_content.py

my_dna = "ACTGATCGATTACTTTTTTTTTTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')

at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.7222222222222222


When trying to choose between different ways to write a program, always favour the solution that is clearest in intent and easiest to read. 


## What have we learned?


On the surface, this exercise is about manipulating DNA sequences. 

On a deeper level, however, the exercise is about learning to break down problems into individual steps which can be solved using the tools available to us. Even the simplest of problems requires using several different tools in the right order. The remainder of the exercises in this course – and nearly all the programs you will write in the future – will require you to break down problems in this way.

We've also learned a specific lessons. We saw how it's important to test code using simple inputs in order to check that it's giving the right answer. 
