<h1 id="toctitle">Performance and benchmarking</h1>
<ul id="toc"/>

---

Distinct but related concepts:

- Measuring
    - Benchmarking (how long does something take)
        - CPU
        - Memory (less so)
    - Profiling (which bits are slow)
        - CPU
        - ~~Memory~~ (not today)
- Optimizing

As with most programming jobs, a range of tools from simple to complex.

## Benchmarking

In approximate order of usefulness....

### Unix time
How long does our program take to run? On Linux/Mac we can do 

```
time somecommand
```

In iPython, prefix shell commands with `!`


In [None]:
!date

Given output that looks like this:

```
real	0m0.490s 
user	0m0.457s 
sys	    0m0.032s 
```

- real is the wallclock time (affected by busy systems and other programs)
- user is the time spent executing our code
- sys is the time spent waiting for system calls (file IO, memory, network)

user+sys is probably the most useful. 

### Manual timing

Just measure the current time at the start of code, then again at end, and get the difference. 

`time.time()` gives us current UNIX epoch (number of seconds since midnight January 1st 1970 (don't ask.))

In [None]:
import time 
time.time()

On most systems this has very high resolution. 

In [None]:
import time 
start = time.time() 

# print the sum of the first million cube numbers
x = 0 
for i in range(1000000): 
    x = x + i ** 3 
print(x) 
 
end = time.time() 
print(end - start) 

This is arguably better than using the `time` command line as it doesn't include Python start up time, etc. However, still affected by other processes.

### `timeit` module

Python has a built in module for doing timing. From the command line:

In [None]:
!python -m timeit "4 ** 10"

In [None]:
%%timeit
4 ** 10

Nice features:
- automatically runs the code many times to get an accurate measurement
- runs the whole thing three times and reports the best (accounts for other processes)
- gives the answer in easy to read units:

In [None]:
!python -m timeit "12345 in range(1000000)" 

In the above code, do we spend more time constructing the range list or checking if the number is in it? Let's try just constructing the range:

In [None]:
!python -m timeit "range(1000000)" 

Yep, takes loads of time to construct the list. Seperate that bit out with a setup (`-s`) command:

In [None]:
!python -m timeit -s "r=range(1000000)" "12345 in r" 

In iPython, we have magic convenience functions:

In [None]:
%timeit 4 ** 10

In [None]:
r=range(1000000)
%timeit 12345 in r

In [None]:
%%timeit
# print the sum of the first million cube numbers
x = 0 
for i in range(1000000): 
    x = x + i ** 3 
#print(x) 

`timeit` is useful for quickly checking which approach is faster. `timeit` case study: which way is faster to calculate AT content - counting a and t, or looking at each base and keeping a tally?

In [None]:
def at_count(dna): 
    return (dna.count('a') + dna.count('t')) / len(dna) 
 
def at_iter(dna): 
    a_count = 0 
    t_count = 0 
    for base in dna: 
        if base == 'a': 
            a_count = a_count + 1 
        elif base == 't': 
            t_count = t_count + 1 
    return (a_count + t_count) / len(dna) 

test_dna = 'atcgatcgatcatgatcggatcgtagctagcatctagtc' 
assert(at_count(test_dna) == at_iter(test_dna)) 

Which is faster?

In [None]:
%timeit at_count(test_dna)

In [None]:
%timeit at_iter(test_dna)

Hmmm, something odd is going on. Short strings don't give reliable benchmarking results in Python due to optimizations in cPython. Let's try a more realistic input:

In [None]:
import random
def random_dna(length):
    return "".join([random.choice(['A','T','G','C']) for _ in range(length)])

In [None]:
random_dna(20)

Now we can compare the two functions:

In [None]:
%timeit at_count(random_dna(10000))
%timeit at_iter(random_dna(10000))

Looks about equal, but wait: what if most of the time is spent generating the random DNA sequence? This is fairer:

In [None]:
d = random_dna(10000)
%timeit at_count(d)
%timeit at_iter(d)

Summary: 

- getting timing right is hard
- `count()` is faster than iteration (due to fast C code)

### Benchmarking memory

Here's the short story:

`pip install psutil`

check https://pypi.python.org/pypi/psutil

then

In [None]:
import psutil, os 
process = psutil.Process(os.getpid()) 
mem = process.memory_info().rss / 1024 / 1024 
print("Used this much memory: " + str(mem) + ' Mb')

In [None]:
process.memory_info().rss


Problem: this is useless in iPython notebooks as it includes everything that's been executed. For simple scripts, it's better. 

In [None]:
cmds = 'import psutil, os'
cmds += '\n' + 'process = psutil.Process(os.getpid())'
cmds += '\n' + 'mem = process.memory_info().rss / 1024 / 1024'
cmds += '\n' + 'print(\"Used this much memory: \" + str(mem) + \' Mb\')'
with open('check_mem.py', 'w') as f:
    f.write(cmds)

!python check_mem.py

This lets us investigate time/memory trade offs. We know that checking to see if a number is in a set is faster than checking to see if it's in a list:

In [None]:
l = range(1000000)
s = set(l)
%timeit 12345 in l
%timeit 12345 in s

but how much longer does it take to create the data structure in the first place?

In [None]:
%timeit list(range(1000000))
%timeit set(range(1000000))

and how much more memory does it take to hold the set?

In [None]:
cmd1 = '\nimport psutil, os'
cmd2 = '\nprocess = psutil.Process(os.getpid())'
cmd2 += '\nmem = process.memory_info().rss / 1024 / 1024'
cmd2 += '\nprint(\"Used this much memory: \" + str(mem) + \' Mb\')'

cmd_list = '\nlist(range(1000000))'
cmd_set = '\nset(range(1000000))'

with open('list_mem.py', 'w') as f_list, open('set_mem.py', 'w') as f_set:
    f_list.write(cmd1 + cmd_list + cmd2)
    f_set.write(cmd1 + cmd_set + cmd2)
!python list_mem.py
!python set_mem.py

Conclusions:
- if we need to create a list once then check membership many times, a set will be faster
- if we need to create many lists, a set might be slower
- a set will use more (x2) memory for these ranges 

Of course, everything might be different for non-integers!

## Profiling

Profiling is the process of taking an existing piece of code and identifying which bits are taking the time. 

Scenario: given

- a single long DNA sequence
- a collection of interesting 4-base motifs

we want to identify frequently-occuring (say 50 times) 4-base motifs in the sequence and divide them into ones that are also on the interesting list, and ones that aren't. 



In [None]:
# create a random dna sequence
dna = random_dna(10000)

# create 100 random interesting motifs
motifs = [random_dna(4) for _ in range(100)]

In [None]:
%%timeit
# standard kmer counting code to identify frequent chunks
frequent_chunks = [] 
for start in range(len(dna) - 3): 
    chunk = dna[start:start + 4] 
    if dna.count(chunk) > 50: 
        frequent_chunks.append(chunk) 

# now check each chunk to see if it's in the list of motifs
for chunk in frequent_chunks: 
    if chunk in motifs: 
        print(chunk + " is frequent and interesting") 
    else: 
        print(chunk + " is frequent but not interesting")

How can we speed this program up? We know that checking to see if an element is in a list is slow, so let's change it to a set:

In [None]:
# create 100 random interesting motifs
motifs = set([random_dna(4) for _ in range(100)])

In [None]:
%%timeit
# standard kmer counting code to identify frequent chunks
frequent_chunks = [] 
for start in range(len(dna) - 3): 
    chunk = dna[start:start + 4] 
    if dna.count(chunk) > 50: 
        frequent_chunks.append(chunk) 

# now check each chunk to see if it's in the list of motifs
for chunk in frequent_chunks: 
    if chunk in motifs: 
        print(chunk + " is frequent and interesting") 
    else: 
        print(chunk + " is frequent but not interesting")
        
print(len(frequent_chunks))

In [None]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

In [None]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")