## `lab12`—Optimizing over Time

❖ Objectives

-   Understand the effect of code structure and algorithm choice on code run time.

### Relative Efficiency

Since there are many ways to solve most problems in computational science, let's investigate one such case.  If you have a list of objects and need to find duplicates, how would you go about it?  Naïvely, you can simply search through the list, comparing each object to each other and accruing the duplicates to a list.

In [None]:
numbers = [12, 15, 12, 2, 6, 7, 1, 2, 2]
duplicates = []

for i in range(0, len(numbers)):  # note that we are making the range explicit here
    for j in range(0, len(numbers)):
        if i == j:  # don't compare numbers[0] to numbers[0], etc.
            continue
        if numbers[i] == numbers[j]:
            duplicates.append(numbers[j])

print(duplicates)

That turns out to be overkill for this problem:  each case of a duplicate numbers sees the other as well, so you see pairs of duplicates rather than just the duplicate numbers.  (That is, we want the list `duplicates` to contain only `[12, 2]` rather than `[12, 12, 2, 2, 2, 2, 2, 2]`.

-   Compose a function `find_duplicates` which accepts a list `values`.  The function `find_duplicates` should return only the numbers which have duplicates (and only one of each).  Use the code above (without modifying the algorithm); you will also need a comparison of the type
        
        values[j] not in duplicates
    
    to catch multiple duplicates.

In [None]:
# Write your function here.  This includes any necessary import statements.

In [None]:
# it should pass this test---do NOT edit this cell
from nose.tools import assert_equal

test_values = ['a', 'a', 'b', 'd', 'a', 'b', 'c', 'd', 'a']
test_result = find_duplicates(test_values)
assert_equal(type(test_result), list, msg="\nYour function doesn't return a list.")
assert_equal(len(test_result), 3, msg="\nYour function does not return the correct number of duplicate items.")
assert_equal(test_result, ['a', 'b', 'd'], msg="\nYour function returns the wrong items as duplicates.")

print('All tests passed successfully.')

A simple optimization can be made to improve this code, particularly for large lists.  We are comparing each value to all others, *even the ones that have already been checked*.  This means that in many cases, we already know that something is a duplicate.  In graphical form, we are doing this:

<img src="./img/duplicate-all.png" width="40%;" />

when we could be doing this:

<img src="./img/duplicate-half.png" width="40%;" />

because we *already checked* the previous (left-hand) values for duplicateness.

The relevant change:  switch the range of the `j` loop from `(0, len(values))` to `(i+1, len(values))`.

-   Compose a new function `better_find_duplicates` which makes this change.

In [None]:
# Write your function here.  This includes any necessary import statements.

In [None]:
# it should pass this test---do NOT edit this cell
from nose.tools import assert_equal

test_values = ['a', 'a', 'b', 'd', 'a', 'b', 'c', 'd', 'a']
test_result = better_find_duplicates(test_values)
assert_equal(type(test_result), list, msg="\nYour function doesn't return a list.")
assert_equal(len(test_result), 3, msg="\nYour function does not return the correct number of duplicate items.")
assert_equal(test_result, ['a', 'b', 'd'], msg="\nYour function returns the wrong items as duplicates.")

print('All tests passed successfully.')

What we intend to do now is compare the relative speeds of these two functions for large sets of data.  A reasonable hypothesis is that the second function, `better_find_duplicates`, is twice as fast as `find_duplicates`, since it requires only half as many tests.  We'll see if we can confirm or discard this guess.

There are a few good ways to time code in Python.  You saw `time.time` in the lecture—it returns the current computer time, which we can use before and after a process.  A more automatic way is the `%timeit` command, which is explicitly designed to test functions repeatedly and report their run time.  (The percent sign in front means that this is a Jupyter notebook command, rather than a specifically Python command.)

In [None]:
trial_values1 = list(range(16))*2
print(trial_values1)
%timeit find_duplicates(trial_values1)

`%timeit` runs your code many times and reports the shortest run time from that set.  (This may be affected by other programs running on your computer as well.)

In [None]:
print(trial_values1)
%timeit better_find_duplicates(trial_values1)

On my machine, I had the following results:

| function | best time |
|----------|-----------|
| `find_duplicates` | 176 µs |
| `better_find_duplicates` | 99 µs |

$99/176 = 0.5625$—that's pretty close to twice as fast (half the time), and we could reasonably expect that to improve as the data set becomes larger.  (Ideal behavior is often achieved with larger data sets since there is additional overhead from the function call and return which becomes a smaller proportion of bigger problems.)

Let's use a bigger data set.

In [None]:
trial_values2 = [5] * 1000
%timeit find_duplicates(trial_values2)
%timeit better_find_duplicates(trial_values2)

Note that `%timeit` (rather intelligently) opts for fewer loops since each loop is more intensive.  In this case, my loop behavior improves to  $103/204 \approx 0.505$.

It can be shown that performance in this algorithm also increases when the list is sorted first (we assume that the sorting only need happen once).

-   Compose a function `sorted_find_duplicates` which performs as above, except that `values` is sorted before the search is made.  (You may use `values.sort()`, which modifies the list in-place.)

In [None]:
# Write your function here.  This includes any necessary import statements.

In [None]:
# it should pass this test---do NOT edit this cell
from nose.tools import assert_equal

test_values = ['a', 'a', 'b', 'd', 'a', 'b', 'c', 'd', 'a']
test_result = sorted_find_duplicates(test_values)
assert_equal(type(test_result), list, msg="\nYour function doesn't return a list.")
assert_equal(len(test_result), 3, msg="\nYour function does not return the correct number of duplicate items.")
assert_equal(test_result, ['a', 'b', 'd'], msg="\nYour function returns the wrong items as duplicates.")

print('All tests passed successfully.')

Now we'll try all three functions thus far—`find_duplicates`, `better_find_duplicates`, and `sorted_find_duplicates`—on a large random list.

In [None]:
import numpy.random as npr
trial_values3 = npr.randint(0,10,size=1000)
print(trial_values3)
%timeit find_duplicates(trial_values3)
%timeit better_find_duplicates(trial_values3)
%timeit sorted_find_duplicates(trial_values3)

Notice the pause after the first two runs of `%timeit`—this is the list being sorted during the first trial run of `sorted_find_duplicates`.  My performance is marginally better, on the order of $2\%$ in this case.

A very great improvement can often be achieved by moving from Python to C—in this case, NumPy, which has much of its actual numerical code written in the very efficient C language.

C is a *compiled* language, meaning that the code is converted directly into machine language before being run.  This gives it a great deal of power and efficiency, which is why many numerical applications are written in C.

In [None]:
# although demonstrative, this is not a particularly efficient code
import numpy as np

def numpy_find_duplicates(values):
    duplicates = []
    values = np.array(values)
    values.sort()
    for i in range(0, len(values)):  # note that we are making the range explicit here
        for j in range(i+1, len(values)):
            if i == j:  # don't compare numbers[0] to numbers[0], etc.
                continue
            if values[i] == values[j] and values[j] not in duplicates:
                duplicates.append(values[j])
    
    return duplicates

In [None]:
import numpy.random as npr
trial_values3 = npr.randint(0,10,size=1000)
print(trial_values3)
%timeit find_duplicates(trial_values3)
%timeit better_find_duplicates(trial_values3)
%timeit sorted_find_duplicates(trial_values3)
%timeit numpy_find_duplicates(trial_values3)  # note that sorting has already taken place as well

I'm going to throw one more into the mix:  this one uses the SciPy `itemfreq` function, which tells you how many times each item occurs in a list.

In [None]:
from scipy.stats import itemfreq

def scipy_find_duplicates(values):
    duplicates = []
    freqs = itemfreq(values)
    
    for i in freqs:
        if i[1] > 1:
            duplicates.append(i[0])
    
    return duplicates

In [None]:
import numpy.random as npr
trial_values3 = npr.randint(0,10,size=1000)
print(trial_values3)
%timeit find_duplicates(trial_values3)
%timeit better_find_duplicates(trial_values3)
%timeit sorted_find_duplicates(trial_values3)
%timeit numpy_find_duplicates(trial_values3)  # note that sorting has already taken place as well
%timeit scipy_find_duplicates(trial_values3)

In this case, my values are:

| function | best time | ratio to worst |
|----------|-----------|----------------|
| `find_duplicates` | 287 ms | 100% |
| `better_find_duplicates` | 146 ms | 50.9% |
| `sorted_find_duplicates` | 144 ms | 50.2% |
| `numpy_find_duplicates` | 142 ms | 49.4% |
| `scipy_find_duplicates` | 75.6 µs | 0.05% |

In the last case, with `scipy`, there is some *serious* speedup taking place.  (NumPy could be similar if we used a more matrix-based algorithm, but the code gets messy so I'll spare you.)

### Algorithm Scaling

Commonly you need to know how one or more algorithms performs with respect to data set size $n$.  In this section, you're going to calculate run times for various scenarios and plot the resulting performance curves (called scaling).

In order to store these data for plotting, we are going to use a single array or data table, paired with algorithm names (the functions) and values of $n$ (the data set sizes).

|                          | $n = 10$ | $n = 100$ | $n = 1\,000$ | $n = 10\,000$ | $n = 100\,000$ |
|--------------------------|---------|----------|-----------|------------|--------------|
| `find_duplicates`        |         |          |           |            |              |
| `better_find_duplicates` |         |          |           |            |              |
| `sorted_find_duplicates` |         |          |           |            |              |
| `numpy_find_duplicates`  |         |          |           |            |              |
| `scipy_find_duplicates`  |         |          |           |            |              |               |

Our goal is to fill in the values of this table and then plot it.

In [None]:
names = ['find_duplicates', 'better_find_duplicates', 'sorted_find_duplicates', 'numpy_find_duplicates', 'scipy_find_duplicates']
n = [10, 100, 1000, 10000, 100000]
table = np.zeros( (5, 5) )

The Python code analogue of `%timeit` is `timeit.timeit`.  Unfortunately, it doesn't know about the functions and variables in your greater Python environment or Jupyter notebook, so you have to redefine variables.  You also need to tell it how many times to test the code (keep this very low until you understand the data set's behavior!—around 1–10).  It is used as follows:

In [None]:
import timeit
timeit.timeit(stmt='sin(x)', setup='from numpy import sin; x=[5,6,7]', number=1000)

We'll populate the cases one at a time for each value of $n$ in `n`.  (This means that most of your table will be zero as we start, but will fill in naturally as the following code executes.)

In [None]:
defn_find_duplicates='''you should copy and paste your definition of find_duplicates from above here in this string'''
defn_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,10,size=1000)
'''

In [None]:
n_trials = 10
t_trials = timeit.timeit(stmt='find_duplicates(trial_values2)', setup=defn_find_duplicates, number=n_trials)
print('%.6f s'%(t_trials))

It's more efficient (for you, the user) to put this in a loop as well:

In [None]:
defn_find_duplicates='''you should copy and paste your definition of find_duplicates from above here in this string'''

In [None]:
n_trials = 10
for i,num in enumerate(n):
    defn_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,%d,size=1000)
    '''%num
    t_trials = timeit.timeit(stmt='find_duplicates(trial_values2)', setup=defn_find_duplicates, number=n_trials)
    table[0,i] = (t_trials)

print(table)

In [None]:
defn_better_find_duplicates='''you should copy and paste your definition of better_find_duplicates from above here in this string'''

In [None]:
n_trials = 10
for i,num in enumerate(n):
    defn_better_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,%d,size=1000)
    '''%num
    t_trials = timeit.timeit(stmt='better_find_duplicates(trial_values2)', setup=defn_better_find_duplicates, number=n_trials)
    table[1,i] = (t_trials)

print(table)

In [None]:
defn_sorted_find_duplicates='''you should copy and paste your definition of sorted_find_duplicates from above here in this string'''


In [None]:
n_trials = 10
for i,num in enumerate(n):
    defn_sorted_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,%d,size=1000)
    '''%num
    t_trials = timeit.timeit(stmt='sorted_find_duplicates(trial_values2)', setup=defn_sorted_find_duplicates, number=n_trials)
    table[2,i] = (t_trials)

print(table)

In [None]:
defn_numpy_find_duplicates='''you should copy and paste your definition of numpy_find_duplicates from above here in this string'''

In [None]:
n_trials = 10
for i,num in enumerate(n):
    defn_numpy_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,%d,size=1000)
    '''%num
    t_trials = timeit.timeit(stmt='numpy_find_duplicates(trial_values2)', setup=defn_numpy_find_duplicates, number=n_trials)
    table[3,i] = (t_trials)

print(table)

In [None]:
defn_scipy_find_duplicates='''you should copy and paste your definition of scipy_find_duplicates from above here in this string'''

In [None]:
n_trials = 10
for i,num in enumerate(n):
    defn_scipy_find_duplicates+='''
import numpy.random as npr
trial_values2 = npr.randint(0,%d,size=1000)
    '''%num
    t_trials = timeit.timeit(stmt='scipy_find_duplicates(trial_values2)', setup=defn_scipy_find_duplicates, number=n_trials)
    table[4,i] = (t_trials)

print(table)

Now that the table is populated, let's plot the data a couple of different ways.  We'll do this a *bit* differently, and instead of just plotting directly we'll write a function which returns the plot.  Then we can plot it directly.

As an example, this block of code defines a function `plot_sin` which plots a sine function and returns it.  Note that our plotting code is becoming more complicated (since we are going to do more sophisticated things with it).  Now we have a `fig` variable (the figure, that is, the "whole thing") and an `axes` variable (the plot and data together).  It's the latter that we plot on, and that needs to be returned.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def plot_sin():
    # Get the data.
    x  = np.linspace(0,2*np.pi,100)
    y1 = np.sin(x)
    y2 = np.cos(x)
    
    # Plot the data.
    fig, axes = plt.subplots(nrows=1, ncols=1)
    axes.plot(x, y1, label='sine')
    axes.plot(x, y2, label='cosine')
    
    # Arrange plot features for the end viewer.
    axes.set_xlabel('n')
    axes.set_ylabel('t')
    axes.legend(loc='best')  # this makes the legend appear where it covers the fewest data points
    
    return axes

my_axes = plot_sin()

-   Compose a function `plot_lines` which accepts a table `my_table`, a list of functions used to generate the table `my_funcs`, and a list of data set sizes used to generate the table `my_n`.  This function should plot each row of the table data against the data set sizes, with each line's label coming from the proper row in `my_funcs` (as above).  A representative line may look like:
        
        axes.plot(my_n, my_table[2], label=my_funcs[2])
    
    Label the x- and y-axes with `'n'` and `'t'` (as above).  The function should return the resulting `axes` (as above).

In [None]:
# Write your function here.  This includes any necessary import statements.
def plot_lines(my_table, my_funcs, my_n):
    pass

In [None]:
# it should pass this test---do NOT edit this cell
# these are worth five points
import matplotlib as mpl
from nose.tools import assert_equal, assert_is_not

test_axes = plot_lines(table, names, n)
assert_equal(isinstance(test_axes, mpl.axes.Axes), True, msg="\nYour function does not return axes.")
assert_equal(len(test_axes.lines), 5, msg="\nYour plot does not have the correct number of lines.")
assert_is_not(len(test_axes.xaxis.get_label_text()), 0, msg="\nYour plot does not have labels on the x-axis.")
assert_is_not(len(test_axes.yaxis.get_label_text()), 0, msg="\nYour plot does not have labels on the y-axis.")
assert_equal(test_axes.legend_.get_visible(), True, msg="\nYour plot does not have a legend.")

print('All tests passed successfully.')

The underlying behavior can also be illuminated by changing the plot type from a linear scale (1, 2, 3, etc.) to a *logarithmic* scale (1, 10, 100, etc.).  In MatPlotLib, this can be accomplished by changing from using `plot` to using `loglog` (arguments stay the same).

-   Compose a function `plot_logs` which does the same as `plot_lines` above, but uses `loglog` instead.

In [None]:
# Write your function here.  This includes any necessary import statements.

In [None]:
# it should pass this test---do NOT edit this cell
# these are worth five points
import matplotlib as mpl
from nose.tools import assert_equal, assert_is_not

test_axes = plot_logs(table, names, n)
assert_equal(isinstance(test_axes, mpl.axes.Axes), True, msg="\nYour function does not return axes.")
assert_equal(len(test_axes.lines), 5, msg="\nYour plot does not have the correct number of lines.")
assert_is_not(len(test_axes.xaxis.get_label_text()), 0, msg="\nYour plot does not have labels on the x-axis.")
assert_is_not(len(test_axes.yaxis.get_label_text()), 0, msg="\nYour plot does not have labels on the y-axis.")
assert_equal(test_axes.legend_.get_visible(), True, msg="\nYour plot does not have a legend.")
#assert_equal(test_result, ['a', 'b', 'd'], msg="\nYour plot is not logarithmic on the x-axis.")
#ax.set_yscale("log") 

print('All tests passed successfully.')

The log-log plot conveniently lets you predict how your algorithm will perform as you move to larger and larger problems by looking at the plot trend.

Such is the art of scaling.  You are equipped now to compare the behavior of various methods in your work, and to predict how that behavior will change as you move to bigger challenges and data sets.