# Class 9 - 6.5.18

# Testing and Test-Driven Development

## Introduction

We've already met one widely adopted programming technique which isn't used as much in the academy - object-oriented programming. Another such technique, ubiquitously used wherever code is written _outside_ the academy, is testing. In my opinion, this is a crucial topic in software development that isn't treated with the respect it should in the academy.

Tests are (usually) short pieces of code designed to assert that a small portion of your program does what you intend it to do. 

For example, if I have a class designed to perform calculations on some DataFrame that was created somewhere else, perhaps by adding some of its columns together, averaging them out and displaying the result, then I wish for my code to be correct and deterministic, in a sense that a single, defined input will always give the same correct output.

This isn't trivial even for the most basic functions in Python. Let's try to build a "wrapper" over `np.tile`:

In [1]:
# A reminer of np.tile()
import numpy as np


arr = np.array([1, 2, 3])
tiled = np.tile(arr, (2, 3))  # repeat it twice in the row dimension (axis=0), 
                              # and three times in the column dimension
tiled

array([[1, 2, 3, 1, 2, 3, 1, 2, 3],
       [1, 2, 3, 1, 2, 3, 1, 2, 3]])

In [2]:
# Our own implementation

class Tiler:
    """ Tile any number of objects on their first axis """
    def __init__(self, *args):
        self.to_tile = args

    def tile(self, reps=(1,)):
        self.tiled = np.tile(self.to_tile, reps=reps)
        return self.tiled

The above class can tile as many array inputs as you wish, and currently it's just a fancy wrapper over `np.tile()`. 


In [3]:
tiled = Tiler([1, 2, 3], [4, 5, 6]).tile()
print(tiled)

print('------')
tiled_again = Tiler([1, 2, 3], [4, 5, 6]).tile((2, 2))
print(tiled_again)

[[1 2 3]
 [4 5 6]]
------
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [1 2 3 1 2 3]
 [4 5 6 4 5 6]]


It's a very simple implementation, with the only "new" thing being the `*args` argument to the `__init__` function. `*args` means: Pack into a tuple called `args` all non-keyword arguments that the function receives. For example:

In [4]:
def f(a, b, *args):
    print(f"a is {a}")
    print(f"b is {b}")
    print(f"The rest of the data received is {args}")
    
f(1, 2, 3, 4, 'a')

a is 1
b is 2
The rest of the data received is (3, 4, 'a')


We use `*args` when we wish to write a function with a non-constant number of arguments. The "magic" is done with the star (`*`), the `args` is just a convention, it's not a keyword. The star knows to bind the inputs into a tuple. In our case, we wish the `Tiler` class to tile as many arrays as the user wishes. This is most easily done with the `*args` parameter. Like `*args`, Python also has `**kwargs` that binds the keyword argument inputs into a dictionary.

So now that we understand the class completely - does it work?

It does work, but it's not _guaranteed_ to work. Some issues are clear and can be solved immediately, like some basic type-checking that we should implement:

In [5]:
class Tiler:
    """ 
    Tile any number of objects on their first axis.
    Returns: np.ndarray tiled across the first dimension.
    Errors: TypeError if supplied with invalid iterables
    """
    def __init__(self, *args):
        _to_tile = self._verify_inp(args)
        self.to_tile = _to_tile

    def tile(self, reps=(1,)):
        self.tiled = np.tile(self.to_tile, reps=reps)
        return self.tiled
    
    def _verify_inp(self, it):
        to_tile = []
        for item in it:
            if not isinstance(item, (list, np.ndarray)):
                raise TypeError(f'Item {item} not a valid iterable to tile.')
            to_tile.append(item)
        return to_tile

Now the class specifies its return type and what happens in case of an error. A good application will know to wrap the use of the class in a `try... except` block and catch any `TypeError`s:

In [6]:
# This works:
try:
    tiled = Tiler([1, 2, 3], [4, 5, 6]).tile()
except TypeError as e:
    print(f"TypeError: {e}")
print(tiled)

# This shouldn't
print("--------")
try:
    tiled_wrong = Tiler([1, 2, 3], {4, 5, 6}).tile()  # input is a set, not a list
except TypeError as e:
    print(f"TypeError: {e}")

[[1 2 3]
 [4 5 6]]
--------
TypeError: Item {4, 5, 6} not a valid iterable to tile.


Are we covered? Is this class no longer vulnerable to any type of non-legitimate input? Of course it is vulnerable, and some of us might already see a few corner cases:

In [7]:
try:
    tiled = Tiler([], np.array([1, 2, 3])).tile()
except TypeError as e:
    print(f"TypeError: {e}")

tiled  # what is that?

array([list([]), array([1, 2, 3])], dtype=object)

This array of lists and arrays is certainly not what we had in mind when tiling an empty list an a simple array. We can think what exactly we expect this function to do, but this output will certainly not be on the list of possible outputs.

I really don't have to convince you that software, even the most basic application, has bugs. But _thinking_ that we solved our bugs by the insertion of methods like `self._verify_inp` doesn't mean we actually solved them.

To this end we write tests for our program. You've already seen unit tests in your homework assignments. You were asked to make sure that your solution passes all tests before submitting. This exemplifies a way through which we can be more certain that our program does what we thought it was doing.

Tests are important to us for two reasons. The first one is that even simple programs are more complicated than we think. The `Tiler` class is a good example, but you might imagine that as soon as we add interfaces between classes, methods and functions, things might get a bit messy. For example, in the aviation industry, for each line of code a software has, you may find about 8 lines dedicated to testing it.

Moreover, when we deal with user input - data files, parameters for script - we should expect the unexpected, even if the main user is ourselves. Our future self in a few months will probably not remember the type of every parameter it has to enter.

The second reason is Python's dynamic nature, or _duck-typing._ If you want to enforce a function to only accept inputs of a single type, _you_ must be the one writing these assertions, either outside or inside the function. For example, a function that adds two numbers needs a `isinstance(value, (int, float))` somewhere near its top to avoid these mistakes. Statically-typed languages, like C, define a type for each variable. A function adding two integers simply cannot accept a non-integer input. 

Python's dynamic nature is a blessing on many occasions, but it can sometimes be a real pain. This nature is the second important reason to write tests to our code. Many cases that in other programming languages would've resulted in a simple `TypeError`, can cause major bugs in Python due to wrong input types.

## Solution

### Test-Driven Development

Test-Driven Development, or TDD, is a very popular way to solve the problems we described. In TDD we reverse the process of writing code. We first write a test to the function we wish to write. It fails - because the function doesn't exist yet - and then we write then function until we're passing our test. We then add more tests to make sure that the function works properly.

Let's assume I wish to write a function that adds two positive integers. First I'll write a few basic tests for the function, and then I'll try to run them.

In [8]:
# Test functions always start with "test_..."
def test_basic_addition():
    first = 2
    second = 4
    result = 6
    return intadd(first, second) == result
    
def test_negative_inp1():
    try:
        result = intadd(-1, 1)
    except TypeError:
        return True
    else:
        return False

def test_negative_inp2():
    try:
        result = intadd(1, -1)
    except TypeError:
        return True
    else:
        return False

After the tests are written (and are failing, since the fucntion `intadd` doesn't exist, we need to write the function so that it will comply with the tests we have.

In [9]:
def intadd(num1, num2):
    """ Non-negative integer addition """
    if (num1 < 0) or (num2 < 0):
        raise TypeError('Input must be positive.')
    return num1 + num2

In [10]:
print(test_basic_addition())
print(test_negative_inp1())
print(test_negative_inp2())

True
True
True


Our function works! See how each test deals with a unique sub-case of failure? It's much easier to debug and answer "what went wrong?" when you tests are so precise. A good tests consists of a couple of lines of initialization, a couple of lines that call the tested function, and the `assert`ing line.

During the time it took me to write the implementation, I thought of a different edge case - floating point numbers input. Now I should write tests that will check this input type, they will fail - and then I can refactor my `intadd` function to deal with these cases.

In [11]:
def test_float_inp():
    """ A shorter version - check both inputs in a single test function """
    try:
        result = intadd(1., 2)
        result = intadd(2, 1.)
    except TypeError:
        return True
    else:
        return False

In [12]:
print(test_float_inp())  # False!

False


In [13]:
def intadd(num1, num2):
    """ Non-negative integer addition """
    if (num1 < 0) or (num2 < 0):
        raise TypeError('Input must be positive.')
    if isinstance(num1, float) or isinstance(num2, float):
        raise TypeError('Input must be integer.')
    return num1 + num2

In [14]:
print(test_float_inp())  # True!

True


And the process continues - every time I think of more edge cases, I write failing tests and then refactor my own function.

Another important thing to note here is the way TDD forces me to "design the API" of my function. As you can see, when I wrote the test I had to think of the output I'll receive when I enter wrong inputs. In this case I thought it was most appropriate for the function to return an TypeError - an exception that if not caught can terminate the execution of my application. This appeared to be the right choice here, but sometimes we can define a different result for faulty inputs. We could've returned `-1` if the input was invalid, for example.

With TDD, most people don't actually write the first tests before they write the function. TDD purists might insist on that, but for all intents and purposes, if you keep the development of the tests very closely tied to the development of the function itself - you're doing TDD.

### Actual usage - `pytest`

As you might have guessed, the Python ecosystem offers tools to automate the testing process. The standard library comes with a `unittest` module, and another famous one is `nosetests`. But the most popular (and advanced) library as of early 2018 is [`pytest`](https://docs.pytest.org/en/latest/), and that will be the our library of choice.

In the `tests_demo` folder you can see how one structures the tests a project.

After creating this file structure, you just call `pytest` in the command line after you `cd`ed into the folder containing the project - and all tests are run for you.

`pytest` has some advanced features. In the tests inside `tests_demo` you could already see the `parametrize` decorator, used to call the same test with several different inputs. `pytest` is smart enough to tell us which input out of the ones we entered caused the exception. In addition, you can also mark some tests as "expected to fail". Finally, It can also create automatic tests for you. [Here](https://docs.pytest.org/en/latest/parametrize.html#basic-pytest-generate-tests-example) you'll find the formal docs, and [here](https://hackebrot.github.io/pytest-tricks/create_tests_via_parametrization/) you can find a clear blog post explaining it.

#### A few extra points:

1. Tests should be as concise as possible. Their execution time should be minimal.
2. Run your test suit as often as possible. Minimal frequency is before each commit you make.
3. You should try hard to translate bugs into tests. These might be the most important tests you'll write.
4. Test names are long. It's fine - it's because we don't use docstrings for tests.
5. You can configure PyCharm to work with `pytest` in _File - Settings - Tools - Python Integrated Tools - Default test runner: py.test._

### Integration Testing

Unit tests repeatedly tests functions or methods your write under different inputs, and are the backbone of any reliable test suit. However, unit tests are not enough, since they don't check the interface between the different functions and classes in your application.

Integration tests are larger, heavier tests that take at least two components, or units, of your application, and makes sure that they interact well with each other.

Obviously, if we start taking each two consecutive functions and write an integration test for that pair, and then continue with all three consecutive functions and write these integration tests, and so on - we'll never finish writing the damn application. That's why integration testing is used at crucial junctions of our application, between major classes for example.

## Exercise

Write a class that returns the difference between the Fibonacci series and the prime numbers series, up to some _n._ The output should be an array of numbers, n-items long, and a plot to accompany it. As an example, for $n=3$, I expect the array to be `np.array([-2, -2, -4])`. Write the class in a test-driven development style.

A reminder, the Fibonacci series is a series of numbers starting from (0, 1), with its next element being the sum of the previous two numbers: 0, 1, 1, 2, 3, 5, ... Prime numbers start from 2 and are only divisible by themselves and 1 without a remainder.

You decide on the class' interface and the details of implementation. You may use `numpy`, and I insist that you write at least 5 unit tests and 1 integration test for this class.

Don't try to implement things in a performant way, with fancy algorithms. The focus here is the unit tests and test-driven development.

### Exercise solutions are in the directory `fib_ser`

# Advanced and Performant Python

One of the best things about Python is how easy it is to get started with. The syntax is clear, it has all the basic features, things work as you expect them to, and life is generally pleasant. But Python also supports very advanced features, which make coding with Python an enjoyable experience even after you think you've learned everything there is to know of the language.

While technically you can write good Python code without using these features - it's sometimes a real shame not to use them.

## Generators

In a simple sense, perhaps simplistic, generators are iterators. Meaning, a generator is always an object you can iterate over. In Python you can iterate over most data structures, including dictionaries, lists, tuples and more - and so in this sense generators are similar. However, when we iterate over a list, for example, we're iterating over an existing data structures with existing items. Likewise for dictionaries - when we iterate over them, we're "handed" with the dictionary's keys and values.

In [15]:
a_list = [0, 10, 20]
for item in a_list:
    print(item)

0
10
20


In [16]:
a_dict = dict(a=0, b=10, c= 20)
for key in a_dict:
    print(key, a_dict[key])

a 0
b 10
c 20


This is the first major difference between a generator and the other iterators. A generator is a _recipe_ to create the next item in the chain. A generator is a piece of code telling the Python interpreter how to create the next item, but it doesn't hold this item in memory yet. A simple example might be a list containing values from 0 to 1000. A generator of this list will not have 1000 cells with their values - it would have instructions on the number of cells, and how to calculate the next value.

We've already met a generator (well, kind of) - the `range()` function. When we tell Python to give us a range of number between 0 and 1000 by writing `range(1000)` - we're not actually generating the 1000 "cells" of values, but only the recipe. Let's see it in "action":

In [17]:
range(1000)  # a "range" object

range(0, 1000)

In [18]:
items = range(1000)
items

range(0, 1000)

In [19]:
import sys
sys.getsizeof(items)  # 48 bytes - not nearly enough to hold 1000 items

48

A simple 1000-element list isn't that heavy for a computer (but what about Arduinos?), but when lists get longer, with bigger arrays and massive data structures inside them, it's very inefficient to hold this amount of unused data in memory. 

Let's define our own generator:

In [20]:
def my_range(n):
    """ Returns a list of items from 0 to n """
    num = 0
    while num < n:
        yield num
        num += 1

When we create a generator, the code is executed until the first `yield` statement. This reserved keyword is what makes a function into a generator.

When the code reaches the `yield` it holds, or "saves" its current state, until called by Python's `next()` function:

In [21]:
new_range = my_range(3)

print(next(new_range))

print(next(new_range))

print(next(new_range))

print(next(new_range))

0
1
2


StopIteration: 

Each time `next()` is used, the line with the `yield` is executed, and the function keeps going until it reaches  another `yield` statement. In the `my_range` function, while the index is smaller than `n` the code will reach a `yield`. When we don't satisfy this condition anymore, the code skips the loop and reaches the end of the function. This results in a special `StopIteration` exception, used only in these special cases. This means you can catch this exception and know that your generator went through all of its items.

But calling `next()` multiple times isn't practical. Luckily, `for` loops implement this exact interface automatically, allowing us to use them instead of the tedious, repetitive `next()` calls:

In [22]:
looprange = my_range(10)

for item in looprange:
    print(item)

0
1
2
3
4
5
6
7
8
9


The `for` loop is also smart enough to catch the `StopIteration` exception and terminate the loop, without raising any "visible" exceptions. A `for` loop is the common way to iterate over generators.

Generators don't allow much besides it. You can't print them exactly, or index into them:

In [23]:
range2 = my_range(10)
print(range2)

<generator object my_range at 0x0000020902FDA518>


In [24]:
range2[3]

TypeError: 'generator' object is not subscriptable

Once used, generators are "depleted", you can't reuse them. This is a major difference between a generator and a list, for example - you're not limited by the number of times you can iterate over a list.

In [25]:
for item in looprange:
    print(item)
# Doesn't return anything, because we already depleted looprange

In [26]:
next(looprange)  # immediately raises StopIteration

StopIteration: 

It's important to stress that all functions can become generators if they contain the `yield` statement:

In [27]:
def println():
    print("Hello, ")
    yield True
    print("World")
    yield False

In [28]:
gen2 = println()
a = next(gen2)

print(a)

Hello, 
True


In [29]:
b = next(gen2)
print(b)

World
False


Another way to create generators is "genexps", or generator expressions, which are very similar to list comprehensions:

In [30]:
nums = (2 * n for n in range(10))
nums

<generator object <genexpr> at 0x000002097BBA6620>

In [31]:
for num in nums:
    print(num)

0
2
4
6
8
10
12
14
16
18


The round brackets tell the interpreter that we're creating a generator here.

### Exercise
Write a generator function, or a piece of code including a generator, returning the `n` first Fibonacci numbers. The Fibonacci sequence starts with `1, 1` and the following item is always the addition of the last two items.

### Exercise solution below...

In [32]:
# Solution with a single function
def fib(n):
    """ Returns the first n Fibonacci numbers """
    a, b = 1, 1
    idx = 0
    while idx < n:
        yield a
        a, b = b, a + b
        idx += 1

ten = fib(10)
list(ten)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [33]:
# Solution as a script
def fibn():
    """ Runs over all Fibonacci numbers """
    a, b = 1, 1
    while True:
        yield a
        a, b = b, a + b
        
        
gen = fibn()
for i in range(10):
    print(next(gen))

1
1
2
3
5
8
13
21
34
55


The nice thing about this implementation is that it's infinite - it contains, and can generate all Fibonacci numbers. We could never do that with lists and regular functions.

Generators are used widely in the Python universe. `pathlib` uses them everywhere, for example. For our use-cases, which involve "data-science" stuff, it's usually less important. However, in one of my projects I have a large 5D array which I need to create. This array can easily weigh more than 1 GB RAM when I'm dealing with longer experiments. That's why I decided to write a generator function that creates and populates this array, `yield`ing 4D substacks of it over time. It reduces memory usage by at least two orders of magnitude, and allows my code to run faster on home machines.

## Decorators

Decorators are functions that receive functions in their arguments. When you wrap an existing function with another function - you created a decorator. This feature is extensively used in web frameworks and in other important Python use cases, which means it has a special syntax: `@decorator`. Let's look at an example:

Assume I have a large data-processing pipeline script, built out of many smaller functions, which unfortunately takes a long time to run. I wish to understand _why_ it's taking so long, so I decide to add a printed statement at the start and end of each function, so that I could see with my eyes where the code "hangs". This is how I implemented it:

In [34]:
def main_pipeline(fname):
    data = load_data(fname)
    processed = process_data(data)
    appended = append_data(processed)
    logged = log_data(appended)

def load_data(fname):
    print("Starting 'load_data'...")
    # ... Code ...
    print("Ending 'load_data'...")

def process_data(data):
    print("Starting 'process_data'...")
    # ... Code ...
    print("Ending 'process_data'...")
    
# And so on...

This is obviously very tedious. Even when I only have four functions, it's very repetitive and feels wrong. Moreover, it might have not solved my issue. My examination showed that all four functions take a considerable time to run, so I decide the profile the execution time of each function, to better understand which function is the most costly and optimize it first.

Here's how I redefined all functions to measure their execution time:

In [35]:
import time

def load_data(fname):
    print("Starting 'load_data'...")
    start_time = time.time()
    # ... Code ...
    print(f"It took the code {time.time() - start_time} milliseconds to run.")
    print("Ending 'load_data'...")

def process_data(data):
    print("Starting 'process_data'...")
    start_time = time.time()
    # ... Code ...
    print(f"It took the code {time.time() - start_time} milliseconds to run.")
    print("Ending 'process_data'...")
    
# And so on...

This works, but again - it's very repetitive. Also, if I decide that I want to see the execution time in seconds, and not milliseconds, I have to go through each function and re-implement it. Very tedious. 

The solutions is to _decorate_ the function with a `printer` and `timer` functions, that do this job exactly:

In [36]:
def printer(func):
    def inner_func(argument):
        print(f"Starting {func.__name__}...")
        result = func(argument)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      


def timer(func):
    def inner_func(argument):
        start_time = time.time()
        result = func(argument)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

This looks complex at first, but it's really pretty simple. It uses the fact that functions in Python are objects, like any other element in the language. And because they're objects, they can be passed around as arguments:

In [37]:
def f(func):
    """ Runs the func functions and prints 'hi' at the end """
    func()
    print("hi")
    
def print_hello():
    print("hello")
    

f(print_hello)

hello
hi


Like all objects, functions have attributes. Namely, they have the `__name__` attribute which contains... their name.

In [38]:
print(f.__name__)
print(print_hello.__name__)

f
print_hello


Now we know we can pass functions as arguments to other functions. Let's try to examine the `printer` and `timer` functions again.

They're both a function that receives a different, "unknown" function, as its argument. So far - so good. Then it defines another function which "wraps" the original function with some actions, like printing or timing. This inner function runs the original function and returns the result. In essence, it created a "new implementation" of that original function that does the exact same thing, but with the wrapping functionality (printing, timing, etc.). This new function (`inner_func`) can replace any instance of the original function without any troubles, since in essence it just calls it. It's adds a couple of statements before and after that call, but the essential functionality remained unchanged.

Lastly, the outer function, which we call the decorator, returns the inner function as its return value. So this function receives a function as its argument and return a new, improved function as its output. To use it, we just "rename" the existing functions:

In [39]:
load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

We can obviously use this `timer` function on any function we wish to time. When we wish to time functions in seconds, rather than milliseconds, we'll just change this one instance of `timer` and be done with it, and likewise for `printer`.

The only small caveat here is the fact that we currently require the function we're replacing to have a single `argument` as its argument. This implementation detail is small but very impactful - it means that our decorator will only decorate successfully functions that have a single argument. To remedy this we'll have to use `*args` and `**kwargs`:

### Detour - `*args`, `**kwargs`

You use `*args` and `*kwargs` when you're not sure how many arguments are used for a function. Actually, the syntax is only `*` and `**` - the words `args` and `kwargs` are used by convention. `args` is obviously arguments, or unnamed arguments given to a function. `kwargs` is keyword arguments, or arguments given as `key=value`. Let's see a simple example:

In [40]:
def f(required_argument, *args, **kwargs):
    print(required_argument)
    if args:
        print(args)
    
    if kwargs:
        for key, value in kwargs.items():
            print(key, value)

In [41]:
f()  # doesn't work - we have one required argument to the function

TypeError: f() missing 1 required positional argument: 'required_argument'

In [42]:
f('required')

required


In [43]:
f('required', 1, 2, 3)  # the second printed row is the args

required
(1, 2, 3)


In [44]:
f('required', 1, 2, 3, kw1='a', kw2='b')

required
(1, 2, 3)
kw1 a
kw2 b


We see that `args` is a tuple containing all unnamed parameters that were given to the function, in the order they were given.

`kwargs` is a dictionary, its keys being the keywords, and values - the given values. Here's another short example:

In [45]:
def f(a=1, b=2):
    print(a, b)

inputs = {'a': 10, 'b': 20}
f(**inputs)

10 20


What we see here is that the function's signature doesn't have to contain `*args` or `**kwargs`. The `**` operator "opens up" the input dictionary, allowing the `f()` function to use the parameters without any issues. This is how we're going to use it for our decorators - we'll redefine them as follows:

In [46]:
def printer(func):
    def inner_func(*args, **kwargs):
        print(f"Starting {func.__name__}...")
        result = func(*args, **kwargs)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      


def timer(func):
    def inner_func(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

We can now be sure that our functions will always run regardless the number of inputs given to them. This makes us happy - but not completely happy. We still have to redefine all functions as we've seen before:

In [47]:
load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

It still requires us to rename all instances of these functions in all places of the code, and when we're done with the printing and timing - we have to rename them back.

Why not "rename" the function back to its original name?

In [48]:
load_data = printer(load_data)
process_data = timer(process_data)

This idiom is common enough to have a built-in language syntax:

In [49]:
@timer
def load_data(fname):
    # ... Code ...
    pass

We can use multiple decorators for functions as well:

In [50]:
@printer
@timer
def process_data(data):
    # ... Code ...
    pass

When we wish to stop printing and timing our function, we simply delete this decorator in the relevant places.

Decorators allow more complex calls, like calling them with arguments, but we'll leave that topic for another day.

## `collections`, `itertools`

The Python standard library comes with many second-order tools that can make your life much easier. Many of the more useful ones are located in these two libraries - `collections` and `itertools`. Below I'll provide a brief tour of some of the more interesting features of these libraries.

### namedtuple

When you want a tiny object with named fields, but without the hassle of creating a fully-fledged class, you actually wish to generate a namedtuple:

In [51]:
from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p1 = Point(0, 0)
p2 = Point(x=0, y=1)
p3 = Point(1, y=2)

print(p2)
print(p3.y)
print(p1[0])

Point(x=0, y=1)
2
0


You can access the data inside a `namedtuple` using either the positional index (`[0]`) or the name of that field (`x`). If all you wish to do is to a keep a small record of something, `namedtuple` is your best option.

### defaultdict

A `defaultdict` is a dictionary that resorts to execute a predefined function if it doesn't find the key. For example:

In [52]:
d = dict(one=1, two=2)
print(d['one'])
print(d['three'])

1


KeyError: 'three'

Rather than a `KeyError`, a `defaultdict` would run a predefined function instead of raising this exception:

In [53]:
from collections import defaultdict

d2 = defaultdict(list, one=1, two=2)
print(d2['one'])

1


However, when we call it with an unknown key:

In [54]:
d2['three']
d2

defaultdict(list, {'one': 1, 'three': [], 'two': 2})

It used the `list` "factory" to create a new list in that key. This is useful when sorting some key-value pairs.

In [55]:
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d2 = defaultdict(list)
for k, v in s:
    d2[k].append(v)

d2

defaultdict(list, {'blue': [2, 4], 'red': [1], 'yellow': [1, 3]})

### Chained iterables

If we wish to iterate over several iterables together, we can use the following method from the `itertools` module:

In [56]:
import itertools

chained = itertools.chain('abcd', 'efg')
for letter in chained:
    print(letter)

print('-----')
# Naive iteration over [[1, 2, 3, 4], [5, 6, 7, 8]] would result in two items - 
# two lists with four elements each:
for item in [[1, 2, 3, 4], [5, 6, 7, 8]]:
    print(item)

a
b
c
d
e
f
g
-----
[1, 2, 3, 4]
[5, 6, 7, 8]


In [57]:
# We wish to iterate over the number themselves
chained2 = itertools.chain.from_iterable([[1, 2, 3, 4], [5, 6, 7, 8]])
for letter in chained2:
    print(letter)

1
2
3
4
5
6
7
8


Note that `itertools` always creates generators from the items it receives as input.

### Permutations

In [58]:
list(itertools.permutations('ABCD', 2))

[('A', 'B'),
 ('A', 'C'),
 ('A', 'D'),
 ('B', 'A'),
 ('B', 'C'),
 ('B', 'D'),
 ('C', 'A'),
 ('C', 'B'),
 ('C', 'D'),
 ('D', 'A'),
 ('D', 'B'),
 ('D', 'C')]

### Combinations

In [59]:
list(itertools.combinations('ABCD', 2))

[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

## Multiprocessing

There are several ways to utilize parallel processing in Python. The easiest of all is multi-processing, i.e. the use of several CPU cores to run jobs in parallel. This is best used when each process is independent from the others, not having to share data between them. 

A typical use case is when we have a list or an `np.array` holding data, and we wish to perform the same computation on each element of that list. If this computation is truly independent, the `multiprocessing` module has some very easy-to-use solutions.

```python
import multiprocessing

def add_tuple(tup):
    return tup[0] + tup[1]

tups = [(0, 1), (2, 3), (4, 5), (6, 7)]
pool = multiprocessing.Pool()  # can also enter the number of processes you wish to use
result = pool.map(add_tuple, tups)
result  # [1, 5, 9, 13]
```

The code above doesn't work in IPython and Jupyter due to some weird conflicts. Luckily, `ipyparallel` is an even better library which does the exact same thing and works everywhere.

The Python script `multiprocess.py` located in this `Classes` folder contains a working copy of this script.

Threading is Python's weak point because of the GIL, and we'll not discuss it in this class. Another form of parallel processing is asynchronous programming, which we'll also not cover, but is actually one of Python's strongest points.

## Numba

`numba` is a special library designed to speed-up Python's computation. In many cases it's comparable to `numpy` in terms of use cases, but it might be simpler for people without previous experience with arrays. We'll jump right into an example and then discuss some of the magic:

In [60]:
from numba import jit
import numpy as np


@jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [61]:
arr = np.ones((10000, 10000))

%timeit sum2d(arr)
%timeit arr.sum()

140 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
145 ms ± 6.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The results up there are, for lack of a better word, amazing. Numpy has been optimized for ages, works in bare C, and is still slightly topped by `numba`, which seemingly just decorates a simple, perhaps _simplistic_, Python loop, making it amazingly fast.

This magic happens with LLVM, an open-source project that aims at building a very fast, cross-language compiler. `numba` translates the code to LLVM-suitable code and lets LLVM optimize this code for it. The output is machine code which is fed into the processor directly, and somehow it's faster than all other solutions.

Numba has more tricks in its sleeve. You can define the input types to squeeze it a bit more:

In [62]:
from numba import jit, float64
import numpy as np


@jit(float64(float64[:, :]))
def sum2d_inps(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [63]:
%timeit sum2d_inps(arr)

130 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can also use parallel looping:

In [64]:
from numba import jit, prange
import numpy as np


@jit([float64(float64[:, :])], parallel=True)
def sum2d_p(arr):
    M, N = arr.shape
    result = np.float64(0.0)
    for i in prange(M):
        for j in prange(N):
            result += arr[i,j]
    return result

In [65]:
%timeit sum2d_p(arr)  # pretty cool

49.9 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


When every bit of performance matters - `numba` might be the way to go. For very complicated functions that use fancy linear algebra algorithms, it might be the case that `numba` doesn't support these methods yet. In these occasions resort to basic `numpy` functions and wait till the `numba` developers implement that method - or do so yourself! `numba` is completely open-sourced.

## Cython

When you wish to write performant code that utilizes significant parts of the standard library - niether `numpy` nor `numba` will help you. They require that you work with arrays, which are not as easy to work with as lists, for example. Dictionaries are also very helpful, but using them only with the standard Python interpreter will hinder you performance considerably.

These are the cases where Cython shines. It allows you to write code with Python-like syntax and compile it ahead-of-time to a `myfile.c` source file, written in `C` automatically. When your code calls a function that was written in Cython, it will actually turn to the optimized `C` function and use that function instead.

As stated, Cython requires you to compile your code before running the parent Python script. To do that, you have to create a `setup.py` file that tells the Cython compiler where to find the files in question.

A Cython file ends with `X.pyx`, so `setup.py` should point there. Here's a basic example of `setup.py`:

```python
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('my_file.pyx'),
    # other setup.py options come here
)
```

Then you navigate with your command line to the folder containing `setup.py` and write `python setup.py build_ext --inplace`, which tells Cython to "build", i.e. compile, the code in the `.pyx` file and add it `inplace`, i.e. to this directory.

An example can be found in the `cython_demo` folder. Let's see it here in action:

In [66]:
from cython_demo import plain_python, primes_cython

In [67]:
plain_python.primes_python(20)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

In [68]:
primes_cython.primes(20)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

In [69]:
%timeit plain_python.primes_python(1000)
%timeit primes_cython.primes(1000)

46.5 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.95 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [70]:
rands = np.random.random((1000000))

In [71]:
%timeit rands[rands < 0.5]

13 ms ± 793 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [72]:
@jit([float64[::1](float64[::1])], nopython=True, parallel=True)
def filter_larger(rands):
    arr = np.zeros_like(rands)
    thresh = 0.5
    last_idx = 0
    for idx in prange(len(rands)):
        if rands[idx] < 0.5:
            arr[last_idx] = rands[idx]
            last_idx += 1
            
    return arr[:last_idx]

# The last_idx variable is probably hindering performance of the parallel loop

In [73]:
%timeit filter_larger(rands)

12 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [74]:
from cython_filter_demo import filter_array

In [75]:
%timeit filter_array.filter_larger_cython(rands)

17.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [76]:
%prun filter_array.filter_larger_cython(rands)

 