# Class 10 - 27.5.19

# Advanced and Performant Python

One of the best things about Python is how easy it is to get started with. The syntax is clear, it has all the basic features, things work as you expect them to, and life is generally pleasant. But Python also supports very advanced features, which make coding with Python an enjoyable experience even after you think you've learned everything there is to know of the language.

While technically you can write good Python code without using these features - it's sometimes a real shame not to use them.

## Generators

In a simple sense, perhaps simplistic, generators are iterators. Meaning, a generator is always an object you can iterate over. In Python you can iterate over most data structures, including dictionaries, lists, tuples and more - and so in this sense generators are similar. However, when we iterate over a list, for example, we're iterating over an existing data structures with existing items. Likewise for dictionaries - when we iterate over them, Python "hands over" the dictionary's keys and values.

In [1]:
a_list = [0, 10, 20]
for item in a_list:
    print(item)

0
10
20


In [2]:
a_dict = dict(a=0, b=10, c=20)
for key in a_dict:
    print(key, a_dict[key])

a 0
b 10
c 20


This is the first major difference between a generator and the other iterators. A generator is a _recipe_ to create the next item in the chain. A generator is a piece of code telling the Python interpreter how to create the next item, but it doesn't hold this item in memory yet. A simple example might be a list containing values from 0 to 1000. A generator of this list will not have 1000 cells with their values - it would have instructions on the number of cells, and how to calculate the next value.

We've already met a generator (well, kind of) - the `range()` function. When we tell Python to give us a range of number between 0 and 1000 by writing `range(1000)` - we're not actually generating the 1000 "cells" of values, but only the recipe. Let's see it in "action":

In [3]:
range(1000)  # a "range" object

range(0, 1000)

In [4]:
items = range(1000)
items

range(0, 1000)

In [5]:
import sys
sys.getsizeof(items)  # 48 bytes - not nearly enough to hold 1000 items

48

A simple 1000-element list isn't that heavy for a computer (but what about Arduinos?), but when lists get longer, with bigger arrays and massive data structures inside them, it's very inefficient to hold this amount of unused data in memory. 

Let's define our own generator:

In [6]:
def my_range(n):
    """ Returns a list of items from 0 to n """
    num = 0
    while num < n:
        yield num
        num += 1

When we create a generator, the code is executed until the first `yield` statement. This reserved keyword is what makes a function into a generator.

When the code reaches the `yield` it holds, or "saves" its current state, until called by Python's `next()` function:

In [7]:
new_range = my_range(3)

print(next(new_range))

print(next(new_range))

print(next(new_range))

print(next(new_range))

0
1
2


StopIteration: 

Each time `next()` is used, the line with the `yield` is executed, and the function keeps going until it reaches  another `yield` statement. In the `my_range` function, while the index is smaller than `n` the code will reach a `yield`. When we don't satisfy this condition anymore, the code skips the loop and reaches the end of the function. This results in a special `StopIteration` exception, used only in these special cases. This means you can catch this exception and know that your generator went through all of its items.

But calling `next()` multiple times isn't practical. Luckily, `for` loops implement this exact interface automatically, allowing us to use them instead of the tedious, repetitive `next()` calls:

In [19]:
looprange = my_range(10)

for item in looprange:
    print(item)

0
1
2
3
4
5
6
7
8
9


The `for` loop is also smart enough to catch the `StopIteration` exception and terminate the loop, without raising any "visible" exceptions. A `for` loop is the common way to iterate over generators.

Generators don't allow much besides it. You can't print them exactly, or index into them:

In [20]:
range2 = my_range(10)
print(range2)

<generator object my_range at 0x7f0dc84d0840>


In [21]:
range2[3]

TypeError: 'generator' object is not subscriptable

Once used, generators are "depleted", you can't reuse them. This is a major difference between a generator and a list, for example - you're not limited by the number of times you can iterate over a list.

In [78]:
for item in looprange:
    print(item)
# Doesn't return anything, because we already depleted looprange

In [79]:
next(looprange)  # immediately raises StopIteration

StopIteration: 

It's important to stress that all functions can become generators if they contain the `yield` statement:

In [22]:
def println():
    print("Hello, ")
    yield True
    print("World")
    yield False

In [23]:
gen2 = println()
a = next(gen2)

print(a)

Hello, 
True


In [24]:
b = next(gen2)
print(b)

World
False


Another way to create generators is "genexps", or generator expressions, which are very similar to list comprehensions:

In [25]:
nums = (
    2 * n 
    for n in range(10)
    )
nums

<generator object <genexpr> at 0x7f0da4118d68>

In [26]:
for num in nums:
    print(num)

0
2
4
6
8
10
12
14
16
18


The round brackets tell the interpreter that we're creating a generator here.

### Exercise
Write a generator function, or a piece of code including a generator, returning the `n` first Fibonacci numbers. The Fibonacci sequence starts with `0, 1` and the following item is always the addition of the last two items.

### Exercise solution below...

In [27]:
# Solution with a single function
def fib(n):
    """ Returns the first n Fibonacci numbers """
    a, b = 0, 1
    idx = 0
    while idx < n:
        yield a
        a, b = b, a + b
        idx += 1

ten = fib(10)
list(ten)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In [28]:
# Solution as a script
def fibn():
    """ Runs over all Fibonacci numbers """
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b
        
        
gen = fibn()
for i in range(10):
    print(next(gen))

0
1
1
2
3
5
8
13
21
34


The nice thing about this implementation is that it's infinite - it contains, and can generate all Fibonacci numbers. We could never do that with lists and regular functions.

Generators are used widely in the Python universe. `pathlib` uses them everywhere, for example. For our use-cases, which involve "data-science" stuff, it's usually less important. However, in one of my projects I have a large 5D array which I need to create. This array can easily weigh more than 1 GB RAM when I'm dealing with longer experiments. That's why I decided to write a generator function that creates and populates this array, `yield`ing 4D substacks of it over time. It reduces memory usage by at least two orders of magnitude, and allows my code to run faster on home machines.

## Decorators

Decorators are functions that receive functions in their arguments. When you wrap an existing function with another function - you created a decorator. This feature is extensively used in web frameworks and in other important Python use cases, which means it has a special syntax: `@decorator`. Let's look at an example:

Assume I have a large data-processing pipeline script, built out of many smaller functions, which unfortunately takes a long time to run. I wish to understand _why_ it's taking so long, so I decide to add a printed statement at the start and end of each function, so that I could see with my eyes where the code "hangs". This is how I implemented it:

In [29]:
def main_pipeline(fname):
    data = load_data(fname)
    processed = process_data(data)
    appended = append_data(processed)
    logged = log_data(appended)

def load_data(fname):
    print("Starting 'load_data'...")
    # ... Code ...
    print("Ending 'load_data'...")

def process_data(data):
    print("Starting 'process_data'...")
    # ... Code ...
    print("Ending 'process_data'...")
    
# And so on...

This is obviously very tedious. Even when I only have four functions, it's very repetitive and feels wrong. Moreover, it might have not solved my issue. My examination showed that all four functions take a considerable time to run, so I decide the profile the execution time of each function, to better understand which function is the most costly and optimize it first.

Here's how I redefined all functions to measure their execution time:

In [30]:
import time

def load_data(fname):
    print("Starting 'load_data'...")
    start_time = time.time()
    # ... Code ...
    print(f"It took the code {time.time() - start_time} milliseconds to run.")
    print("Ending 'load_data'...")

def process_data(data):
    print("Starting 'process_data'...")
    start_time = time.time()
    # ... Code ...
    print(f"It took the code {time.time() - start_time} milliseconds to run.")
    print("Ending 'process_data'...")
    
# And so on...

This works, but again - it's very repetitive. Also, if I decide that I want to see the execution time in seconds, and not milliseconds, I have to go through each function and re-implement it. Very tedious. 

The solutions is to _decorate_ the function with a `printer` and `timer` functions, that do this job exactly:

In [31]:
def printer(func):
    def inner_func(a):
        print(f"Starting {func.__name__}...")
        result = func(a)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      

func(5)

def timer(func):
    def inner_func(argument):
        start_time = time.time()
        result = func(argument)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

This looks complex at first, but it's really pretty simple. It uses the fact that functions in Python are objects, like any other element in the language. And because they're objects, they can be passed around as arguments:

In [32]:
def f(func):
    """ Runs the func functions and prints 'hi' at the end """
    func()
    print("hi")
    
def print_hello():
    print("hello")
    

f(print_hello)

hello
hi


Like all objects, functions have attributes. Namely, they have the `__name__` attribute which contains... their name.

In [33]:
print(f.__name__)
print(print_hello.__name__)

f
print_hello


Now we know we can pass functions as arguments to other functions. Let's try to examine the `printer` and `timer` functions again.

They're both a function that receives a different, "unknown" function, as its argument. So far - so good. Then it defines another function which "wraps" the original function with some actions, like printing or timing. This inner function runs the original function and returns the result. In essence, it created a "new implementation" of that original function that does the exact same thing, but with the wrapping functionality (printing, timing, etc.). This new function (`inner_func`) can replace any instance of the original function without any troubles, since in essence it just calls it. It's adds a couple of statements before and after that call, but the essential functionality remained unchanged.

Lastly, the outer function, which we call the decorator, returns the inner function as its return value. So this function receives a function as its argument and return a new, improved function as its output. To use it, we just "rename" the existing functions:

In [34]:
load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

We can obviously use this `timer` function on any function we wish to time. When we wish to time functions in seconds, rather than milliseconds, we'll just change this one instance of `timer` and be done with it, and likewise for `printer`.

The only small caveat here is the fact that we currently require the function we're replacing to have a single `argument` as its argument. This implementation detail is small but very impactful - it means that our decorator will only decorate successfully functions that have a single argument. To remedy this we'll have to use `*args` and `**kwargs`:

### Detour - `*args`, `**kwargs`

You use `*args` and `*kwargs` when you're not sure how many arguments are used for a function. Actually, the syntax is only `*` and `**` - the words `args` and `kwargs` are used by convention. `args` is obviously arguments, or unnamed arguments given to a function. `kwargs` is keyword arguments, or arguments given as `key=value`. Let's see a simple example:

In [35]:
def f(required_argument, *args, **kwargs):
    print(required_argument)
    if args:
        print(args)
    
    if kwargs:
        for key, value in kwargs.items():
            print(key, value)

In [36]:
f()  # doesn't work - we have one required argument to the function

TypeError: f() missing 1 required positional argument: 'required_argument'

In [44]:
f('required')

required


In [45]:
f('required', 1, 2, 3)  # the second printed row is the args

required
(1, 2, 3)


In [46]:
f('required', 1, 2, 3, kw1='a', kw2='b')

required
(1, 2, 3)
kw1 a
kw2 b


We see that `args` is a tuple containing all unnamed parameters that were given to the function, in the order they were given.

`kwargs` is a dictionary, its keys being the keywords, and values - the given values. Here's another short example:

In [47]:
def f(a=1, b=2):
    print(a, b)

inputs = {'a': 10, 'b': 20}
f(**inputs)

10 20


What we see here is that the function's signature doesn't have to contain `*args` or `**kwargs`. The `**` operator "opens up" the input dictionary, allowing the `f()` function to use the parameters without any issues. This is how we're going to use it for our decorators - we'll redefine them as follows:

In [48]:
def printer(func):
    def inner_func(*args, **kwargs):
        print(f"Starting {func.__name__}...")
        result = func(*args, **kwargs)
        print(f"Ending {func.__name__}...")
        return result
    return inner_func      


def timer(func):
    def inner_func(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        print(f"It tooks the code {time.time() - start_time} milliseconds to run.")
        return result
    return inner_func

We can now be sure that our functions will always run regardless the number of inputs given to them. This makes us happy - but not completely happy. We still have to redefine all functions as we've seen before:

In [49]:
load_data_printer = printer(load_data)
load_data_timed = timer(load_data)

process_data_printer = printer(process_data)
process_data_timed = timer(process_data)

It still requires us to rename all instances of these functions in all places of the code, and when we're done with the printing and timing - we have to rename them back.

Why not "rename" the function back to its original name?

In [50]:
load_data = printer(load_data)
process_data = timer(process_data)

This idiom is common enough to have a built-in language syntax:

In [51]:
@timer
def load_data(fname):
    # ... Code ...
    pass

We can use multiple decorators for functions as well:

In [52]:
@printer
@timer
def process_data(data):
    # ... Code ...
    pass

When we wish to stop printing and timing our function, we simply delete this decorator in the relevant places.

Decorators allow more complex calls, like calling them with arguments, but we'll leave that topic for another day.

## `collections`, `itertools`

The Python standard library comes with many second-order tools that can make your life much easier. Many of the more useful ones are located in these two libraries - `collections` and `itertools`. Below I'll provide a brief tour of some of the more interesting features of these libraries.

### namedtuple

When you want a tiny object with named fields, but without the hassle of creating a fully-fledged class, you actually wish to generate a namedtuple:

In [53]:
from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p1 = Point(0, 0)
p2 = Point(x=0, y=1)
p3 = Point(1, y=2)

print(p2)
print(p3.y)
print(p1[0])

Point(x=0, y=1)
2
0


You can access the data inside a `namedtuple` using either the positional index (`[0]`) or the name of that field (`x`). If all you wish to do is to a keep a small record of something, `namedtuple` is your best option.

### defaultdict

A `defaultdict` is a dictionary that resorts to execute a predefined function if it doesn't find the key. For example:

In [54]:
d = dict(one=1, two=2)
print(d['one'])
print(d['three'])

1


KeyError: 'three'

Rather than a `KeyError`, a `defaultdict` would run a predefined function instead of raising this exception:

In [80]:
from collections import defaultdict

d2 = defaultdict(list, one=1, two=2)
print(d2['one'])

1


However, when we call it with an unknown key:

In [81]:
d2['three']
d2

defaultdict(list, {'one': 1, 'two': 2, 'three': []})

It used the `list` "factory" to create a new list in that key. This is useful when sorting some key-value pairs.

In [82]:
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d2 = defaultdict(list)
for k, v in s:
    d2[k].append(v)

d2

defaultdict(list, {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]})

### Chained iterables

If we wish to iterate over several iterables together, we can use the following method from the `itertools` module:

In [83]:
import itertools

chained = itertools.chain('abcd', 'efg')
for letter in chained:
    print(letter)

print('-----')
# Naive iteration over [[1, 2, 3, 4], [5, 6, 7, 8]] would result in two items - 
# two lists with four elements each:
for item in [[1, 2, 3, 4], [5, 6, 7, 8]]:
    print(item)

a
b
c
d
e
f
g
-----
[1, 2, 3, 4]
[5, 6, 7, 8]


In [84]:
# We wish to iterate over the number themselves
chained2 = itertools.chain.from_iterable([[1, 2, 3, 4], [5, 6, 7, 8]])
for letter in chained2:
    print(letter)

1
2
3
4
5
6
7
8


Note that `itertools` always creates generators from the items it receives as input.

### Permutations

In [85]:
list(itertools.permutations('ABCD', 2))

[('A', 'B'),
 ('A', 'C'),
 ('A', 'D'),
 ('B', 'A'),
 ('B', 'C'),
 ('B', 'D'),
 ('C', 'A'),
 ('C', 'B'),
 ('C', 'D'),
 ('D', 'A'),
 ('D', 'B'),
 ('D', 'C')]

### Combinations

In [86]:
list(itertools.combinations('ABCD', 2))

[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

## Multiprocessing

There are several ways to utilize parallel processing in Python. The easiest of all is multi-processing, i.e. the use of several CPU cores to run jobs in parallel. This is best used when each process is independent from the others, not having to share data between them. 

A typical use case is when we have a list or an `np.array` holding data, and we wish to perform the same computation on each element of that list. If this computation is truly independent, the `multiprocessing` module has some very easy-to-use solutions.

```python
import multiprocessing

def add_tuple(tup):
    return tup[0] + tup[1]

tups = [(0, 1), (2, 3), (4, 5), (6, 7)]
with multiprocessing.Pool() as pool:  # can also enter the number of processes you wish to use
    result = pool.map(add_tuple, tups)
result  # [1, 5, 9, 13]
```

The code above doesn't work in IPython and Jupyter due to some weird conflicts. Luckily, `ipyparallel` is an even better library which does the exact same thing and works everywhere.

The Python script `multiprocess.py` located in this `Classes` folder contains a working copy of this script.

Threading is Python's weak point because of the GIL, and we'll not discuss it in this class. Another form of parallel processing is asynchronous programming, which we'll also not cover, but is actually one of Python's strongest points.

## Numba

`numba` is a special library designed to speed-up Python's computation. In many cases it's comparable to `numpy` in terms of use cases, but it might be simpler for people without previous experience with arrays. We'll jump right into an example and then discuss some of the magic:

In [2]:
from numba import jit
import numpy as np


@jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [3]:
arr = np.ones((10000, 10000))

%timeit sum2d(arr)
%timeit arr.sum()

122 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.6 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The results up there are, for lack of a better word, amazing. Numpy has been optimized for ages, works in bare C, and sitll only barely passes `numba`, which seemingly just decorates a simple, perhaps _simplistic_, Python loop, making it amazingly fast.

This magic happens with LLVM, an open-source project that aims at building a very fast, cross-language compiler. `numba` translates the code to LLVM-suitable code and lets LLVM optimize this code for it. The output is machine code which is fed into the processor directly, and somehow it's faster than all other solutions.

Numba has more tricks in its sleeve. You can define the input types to squeeze it a bit more:

In [4]:
from numba import jit, float64
import numpy as np


@jit(float64(float64[:, :]), nopython=True)
def sum2d_inps(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [5]:
%timeit sum2d_inps(arr)

121 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can also use parallel looping:

In [6]:
from numba import jit, prange
import numpy as np


@jit([float64(float64[:, :])], parallel=True)
def sum2d_p(arr):
    M, N = arr.shape
    result = np.float64(0.0)
    for i in prange(M):
        for j in prange(N):
            result += arr[i,j]
    return result

In [7]:
%timeit sum2d_p(arr)  # pretty cool

50.3 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


When every bit of performance matters - `numba` might be the way to go. For very complicated functions that use fancy linear algebra algorithms, it might be the case that `numba` doesn't support these methods yet. In these occasions resort to basic `numpy` functions and wait till the `numba` developers implement that method - or do so yourself! `numba` is completely open-sourced.

## Cython

When you wish to write performant code that utilizes significant parts of the standard library, as well as `numpy` and the scientific stack - niether `numpy` nor `numba` will help you. They require that you work with arrays, which are not as easy to work with as lists, for example. Dictionaries are also very helpful, but using them only with the standard Python interpreter will hinder you performance considerably.

These are the cases where Cython shines. It allows you to write code with Python-like syntax and compile it ahead-of-time to a `myfile.c` source file, written in `C` automatically. When your code calls a function that was written in Cython, it will actually turn to the optimized `C` function and use that function instead.

As stated, Cython requires you to compile your code before running the parent Python script. To do that, you have to create a `setup.py` file that tells the Cython compiler where to find the files in question.

A Cython file ends with `X.pyx`, so `setup.py` should point there. Here's a basic example of `setup.py`:

```python
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('my_file.pyx'),
    # other setup.py options come here
)
```

Then you navigate with your command line to the folder containing `setup.py` and write `python setup.py build_ext --inplace`, which tells Cython to "build", i.e. compile, the code in the `.pyx` file and add it `inplace`, i.e. to this directory.

An example can be found in the `cython_demo` folder. Let's see it here in action:

In [8]:
from cython_demo import plain_python
from cython_demo.cython_demo import primes_cython

In [9]:
plain_python.primes_python(20)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

In [12]:
primes_cython.primes(20)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

In [13]:
%timeit plain_python.primes_python(1000)
%timeit primes_cython.primes(1000)

32.6 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.32 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
rands = np.random.random((1000000))

In [15]:
%timeit rands[rands < 0.5]

8.32 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
@jit(nopython=True, parallel=True)
def filter_larger(rands):
    arr = np.zeros_like(rands)
    thresh = 0.5
    last_idx = 0
    for idx in prange(len(rands)):
        if rands[idx] < 0.5:
            arr[last_idx] = rands[idx]
            last_idx += 1
            
    return arr[:last_idx]

# The last_idx variable is probably hindering performance of the parallel loop

In [24]:
%timeit filter_larger(rands)

3.67 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
from cython_filter_demo.cython_filter_demo import filter_array

In [20]:
%timeit filter_array.filter_larger_cython(rands)

7.38 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%prun filter_array.filter_larger_cython(rands)

 

In [22]:
import cython_demo.cython_demo

## Memoization (Caching)

Yet another way to improve performance of your scripts, perhaps a more straight-forward one, is memoization. This essentially means caching (saving) the results of computations done for a given set of parameters. Every time the function is called it first checks whether the result of the operation was already computed earlier, and if so it immediately returns it rather than re-computing it all over again.

Caching is extremely easy to do in Python. The standard library has a module called `functools` which contains several important functions that work on other functions, and one of them is `lru_cache`, which stands for "least recently used". While it's not the only way to do memoization in Python - there are multiple 3rd partly libraries that implement fancy memoization techniques - `lru_cache` is usually good enough.

Using it is extremely simple:

In [25]:
def fib(n):
    """ Calculate the nth Fibonacci sequence element """
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

The Fibonacci series is a classic example, since every computation of a new element in the sequence is built on previous calculations. The function above is a simple implementation using recursion, but it currently doesn't cache its result. Meaning that it has to re-compute all values whenever its called.

To cache the result we simply have to add a decorator to it:

In [None]:
import functools


@functools.lru_cache()
def fib(n):
    """ Calculate the nth Fibonacci sequence element """
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

Let's look at the timings:

In [33]:
%timeit fib(60)

KeyboardInterrupt: 

In [31]:
%timeit fib(61)

91.2 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Running `fib(61)` takes almost no time, since the result of `fib(60)` is already cached.

# "Code Smells"

We'll now turn our attention to higher-minded concepts that you should pay attention to when creating software. The term above refers to elements in your code base that represent something which is not _wrong_, but can probably be better. That's what the "smell" means - it's like a bad feeling about the code, but it's not something which will tear down your application if it remains as is.

### Repetitive code

The rule of thumb here is once you understand that some piece of code will be re-used somewhere else, immediately extract it out into a function and call that function instead. This will help you test it for correctness, document it more thoroughly and improve the readability of the piece of code using this new function.

Here's an example, with the repetitive code on top and the refactored piece of code on the bottom:

In [None]:
import tifffile


# ...
# In the middle of some data analysis script
file_list_type_1 = ['a.tif', 'b.tif', 'c.tif']
for file in file_list_type_1:
    img = tifffile.imread(file)
    img -= img.min()
    img /= img.max()

# ...
file_list_type_2 = ['x.tif', 'y.tif', 'z.tif']
for file in file_list_type_2:
    img = tifffile.imread(file)
    img -= img.min()
    img /= img.max()

In [None]:
import numpy as np


def normalize_img(img: np.ndarray) -> np.ndarray:
    """ 
    Receives an image in the form of a numpy array, makes 
    it positive and normalizes it between 0 and 1 
    """
    img -= img.min()
    img /= img.max()
    return img

# ...
file_list_type_1 = ['a.tif', 'b.tif', 'c.tif']
for file in file_list_type_1:
    img = tifffile.imread(file)
    img = normalize_img(img)

# ...
file_list_type_2 = ['x.tif', 'y.tif', 'z.tif']
for file in file_list_type_2:
    img = tifffile.imread(file)
    img = normalize_img(img)

Even though the code in question is only two lines long - I decided to extract it into its own function. Besides the increased readability, I may have noticed in a later stage of my coding that this function isn't as harmless as it seems due to a possible **integer overflow.** So now the function is longer than two lines, and more tests have to be added. Coding these upgrades in the first case, where we didn't extract the code snippet, would've been double the work with a higher chance for bugs.

### Long functions (with "block comments")

Ideally, functions should be about 10-20 lines long, i.e. a single function or method shouldn't be longer than your screen. There's no lower-bound limit, meaning that very short functions - 2-3 lines long, as seen above - are absolutely fine.

Long functions are hard to understand, hard to test and will usually contain several blocks with distinct purposes. There's absolutely no reason to group up blocks that have different "responsibilities" in a single function. On occasions in which we _do_ write these long functions, we sometimes like to add block comments to the function, like:

```python
###################################################
# This part deals with reading the data into memory
###################################################
data = tiffile.imread(...)
# ...

#############################################
# Find the active areas in the processed data
#############################################
# ...

```


Here's a contrived example:

In [None]:
def process_data(filename):
    # Checks for validity of data and reads it
    assert pathlib.Path(filename).exists()
    raw = tifffile.imread(filename)
    assert raw.ndim == 3
    assert raw.shape[0] > 1
    
    # Now we process the data
    summed = raw.sum(0)
    summed = (summed - summed.mean())
    summed /= summed.max()
    
    # ...

It should be clear to you that each of the code blocks, annotated by a comment, should be a different function.

### Objects that should be functions, functions that should be objects

A "healthy" code base will contain a mix of objects (with their methods) and functions. Using only one of these programming paradigms in a medium-to-large scale project is probably not the way to go. But how do you decide whether some code has to go into a function, or "deserves" it's own object? Here are a couple of thumb rules:

#### Helper functions aren't general
Assume you had a large function which you divided into many smaller functions as I suggested above. But you notice that these small functions aren't general, i.e. they don't deal with general tasks like reading a file from disk, or normalizing some image. Rather, they're goal is to run some calculation which is very specific for the current application you're working on. Perhaps they're automatically inserting experiment dates into a database you generated, or they're filtering files in a directory per some heuristic.

These tasks are well-defined and well-contained, so it's definitely a good idea to keep them as a function, but the thing they do will be used once, and once only - inside this application only. If you indeed have a couple of such functions, then you're probably dealing with something that should be an object, and these couple of functions should be its methods. Now your code base reflects your undestanding of the code - the internal methods are indeed specific to this object alone, and are unusable outside of the current task at hand.

#### Long list of arguments

Whenevre your algorithm has several functions who perform a task back to back, and they all take approximately the same argumets (number of pixels in the image, filename, the data array, etc.) then you should probably turn these functions into methods in an object, and just ditch the arguments by using `self`.

This refactoring into an object will also let you organize the code and improve its readability. As separate functions you might not remember who do you call first - do I first `divide_by_largest()` and then `find_most_popular()` or the other way around? As methods in an object you can sort them in a `main()`, publically-available method which exposes the only true way to use these functions.

#### Objects with either one or two methods

Usually if you have an object which has no more than a couple of methods, it's best to just turn these methods into functions and use them instead. Objects create more boilerplate and clutter, and testing will be generally harder.

### Too many nested levels

Having too many nested levels in your code gives the readers of it a harder time - they have to remember the last condition that was met (or wasn't), and to understand its relation to the current condition. But how do we do that? We have two main methods: early returns and "switch-like" statements.

#### Early returns

In [None]:
# BAD CODE BELOW:
def f(a, b, c, d):
    if a > b:
        c = func1(a)
        if c:
            print(f"C is {c}")
            for item in c:
                d = [m for m in item if item is not None]
        else:
            return None
        return d
    else:
        c = func1(b)
        if c:
            print(f"C is {c}")
            d = []
            for item in c:
                d.append([m for m in item if item is not None])
        else:
            return None
        return d

In [None]:
# BETTER
def f(a, b, c, d):
    c = func1(max(a, b))
    if not c:
        return None

    print(f"C is {c}")
    d = []
    for item in c:
        d.append([m for m in item if item is not None])
    return d

There are two things hiding here - the first is the use of the built-in `max()` function to drop the first `if` statements, since the two code paths are identical. But the other important thing here are the early returns. Instead of asking `if c:` and then having a fully-indented code block below, we reverse the condition, asking `if not c: return None`, and then we can safely unindent the following code path, since we're sure that `c` has the right value for us. It's also easier to read, since you can remember that for all lines of code below the `if not c` condition, `c` is not `False` or `None` - there are no `else` clauses that would make it less obvious was condition are we really checking at this line of code.

#### Switch-like statement in Python

Progammers in other languages, including MATLAB, are usually aware of the `switch - case` operator which allows you to choose what to do based on a specific value of some variable during runtime. For example:

In [34]:
def my_func1():
    return 4

def my_func2(data):
    print(f"Data is {data}")

def my_func3(data):
    pass

In [35]:
### Doesn't work
data = my_func1()
switch data:
    case 4:
        my_func2(data)
    case 15:
        my_func3(data)
    # etc...

SyntaxError: invalid syntax (<ipython-input-35-c81ff2e86a1c>, line 3)

Python doesn't have a proper switch statement, but you can mimick this behavior using dictionaries! Here's an equivalent piece of code:

In [36]:
switch = {4: my_func2, 15: my_func3}
data = my_func1()
switch[data](data)

Data is 4


When we access the `switch` dictionary at the index `data`, we get back the name of the function which was mapped there. This is like running the following statement:

In [37]:
a = my_func3
a

<function __main__.my_func3(data)>

The variable `a` is just a reference to the function. Printing it doesn't call the function, we have to add parenthesis in order for the function to be executed. And this is why we have the `(data)` part after `switch[data]` - the parenthesis, with the argument inside them, make the actual function call happen.

Switch statements aren't too common out in the wild, but sometimes they fit best your mental model of your code. When that is the case, a dictionary is a suitable replacement for the missing `switch`. By the way, there are libraries which try to mimick a `switch` in a clearer manner.

# Software Design Principles

The previous part dealt with low-level concepts with very clear "do's and don'ts". We'll now turn our heads to some higher-level concepts when you think of the design of your software. Most of the ideas presented below are from Robert Martin's, AKA Uncle Bob, lectures and textbooks. He's one of the founding fathers of object-oriented design.

## Object Orthogonality, Encapsulation

In many cases objects interact with one another. In the case of some `ProcessData` class, which might process some instances of a `Data` class, that can contain a couple of `Series` and metadata, for example, we can see how `ProcessData` communicates with the data inside the `Data` class, modifying it further. 

A preliminary design might look like the following:

In [38]:
import numpy as np
import pandas as pd


class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self.ser1 = pd.Series(arr1, dtype=np.uint8)
            self.ser2 = pd.Series(arr2, dtype=np.int16)
            self.metadata = dict(shape1=self.ser1.shape,
                                 shape2=self.ser2.shape,
                                 total=self.ser1.shape[0] + self.ser2.shape[0],
                                 date=date)

            
class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, data1: Data, data2: Data):
        self.data1 = data1
        self.data2 = data2
        self.result = []
        self.metadata = dict(columns1=data1.columns,
                             columns2=data2.columns,
                             metadata=data1.metadata)
        
    def process(self):
        self.result.extend([data1.ser1.sum(), data2.ser1.sum()])
        self.result.append([data1.ser1.mean() + data2.ser2.mean()])
        return result

We have here a `Data` class which serves as a container for two DataFrames that are logically connected. It also simplifies the access to some of the metadata contained with theses DataFrames.

We also have a `ProcessData` class that uses the `Data` instances to calculate some statistical properties and keep them for later use.

While this design works (which is important), it's flawed in the sense that the `ProcessData` object is very reliant on the implementation details of the `Data` class. How would you write tests for `ProcessData`? Many of the possible tests you may write are reliant on proper `Data` implementation. When higher-level objects are dependent on specific attributes of some lower-level module, we need to perform Dependency Inversion. This decoupling process can also be called "object orthogonality".

We'll do a couple of major changes to our design which will solve, step by step, the design issues we encoutered.

First we'll create a new `DataContainer` class that holds `Data` instances, and redefine the `Data` class more appropriately:

In [39]:
class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self._ser1 = pd.Series(arr1, dtype=np.uint8)
            self._ser2 = pd.Series(arr2, dtype=np.int16)
            self._metadata = dict(shape1=self.df1.shape,
                                 shape2=self.df2.shape,
                                 total=self.df1.shape[0] + self.df2.shape[0],
                                 date=date)
    @property
    def data(self):
        """ Returns the actual data variables as an iterable"""
        result = [self._ser1, self._ser2]
        return result
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        return [x.sum() for x in self.data]


class DataContainer:
    """ Holds, in order, instances of Data """
    def __init__(self, datas):
        self._data = []
        self._metadata = {}
        try:
            for idx, data in enumerate(datas):
                if isinstance(data, Data):
                    self._data.append(data)
                    self._metadata[idx] = data.metadata
                else:
                    raise TypeError(f"TypeError: Data {data} isn't a 'Data' type.")
        except TypeError as e:
            print(e)
    
    @property
    def data(self):
        return self._data
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        result = []
        for data in self._data:
            result.append(data.sum())
        return result

First note the "new technical term": We introduce here the `@property` decorators. If we define some method as a property, that keyword can be used like a regular attribute, except for the fact that it's immutable:

In [40]:
class Trial:
    def __init__(self):
        self.two_as_attr = 2
    
    def two_as_method(self):
        return 2
    
    @property
    def two_as_prop(self):
        return 2

tr = Trial()

# Changing attributes is possible:
print(f"The original attribute: {tr.two_as_attr}")
tr.two_as_attr = 3
print(f"Attributes can be changed: {tr.two_as_attr}")
print("------")

# Using the regular method requires brackets
print(f"Using the method: {tr.two_as_method()}")
print("And of course, it can't be changed (immutable).")
print("------")

# Using a property "feels" like using an attributes:
print(f"As a property: {tr.two_as_prop}")  # no brackets
try:
    tr.two_as_prop = 3  # AttributeError
except AttributeError as e:
    print(f"AttributeError: {e} - properties can't be changed.")

The original attribute: 2
Attributes can be changed: 3
------
Using the method: 2
And of course, it can't be changed (immutable).
------
As a property: 2
AttributeError: can't set attribute - properties can't be changed.


But besides this new, exciting feature of Python, what else has changed with the implementation?

#### `Data`:
1. We redefined `Data`. The new object doesn't allow anyone from the outside to change the data it holds, it only allows for a "view" of the data. The use of properties ensure that once the object was created, the internal structure of the instance remains intact. The single underscore before the variable names also prevents direct access to the attribute. This idea is called _encapsulation_.

2. Furthermore, if we examine the `sum()` method, we see that it's now bound to the `Data` object itself. If we write it explicitly it makes senes: _The sum of the data is a bound method to our data - an intrinsic property of it._ If we ever decide to change how our data is stored, the `sum()` method should change accordingly, but no other object will be affected.


#### `DataContainer`:
1. The new `DataContainer` class _doesn't really know_ what it's holding. All it cares is that they're `Data` instances. It doesn't peek inside the methods of the different `Data` instances.

2. It doesn't allow access to the list of `Data` instances itself. It exposes a `data` property which returns the list. If we decide to change the internal implementation of `DataContainer`, users of this class wouldn't care as long as we keep the output of the `data` property similar. Even if the list is empty - it will always return something.

Let's see the redefined implementation of the `ProcessData` class:

In [41]:
class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, datacont: DataContainer):
        self.datacont = datacont
        self.result = {}
        self.metadata = datacont.metadata
        
    def process(self):
        """ Mock processing pipeline """
        self.result['sum'] = self.datacont.sum()
        means = [x.mean() for x in self.datacont.data]
        self.result['mean'] = means
        return self.result

The code snippet above is now much cleaner than the one we had beforehand. It uses the "API" of the `DataContainer` in two ways - either using a fully-featured `sum()` function, or by (securely) accessing the data using the `data` property and running non-standard processing on it - mean calculation in our case.}

The downside is the added class - more code to write, more tests, more imports at the top. But the added value is tremendous. Think how easy it is to add new functionality into the pipeline. Everything is flexible, allowing to create a new `median()` function in the `DataContainer` class, for example. We can even change the internal structure of the `Data` class and still use the downstream class effectively.

## Classes as Data Types and Class Methods

Yet another fairly important usecase for classes in Python is their as user-defined types for particular data. Programming languages define for us the basic types of data - floating-point numbers, integres, string and so on. But what if (some of) our data is not composed of these primitive types? Can we construct data types of our own?

For instane, assume I'm collecting data from participants in a study I'm running, and one of the data points I'm gatheting is their age. How should I encode it?

The age of a person is not an integer number. It _can_ be thought of as a floating point number, but then being 41.9 means that your age is 41 years and almost eleven months, which isn't too obvious from just looking at 41.9, since the 9 could be interpreter as the month of September. We can try to write stuff like '41.9' or '41 years and 9 months' or '41.9.14' but it doesn't look so good.

**Instead,** what we should do is to write a class that defines an age:

In [82]:
class Age:
    """ The age of a person """
    
    def __init__(self, years, months=1, days=0):
        if (years < 0) or (years > 120):
            raise TypeError(f"Years should be a valid integer, received {years}")
        if (months < 1) or (months > 12) or (not isinstance(months, int)):
            raise TypeError(f"Months should be an integer between 1 and 12, received {months}")
        if (days < 1) or (days > 31):
            raise TypeError(f"Days should be an integer between 1 and 31, received {days}")
            
        self.years = int(years)
        self.months = months
        self.days = int(days)

Now the DataFrame or array containing our data can have a column of type Age which will contain meaningful data about the persons age. Notice how compact this class is. It doesn't contain the ID number of the person, nor it's name. All it does is encode the age. It's important that each of the class we write will have one specific purpose, and not more.

However, we're not quite done here. There is another possible "representation" of age and that is the date of birth. It's quite a natural requirement that given a date of birth - a string or a datetime object - our Age class will know how to generate a proper Age() instance. Similarly, given an Age() instance, we should be able to generate the person's date of birth. 

The second requirement is pretty easy - make a `get_dob()` method that returns the date of birth. But how should we approach the first requirement, of instantiating an Age() from a given date? Let's try to _refactor_ our class:

In [94]:
import datetime


class Age:
    """ The age of a person """
    
    cur_year = datetime.date.today().year
    
    @classmethod
    def from_str(cls, date_str):
        """ Instantiate from a string containing a date in the standard ISO format. """
        try:
            date = datetime.date.fromisoformat(date_str)
        except ValueError:
            raise
        else:    
            return cls(cls.cur_year - date.year, date.month, date.day)
        
    @classmethod
    def from_dob(cls, dob):
        """ Instatiates from a datetime.date or a datetime.datetime object """
        try:
            years = dob.year
            months = dob.month
            days = dob.day
        except AttributeError:
            print(f"Input should be a datetime.datetime or a datetime.date instance. Received {dob} which is a {type(dob)}.")
            raise
        else:
            return cls(years, months, days)
    
    
    def __init__(self, years, months, days):
        """ Instantiate an instance of the class by directly inputting the age of the person """
        if (years < 0) or (years > 120):
            raise TypeError(f"Years should be a valid integer, received {years}")
        if (months < 1) or (months > 12) or (not isinstance(months, int)):
            raise TypeError(f"Months should be an integer between 1 and 12, received {months}")
        if (days < 1) or (days > 31):
            raise TypeError(f"Days should be an integer between 1 and 31, received {days}")
        self.years = years
        self.months = months
        self.days = days
        
    def __str__(self):
        return f"Age(years={self.years}, months={self.months}, days={self.days})"
        
    def get_dob(self):
        """ Returns the date of birth """
        return datetime.date(self.cur_year - self.years, self.months, self.days)

In [97]:
age = Age(42, 11, 1)
age.get_dob()
age2 = Age.from_str('2001-04-05')
print(age2)

Age(years=18, months=4, days=5)


## Typestates

Typestates are a way to enforce the state of our data\application with strict types.


Let's assume I have 24 human volunteers in a combined fMRI + questionnaire study. I keep them all in a single DataFrame for brevity and ease-of-use, but in effect they're in different stages of my experiment. A few were just recruited last week, and I haven't even set a date for our first meeting. A few others were already scanned in the magnet once, but still have to go through my second questionnaire session. 

My application monitors these students, alerts me of incoming meeting dates, and (of course) analyzes the results of the questionnaires and scans.

The __correctness__ of this application can be enforced in many ways - tests, mock data, daily use - but here I choose to show another mechanism - typestates. The fact that the current status of each volunteer isn't specified with a simple string in a table, but is actually a different class altogether, is another way to make sure that I always receive the expected output from each method call.

In [98]:
import datetime
import pandas as pd


# Helper types
class Name:
    """ First and last name """
    # Implementation omitted


class Age:
    """ Special age type """
    # Implementation omitted


class FmriResult:
    """ Results from an fMRI scan """
    # Implementation omitted


# Volunteer types    
class Volunteer:
    """ Base class for all volunteers in my project """
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int):
        self.name = name
        self.age = age
        self.call_date = call_date
        self.id = vol_id
        
    def __str__(self):
        return f"{self.name}, age {self.age}, first called at {self.call_date}."
        
    def update_df(self, records: pd.DataFrame):
        """ Add the instance to the dataframe containing the rest of the data """
        record = pd.DataFrame([self.name, self.age, self.call_date, 
                               self.id, self.metadata, type(self), copy.copy(self)])
        records.append(record)
        return records
    
    def remove_from_df(self, records: pd.DataFrame):
        """ Remove the instance from the student records """
        idx = records.id == self.id
        records.drop(idx, inplace=True)
        return records

    
class PreScanOne(Volunteer):
    """ Volunteer before the first session """
    loc = 0  # ordinal place in hierarchy
    
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int, 
                 scan_one_date: datetime.time):
        super().__init__(name, age, call_date, vol_id)
        self.metadata = dict(scan_one_date=scan_one_date)
        
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PreScanOne to a PostScanOne """
        new = PostScanOne(self, result, next_date)
        return new
    

class PostScanOne(Volunteer):
    """ Volunteer after the first session """
    loc = 1
    
    def __init__(self, pre_volunteer: PreScanOne, scan_one_data: FmriResult, 
                 scan_two_date: datetime.time):
        super().__init__(pre_volunteer.name, pre_volunteer.age, pre_volunteer.call_date, pre_volunteer.id)
        self.metadata = pre_volunteer.metadata
        self.metadata['scan_one_data'] = scan_one_data
        self.metadata['scan_to_date'] = scan_two_date
    
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PostScanOne to a PreScanTwo """
        new = PreScanTwo(self, result, next_date)
        return new
    
    
# Examples of generic methods that use this interface
def advance_volunteer(old_vol, results: FmriResult, records: pd.DataFrame):
    """ 
    Move volunteer to next step in the experiment, returning the new 
    instance and records.
    """
    old_vol.remove_from_df(records)
    new_vol = old_vol.advance(results, records)
    new_vol.update_df(records)
    return new_vol, records


def process_data(records):
    """ Run the same processing function over all fMRI data """
    results = []
    for vol in records:
        try:
            results.append(vol.process_data)
        except AttributeError:  # instance doesn't have data
            pass
    return results

This is long, but interesting, so let's try to break it down.

At the beginning we have a few help classes which I merely defined, but not implemented. These shouldn't look strange to you. We talked during class of how an `Age` type is an important example of defining our own types in a program, since it's neither an integer nor a floating point number.

The second part is the most interesting. We have a base class called `Volunteer` which contains basic information which is common to all experiment volunteers. But it's actually more than that - it also defines the _interfaces_ between the classes, it forces the classes to have specific attributes that will comply to this protocol, linking their behavior together.

The other two classes inherit from `Volunteer` and represent the first two steps in the "Volunteer path". The `loc` class variable signifies that. From phase one (`PreScanOne`( a volunteer can only advance forward (or drop out from the experiment) to step 2. And likewise from step 2 to 3 - you'll always find the same `.advance()` method that takes you to the next step, even though the implementation is slightly different. To handle the variability in the held data, we have the `metadata` attribute which can hold different parameters and datapoints.

The last part shows how to use such an interface. We have a function that advances an instance of a class "one step" to the next phase. We have a function that runs some processing on the data held inside the instances, and we can have as many functions (and classes as we wish). It's completely extensible since the API is well-defined.

## Helper Concepts and Libraries

In practice, good and clear software design can be aided from using unique Python features and packages. We'll review a few of the more prominent ones:

### Type Annotations and MyPy

Since version 3.6, Python allows this syntax:

In [99]:
from typing import Tuple, Dict

def doer_of_stuffs(a: float, b: int, c: str = 'ccc') -> Tuple[str, Dict[int, float]]:
    """
    Does stuff to a, b, and c.
    Returns: A tuple of a string and a dictionary mapping ints to floats
    """
    a_helper: float = a + 2
    b_helper: float = b / 3
    int_a = int(a_helper)
    c2: str = c + c
    return c2, {b: a_helper, int_a: b_helper}

While a bit more verbose, these _type annotations_ make things clearer when dealing with large codebases. Knowing the defined type of your variables as they bounce around between modules and functions can help with the debugging process of your code tremendously.

Moreover, modern IDEs like PyCharm and VSCode will alert you before you run the code of any possible type errors. For example:

In [100]:
def main():
    a = 3  # integer
    a /= 2  # now it's a float
    arr = np.array([1, 2, 3])
    
    # ... lots of code here
    
    b = arr[a]  # TypeError - cannot index with a float variable

PyCharm and VSCode will mark this `arr[a]` expression and try to prevent you from running this code. 

A more wholesome approach is `mypy`, which was developed in Dropbox, a company very reliant on its Python-based product. When the Dropbox codebase increased in size, its engineers wanted to keep using Python due to its amazing features, but avoid the problems that come with a dynamically-typed language. Thus, `mypy` was born. In essence, it's a command-line tool that runs type checks on the entirety of your code base, verifying the type-correctness of your application. In many places a clean `mypy` error log is required before committing changes to the code base.

`mypy` supports both comment-based type annotations for older versions of Python (Dropbox, as of early 2018, is still using Python 2.7) and the new style of type annotations shown above. It can also generate type annotations on the fly, using `PyAnnotate`, while you run your application.

An example can be found in the `mypy_demo` folder.

### Enumerations

Python added enumeration support in Python 3.4, and it's starting to pop-up more and more in new code bases. An enumeration is a list of discrete possible values. Assuming I have a simple addition function:

In [101]:
def add_or_sub(a, b, add=True):
    """ Simple addition\subtraction """
    return a + b if add else a - b

The list of possible values for `a` and `b` is endless, so these cannot be enumerated. The `add` keyword is called a "flag", since it has two possible values - `True` and `False`. It's an enumeration of two possible values.

When we have more than two options, or when our two options aren't simply booleans, we can use an enumeration. Here's a simple example:

In [102]:
from enum import Enum


class Color(Enum):
    RED = 2
    GREEN = 1
    BLUE = 0
    BLACK = 'BLACK'
    
def return_color(c: Color, num: int) -> Color:
    ones = num % 10
    if ones == c.value:
        return c
    else:
        return Color.BLACK
    
ans = return_color(Color.RED, 12)
print(ans)

Color.RED


In the "real world" enumerations aren't too popular due to the fact that they were introduced very late. But a use-case could look like the following:

In [103]:
import pandas as pd


rng = pd.date_range('1/1/2018',periods=100, freq='D')  # 'D' is days
rng

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
               '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
               '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20',
               '2018-01-21', '2018-01-22', '2018-01-23', '2018-01-24',
               '2018-01-25', '2018-01-26', '2018-01-27', '2018-01-28',
               '2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
               '2018-02-02', '2018-02-03', '2018-02-04', '2018-02-05',
               '2018-02-06', '2018-02-07', '2018-02-08', '2018-02-09',
               '2018-02-10', '2018-02-11', '2018-02-12', '2018-02-13',
               '2018-02-14', '2018-02-15', '2018-02-16', '2018-02-17',
               '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21',
               '2018-02-22', '2018-02-23', '2018-02-24', '2018-02-25',
      

In [104]:
rng = pd.date_range('1/1/2018',periods=100, freq='M')  # it can also be 'M'
rng

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31',
               '2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31',
               '2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31',
               '2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
               '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
               '2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
               '2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',
      

What are the possible values for the `freq` keyword? Day is `D`, month is `M`, Year will probably be `Y`. Are there any more keywords? Will `d` also work, or do I have to use capital `D`? Actually, checking the [official](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) documentation doesn't result in anything too useful.

This is where enumerations come into play. This could've been simpler if we could only choose a value from a list of possible values:

In [106]:
class DateRangeFreq(Enum):
    D = 'days'
    M = 'months'
    Y = 'years'

rng = pd.date_range('1/1/2018',periods=100, freq=pd.DateRangeFreq.D)  # doesn't actually work...

AttributeError: module 'pandas' has no attribute 'DateRangeFreq'

If we were unsure of the available parameters, we could import the `DateRangeFreq` object and inspect its possible values. As you can see, each key has a value associated with it. This value can be an integer, string or event a Python object.

Enumerations are still hard to find in the Python ecosystem. They're a recent addition, and Pythonistas are used to typing strings in their function parameters, and not enumerations. But in many other languages with native enum support these data structures are very frequent for this use case, as well as others. If you're writing a piece of code that is intended to a Python 3.4+ audience, I suggest you use enumerations liberally in your code.

### `attrs` - Classes without boilerplate

Python classes are extremely useful, but they're also pretty verbose. They require you to write a lot of code for very basic operations.

For example, in the the `__init__()` method you have to go through each variable in the function signature and assign it to your own value:

In [107]:
class Example:
    def __init__(self, param1, param2, param3, param4):
        self.param1 = param1
        self.param2 = param2
        self.param3 = param3
        self.param4 = param4
    
    def my_method(self):
        """ Do stuff """
        pass

So many lines of repetitive code doing basically nothing. I didn't assert the types of the variables, I didn't do some basic pre-processing - this is called "boilerplate" code. Python requires me to write these tedious lines every time I create a class, and when classes get bigger and bigger, these assignments can be a hassle to write.

`attrs` to the rescue:

In [108]:
import attr
from attr.validators import instance_of


@attr.s
class ExampleTwo:
    param1 = attr.ib(validator=instance_of(int))
    param2 = attr.ib(validator=instance_of(float))
    param3 = attr.ib(default='no')
    param4 = attr.ib(default=attr.Factory(list))
    
    def my_method(self):
        """ Do stuff """
        pass

That's it. No `__init__` is required, each `paramX` variable is already assigned to `self.paramX`. It also allows the addition of validators, default values, converter functions (not shown), and it even implements the comparison methods (`__eq__`, `__gt__`, etc.) for you. It has a ton of other useful features which I won't go into right now, but you can be sure that it's a package worth using.

I can testify that 95% of classes I write today are `attrs` classes, and so do many other fellow Pythonistas. I encourage you to read the [official documentation](http://www.attrs.org/en/stable/?badge=stable) and start using it ASAP.

### Dimensionality analysis and units

When working with numbers that have units, it's usually a good idea to keep the physical quantity assigned to that value as close as possible.

When you're measuring the local field potential using some electrode array, it's good practice to verify that throughout the entirety of your processing pipeline, the voltage values aren't divided by a number with units of time, because units of _[Volts] / [seconds]_ usually have no physical meaning. It can also help you assert that your dF/F calculation indeed has natural units, and not some other arbitrary units.

There are many options in the Python world for dimensionality analysis. If you're using Python to write symbolic math and solve equations, I suggest you use SymPy's `physics.units` module. Else - use `pint`.

In [109]:
import pint


ureg = pint.UnitRegistry()
3 * ureg.meter + 4 * ureg.cm

In [110]:
measures = ureg.Quantity(np.random.random(100), 'volts')
print(measures)

[0.86391808 0.80978233 0.61946865 0.92476365 0.54612499 0.09720625 0.74498431 0.46146294 0.8292554  0.32929196 0.66626406 0.58078878 0.75559014 0.58448808 0.45797219 0.37495129 0.40365534 0.57607671 0.26639574 0.69993773 0.27043263 0.28197405 0.39358254 0.54878849 0.4536735  0.77949003 0.50405561 0.75888185 0.38107352 0.60872202 0.54252542 0.60435099 0.12799454 0.96368337 0.70291376 0.31704293 0.56331423 0.46653042 0.23144402 0.46126507 0.27680999 0.43026911 0.34848799 0.0747386  0.78686883 0.52757631 0.08994196 0.31799668 0.81047963 0.14054888 0.45391329 0.33311589 0.00282717 0.78593707 0.94349129 0.95387516 0.43304587 0.1066743  0.12133096 0.12211176 0.35031233 0.80722592 0.98821595 0.32115933 0.61608006 0.0114678 0.36467658 0.25797414 0.05165721 0.79903202 0.81460115 0.42354514 0.37769425 0.7671096  0.15520372 0.76805772 0.63531454 0.42907546 0.05934157 0.05175393 0.06850817 0.44843725 0.09708755 0.06521083 0.01864829 0.89428011 0.51148607 0.15315861 0.74256808 0.71719171 0.28400971

In [111]:
print(measures * 2)

[1.72783617 1.61956465 1.2389373  1.8495273  1.09224997 0.19441249 1.48996862 0.92292588 1.65851079 0.65858392 1.33252811 1.16157755 1.51118027 1.16897617 0.91594438 0.74990258 0.80731067 1.15215342 0.53279148 1.39987546 0.54086526 0.5639481  0.78716507 1.09757699 0.907347   1.55898006 1.00811122 1.5177637  0.76214704 1.21744404 1.08505085 1.20870197 0.25598907 1.92736675 1.40582752 0.63408587 1.12662847 0.93306084 0.46288804 0.92253014 0.55361997 0.86053822 0.69697599 0.1494772  1.57373766 1.05515263 0.17988391 0.63599337 1.62095926 0.28109776 0.90782658 0.66623179 0.00565434 1.57187415 1.88698258 1.90775031 0.86609175 0.2133486  0.24266192 0.24422352 0.70062467 1.61445184 1.97643189 0.64231865 1.23216012 0.02293559 0.72935316 0.51594828 0.10331442 1.59806403 1.62920231 0.84709027 0.75538849 1.5342192  0.31040744 1.53611544 1.27062908 0.85815091 0.11868314 0.10350786 0.13701633 0.89687451 0.19417509 0.13042166 0.03729659 1.78856022 1.02297214 0.30631723 1.48513617 1.43438343 0.5680194

In [112]:
amps = measures / (2 * ureg.ohm)  # I = V/R
amps.dimensionality

<UnitsContainer({'[current]': 1.0})>

In [113]:
amps.to('seconds')  # DimensionalityError

DimensionalityError: Cannot convert from 'volt / ohm' ([current]) to 'second' ([time])

For some projects this can be a pretty big overkill, but for others this can save many "silent" bugs.

## Design vs. Productivity

Before we start exercising, one important note to remember: There's a thin line between under- and over-engineering. Very small scripting projects require almost no engineering at all. This might mean that after you gain a few extra months of experience in Python, the structure of code for a small scripting job in Python might be obvious for you right from the get-go. You'll know which data structures you'll have, whether or not you'll need a class or two, and how the user interface might go.

On the other hand, large applications which span at least a few thousands lines of code will always need _some_ form of pre-planning. It would be senseless not to write out a diagram of the main modules in your code and their interfaces. One can consider this to be common knowledge, or a simple programmer's instinct. Just like architects sit down and plan for months in advance the construct what they're about to create, programmers should spell out the architecture of their own programs. In no way will this guarantee you'll get the architecture right in the first time, but the design might serve as good building blocks when you start the refactoring process.

Problems mostly occur when you write medium-sized scripts, up to a couple thousand lines. These scripts usually start out small - a few functions that deal with file I/O and display of data - but can grow quite quickly once you start adding functionality. When the script was short you probably didn't even write tests, since you were sure you're handling some insignificant piece of code, and now it starts biting back at you.

It's hard to write rules for these occasions. When someone asks me for improved functionality on some short script I wrote, I sometimes tell them it will take more time than I think it should, since I want to devote time to refactor the code, add tests and make the new functionality feel more natural inside it.

It's also good practice to use classes to bind data and methods, even when you think they might be an overkill. It's much easier to expand the functionality of classes than of an assortment of functions.