Lesson 4: Pythonic programming
====================
---
Prof. James Sharpnack<br>
Statistics Department, UC Davis<br>
&copy; 2017

## Python built-in functions and iterables

Python has many built in functions that are always available to the user.  See [the following documentation](https://docs.python.org/3/library/functions.html). You should learn their uses, since many are very helful.  Most of them operate on broad classes of objects.  For example, `enumerate()` operates on any container, which are lists, tuple, sets, dictionaries.  The main types that you should know how to use extensively are list, tuple, set, dict, str, int, float, function.  You can initialize these types by calling  `int('134')` for example.  Running the cell below will show the docstring for the int type.

In [5]:
int?

Some of the most useful of these functions are `enumerate, len, max, min, sorted, range, reversed, sum, zip`.  These are all tied to iterables, which we should discuss here.  Iterables are types for which the `iter` method is defined, which is how for loops work in python.  Containers are iterable, and the resulting iterator will allow you to iterate through its elements.  For example, the following two cells do the same thing.

In [6]:
all_primes = list([2,3,5,7,11,13])
for i in all_primes: #simple for loop
    print(i)

2
3
5
7
11
13


In [13]:
all_primes_iter = iter(all_primes)
while True:
    try:
        i = next(all_primes_iter)
        print(i)
    except StopIteration:
        break

2
3
5
7
11
13


The above script will repeat the `next` function, which moves the position down the iterator.  When there next cannot find any more elements then it will raise a StopIteration error.  The try, except idiom in python is very handy.  Using it you can catch certain types of errors, and perform other operations in their event.  See [the python error](https://docs.python.org/3/tutorial/errors.html) documentation for more details.

`len` will return the number of elements in a container, such as list.  It will also give the length of a string.

`range(n,m)` will return an iterable that counts from n to m-1.  In python 2.7, it would return a list, and `xrange` would return the iterable.  In python 3, there is only range and it returns the iterable, the distinction being that range no longer allocates all that memory to build the list by default.

`sorted(iterable[,key])` will sort the iterable based on the key function.  Designing an interesting key function can allow you to sort a string based on length, or the second letter, etc.  This functionality is very handy.

`max, min` will find the minimum and maximum of an iterable, and they support a key like the sorted function.

`enumerate(iterable)` will return an iterable that returns tuples of the elements of the list  along with the index.  See the next cell.

In [21]:
for i,j in enumerate([1,4,9]):
    print(i,j)

0 1
1 4
2 9


`reversed(iterable)` reverses the iterable and returns an iterable.

`sum(iterable[,start])` will add the elements of the iterable starting at the start.  If it is a list of numbers then the default start of 0 is good.  But you can do the following for example.

In [28]:
print(sum(range(10)))
print(sum(([i] for i in range(10)), []))

45
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


## Mutable or immutable: list or tuple?

You can group python types into mutable and immutable, where the contents of immutable types cannot be altered.  Two mirror image types are the tuple and list.  The tuple is immutable, so you cannot extend it, modify the entries by slicing, etc.

In [29]:
a_tup = (1,4,9)
a_tup[1] = 3 #this doesn't work

TypeError: 'tuple' object does not support item assignment

In [30]:
a_list = [1,4,9]
a_list[1] = 3

Other immutable types are `int, str, bool` and some mutable types are `list, array.array, set, dict`.  The mutable types can be appended to and slicing can alter the contents.  A tuple is more than an immutable list.  It also has tuple unpacking, which can be very handy for working with functions and returning multiple items.

In [34]:
a, b = (1, 2) #parallel assignment
b, a = a, b #swap the values!
a, b, *c = range(5) #*c will grab the rest of the items
print(a,b,c)

0 1 [2, 3, 4]


In [35]:
def test_right_triangle(a,b,c):
    print(a**2 + b**2 == c**2)

In [38]:
trarg = (3,4,5)
test_right_triangle(*trarg) #here we passed multiple arguments from one tuple with unpacking

True


Tuples should be thought of as records, as in a database, as well.  They work well as records  because you cannot change the order of the elements, giving each position a fixed meaning.  So if you have a dataset of (year, price, item name) then it would be best to organize it at a list of 3-tuples (until we start using pandas that is).

## List comprehensions and generator expressions

One of the more pythonic things that you will learn is the use of list comprehensions and generator expressions.  The idea is that if you want to build a list from an iterable, while mapping and filtering the elements, then you should go with a list comprehension.  They are typically more succinct, faster, and very flexible.  I did not introduce the map function above because a list comprehension is easier to write and more readable.  `map` will apply a function to each entry of a list, and `filter` will remove those for which a function evaluates to false.  But we can do that with a list comp...  

In [41]:
print([i//3 for i in range(100) if i % 3 == 0])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]


In the above we mapped the numbers 0-99 by dividing by 3 and filtered out those which are not divisible by 3.  Many times you do not want to create a list and waste the memory, and instead you want to just create an iterable.  A generator expression will do this for you.  It looks like a list comp, but you just surround the expression with parentheses.  This can then be passed to functions that take iterables.  You can also initialize a dictionary with a generator expression (as in two cells down).

In [43]:
sum((i//3 for i in range(100) if i % 3 == 0),100) #100+0+1+2+...+33

661

In [44]:
{j:i for i,j in enumerate('hotdog')}

{'d': 3, 'g': 5, 'h': 0, 'o': 4, 't': 2}

## The right data structure for you

Different data structures are efficient for different operations.  The list is great if you want to look up the item based on its index.  It is inefficient for looking up the index for a certain value.  This is because `list.index` will look up the index for that value by iterating through the list until the value is found.  This means that the amount of time that this can take will scale like the length of the list (we say that this takes $O(n)$ time where $n$ is the length of the list).  If you want to do fast index lookups you have two main options: a dictionary for reverse lookups, or a sortedlist.

A dictionary is a hash table.  A hash table stores the values in blocks in memory, where the location in memory is based on a built in hash function.  So if you call `hash('james')` then you see the hash value of a string, which determines where the value is stored in memory.  This means that to find a value of a key, you only need to evaluate the hash function of the key.  This means that lookups take constant time.  The drawback is that the hash table takes up perhaps an unneccessary amount of memory.  You also need to use keys that are hashable, such as strings, ints, floats.

<img src="https://upload.wikimedia.org/wikipedia/commons/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg">

A sorted list is just a sorted version of the original list used to construct it.  Because it is sorted you can look for values through the bisection method.  Bisection will look at the mid-point in the sorted list and determine if the query is greater than or less than that value.  Then you can rule out half of the list, and continue bisecting in the other half recursively until you find the value.  This means that it takes $O(\log n)$ time to do the lookup.  Look at the code below from the scrabble dictionary reverse lookup example.

In [46]:
def read_dictionary(filename="data/sowpods.txt"):
    """create a list of words in the scrabble dictionary"""
    with open(filename,'r') as scrabblefile:
        scrabble_dict = [word.strip() for word in scrabblefile]
    return(scrabble_dict)
    
scrabble_words = read_dictionary()

In [47]:
from sortedcontainers import SortedList

ind_dict = {word: i for i, word in enumerate(scrabble_words)} #init the dictionary
ind_sl = SortedList(scrabble_words) #init the sorted list

In [51]:
word = scrabble_words[12345]
%timeit ind_dict[word]
%timeit ind_sl.bisect(word)
%timeit scrabble_words.index(word)

The slowest run took 44.11 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 74.1 ns per loop
100000 loops, best of 3: 4.64 µs per loop
1000 loops, best of 3: 325 µs per loop
