### I0u19a - Data Processing - KU Leuven
###### _Thomas Moerman, Jan Aerts_
![license](https://licensebuttons.net/l/by/3.0/88x31.png)

# **Functional Programming in Python**

Hello and welcome to the tutorial on data processing using functional programming (FP) concepts!

We'll be using [Jupyter](http://jupyter.org/) (you're looking at it) as a tool to walk you through a few examples. At the VDA-LAB, we like notebooks as a teaching tool because they allow you to experiment with code and data as you work your way through the document.

A few guidelines on the notebook itself:
* A notebook consists of *cells*, which are snippets of either text (markdown) or code (Python in this case).
* Cells can be executed by clicking the `[>]` "play" button, or by hitting shift-enter on the keyboard.
* You can navigate between cells either by clicking or by using the arrow buttons.

### **Objectives**

Many parallel or distributed data processing libraries or frameworks adopt concepts from FP. This notebook provides examples and exercises to familiarize students with functional programming (FP) concepts. We will be using two libraries to illustrate the concepts: [Toolz](http://toolz.readthedocs.io/en/latest/) and [Dask](https://dask.pydata.org/en/latest/). 

> **Careful!** The examples here are intended as didactic material to teach FP concepts, they are not necessarily idiomatic Python nor the best approach for every use case.

### **Contents**



---

* import the necessary Python libraries

In [146]:
import json
import toolz as tz
import dask.bag as db

---

## **1. FP basics**

When programming in a functional style, functions are the main abstraction for composing programs. When used with skill, they represent a **vocabulary** in which computational ideas can be elegantly expressed.

* functions transform data
* functions are reusable
* functions are composable

### 1.1 Functions as arguments
* like objects, functions can be passed to other functions as arguments

In [200]:
def do_twice(my_function, arg):
    """
    Applies my_function twice to arg and returns the result.
    """
    
    once  = my_function(arg)
    twice = my_function(once)
    
    return twice

* we pass the function __`inc`__ as the first argument to __`do_twice`__

In [194]:
do_twice(inc, 5)

7

* we can define a function _inline_ using a **`lambda`** expression

In [201]:
do_twice(lambda x: x + 1, 5)

7

### 1.2 Functions as return values

* functions can be returned as the resulting value of another function

In [213]:
def do_twice_v2(my_function):    
    # function defined within the scope of another function
    def inner(x):
        return my_function(my_function(x))
    
    return inner  # return the inner function

def do_twice_v3(my_function):    
    # same as above, but using a lambda expression
    return lambda x: my_function(my_function(x))

In [204]:
twice_inc = do_twice_v2(inc)

In [205]:
type(twice_inc)

function

In [206]:
twice_inc(7)

9

In [226]:
do_twice_v3(inc)(7)

9

> functions that take or produce functions are called: **higher-order functions**

### 1.3 Partial function application

* build a new function by filling in a subset of a functions arguments, using __`partial`__
* note: order of arguments is important

In [229]:
twice_inc = tz.partial(do_twice, inc)

In [230]:
twice_inc(3)

5

### 1.4 Curried functions

* curry-ing functions automates partial application

In [217]:
@tz.curry
def do_twice_curried(fn, x):
    return fn(fn(x))

* still works as expected

In [221]:
do_twice_curried(inc, 8)

10

* if not all arguments are provided, a partially applied function is returned

In [231]:
twice_inc = do_twice_curried(inc)

In [232]:
twice_inc(4)

6

### 1.5 Purity
* see https://toolz.readthedocs.io/en/latest/purity.html

### 1.6 Laziness

* see https://toolz.readthedocs.io/en/latest/laziness.html

---

## **2. Functions operating on collections**

In a data processing context, **functions** and **collections** go hand in hand.

* functions to transform collections
* functions to aggregate or summarize collections
* map, filter, reduce are the workhorses, the most common higher-order functions that operate on collections

In [271]:
scientists = [{'first': 'Richard', 'last': 'Feynman',  'gender': 'M'}, 
              {'first': 'Marie',   'last': 'Curie',    'gender': 'F'},
              {'first': 'Paul',    'last': 'Stamets',  'gender': 'M'},
              {'first': 'Ada',     'last': 'Lovelace', 'gender': 'F'},
              {'first': 'Stephen', 'last': 'Hawking',  'gender': 'M'},
              {'first': 'Carolyn', 'last': 'Porco',    'gender': 'F'}]

### 2.1 Mapping

* we pass length as the to `map` as the function to apply to all entries in the collection 
* the length of the output is equal to the length of the input

Let's check out the function signature:

```
map(func, *iterables) --> map object
```

In [280]:
tz.map?

[0;31mInit signature:[0m [0mtz[0m[0;34m.[0m[0mmap[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
map(func, *iterables) --> map object

Make an iterator that computes the function using arguments from
each of the iterables.  Stops when the shortest iterable is exhausted.
[0;31mType:[0m           type


In [272]:
def name_length(scientist):
    return len(scientist['last'])

name_lengths = tz.map(name_length, scientists)

* mapping is lazy
* the result is not immediately computed

In [273]:
name_lengths

<map at 0x7fcd0069de10>

* we turn the map into a collection by wrapping it into a list

In [274]:
list(name_lengths)

[7, 5, 7, 8, 7, 5]

* let's try with a lambda expression

In [275]:
loud_scientists = tz.map(lambda s: s['first'].upper(), scientists)

In [276]:
for entry in loud_scientists:
    print(entry)

RICHARD
MARIE
PAUL
ADA
STEPHEN
CAROLYN


#### **!! CAUTION !!**
* What happens if we iterate over the same lazy collection twice??

In [277]:
for entry in loud_scientists:
    print(entry)

* we can map over multiple collections at once
* we need to pass in a function with a different signature

In [281]:
sums = tz.map(lambda x, y: x + y, [0, 1, 2, 3], [10, 11, 12, 13])

In [282]:
list(sums)

[10, 12, 14, 16]

### 2.2 Filtering

* we pass a **predicate**, i.e. a function that takes an entry and returns a Boolean value

In [279]:
ladies_only = tz.filter(lambda s: s['gender'] == 'F', scientists)

list(ladies_only)

[{'first': 'Marie', 'gender': 'F', 'last': 'Curie'},
 {'first': 'Ada', 'gender': 'F', 'last': 'Lovelace'},
 {'first': 'Carolyn', 'gender': 'F', 'last': 'Porco'}]

### 2.3 Reducing

* a **reduction** collapses the collection

Let's check the function signature:

```
reduce(function, sequence[, initial]) -> value
```

* the result is a single value
* function a.k.a. `seqop`
* optional argument: `initial`

In [283]:
tz.reduce?

[0;31mDocstring:[0m
reduce(function, sequence[, initial]) -> value

Apply a function of two arguments cumulatively to the items of a sequence,
from left to right, so as to reduce the sequence to a single value.
For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5).  If initial is present, it is placed before the items
of the sequence in the calculation, and serves as a default when the
sequence is empty.
[0;31mType:[0m      builtin_function_or_method


In [310]:
sum_numbers = tz.reduce(lambda x, y: x + y, range(1, 5))

sum_numbers

10

* or written more elegantly

In [311]:
from operator import add

tz.reduce(add, range(1, 5))

10

* the accumulator (int) and entry (dict) have different signatures
* we need to provide an initial value for the accumulator

In [292]:
sum_first_name_lengths = tz.reduce(lambda acc, s: acc + len(s['first']), scientists, 0)

sum_first_name_lengths

33

#### **Advanced example**

* we can implement filter with reduce

In [307]:
def filter_by_reducing(predicate, sequence):
    def maybe_append(acc, e):
        if predicate(e):
            acc.append(e)  # only append if the predicate evaluates positively
            
        return acc
    
    return tz.reduce(maybe_append, sequence, [])

filter_by_reducing(lambda s: s['gender'] == 'F', scientists)

[{'first': 'Marie', 'gender': 'F', 'last': 'Curie'},
 {'first': 'Ada', 'gender': 'F', 'last': 'Lovelace'},
 {'first': 'Carolyn', 'gender': 'F', 'last': 'Porco'}]

### 2.4 Aggregating

In [185]:
book = open('../data/dickens.txt')

In [186]:
loud_book = map(str.upper, book)

## **1. Itertoolz**

Operations on iterables.

### **1.1 Mapping**

In [147]:
tz.concat?

[0;31mSignature:[0m [0mtz[0m[0;34m.[0m[0mconcat[0m[0;34m([0m[0mseqs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Concatenate zero or more iterables, any of which may be infinite.

An infinite sequence will prevent the rest of the arguments from
being included.

We use chain.from_iterable rather than ``chain(*seqs)`` so that seqs
can be a generator.

>>> list(concat([[], [1], [2, 3]]))
[1, 2, 3]

See also:
    itertools.chain.from_iterable  equivalent
[0;31mFile:[0m      ~/work/batiskav/installs/anaconda3/lib/python3.5/site-packages/toolz/itertoolz.py
[0;31mType:[0m      function


In [6]:
tz.map?

[0;31mInit signature:[0m [0mtz[0m[0;34m.[0m[0mmap[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
map(func, *iterables) --> map object

Make an iterator that computes the function using arguments from
each of the iterables.  Stops when the shortest iterable is exhausted.
[0;31mType:[0m           type


In [15]:
mapped = map(lambda x, y: x + y, [0, 1, 2], [4, 5, 6])

In [16]:
list(mapped)

[4, 6, 8]

## 2. Dask Bag

TODO Dask intro

* unordered 
* repeats allowed

In [19]:
b = db.from_sequence(range(0, 100))

In [23]:
sum_by_even = b.foldby(key=lambda x: x % 2 == 0, 
                       binop=lambda x,y: x+y)

In [24]:
list(sum_by_even)

[(False, 2500), (True, 2450)]

In [360]:
m = {'bla': 5, 'test': 3}

In [362]:
set(m.keys())

{'bla', 'test'}

## 3. Dos cervezas, por favor!

* following exercises use a dataset parsed from a `json` file
* we'll be using a **Dask bag** to compute some answers

In [382]:
def is_valid(raw_dict):
    return raw_dict['Percentagealcohol'] != 'NA'

def parse_beer(raw_dict):        
    return {'brewery': raw_dict['Brouwerij'],
            'brand': raw_dict['Merk'],
            'type': raw_dict['Soort'],
            'alcohol_pct': float(raw_dict['Percentagealcohol'])}

beers_raw = db.from_sequence(json.load(open('../data/beers.json')))

* count the number of invalid beer entries (these have a missing value for 'NA')

In [383]:
beers_raw.filter(predicate=lambda x: not is_valid(x)).count().compute()

12

* let's now get rid of invalid entries and parse them into a more convenient representation

In [384]:
beers = beers_raw.filter(is_valid).map(parse_beer).persist()

* quiz: why have we used the `persist` function above?

In [386]:
beers.take(2)

({'alcohol_pct': 6.0,
  'brand': '3 Schténg',
  'brewery': "Brasserie Grain d'Orge",
  'type': 'hoge gisting'},
 {'alcohol_pct': 5.6,
  'brand': '400',
  'brewery': "'t Hofbrouwerijke voor Brouwerij Montaigu",
  'type': 'blond'})

---

#### Exercise 1: **find the number of breweries**

* hint: to get unique results, use `distinct`

In [387]:
breweries = beers.map(lambda b: b['brewery']).distinct().count().compute()

360

#### Exercise 2: **find the strongest beer**
* hint: the dask bag function for reducting is called: `fold`

In [395]:
beers.fold(lambda b1, b2: b1 if b1['alcohol_pct'] > b2['alcohol_pct'] else b2).compute()

{'alcohol_pct': 26.0,
 'brand': 'Black Damnation V (Double Black)',
 'brewery': 'De Struise Brouwers bij Brouwerij Deca',
 'type': 'Russian Imperial Stout, Eisbockmethode'}

#### Exercise 3: **find the 3 most common alcohol percentages**

* ordering is a very inefficient operation if we are only interested in the top 3
* hint: use the `topk` function

In [399]:
beers.map(lambda x: x['alcohol_pct']).frequencies().topk(3, lambda x, y: y).compute()

[(8.0, 180), (6.5, 150), (5.0, 131)]

#### Exercise 4: **find the brewery that makes the strongest beer on average**

* a nice trick to avoid naming collisions is to wrap functions in an outer function
* aggregations by key are done with `foldby`
* remember why you need `binop` and `combine` functions!

In [413]:
def bad_ass_brewery(beers):
    
    def binop(sum_count, beer):
        sum_, count_ = sum_count # unpack tuple
        return (sum_ + beer['alcohol_pct'], count_ + 1)
    
    def combine(sum_count_1, sum_count_2):
        sum_1, count_1 = sum_count_1
        sum_2, count_2 = sum_count_2
        return (sum_1 + sum_2, count_1 + count_2)
    
    def top_avg_alcohol_pct(brewery_sum_count):
        brewery, (sum_, count_) = brewery_sum_count
        return sum_ / count_                 
    
    return beers.foldby('brewery', binop=binop, initial=(0,0), combine=combine).topk(1, key=top_avg_alcohol_pct)

In [412]:
bad_ass_brewery(beers).compute()

[('Staminee De Garre (Brouwerij Van Steenberge)', (11.5, 1))]

#### Exercise 5: **find the top 3 breweries with the most diverse range of beer types**

* think of which initial value to provide to the binop
* your result should also provide the beer types per brewery

In [418]:
s = {8}

In [419]:
if s:
    print('bla')

bla


In [447]:
def diverse_breweries(beers, top=3):
    
    def binop(set_of_types, beer):
        if set_of_types:
            result = set_of_types
        else:
            result = set()
            
        result.add(beer['type'])
        return result
    
    def combine(set_1, set_2):
        return set.union(set_1, set_2)
    
    def count_types(brewery_set_of_types):
        brewery, set_of_types = brewery_set_of_types
        return len(set_of_types)
    
    return beers.foldby('brewery', binop=binop, initial=None, combine=combine).topk(top, key=count_types)

In [448]:
diverse_breweries(beers, top=1).compute()

[('Brouwerij Huyghe',
  {'Erkend Belgisch Abdijbier',
   'Erkend Belgisch Abdijbier, tripel',
   'Speciale Belge',
   'abdijbier',
   'abdijbier, tripel',
   'blond',
   'bruin',
   'donker, hoge gisting',
   'fruitbier',
   'glutenvrij pilsener',
   'hoge gisting',
   'honingbier',
   'kerstbier',
   'pils',
   'pils, biologisch',
   'roodbruin',
   'sterk blond',
   'sterk donker',
   'tripel',
   'witbier'})]