# PyFP - The `toolz` Package

## Overview
Toolz provides a set of utility functions for iterators, functions, and dictionaries. These functions interoperate well and form the building blocks of common data analytic operations. They extend the standard libraries itertools and functools and borrow heavily from the standard libraries of contemporary functional languages.

Toolz provides a suite of functions which have the following functional virtues:

* Composable: They interoperate due to their use of core data structures.
* Pure: They don’t change their inputs or rely on external state.
* Lazy: They don’t run until absolutely necessary, allowing them to support large streaming data sets.

Toolz functions are pragmatic. They understand that most programmers have deadlines.

* Low Tech: They’re just functions, no syntax or magic tricks to learn
* Tuned: They’re profiled and optimized
* Serializable: They support common solutions for parallel computing

This gives developers the power to write powerful programs to solve complex problems with relatively simple code. This code can be easy to understand without sacrificing performance. Toolz enables this approach, commonly associated with functional programming, within a natural Pythonic style suitable for most developers.

## Related documents
* github: https://github.com/pytoolz/toolz
* docs: https://toolz.readthedocs.io/en/latest/

## How to get it up and running
You may need to setup your python virtual environment and pip install.

```
$ mkdir project
$ cd project
$ virtuanenv .env
$ source .env/bin/activate
(.env) $ pip install toolz

```

## Import into project

In [1]:
from toolz import *

# Examples from docs

## Function Purity
We call a function pure if it meets the following criteria

1. It does not depend on hidden state, or equivalently it only depends on its inputs.
2. Evaluation of the function does not cause side effects

In short the internal work of a pure function is isolated from the rest of the program.

In [2]:
# A pure function
def min(x, y):
    if x < y:
        return x
    else:
        return y


# An impure function
exponent = 2

def powers(L):
    for i in range(len(L)):
        L[i] = L[i]**exponent
    return L

## State
Impure functions are often more efficient but also require that the programmer “keep track” of the state of several variables. Keeping track of this state becomes increasingly difficult as programs grow in size. By eschewing state programmers are able to conceptually scale out to solve much larger problems. The loss of performance is often negligible compared to the freedom to trust that your functions work as expected on your inputs.

Maintaining state provides efficiency at the cost of surprises. Pure functions produce no surprises and so lighten the mental load of the programmer.

## Laziness

Lazy iterators evaluate only when necessary. They allow us to semantically manipulate large amounts of data while keeping very little of it actually in memory. They act like lists but don’t take up space.

Example - A Tale of Two Cities
We open a file containing the text of the classic text “A Tale of Two Cities” by Charles Dickens [link](http://www.gutenberg.org/files/98/98-0.txt).

```python
>>> book = open('tale-of-two-cities.txt')
>>> next(book)
"It was the best of times,"

>>> next(book)
"it was the worst of times,"
```


## Computation
We can lazily operate on lazy iterators without doing any actual computation. For example lets read the book in upper case

```python
>>> from toolz import map  # toolz' map is lazy by default
>>> loud_book = map(str.upper, book)
>>> next(loud_book)
"IT WAS THE AGE OF WISDOM,"
>>> next(loud_book)
"IT WAS THE AGE OF FOOLISHNESS,"
```

It is as if we applied the function str.upper onto every line of the book; yet the first line completes instantaneously. Instead Python does the uppercasing work only when it becomes necessary, i.e. when you call next to ask for another line.

## Reductions
You can operate on lazy iterators just as you would with `lists`, `tuples`, or `sets`. You can use them in for loops as in

``` python
for line in loud_book:
    ...
```    
You can instantiate them all into memory by calling them with the constructors `list`, or `tuple`.
``` python
loud_book = list(loud_book)
```

Of course if they are very large then this might be unwise. Often we use laziness to avoid loading large datasets into memory at once. Many computations on large datasets don’t require access to all of the data at a single time. In particular reductions (like sum) often take large amounts of sequential `data(like[1, 2, 3, 4])` and produce much more manageable results (like 10) and can do so just by viewing the data a little bit at a time. For example we can count all of the letters in the Tale of Two Cities trivially using functions from `toolz`

``` python
>>> from toolz import concat, frequencies
>>> letters = frequencies(concat(loud_book))
{ 'A': 48036,
  'B': 8402,
  'C': 13812,
  'D': 28000,
  'E': 74624,
  ...
 ```
 In this case frequencies is a sort of reduction. At no time were more than a few hundred bytes of Tale of Two Cities necessarily in memory. We could just have easily done this computation on the entire Gutenberg collection or on Wikipedia. In this case we are limited by the size and speed of our hard drive and not by the capacity of our memory.

## Control Flow
Programming is hard when we think simultaneously about several concepts. Good programming breaks down big problems into small problems and builds up small solutions into big solutions. By this practice the need for simultaneous thought is restricted to only a few elements at a time.

All modern languages provide mechanisms to build data into data structures and to build functions out of other functions. The third element of programming, besides data and functions, is control flow. Building complex control flow out of simple control flow presents deeper challenges.

Each element in a computer program is either

* A variable or value literal like `x`, `total`, or `5`
* A function or computation like the `+` in `x + 1`, the function `fib` in `fib(3)`, the method split in `line.split(',')`, or the `=` in `x = 0`
* Control flow like `if`, `for`, or `return`

Here is a piece of code; see if you can label each term as either variable/value, function/computation, or control flow

In [3]:
def fib(n):
    a, b = 0, 1
    for i in range(n):
        a, b = b, a + b
    return b

Programming is hard when we have to juggle many code elements of each type at the same time. Good programming is about managing these three elements so that the developer is only required to think about a handful of them at a time. For example we might collect many integer variables into a list of integers or build a big function out of smaller ones. While we have natural ways to manage data and functions, control flow presents more of a challenge.

We organize our data into **data structures** like lists, dictionaries, or objects in order to group related data together – this allows us to manipulate large collections of related data as if we were only manipulating a single entity.

We **build large functions out of smaller ones**; enabling us to break up a complex task like doing laundry into a sequence of simpler tasks.

In [4]:
def do_laundry(clothes):
    wet_clothes = wash(clothes, coins)
    dry_clothes = dry(wet_clothes, coins)
    return fold(dry_clothes)

Control flow is more challenging; how do we break down complex control flow into simpler pieces that fit in our brain? How do we encapsulate commonly recurring patterns?

Lets motivate this with an example of a common control structure, applying a function to each element in a list. Imagine we want to download the HTML source for a number of webpages.

``` python
from urllib import request

urls = ['http://www.google.com', 'http://www.wikipedia.com', 'http://www.apple.com']
html_texts = []
for item in urls:
    html_texts.append(request.urlopen(item))
return html_texts
```

Or maybe we want to compute the Fibonacci numbers on a particular set of integers

``` python
integers = [1, 2, 3, 4, 5]
fib_integers = []
for item in integers:
    fib_integers.append(fib(item))
return fib_integers
```

These two unrelated applications share an identical control flow pattern. They apply a function (urlopen or fib) onto each element of an input list (urls, or integers), appending the result onto an output list. Because this control flow pattern is so common we give it a name, map, and say that we map a function (like urlopen) onto a list (like urls).

Because Python can treat functions like variables we can encode this control pattern into a higher-order-function as follows:

```python
def map(function, sequence):
    output = []
    for item in sequence:
        output.append(function(item))
    return output
```

This allows us to simplify our code above to the following, pithy solutions


``` python
html_texts = map(urlopen, urls)
fib_integers = map(fib, integers)
```

Experienced Python programmers know that this control pattern is so popular that it has been elevated to the status of syntax with the popular list comprehension

``` python
html_texts = [urlopen(url) for url in urls]
```

# Currying
Traditionally partial evaluation of functions is handled with the `partial` higher order function from `functools`. Currying provides syntactic sugar.

``` python
>>> from functools import partial
>>> double = partial(mul, 2)    # Partial evaluation
>>> doubled = double(2)         # Currying
```

This syntactic sugar is valuable when developers chain several higher order functions together.

Often when composing smaller functions to form big ones we need partial evaluation. 

In general.
``` python
>>> def f(x, y, z):
...     # Do stuff with x, y, and z

>>> # partially evaluate f with known values a and b
>>> def g(z):
...     return f(a, b, z)

>>> # partially evaluate f with known values a and b
>>> g = partial(f, a, b)
```

In this context currying is just syntactic sugar for partial evaluation. A curried function partially evaluates if it does not receive enough arguments to compute a result.

In [5]:
from toolz import curry

@curry              # We can use curry as a decorator
def mul(x, y):
    return x * y

double = mul(2)     # mul didn't receive enough arguments to evaluate
                    # so it holds onto the 2 and waits, returning a
                    # partially evaluated function, double
print(double(5))

10


# Streaming Analytics
The toolz functions can be composed to analyze large streaming datasets. Toolz supports common analytics patterns like the selection, grouping, reduction, and joining of data through pure composable functions. These functions often have analogs to familiar operations in other data analytics platforms like SQL or Pandas.

Throughout this document we’ll use this simple dataset of accounts

In [6]:
accounts = [(1, 'Alice', 100, 'F'),  # id, name, balance, gender
            (2, 'Bob', 200, 'M'),
            (3, 'Charlie', 150, 'M'),
            (4, 'Dennis', 50, 'M'),
            (5, 'Edith', 300, 'F')]

## Selecting with `map` and `filter`

Simple projection and linear selection from a sequence is achieved through the standard functions `map` and `filter`.

``` sql
SELECT name, balance
FROM accounts
WHERE balance > 150;
```

These functions correspond to the SQL commands `SELECT` and `WHERE`.

In [7]:
from toolz.curried import pipe, map, filter, get
pipe(accounts, filter(lambda acc: acc[2] > 150),
               map(get([1, 2])),
               list)

[('Bob', 200), ('Edith', 300)]

## Split-apply-combine with `groupby` and `reduceby`
We separate split-apply-combine operations into the following two concepts

1. Split the dataset into groups by some property
2. Reduce each of the groups with some synopsis function

Toolz supports this common workflow with

1. a simple in-memory solution
2. a more sophisticated streaming solution.

### In Memory Split-Apply-Combine
The in-memory solution depends on the functions groupby to split, and valmap to apply/combine.

``` sql
SELECT gender, SUM(balance)
FROM accounts
GROUP BY gender;
```

We first show these two functions piece by piece to show the intermediate groups.

In [8]:
from toolz import groupby, valmap, compose
from toolz.curried import get, pluck

groupby(get(3), accounts)

{'F': [(1, 'Alice', 100, 'F'), (5, 'Edith', 300, 'F')],
 'M': [(2, 'Bob', 200, 'M'), (3, 'Charlie', 150, 'M'), (4, 'Dennis', 50, 'M')]}

In [9]:
valmap(compose(sum, pluck(2)),_)

{'F': 400, 'M': 400}

## Streaming Split-Apply-Combine
The groupby function collects the entire dataset in memory into a dictionary. While convenient, the groupby operation is not streaming and so this approach is limited to datasets that can fit comfortably into memory.

Toolz achieves streaming split-apply-combine with reduceby, a function that performs a simultaneous reduction on each group as the elements stream in. To understand this section you should first be familiar with the builtin function reduce.

The reduceby operation takes a key function, like `get(3)` or `lambda x: x[3]`, and a binary operator like `add` or `lesser = lambda acc, x: acc if acc < x else x`. 

It successively applies the key function to each item in succession, accumulating running totals for each key by combining each new value with the previous using the binary operator. It can’t accept full reduction operations like sum or min as these require access to the entire group at once. Here is a simple example:

In [10]:
from toolz import reduceby

def iseven(n):
    return n % 2 == 0

def add(x, y):
    return x + y

reduceby(iseven, add, [1, 2, 3, 4])

{False: 4, True: 6}

The even numbers are added together `(2 + 4 = 6)` into group `True`, and the odd numbers are added together `(1 + 3 = 4)` into group False.

Note that we have to replace the reduction sum with the binary operator add. The incremental nature of add allows us to do the summation work as new data comes in. The use of binary operators like add over full reductions like sum enables computation on very large streaming datasets.

The challenge to using reduceby often lies in the construction of a suitable binary operator. Here is the solution for our accounts example that adds up the balances for each group:

```python
>>> binop = lambda total, account: total + account[2]
>>> reduceby(get(3), binop, accounts, 0)
{'F': 400, 'M': 400}
```

This construction supports datasets that are much larger than available memory. Only the output must be able to fit comfortably in memory and this is rarely an issue, even for very large split-apply-combine computations.

## Semi-Streaming join
We register multiple datasets together with join. Consider a second dataset storing addresses by ID

In [11]:
addresses = [(1, '123 Main Street'),  # id, address
             (2, '5 Adams Way'),
             (5, '34 Rue St Michel')]

We can join this dataset against our accounts dataset by specifying attributes which register different elements with each other; in this case they share a common first column, id.

```sql
SELECT accounts.name, addresses.address
FROM accounts, addresses
WHERE accounts.id = addresses.id;
```

In [12]:
from toolz import join, first

result = join(first, accounts,
              first, addresses)

for ((id, name, bal, gender), (id, address)) in result:
    print((name, address))

('Alice', '123 Main Street')
('Bob', '5 Adams Way')
('Edith', '34 Rue St Michel')


Join takes four main arguments, a left and right key function and a left and right sequence. It returns a sequence of pairs of matching items. In our case the return value of join is a sequence of pairs of tuples such that the first element of each tuple (the ID) is the same. In the example above we unpack this pair of tuples to get the fields that we want (`name` and `address`) from the result.

## Join on arbitrary functions / data

Those familiar with SQL are accustomed to this kind of join on columns. However a functional join is more general than this; it doesn’t need to operate on tuples, and key functions do not need to get particular columns. In the example below we match numbers from two collections so that exactly one is even and one is odd.

In [13]:
def iseven(x):
    return x % 2 == 0
def isodd(x):
    return x % 2 == 1

list(join(iseven, [1, 2, 3, 4],
          isodd, [7, 8, 9]))

[(2, 7), (4, 7), (1, 8), (3, 8), (2, 9), (4, 9)]

## Semi-Streaming Join
The Toolz Join operation fully evaluates the left sequence and streams the right sequence through memory. Thus, if streaming support is desired the larger of the two sequences should always occupy the right side of the join.

## Algorithmic Details
The semi-streaming join operation in toolz is asymptotically optimal. Computationally it is linear in the size of the input + output. In terms of storage the left sequence must fit in memory but the right sequence is free to stream.

The results are not normalized, as in SQL, in that they permit repeated values. If normalization is desired, consider composing with the function unique (note that unique is not fully streaming.)

## More Complex Example
The accounts example above connects two one-to-one relationships, accounts and addresses; there was exactly one name per ID and one address per ID. This need not be the case. The join abstraction is sufficiently flexible to join one-to-many or even many-to-many relationships. The following example finds city/person pairs where that person has a friend who has a residence in that city. This is an example of joining two many-to-many relationships, because a person may have many friends and because a friend may have many residences.

In [14]:
friends = [('Alice', 'Edith'),
           ('Alice', 'Zhao'),
           ('Edith', 'Alice'),
           ('Zhao', 'Alice'),
           ('Zhao', 'Edith')]

cities = [('Alice', 'NYC'),
          ('Alice', 'Chicago'),
          ('Dan', 'Syndey'),
          ('Edith', 'Paris'),
          ('Edith', 'Berlin'),
          ('Zhao', 'Shanghai')]

# Vacation opportunities
# In what cities do people have friends?
result = join(second, friends,
              first, cities)

for ((name, friend), (friend, city)) in sorted(unique(result)):
    print((name, city))

('Alice', 'Berlin')
('Alice', 'Paris')
('Alice', 'Shanghai')
('Edith', 'Chicago')
('Edith', 'NYC')
('Zhao', 'Chicago')
('Zhao', 'NYC')
('Zhao', 'Berlin')
('Zhao', 'Paris')


Join is computationally powerful:

It is expressive enough to cover a wide set of analytics operations
It runs in linear time relative to the size of the input and output
Only the left sequence must fit in memory

# Tips and Tricks
Toolz functions can be combined to make functions that, while common, aren’t a part of toolz’s standard library. This section presents a few of these recipes.

### `keyjoin(leftkey, leftseq, rightkey, rightseq)`

In [15]:
from itertools import starmap
from toolz import join, merge

def keyjoin(leftkey, leftseq, rightkey, rightseq):
    return starmap(merge, join(leftkey, leftseq, rightkey, rightseq))

In [16]:
people = [{'id': 0, 'name': 'Anonymous Guy', 'location': 'Unknown'},
          {'id': 1, 'name': 'Karan', 'location': 'San Francisco'},
          {'id': 2, 'name': 'Matthew', 'location': 'Oakland'}]
hobbies = [{'person_id': 1, 'hobby': 'Tennis'},
           {'person_id': 1, 'hobby': 'Acting'},
           {'person_id': 2, 'hobby': 'Biking'}]

list(keyjoin('id', people, 'person_id', hobbies))

[{'id': 1,
  'name': 'Karan',
  'location': 'San Francisco',
  'person_id': 1,
  'hobby': 'Tennis'},
 {'id': 1,
  'name': 'Karan',
  'location': 'San Francisco',
  'person_id': 1,
  'hobby': 'Acting'},
 {'id': 2,
  'name': 'Matthew',
  'location': 'Oakland',
  'person_id': 2,
  'hobby': 'Biking'}]