Application
-------------

We load data from textfiles, parse JSON, clean up our data and save it again in a nicer binary format.

In [None]:
%load_ext snakeviz

In [None]:
from glob import glob
import json
import pandas as pd

In [None]:
%%time
filenames = sorted(glob('data/data-*.json'))

for fn in filenames:
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    df.to_hdf(fn.replace('json', 'h5'), '/data')

Parallel Execution
--------------------

We use `concurrent.futures.ProcessPoolExecutor` to parallelize this operation.

We pull out our operations into a separate function.

Whenever we have code like the following:

```python
for x in L:
    f(x)
```

We can instead write it as the following

```python
from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

list(e.map(f, L))
```

In [None]:
def load_parse_store(fn):
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    df.to_hdf(fn.replace('json', 'h5'), '/data')

In [None]:
%%time

from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

list(e.map(load_parse_store, filenames))

Parallelism isn't everything
--------------------------------

Knowing about and importing the optimized `ujson` library is just as effective as multi-core execution.

In [None]:
import ujson as json

In [None]:
%%time
filenames = sorted(glob('data/data-*.json'))

for fn in filenames:
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    df.to_hdf(fn.replace('json', 'h5'), '/data')

In [None]:
%%time

from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

list(e.map(load_parse_store, filenames))

History: multiprocessing.Pool
--------------------------------

Perviously people have done multi-processing computations with the `multiprocessing.Pool` object, which behaves more or less identically.

However, today most library designers are coordinating around the `concurrent.futures` interface, so it's wise to move over.

In [None]:
%%time 

from multiprocessing import Pool
p = Pool()

list(p.map(load_parse_store, filenames))