# Lesson 04: Concurrency & Parallelism

## 1. Parallelism

Compared to concurrency, parallelism is easier to use, and is _usually_ easier to think about and design.

We can explore this through a data-processing scenario: going through a very large CSV/JSON doc and filtering out columns or keys.
This is usually done when selecting a relevant subset of data from a very broad data set, usually sourced from a 3rd party.

e.g. You want to download the wikipedia dataset and filter for actors, movie titles and release years, so that you can make a simple and comprehensive list.

e.g.2 You want to make a demo example for this lesson, so you have to fake the data before demonstrating the filtering

### Example 1: Generating a text file

1. We need to generate a _very_ large NDJSON file (newline-delimited JSON). For simplicities sake, all lines are readable/same schema etc.

    ```json
    {"a": "B"}
    {"a": "C"}
    ```

1.1. There are many language and OS-level optimisations around doing the _exact_ same thing, like performing the same calculation over the same file line data. This means that we have to randomise the values in order to make a good test file.

    > use `faker`

In [4]:
from faker import (Faker, providers)
f = Faker()
f.add_provider(providers.misc)
f.add_provider(providers.geo)

In [5]:
def fkr_n(fkr, n): return [fkr() for _ in range(n)]

In [10]:
def gen_movie():
    return {
        "titleId":         f.uuid4(),
        "ordering":        f.random_int(),
        "title":           f.catch_phrase(),
        "region":          f.locale(),
        "language":        f.language_name(),
        "types":           fkr_n(f.name, 5),
        "attributes":      fkr_n(f.name, 5),
        "isOriginalTitle": f.boolean(),
        "tconst":          f.uuid4(),
        "titleType":       f.domain_name(),
        "primaryTitle":    f.catch_phrase(),
        "originalTitle":   ":".join([f.company(), f.catch_phrase()]),
        "isAdult":         f.boolean(),
        "startYear":       f.date(),
        "endYear":         f.year(),
        "runtimeMinutes":  f.random_int(),
        "genres":          fkr_n(f.country, 5),
        "tconst":          f.hex_color(),
        "directors":       fkr_n(f.name, 2),
        "writers":         fkr_n(f.name, 15),
        "actors":          fkr_n(f.name, 50),
    }

In [14]:
import json
from IPython.display import JSON
JSON(gen_movie())

<IPython.core.display.JSON object>

Woohoo! Now we just need to write this to a file

Lets make a function that loops and `yields` data

In [19]:
class MovieTable:
    def records(fpath, n_records=10):
        for _ in range(n_records):
            yield gen_movie()

def test(g): list(g)

In [20]:
%timeit test(MovieTable.records("/tmp/movies.ndjson", 20))

466 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


But I need to show you the CPU usage _per core_!

What about threading?

> Use psutil

In [24]:
from itertools import count
import json
import asyncio
from dataclasses import dataclass
import functools

import psutil

@dataclass
class Timer:
    f: object
    sentinel: bool = False

    async def cpu(self):
        while not self.sentinel:
            self.f()
            await asyncio.sleep(1)

    def task(self):
        return asyncio.create_task(self.cpu())

    async def stop(self):
        self.sentinel = True

async def run_with_timer(f: functools.partial, t: Timer):
    tsk = t.task()
    await f()
    tsk.cancel()

    try:
        await tsk
    except asyncio.CancelledError:
        print("finished")

In [25]:
class AMovieTable:
    async def records(fpath, n_records=10):
        with open(fpath, "w") as ostream:
            for _ in range(n_records):
                print(json.dumps(gen_movie()), file=ostream, end="\n")
                await asyncio.sleep(0)

def cpu():
    print("\t".join(map(str, psutil.cpu_percent(percpu=True))))

In [26]:
await run_with_timer(
    functools.partial(AMovieTable.records, "/tmp/movies.ndjson", 500),
    Timer(cpu)
)

1.5	1.6	1.8	1.9	1.9	1.5
0.0	1.0	0.0	1.0	0.0	100.0
1.9	1.9	1.9	0.0	1.9	100.0
1.9	2.8	1.9	0.9	0.9	99.1
0.0	0.0	0.0	0.9	1.0	100.0
0.9	1.9	0.9	0.0	1.0	100.0
0.9	0.0	0.0	0.9	0.9	100.0
2.8	3.7	0.9	1.0	0.0	100.0
0.0	0.9	0.9	0.0	0.9	100.0
2.8	2.8	0.0	1.0	1.0	100.0
0.9	0.0	0.0	0.9	1.0	100.0
1.0	2.8	1.9	0.0	1.0	100.0
0.0	0.0	0.0	0.0	0.9	100.0
finished


In [23]:
!wc -l "/tmp/movies.ndjson" && ls -alh "/tmp/movies.ndjson"

500 /tmp/movies.ndjson
-rw-r--r-- 1 jovyan users 1.8M Nov 22 15:41 /tmp/movies.ndjson


In [None]:
asyncio.Queue(maxsize=0)