# Lesson 04: Concurrency & Parallelism

## 1. Parallelism

Compared to concurrency, parallelism is easier to use, and is _usually_ easier to think about and design.

We can explore this through a data-processing scenario: going through a very large CSV/JSON doc and filtering out columns or keys.
This is usually done when selecting a relevant subset of data from a very broad data set, usually sourced from a 3rd party.

e.g. You want to download the wikipedia dataset and filter for actors, movie titles and release years, so that you can make a simple and comprehensive list.

e.g.2 You want to make a demo example for this lesson, so you have to fake the data before demonstrating the filtering

### Example 1: Generating a text file

1. We need to generate a _very_ large NDJSON file (newline-delimited JSON). For simplicities sake, all lines are readable/same schema etc.

    ```json
    {"a": "B"}
    {"a": "C"}
    ```

1.1. There are many language and OS-level optimisations around doing the _exact_ same thing, like performing the same calculation over the same file line data. This means that we have to randomise the values in order to make a good test file.

    > use `faker`

In [1]:
from faker import (Faker, providers)
f = Faker()
list(map(f.add_provider, [providers.misc, providers.geo]))

[None, None]

In [2]:
[f.name() for _ in range(3)]

['Kimberly Scott', 'Mary Young', 'Alyssa Blackburn']

In [3]:
def fkr_n(fkr, n): return [fkr() for _ in range(n)]

In [4]:
def gen_movie():
    return {
        "titleId":         f.uuid4(),
        "ordering":        f.random_int(),
        "title":           f.catch_phrase(),
        "region":          f.locale(),
        "language":        f.language_name(),
        "types":           fkr_n(f.name, 5),
        "attributes":      fkr_n(f.name, 5),
        "isOriginalTitle": f.boolean(),
        "tconst":          f.uuid4(),
        "titleType":       f.domain_name(),
        "primaryTitle":    f.catch_phrase(),
        "originalTitle":   ":".join([f.company(), f.catch_phrase()]),
        "isAdult":         f.boolean(),
        "startYear":       f.date(),
        "endYear":         f.year(),
        "runtimeMinutes":  f.random_int(),
        "genres":          fkr_n(f.country, 5),
        "tconst":          f.hex_color(),
        "directors":       fkr_n(f.name, 2),
        "writers":         fkr_n(f.name, 40),
        "actors":          fkr_n(f.name, 120),
    }

In [5]:
import json
from IPython.display import JSON

In [6]:
JSON(gen_movie())

<IPython.core.display.JSON object>

Woohoo! Now we just need to write this to a file

In [7]:
class MovieTable:
    def records(fpath, n_records=10):
        for _ in range(n_records):
            yield gen_movie()
            
def test(g): list(g)

In [8]:
%timeit test(MovieTable.records("/tmp/movies.ndjson", 20))

321 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


But I need to show you the CPU usage _per core_!

What about threading?

> Use psutil

In [9]:
import psutil
from itertools import count
import json

class AMovieTable:
    async def records(fpath, n_records=10):
        with open(fpath, "w") as ostream:
            for _ in range(n_records):
                print(json.dumps(gen_movie()), file=ostream, end="\n")
                await asyncio.sleep(0)

from dataclasses import dataclass
import functools

@dataclass
class Timer:
    f: object
    sentinel: bool = False

    async def cpu(self):
        while not self.sentinel:
            self.f()
            await asyncio.sleep(1)

    def task(self):
        return asyncio.create_task(self.cpu())

    async def stop(self):
        self.sentinel = True
        
def cpu():
    print("\t".join(map(str, psutil.cpu_percent(percpu=True))))
    
async def run_with_timer(f, t):
    tsk = t.task()
    await f()
    tsk.cancel()

    try:
        await tsk
    except asyncio.CancelledError:
        print("finished")

In [12]:
import asyncio
from functools import partial

# t = Timer(cpu).task()
# await AMovieTable.records("/tmp/movies.ndjson", 100)
# t.cancel()

await run_with_timer(partial(AMovieTable.records, "/tmp/movies.ndjson", 1_000), Timer(cpu))

7.0	4.7	7.2	6.0	5.5	10.9	4.2	5.1	3.3	5.2	6.2	4.1	6.6	7.3	7.2	7.1
8.3	2.9	4.9	1.0	3.9	80.8	6.9	0.0	3.8	3.8	0.0	21.0	1.0	18.4	4.9	10.8
3.8	5.8	4.8	1.9	10.5	20.2	9.5	4.8	17.5	2.9	2.9	10.6	14.6	87.4	16.5	1.9
6.4	2.9	1.9	10.6	0.0	1.0	1.9	0.0	1.9	2.9	5.7	1.9	3.8	100.0	6.7	1.9
4.6	7.7	1.0	25.2	3.9	0.0	2.0	1.0	1.9	6.7	1.0	5.8	4.7	99.1	1.0	8.7
9.7	4.9	8.7	12.5	20.2	11.5	7.8	0.0	2.9	10.7	7.8	1.9	1.0	30.1	1.0	73.1
6.4	8.7	100.0	22.6	3.8	4.9	3.9	4.9	3.8	5.8	0.0	5.8	9.7	6.8	2.9	0.0
11.7	3.8	100.0	8.7	18.6	17.5	7.9	1.0	5.8	6.6	0.0	7.7	1.9	11.5	9.8	5.8
7.3	3.8	99.0	1.9	1.9	6.7	2.9	13.3	1.0	0.0	1.0	4.8	3.9	1.0	1.0	0.0
5.8	5.8	100.0	2.9	2.9	3.8	4.0	4.8	1.9	8.7	0.0	19.2	4.9	1.9	3.8	3.8
8.1	8.5	22.1	78.3	6.7	4.8	2.9	4.8	1.9	2.9	5.7	5.7	2.9	1.0	8.6	4.8
10.0	14.6	10.5	100.0	6.7	2.9	2.9	1.0	3.9	5.8	2.9	0.0	5.8	1.9	4.8	1.0
10.1	7.7	12.3	100.0	5.8	8.6	4.9	5.7	27.6	7.7	1.0	0.0	1.9	1.0	4.8	6.7
6.4	1.0	0.0	100.0	13.5	4.9	4.9	6.7	17.5	3.9	9.7	0.0	1.9	4.9	1.0	1.0
14.2	6.7	8.8	99.0	10.7	22.1	8.9	16.5	4.8	1.9	12.5	1

In [11]:
!wc -l "/tmp/movies.ndjson" && ls -alh "/tmp/movies.ndjson"

100 /tmp/movies.ndjson
-rw-r--r-- 1 jovyan users 349K Nov 22 15:08 /tmp/movies.ndjson
