# Lesson 04: Concurrency & Parallelism

## 1. Parallelism

Compared to concurrency, parallelism is easier to use, and is _usually_ easier to think about and design.

We can explore this through a data-processing scenario: going through a very large CSV/JSON doc and filtering out columns or keys.
This is usually done when selecting a relevant subset of data from a very broad data set, usually sourced from a 3rd party.

e.g. You want to download the wikipedia dataset and filter for actors, movie titles and release years, so that you can make a simple and comprehensive list.

e.g.2 You want to make a demo example for this lesson, so you have to fake the data before demonstrating the filtering

### Example 1: Generating a text file

1. We need to generate a _very_ large NDJSON file (newline-delimited JSON). For simplicities sake, all lines are readable/same schema etc.

    ```json
    {"a": "B"}
    {"a": "C"}
    ```

1.1. There are many language and OS-level optimisations around doing the _exact_ same thing, like performing the same calculation over the same file line data. This means that we have to randomise the values in order to make a good test file.

    > use `faker`
    
itle.akas.tsv.gz - Contains the following information for titles:

titleId (string) - a tconst, an alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
title (string) – the localized title
region (string) - the region for this version of the title
language (string) - the language of the title
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
attributes (array) - Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) – 0: not original title; 1: original title
title.basics.tsv.gz - Contains the following information for titles:
tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title
title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:
tconst (string) - alphanumeric unique identifier of the title
directors (array of nconsts) - director(s) of the given title
writers (array of nconsts) – writer(s) of the given title

In [1]:
from faker import (Faker, providers)
f = Faker()
list(map(f.add_provider, [providers.misc, providers.geo]))

[None, None]

In [2]:
map(f.name(), range(10))

<map at 0x7fcc614d67f0>

In [3]:
def fkr_n(fkr, n): return [fkr() for _ in range(n)]

In [4]:
movie = {
    "titleId":         f.uuid4(),
    "ordering":        f.random_int(),
    "title":           f.catch_phrase(),
    "region":          f.locale(),
    "language":        f.language_name(),
    "types":           fkr_n(f.name, 5),
    "attributes":      fkr_n(f.name, 5),
    "isOriginalTitle": f.boolean(),
    "tconst":          f.uuid4(),
    "titleType":       f.domain_name(),
    "primaryTitle":    f.catch_phrase(),
    "originalTitle":   ":".join([f.company(), f.catch_phrase()]),
    "isAdult":         f.boolean(),
    "startYear":       f.date(),
    "endYear":         f.year(),
    "runtimeMinutes":  f.random_int(),
    "genres":          fkr_n(f.country, 5),
    "tconst":          f.hex_color(),
    "directors":       fkr_n(f.name, 2),
    "writers":         fkr_n(f.name, 40),
    "actors":          fkr_n(f.name, 120),
}

In [9]:
import json
from IPython.display import JSON

In [11]:
JSON(json.dumps(movie))

<IPython.core.display.JSON object>

Woohoo! Now we just need to write this to a file