Notebook Bag
============



## Preliminaries



This notebook requires the installation of the package `mimesis`.
Install it with `pip` or `conda`, for the latter run
`conda install -c conda-forge mimesis`.



## Start the Dask Client



Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.

The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.



In [2]:
from dask.distributed import Client, progress

client = Client(n_workers=4, threads_per_worker=1)
client


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 7.92 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:62967,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 7.92 GiB

0,1
Comm: tcp://127.0.0.1:62990,Total threads: 1
Dashboard: http://127.0.0.1:62991/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62970,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9u4hgyyp,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9u4hgyyp

0,1
Comm: tcp://127.0.0.1:62994,Total threads: 1
Dashboard: http://127.0.0.1:62997/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62972,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9k8kz5a2,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9k8kz5a2

0,1
Comm: tcp://127.0.0.1:62993,Total threads: 1
Dashboard: http://127.0.0.1:62995/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62974,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-vlx2wz1q,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-vlx2wz1q

0,1
Comm: tcp://127.0.0.1:62987,Total threads: 1
Dashboard: http://127.0.0.1:62988/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62976,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-2g82qvhr,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-2g82qvhr


Address of the scheduler



In [3]:
client.scheduler_info()


0,1
Comm: tcp://127.0.0.1:62967,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 7.92 GiB

0,1
Comm: tcp://127.0.0.1:62990,Total threads: 1
Dashboard: http://127.0.0.1:62991/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62970,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9u4hgyyp,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9u4hgyyp
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 65.47 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://127.0.0.1:62994,Total threads: 1
Dashboard: http://127.0.0.1:62997/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62972,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9k8kz5a2,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-9k8kz5a2
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 65.50 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://127.0.0.1:62993,Total threads: 1
Dashboard: http://127.0.0.1:62995/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62974,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-vlx2wz1q,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-vlx2wz1q
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 65.59 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B

0,1
Comm: tcp://127.0.0.1:62987,Total threads: 1
Dashboard: http://127.0.0.1:62988/status,Memory: 1.98 GiB
Nanny: tcp://127.0.0.1:62976,
Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-2g82qvhr,Local directory: C:\Users\Julian\AppData\Local\Temp\dask-scratch-space\worker-2g82qvhr
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 0.0%,Last seen: Just now
Memory usage: 65.36 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 0.0 B


Notes:

-   the connection string (ip:port) of the scheduler can be used to connect to an existing cluster
-   a cluster can be shutdown with `client.shutdown()`



## Prepare Data



In [4]:
import dask
import json
import os

os.makedirs("data/bag", exist_ok=True)  # Create data/ directory

b = dask.datasets.make_people(
    npartitions=10, records_per_partition=1000  # Make records of people,
)  # with default values
b.map(json.dumps).to_textfiles("data/bag/*.json")  # Encode as JSON, write to disk


['c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/0.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/1.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/2.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/3.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/4.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/5.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/6.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/7.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/8.json',
 'c:/Users/Julian/VSCode/FHE/aim_adm/data/bag/9.json']

Take a quick look



In [5]:
!ls -lah data/bag/*.json

Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


## Load Data



First look at the raw data:



In [6]:
!head -n 2 data/bag/0.json

Der Befehl "head" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


Load it with dask:



In [7]:
import dask.bag as db

b = db.read_text("data/bag/*.json").map(json.loads)
b


dask.bag<loads, npartitions=10>

How many entries are there?



In [8]:
b.count().compute()


10000

Take two elements (from the first partition)



In [9]:
b.take(2)


({'age': 62,
  'name': ['Daniell', 'Montgomery'],
  'occupation': 'Chef',
  'telephone': '+13645830731',
  'address': {'address': '523 Cloud Court', 'city': 'Hamilton'},
  'credit-card': {'number': '3791 914896 95304', 'expiration-date': '12/18'}},
 {'age': 47,
  'name': ['Sammy', 'Alvarado'],
  'occupation': 'Wine Merchant',
  'telephone': '+1-440-896-2713',
  'address': {'address': '1087 Reuel Mews', 'city': 'Eagle Mountain'},
  'credit-card': {'number': '5172 9380 8616 9937',
   'expiration-date': '05/18'}})

Extract some information from each entry:



In [10]:
b.map(lambda record: record["occupation"]).take(2)


('Chef', 'Wine Merchant')

To get a list of all distinct occupations, use the function `distinct`.



In [11]:
%%time
b.map(lambda record: record['occupation']).distinct().compute()

CPU times: total: 328 ms
Wall time: 1.49 s


['Chef',
 'Wine Merchant',
 'Knitter',
 'Typewriter Engineer',
 'Literary Editor',
 'Food Processor',
 'Nuclear Scientist',
 'Travel Consultant',
 'Circus Worker',
 'Medical Physicist',
 'Safety Officer',
 'Site Engineer',
 'Pilot',
 'Caterer',
 'Cab Driver',
 'Hot Foil Printer',
 'Greengrocer',
 'Professional Wrestler',
 'Town Clerk',
 'Recreational',
 'Tree Surgeon',
 'Building Estimator',
 'Street Trader',
 'Polisher',
 'Geologist',
 'Racehorse Groom',
 'Stockbroker',
 'Fuel Merchant',
 'Pharmacist',
 'Navigator',
 'Road Safety Officer',
 'Podiatrist',
 'Auditor',
 'Maid',
 'Chicken Chaser',
 'Rent Offcer',
 'Bar Manager',
 'Assistant Caretaker',
 'Typist',
 'Plant Engineer',
 'Machine Setter',
 'Nutritionist',
 'Jockey',
 'Import Consultant',
 'Palaeontologist',
 'Tailor',
 'Councillor',
 'Shop Assistant',
 'Advertising Assistant',
 'Botanist',
 'Stable Hand',
 'Systems Engineer',
 'Janitor',
 'Magistrate',
 'Stage Mover',
 'Florist',
 'Picture Editor',
 'Gaming Club Manager',
 'Fu

What is the difference to this approach?



In [12]:
%%time
set(b.map(lambda record: record['occupation']).take(1000, npartitions=-1))

CPU times: total: 234 ms
Wall time: 1.65 s


{'Accounts Assistant',
 'Accounts Clerk',
 'Acoustic Engineer',
 'Actress',
 'Actuary',
 'Administration Manager',
 'Administration Staff',
 'Advertising Agent',
 'Advertising Assistant',
 'Advertising Manager',
 'Advertising Staff',
 'Aeronautical Engineer',
 'Aircraft Designer',
 'Aircraft Maintenance Engineer',
 'Airman',
 'Airport Controller',
 'Almoner',
 'Ambulance Controller',
 'Ambulance Driver',
 'Amusement Arcade Worker',
 'Animal Breeder',
 'Antique Dealer',
 'Applications Programmer',
 'Arbitrator',
 'Arborist',
 'Archaeologist',
 'Architect',
 'Area Manager',
 'Aromatherapist',
 'Art Critic',
 'Art Dealer',
 'Art Restorer',
 'Artist',
 'Assessor',
 'Assistant',
 'Assistant Caretaker',
 'Astrologer',
 'Attendant',
 'Au Pair',
 'Auction Worker',
 'Auctioneer',
 'Audiologist',
 'Audit Clerk',
 'Audit Manager',
 'Auditor',
 'Auto Electrician',
 'Bacon Curer',
 'Baggage Handler',
 'Bailiff',
 'Bakery Assistant',
 'Bakery Operator',
 'Balloonist',
 'Baptist Minister',
 'Bar Mana

## Map, Filter, Aggregate



We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value.



In [13]:
b.filter(lambda record: record["age"] > 30).take(2)  # Select only people over 30


({'age': 62,
  'name': ['Daniell', 'Montgomery'],
  'occupation': 'Chef',
  'telephone': '+13645830731',
  'address': {'address': '523 Cloud Court', 'city': 'Hamilton'},
  'credit-card': {'number': '3791 914896 95304', 'expiration-date': '12/18'}},
 {'age': 47,
  'name': ['Sammy', 'Alvarado'],
  'occupation': 'Wine Merchant',
  'telephone': '+1-440-896-2713',
  'address': {'address': '1087 Reuel Mews', 'city': 'Eagle Mountain'},
  'credit-card': {'number': '5172 9380 8616 9937',
   'expiration-date': '05/18'}})

## Chain Computations



It is common to do many of these steps in one pipeline, only calling compute or take at the end.



In [14]:
result = (
    b.filter(lambda record: record["age"] > 30)
    .map(lambda record: record["occupation"])
    .frequencies(sort=True)
    .topk(15, key=1)
)
result


dask.bag<topk-aggregate, npartitions=1>

As with all lazy Dask collections, we need to call `compute` to actually evaluate our result. The `take` method used in earlier examples is also like `compute` and will also trigger computation.



In [15]:
result.compute()


[('Toll Collector', 16),
 ('Technical Editor', 15),
 ('Public Relations Of?cer', 14),
 ('Literary Editor', 13),
 ('Hot Foil Printer', 13),
 ('Recreational', 13),
 ('Proprietor', 13),
 ('Soldier', 13),
 ('Training Advisor', 13),
 ('Caulker', 13),
 ('Plant Engineer', 12),
 ('Magistrate', 12),
 ('Ambulance Controller', 12),
 ('Transport Controller', 12),
 ('Illustrator', 12)]

## Convert to Dask DataFrames



Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes. Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.

However, Dask Dataframes also expect data that is organized as flat columns. It does not support nested JSON data very well (Bag is better for this).

Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe.



In [16]:
def flatten(record):
    return {
        "age": record["age"],
        "occupation": record["occupation"],
        "telephone": record["telephone"],
        "credit-card-number": record["credit-card"]["number"],
        "credit-card-expiration": record["credit-card"]["expiration-date"],
        "name": " ".join(record["name"]),
        "street-address": record["address"]["address"],
        "city": record["address"]["city"],
    }


b.map(flatten).take(1)


({'age': 62,
  'occupation': 'Chef',
  'telephone': '+13645830731',
  'credit-card-number': '3791 914896 95304',
  'credit-card-expiration': '12/18',
  'name': 'Daniell Montgomery',
  'street-address': '523 Cloud Court',
  'city': 'Hamilton'},)

In [17]:
df = b.map(flatten).to_dataframe()
df.head()


Unnamed: 0,age,occupation,telephone,credit-card-number,credit-card-expiration,name,street-address,city
0,62,Chef,+13645830731,3791 914896 95304,12/18,Daniell Montgomery,523 Cloud Court,Hamilton
1,47,Wine Merchant,+1-440-896-2713,5172 9380 8616 9937,05/18,Sammy Alvarado,1087 Reuel Mews,Eagle Mountain
2,42,Knitter,+1-219-529-8230,3746 451311 86322,04/21,Marlana Black,89 Denslowe Pine,Patterson
3,42,Typewriter Engineer,+1-559-246-6819,4041 7628 4841 4742,11/20,Thad Fuentes,289 Spears Highway,Mundelein
4,56,Literary Editor,+14150580576,3403 294359 18802,10/17,Jeremiah Ray,28 Whitney Pine,Bloomington


## Task



Count the number of people with expired credit cards.
Do not use the data frame form the previous task, instead use
the original bag `b` and apply a filter to it.
Finally provide a **pandas** DataFrame with columns
`name`, `street` and `city` that contains all these people
(Note: calling `compute` for a Dask DataFrame will return a Pandas DataFrame).

