# unravel directories of notebooks with `dask`

this is another pass at using `dask` to load notebooks with the ultimate intent to search them.
in [searching-notebooks](oct/oct/2022-10-05-dask-search.ipynb), i first approach this task with some
keen `pandas` skills that we not so kind in the `dask` land.
this document takes another pass at using clearer expressions to ravel a bunch
of notebooks to `dask.dataframe`

taking care to load notebooks as `dask.dataframe`s offers the power to apply
direct queries, export to parquet, export to sqlite, export to duckdb, arrow..

In [1]:
    import pandas, json, jsonpointer, orjson, dask.dataframe; from pathlib import Path
    from toolz.curried import *


## if you know the shape then define it

`dask` truly prefers explicit dtypes while `pandas` is more flexible.
`meta` holds our shape information for the cells, outputs, and displays

In [2]:
    class meta: 
        O = "object"
        ANY = None, O
        NB = [("cells", O), ("metadata", O), ("nbformat", int), ("nbformat_minor", int)]
        CELL = [
            ("cell_type", str), ("execution_count", int), ("id", str),
            ("metadata", O), ("outputs", O), ("source", str), ("cell_ct", int),]
        OUTPUT = [
            ("data", O), ("metadata", O), ("ename", str), ("evalue", str),
            ("text", str), ("execution_count", int), ("output_type", str), ("output_ct", int)]
        DISPLAY = [("type", str), ("value", str)]        
        new_nb = pandas.Series(index=map(first, NB), dtype="O")
        new_cell = pandas.Series(index=map(first, CELL), dtype="O")
        new_output = pandas.Series(index=map(first, OUTPUT), dtype="O")
        new_display = pandas.Series(index=map(first, DISPLAY), dtype="O")
    def enumerate_list(x, key="cell_ct"): return [{key: i, **y} for i, y in enumerate(x)]    

    def get_series(data, key="text", new=meta.new_output):
        if key in data:
            data[key] = "".join(data[key])
        s = new.copy()
        return s.update(data) or s
        

off to the races as we load some data from our local files.

In [37]:
    WHERE = Path("oct")

the files we include start and remain our index. in prior iterations, there were a few set index operations, but we don't want to be opening files to do this cause that is costly. we'll store other metadata on the dataframe as we unpack the notebook shapes.

In [70]:
    files = dask.bag.from_sequence(
        dict(file=str(x)) for x in WHERE.glob("*.ipynb")
    ).to_dataframe().set_index("file")
    files

oct/2022-10-05-dask-search.ipynb
oct/2022-10-06-github-open-source-stats.ipynb
...
oct/2022-11-21-1.ipynb
oct/test_nbconvert_html5.ipynb


`contents` loads our files in to a dataframe containg real cell contents.
each row is a file.

In [72]:
    contents = files.index.to_series().apply(
        compose_left(Path, Path.read_text, orjson.loads, partial(
            get_series, new=meta.new_nb)), meta=meta.NB)
    contents

Unnamed: 0_level_0,cells,metadata,nbformat,nbformat_minor
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
oct/2022-10-05-dask-search.ipynb,object,object,int64,int64
oct/2022-10-06-github-open-source-stats.ipynb,...,...,...,...
...,...,...,...,...
oct/2022-11-21-1.ipynb,...,...,...,...
oct/test_nbconvert_html5.ipynb,...,...,...,...


the `cells` are built by exploding the rows of the `contents`

In [73]:
    cells = contents.cells
    cells = cells.apply(enumerate_list, meta=meta.ANY)
    cells = cells.explode().apply(get_series, key="source", new=meta.new_cell, meta=meta.CELL)
    meta_cells = cells["metadata cell_ct".split()]
    cells.pop("metadata")

Dask Series Structure:
npartitions=20
oct/2022-10-05-dask-search.ipynb                 object
oct/2022-10-06-github-open-source-stats.ipynb       ...
                                                  ...  
oct/2022-11-21-1.ipynb                              ...
oct/test_nbconvert_html5.ipynb                      ...
Name: metadata, dtype: object
Dask Name: getitem, 16 graph layers

new we deal will outputs that include display_data, stdout, and stderr.

In [75]:
    outputs = cells["outputs cell_ct".split()].dropna(subset="outputs")
    outputs.outputs = outputs.outputs.apply(enumerate_list, key="output_ct", meta=meta.ANY)
    outputs = outputs.explode("outputs").dropna(subset="outputs")
    outputs = dask.dataframe.concat([
        outputs.pop("outputs").apply(get_series, key="text", new=meta.new_output, meta=meta.OUTPUT),
        outputs
    ], axis=1)
    meta_display = outputs["metadata cell_ct output_ct".split()]
    outputs.pop("metadata")
    outputs

Unnamed: 0_level_0,data,ename,evalue,text,execution_count,output_type,output_ct,cell_ct
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
oct/2022-10-05-dask-search.ipynb,object,object,object,object,int64,object,int64,int64
oct/2022-10-06-github-open-source-stats.ipynb,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...
oct/2022-11-21-1.ipynb,...,...,...,...,...,...,...,...
oct/test_nbconvert_html5.ipynb,...,...,...,...,...,...,...,...


separating the different standard out/error displays from the rich display data. there is probably more to for managing the different types of outputs from the different reprs.

In [77]:
    display_data = outputs["data execution_count output_type output_ct".split()].dropna(subset="data")
    display_data["data"] = display_data["data"].apply(compose_left(dict.items, list), meta=meta.ANY)
    display_data = display_data.explode("data").dropna(subset="data")
    display_data = dask.dataframe.concat([
        display_data.pop("data").apply(
            compose_left(
                partial(zip, meta.new_display.index), dict, 
                partial(get_series, key=None, new=meta.new_display)
            ), meta=meta.DISPLAY), display_data], axis=1)

## where to go from

* extend to other files. the notebook format is a hypermedia document format.
* save to different formats. initially we think about parquet, while in theory from this dataframe
we could go further an imagine it being the seed for documentation.

In [79]:
    display(cells, outputs, display_data)

Unnamed: 0_level_0,cell_type,execution_count,id,outputs,source,cell_ct
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
oct/2022-10-05-dask-search.ipynb,object,int64,object,object,object,int64
oct/2022-10-06-github-open-source-stats.ipynb,...,...,...,...,...,...
...,...,...,...,...,...,...
oct/2022-11-21-1.ipynb,...,...,...,...,...,...
oct/test_nbconvert_html5.ipynb,...,...,...,...,...,...


Unnamed: 0_level_0,data,ename,evalue,text,execution_count,output_type,output_ct,cell_ct
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
oct/2022-10-05-dask-search.ipynb,object,object,object,object,int64,object,int64,int64
oct/2022-10-06-github-open-source-stats.ipynb,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...
oct/2022-11-21-1.ipynb,...,...,...,...,...,...,...,...
oct/test_nbconvert_html5.ipynb,...,...,...,...,...,...,...,...


Unnamed: 0_level_0,type,value,execution_count,output_type,output_ct
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
oct/2022-10-05-dask-search.ipynb,object,object,int64,object,int64
oct/2022-10-06-github-open-source-stats.ipynb,...,...,...,...,...
...,...,...,...,...,...
oct/2022-11-21-1.ipynb,...,...,...,...,...
oct/test_nbconvert_html5.ipynb,...,...,...,...,...
