# Using Orcabridge

In this notebook, we will explore the basic usage of Orcabridge library.

Below we explore the usage of `orcabridge` package, enumerating the core components. Many of these will correspond directly to [core concepts](./01_orcabridge_core_concepts%20copy.ipynb) introduced in in [part 1](./01_orcabridge_core_concepts%20copy.ipynb).

In [1]:
%load_ext autoreload
%autoreload
# import orcabridge package
import orcabridge as ob
import polars as pl
import pyarrow as pa

In [None]:
from orcabridge.pod import TypedFunctionPod
from orcabridge.hashing.semantic_arrow_hasher import SemanticArrowHasher
from orcabridge.pod.core import CachedFunctionPod
from orcabridge.hashing.object_hashers import LegacyObjectHasher
from orcabridge.hashing.function_info_extractors import FunctionSignatureExtractor
from orcabridge.core.streams import SyncStreamFromLists
from orcabridge.types.registry import *

In [3]:
from orcabridge.store.arrow_data_stores import (
    ParquetArrowDataStore,
    demo_single_row_constraint,
)

In [4]:
import logging

logging.basicConfig(level=logging.INFO)
demo_single_row_constraint()

INFO:orcabridge.store.arrow_data_stores:Loading metadata index...
INFO:orcabridge.store.arrow_data_stores:Loaded metadata for 0 records
INFO:orcabridge.store.arrow_data_stores:Initialized lazy ParquetArrowDataStore at /tmp/tmprez6xpqd
INFO:orcabridge.store.arrow_data_stores:Added record experiments:dataset_A:entry_001_abcdef1234567890abcdef1234567890 with 1 rows
ERROR:orcabridge.store.arrow_data_stores:Schema mismatch for experiments/dataset_A:
ERROR:orcabridge.store.arrow_data_stores:  Existing user columns: ['_user_entry_id', 'timestamp', 'value', 'category']
ERROR:orcabridge.store.arrow_data_stores:  New user columns: ['entry_id', 'timestamp', 'value', 'category']
ERROR:orcabridge.store.arrow_data_stores:  Missing in new: {'_user_entry_id'}
ERROR:orcabridge.store.arrow_data_stores:  Extra in new: {'entry_id'}
INFO:orcabridge.store.arrow_data_stores:Shutting down ParquetArrowDataStore...
INFO:orcabridge.store.arrow_data_stores:Synced 1 dirty caches to disk
INFO:orcabridge.store.arrow

Testing Single-Row Constraint...

=== Testing Valid Single-Row Records ===
✓ Added single-row record entry_001_abcdef... (value: 100.0)


ValueError: Schema mismatch for experiments/dataset_A. Existing data has columns ['_user_entry_id', 'timestamp', 'value', 'category'], but new data has columns ['entry_id', 'timestamp', 'value', 'category']. All records in a source must have the same schema.

In [5]:
from datetime import datetime, timedelta

# Initialize store with single-row constraint enforcement
store = ParquetArrowDataStore(base_path="./data", duplicate_entry_behavior="overwrite")


INFO:orcabridge.store.arrow_data_stores:Loading metadata index...
INFO:orcabridge.store.arrow_data_stores:Loaded metadata for 1 records
INFO:orcabridge.store.arrow_data_stores:Initialized lazy ParquetArrowDataStore at ./data


In [None]:
# This works - single row
single_row_data = pa.table({"value": [46.0], "timestamp": [datetime.now()]})
store.add_record("experiments", "dataset_A", "entry_1245d...", single_row_data)

# This fails - multiple rows
multi_row_data = pa.table({"value": [1.0, 2.0, 3.0], "timestamp": [datetime.now()] * 3})

INFO:orcabridge.store.arrow_data_stores:Added record experiments:dataset_A:entry_123d... with 1 rows


In [11]:
store.force_sync()

In [9]:
data = store.get_all_records_as_polars("experiments", "dataset_A")

  user_columns = [col for col in lazy_frame.columns if col not in system_cols]


In [12]:
data.collect()

value,timestamp
f64,datetime[μs]
42.0,2025-06-16 09:05:29.641098
42.0,2025-06-16 09:06:46.449592


In [7]:
data.collect()

value,timestamp,entry_id
f64,datetime[μs],cat


In [9]:
store.add_record("experiments", "dataset_A", "entry_123abc...", multi_row_data)

ValueError: Each record must contain exactly 1 row, got 3 rows. This constraint ensures that for each source_name/source_id combination, there is only one valid entry per entry_id.

In [None]:
data_store = ParquetArrowDataStore(
    "./dataset", sync_interval_seconds=10, max_loaded_sources=30
)

INFO:orcabridge.store.arrow_data_stores:Loading metadata index...
INFO:orcabridge.store.arrow_data_stores:Loaded metadata for 0 records
INFO:orcabridge.store.arrow_data_stores:Initialized lazy ParquetArrowDataStore at ./dataset


In [None]:
data_store.add_record()

In [19]:
from pathlib import Path
from orcabridge.types.default import default_registry
from orcabridge.types.registry import PacketConverter

type_spec = {"name": str, "file": Path}
example_packet = {"name": "Edgar", "file": "sample.txt"}
example_packets = [
    {"name": "Edgar", "file": "sample.txt"},
    {"name": "Alice", "file": "sample2.txt"},
]

In [5]:
converter = PacketConverter(type_spec, registry=default_registry)

In [20]:
table = converter.to_arrow_table(example_packets)

In [21]:
table

pyarrow.Table
name: string
file: string
----
name: [["Edgar","Alice"]]
file: [["sample.txt","sample2.txt"]]

In [22]:
hasher.hash_table(table)

'2cda22e1527ecf01d1555ed90a0f0a2b40ec7f1034387b5f0f93afb38cf041cf'

In [23]:
processed_table = hasher._process_table_columns(table)

In [25]:
import inspect

In [None]:
def test(a: int = 5):
    return 5

In [29]:
for k, v in inspect.signature(test).parameters.items():
    print(k, v)

a a: int = 5


In [33]:
v.name

'a'

In [24]:
processed_table

pyarrow.Table
name: string
file: string
----
name: [["Edgar","Alice"]]
file: [["3a1f868f16c70867afdff05d9c7de3a6e573d2ade1ce6a48293d973f8ad68504","3a1f868f16c70867afdff05d9c7de3a6e573d2ade1ce6a48293d973f8ad68504"]]

In [15]:
z = hasher._sort_table_columns(processed_table)

In [16]:
processed_table

pyarrow.Table
name: string
file: string
----
name: [["Edgar"]]
file: [["f8efdb6bc4c7dc8eb7b439ba9b3d132733f0e73c4aa83748bb13f25133a43633"]]

In [None]:
hasher.hash_table(table)

'99ca6c8b436a17d888d65051ba7977eeec93323746e619db5b1c9ab53171566d'

In [18]:
hasher.hash_table(table)

'35d6c3043e95f2176c36188849655ac2a412562c6ccfcbec3a7a8a29a3a1eb44'

In [6]:
function_info_extractor = FunctionSignatureExtractor()
object_hasher = LegacyObjectHasher(function_info_extractor=function_info_extractor)

In [6]:
def product(x: float, y: int) -> float:
    return x * y

In [7]:
class ArrowPacketHasher:
    def hash_packet(self, packet):
        print(f"Requested to hash packet {packet}")
        return "test"

    def hash_arrow_packet(self, packet):
        print(f"Requested to hash arrow packet {packet}")
        return "test_arrow"


class MyArrowDataStore:
    def add_record(
        self,
        source_name: str,
        source_id: str,
        entry_id: str,
        arrow_data: pa.Table,
    ) -> pa.Table:
        print(
            f"Adding record to Arrow data store: {source_name}, {source_id}, {entry_id}: {arrow_data}"
        )
        return arrow_data

    def get_record(
        self, source_name: str, source_id: str, entry_id: str
    ) -> pa.Table | None:
        return None

    def get_all_records(self, source_name: str, source_id: str) -> pa.Table | None:
        """Retrieve all records for a given source as a single table."""
        return None

    def get_all_records_as_polars(
        self, source_name: str, source_id: str
    ) -> pl.LazyFrame | None:
        """Retrieve all records for a given source as a single Polars DataFrame."""
        return None


In [None]:
pod = TypedFunctionPod(product, "result")

In [11]:
pod.identity_structure()

('TypedFunctionPod',
 <function __main__.product(x: float, y: int) -> float>,
 ('result',))

In [12]:
stream = SyncStreamFromLists(
    [
        {"id": 0},
        {"id": 1},
        {"id": 2},
    ],
    [{"x": 3.0, "y": 4}, {"x": 5.0, "y": 6}, {"x": 7.0, "y": 8}],
)

In [13]:
for tag, packet in pod(stream):
    print((f"Tag: {tag}, Packet: {packet}"))

Tag: {'id': 0}, Packet: {'result': 12.0}
Tag: {'id': 1}, Packet: {'result': 30.0}
Tag: {'id': 2}, Packet: {'result': 56.0}


In [14]:
pod.output_converter.from_arrow_table(pod.output_converter.to_arrow_table(packet))

[{'result': 56.0}]

In [None]:
cached_pod = CachedFunctionPod(
    pod,
    object_hasher=object_hasher,
    arrow_hasher=ArrowPacketHasher(),
    result_store=MyArrowDataStore(),
    tag_store=MyArrowDataStore(),
)

In [16]:
cached_pod.function_pod.function(3.0, 4)

12.0

In [17]:
cached_pod.function_pod.function(3.0, 4)

12.0

In [18]:
for tag, packet in cached_pod(stream):
    print((f"Tag: {tag}, Packet: {packet}"))

Requested to hash arrow packet pyarrow.Table
x: double
y: int64
----
x: [[3]]
y: [[4]]
Requested to hash arrow packet pyarrow.Table
id: int64
__packet_key: string
----
id: [[0]]
__packet_key: [["test_arrow"]]
Adding record to Arrow data store: product, e9487308f083ecc170e2fe679ce0d30b70e7c9d7c59b532118944e461e40ba1f, test_arrow: pyarrow.Table
id: int64
__packet_key: string
----
id: [[0]]
__packet_key: [["test_arrow"]]
Requested to hash arrow packet pyarrow.Table
x: double
y: int64
----
x: [[3]]
y: [[4]]
Adding record to Arrow data store: product, e9487308f083ecc170e2fe679ce0d30b70e7c9d7c59b532118944e461e40ba1f, test_arrow: pyarrow.Table
result: double
----
result: [[12]]
Tag: {'id': 0}, Packet: {'result': 12.0}
Requested to hash arrow packet pyarrow.Table
x: double
y: int64
----
x: [[5]]
y: [[6]]
Requested to hash arrow packet pyarrow.Table
id: int64
__packet_key: string
----
id: [[1]]
__packet_key: [["test_arrow"]]
Adding record to Arrow data store: product, e9487308f083ecc170e2fe679c

In [3]:
from pathlib import Path


@typed_function_pod(["sum", "difference", "info_path"])
def add_and_subtract(a: int, b: int) -> tuple[int, int, Path]:
    """
    Adds and subtracts two integers.

    :param a: First integer.
    :param b: Second integer.
    :return: A tuple containing the sum and the difference of a and b.
    """
    return a + b, a - b, Path("local_info.json")

In [4]:
stream = SyncStreamFromLists(
    [{"name": "Edgar"}, {"name": "Alice"}, {"name": "Bob"}],
    [{"a": 5, "b": 3}, {"a": 10, "b": 2}, {"a": 7, "b": 4}],
)

In [5]:
converter = PacketConverter(
    add_and_subtract.function_output_types, add_and_subtract.registry
)

In [6]:
packets = [p for t, p in add_and_subtract(stream).flow()]

In [7]:
table = converter.to_arrow_table(packets)

In [8]:
table

pyarrow.Table
sum: int64
difference: int64
info_path: string
----
sum: [[8,12,11]]
difference: [[2,8,3]]
info_path: [["local_info.json","local_info.json","local_info.json"]]

In [11]:
import pyarrow as pa

In [20]:
tag = {"name": ["Edgar", "Names"], "age": 37}

In [21]:
tag["__packet_key"] = "some_unique_key"


In [22]:
from orcabridge.hashing.defaults import LegacyObjectHasher

LegacyObjectHasher().hash_to_hex(tag)


'51d2a5483d1623ba4582fb372a73bb7726caf1a9756af32778cbac1b8c16f6c1'

In [23]:
pa.Table.from_pylist([tag])

pyarrow.Table
name: list<item: string>
  child 0, item: string
age: int64
__packet_key: string
----
name: [[["Edgar","Names"]]]
age: [[37]]
__packet_key: [["some_unique_key"]]

In [9]:
converter.from_arrow_table(table)

[{'sum': 8, 'difference': 2, 'info_path': PosixPath('local_info.json')},
 {'sum': 12, 'difference': 8, 'info_path': PosixPath('local_info.json')},
 {'sum': 11, 'difference': 3, 'info_path': PosixPath('local_info.json')}]

In [11]:
for field in table.schema:
    print(field.metadata)

{}
{}
{b'semantic_type': b'path'}


In [7]:
import pyarrow as pa

In [13]:
pa.type(int)

AttributeError: module 'pyarrow' has no attribute 'type'

In [10]:
table = converter.to_arrow_table(packets[0])
table

pyarrow.Table
sum: int64
difference: int64
info_path: string
----
sum: [[8]]
difference: [[2]]
info_path: [["local_info.json"]]

In [None]:
import pyarrow as pa
import time


def benchmark_conversions():
    # Create test tables
    single_row = pa.table({"x": [42], "y": ["hello"], "z": [3.14]})
    multi_row = pa.table(
        {"x": list(range(1000)), "y": [f"item_{i}" for i in range(1000)]}
    )

    # Method 1: to_pydict() + post-processing
    def method1(table):
        pydict = table.to_pydict()
        if len(table) == 1:
            return {key: values[0] for key, values in pydict.items()}
        return pydict

    # Method 2: Direct scalar extraction
    def method2(table):
        if len(table) == 1:
            return {
                col_name: table.column(col_name)[0].as_py()
                for col_name in table.column_names
            }
        return table.to_pydict()

    # Method 3: Using pandas intermediate (generally slower)
    def method3(table):
        df = table.to_pandas()
        if len(df) == 1:
            return df.iloc[0].to_dict()
        return df.to_dict("list")

    # Benchmark single row
    print("Single row benchmarks:")
    for i, method in enumerate([method1, method2, method3], 1):
        start = time.time()
        for _ in range(10000):
            result = method(single_row)
        end = time.time()
        print(f"Method {i}: {end - start:.4f}s")

    print("\nMulti-row benchmarks:")
    for i, method in enumerate([method1, method2, method3], 1):
        start = time.time()
        for _ in range(1000):
            result = method(multi_row)
        end = time.time()
        print(f"Method {i}: {end - start:.4f}s")


In [24]:
benchmark_conversions()

Single row benchmarks:
Method 1: 0.3935s
Method 2: 0.2222s
Method 3: 4.3977s

Multi-row benchmarks:
Method 1: 2.4562s
Method 2: 2.4976s
Method 3: 1.6264s


In [11]:
converter.from_arrow_table(table)

[{'sum': 8, 'difference': 2, 'info_path': PosixPath('local_info.json')}]

In [None]:
b"semantic_type" in table.schema[2].metadata

True

In [None]:
k

In [12]:
converter.storage_type_info

{'sum': TypeInfo(python_type=<class 'int'>, arrow_type=DataType(int64), semantic_type='int'),
 'difference': TypeInfo(python_type=<class 'int'>, arrow_type=DataType(int64), semantic_type='int'),
 'info_path': TypeInfo(python_type=<class 'pathlib._local.Path'>, arrow_type=DataType(string), semantic_type='path')}

In [11]:
packets

[{'sum': 8, 'difference': 2, 'info_path': PosixPath('local_info.json')},
 {'sum': 12, 'difference': 8, 'info_path': PosixPath('local_info.json')},
 {'sum': 11, 'difference': 3, 'info_path': PosixPath('local_info.json')}]

In [44]:
store_data = to_store(packet)

In [45]:
arrow_packet = convert_packet_to_arrow_table(
    packet, add_and_subtract.function_output_types, add_and_subtract.registry
)

In [46]:
convert_arrow_tablet_to_packet

NameError: name 'convert_arrow_tablet_to_packet' is not defined

In [71]:
arrow_packet.schema

sum: int64
difference: int64
info_path: string

In [None]:
with pa.OSFile("arraydata.arrow", "wb") as sink:
    with pa.ipc.new_file(sink, schema=arrow_packet.schema) as writer:
        writer.write(batch)

AttributeError: module 'pyarrow' has no attribute 'save_table'

In [None]:
arrow_packet

pyarrow.Table
sum: int64
difference: int64
info_path: string
----
sum: [[8]]
difference: [[2]]
info_path: [["local_info.json"]]

In [50]:
from deltalake import write_deltalake, DeltaTable

write_deltalake("tmp/another-table", arrow_packet, mode="append")

table = DeltaTable("tmp/another-table")

In [55]:
pl.DataFrame(table.to_pyarrow_table())

sum,difference,info_path
i64,i64,str
8,2,"""local_info.json"""


In [23]:
import polars as pl

In [None]:
pl.

AttributeError: 'deltalake._internal.Schema' object has no attribute 'to_pyarrow'

In [28]:
pl.DataFrame(table.to_pyarrow_table())

sum,difference
i64,i64
8,2
8,2
8,2


In [27]:
table.to_pyarrow_table()

pyarrow.Table
sum: int64
difference: int64
----
sum: [[8],[8],[8]]
difference: [[2],[2],[2]]

In [None]:
read_

In [10]:
add_and_subtract.registry

<orcabridge.types.registry.TypeRegistry at 0x7f80285901a0>

In [5]:
add_and_subtract(stream).flow()

[({'name': 'Edgar'}, {'sum': 8, 'difference': 2}),
 ({'name': 'Alice'}, {'sum': 12, 'difference': 8}),
 ({'name': 'Bob'}, {'sum': 11, 'difference': 3})]

In [6]:
add_and_subtract

FunctionPod:module:__main__ name:_original_add_and_subtract params:(a: int, b: int) returns:tuple[int, int] ⇒ ['sum', 'difference']

In [None]:
add_and_subtract.

<orcabridge.streams.SyncStreamFromGenerator at 0x7fe79052c050>

## Working with streams

`Stream` is fundamental to Orcapod data pipeline, representing *edges* in a directed acyclic graph (DAG) of an Orcapod pipeline. `Stream` is best thought of as a flowing stream of `packets` -- a unit of data in Oracpod. A `packet` is essentially a ditionary mapping argument names to a `pathset` (that is, one or more files with arbitrary nesting). Ultimately, a pod will receive and work on the `packet`, looking up the pathset that matches the expected argument names defined as the inputs into the pod. Before we explore creating and using `pod`, we will create a very basic `stream` called `GlobStream`, sourcing from a directory. A packet is formed for each file that matches the specified *glob* pattern.

Let's create a data source out of all `*.txt` files found in the folder `examples/dataset1`

In [15]:
%ls ../examples/dataset1

[0m[01;32mday1.txt[0m*  [01;32mday2.txt[0m*  [01;32mday3.txt[0m*  [01;32mday4.txt[0m*  [01;32mday6.txt[0m*


In [16]:
dataset1 = ob.GlobSource("txt_file", "../examples/dataset1", "*.txt")

We can then obtain `stream` from a `source` by invoking the source with `Source()`. The return `stream` acts as an iterator over the `packet` and its `tag`.
For convenience, `source` can be treated synonymously with a `stream`, allowing you to directly iterate over the content.

In [17]:
for tag, packet in dataset1():
    print(f"Packet {packet} with tag {tag}")

Packet {'txt_file': PosixPath('../examples/dataset1/day1.txt')} with tag {'file_name': 'day1'}
Packet {'txt_file': PosixPath('../examples/dataset1/day2.txt')} with tag {'file_name': 'day2'}
Packet {'txt_file': PosixPath('../examples/dataset1/day3.txt')} with tag {'file_name': 'day3'}
Packet {'txt_file': PosixPath('../examples/dataset1/day4.txt')} with tag {'file_name': 'day4'}
Packet {'txt_file': PosixPath('../examples/dataset1/day6.txt')} with tag {'file_name': 'day6'}


In [18]:
# equivalent to above but more natural without the need to call `dataset1()`
for tag, packet in dataset1:
    print(f"Packet {packet} with tag {tag}")

Packet {'txt_file': PosixPath('../examples/dataset1/day1.txt')} with tag {'file_name': 'day1'}
Packet {'txt_file': PosixPath('../examples/dataset1/day2.txt')} with tag {'file_name': 'day2'}
Packet {'txt_file': PosixPath('../examples/dataset1/day3.txt')} with tag {'file_name': 'day3'}
Packet {'txt_file': PosixPath('../examples/dataset1/day4.txt')} with tag {'file_name': 'day4'}
Packet {'txt_file': PosixPath('../examples/dataset1/day6.txt')} with tag {'file_name': 'day6'}


A few things to note. When creating the `GlobSource` we pass in the argument name to be associated with the `pathset` matching our glob pattern (`*.txt` in this case). By default, the `GlobSource` tags each packet with a key of `file_name` and value of the name of the file that was matched (minus the file extension). This behavior can be easily changed by passing in a custom function for tag generation at the time of `GlobSource` creation.

In [19]:
from pathlib import Path

dataset1_custom = ob.GlobSource(
    "data",
    "../examples/dataset1",
    "*.txt",
    tag_function=lambda x: {"date": Path(x).stem},
)

In [20]:
for tag, packet in dataset1_custom:
    print(f"Packet {packet} with tag {tag}")

Packet {'data': PosixPath('../examples/dataset1/day1.txt')} with tag {'date': 'day1'}
Packet {'data': PosixPath('../examples/dataset1/day2.txt')} with tag {'date': 'day2'}
Packet {'data': PosixPath('../examples/dataset1/day3.txt')} with tag {'date': 'day3'}
Packet {'data': PosixPath('../examples/dataset1/day4.txt')} with tag {'date': 'day4'}
Packet {'data': PosixPath('../examples/dataset1/day6.txt')} with tag {'date': 'day6'}


Custom tag function would allow one to extract information useful in controlling the flow of the data pipeline from the file path or even the file content. We will return to this a bit later.

In general, a packet is generated and starts flowing into a `stream` **only** when you ask for it by iterating through the elements. This allows for a series of streams and pods to be chained together without immediately invoking any computation.

Let's go ahead and load another source from a folder containing multiple `*.bin` files, representing data collected on different days.

In [21]:
dataset2 = ob.GlobSource("bin_data", "../examples/dataset2", "*.bin")

for tag, packet in dataset2:
    print(f"Packet {packet} with tag {tag}")

Packet {'bin_data': PosixPath('../examples/dataset2/session_day1.bin')} with tag {'file_name': 'session_day1'}
Packet {'bin_data': PosixPath('../examples/dataset2/session_day3.bin')} with tag {'file_name': 'session_day3'}
Packet {'bin_data': PosixPath('../examples/dataset2/session_day4.bin')} with tag {'file_name': 'session_day4'}
Packet {'bin_data': PosixPath('../examples/dataset2/session_day5.bin')} with tag {'file_name': 'session_day5'}


Now we have two streams to work with, let's explore how we can manipulate/control the flow of streams using `operations` and, specifically, `mapper` operations.

## Manipulating streams with `operations`

As defined ealier in the [core concepts](./01_orcabridge_core_concepts%20copy.ipynb#core-concepts), we refer to any computation/transformation that works on stream(s) as `operations` in the pipeline. If the Orcapod pipeline were to be viewed as a DAG, the `streams` would be the edges connecting *nodes* that are the `operations`. 

`Operations` can be divided into three categories based on their roles in the processing and manipulating streams. `Source`, `Mappers` and `Pods`.  We have already seen an example of `Source` earlier when we worked with `GlobSource`. Officially, `Source` is an `operation` that produces a `stream` without taking in any inputs. They are best thought of as entry points of data into the pipeline.



`Mappers` are `operations` that controls and alter the streams but *without generating or modifying new data files*. As we will see shortly, `mappers` work to alter the stream by alterning packet tags and/or packet content, but critically will never create or modify new files that were not already present somewhere in the stream feeding into the `mapper` node. While this might sound like an unnecessary restriction on what `mappers` can do, we will see that this property guarantees that *mappers can not ever alter the reproducibility of computational chains*.

The third category of `operations` are `Pods`, these operations are **allowed to generate and flow new files into the streams** *based on* inputs they receive from other streams. Aside from `Source`, which takes no inputs, `Pods` are the only operations that can introduce new files into the stream.

We will explore pods in great detail later. First let's get to know `mappers`.

### Controling data streams with `Mappers`

Once you have created a `source` from which streams can be formed, you can alter the stream by applying various `mappers`. More precisely, a `mapper` can work on tags and/or packets.

### Map packets
Likely one of the most common mapper operation to be found in Orcapod pipeline is `MapPackets` mapper. As the name implies, it let's you alter the keys (argument names) found in the `packet`.

In [22]:
print("Before mapping:")
for tag, packet in dataset1:
    print(f"Packet {packet} with tag {tag}")


# create a new stream mapping packet keys 'txt_file' to 'content'
packet_mapper = ob.MapPackets(key_map={"txt_file": "content"})

print("After mapping:")
for tag, packet in packet_mapper(dataset1):
    print(f"Mapped Packet {packet} with tag {tag}")

Before mapping:
Packet {'txt_file': PosixPath('../examples/dataset1/day1.txt')} with tag {'file_name': 'day1'}
Packet {'txt_file': PosixPath('../examples/dataset1/day2.txt')} with tag {'file_name': 'day2'}
Packet {'txt_file': PosixPath('../examples/dataset1/day3.txt')} with tag {'file_name': 'day3'}
Packet {'txt_file': PosixPath('../examples/dataset1/day4.txt')} with tag {'file_name': 'day4'}
Packet {'txt_file': PosixPath('../examples/dataset1/day6.txt')} with tag {'file_name': 'day6'}
After mapping:
Mapped Packet {'content': PosixPath('../examples/dataset1/day1.txt')} with tag {'file_name': 'day1'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day2.txt')} with tag {'file_name': 'day2'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day3.txt')} with tag {'file_name': 'day3'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day4.txt')} with tag {'file_name': 'day4'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day6.txt')} with tag {'file_name

You'd notice that for each packet, the key `txt_file` was replaced with `content` without altering the pointed `path` or the associated tag. As the keys of the packets will be used as the name of arguments when invoking pods on a stream, we will see that `MapPackets` are commonly used to *map* the correct path to the argument.

### Map tags
As we have already seen, each packet in the stream is associated with a tag, often derived from the data source. In the case of `GlobFileSource`, the tags are by default the name of the file that formed the packet. These tags are used to *transiently* identify the packet and will be used when matching packets across multiple streams (as we will see shortly in `Join` operation). You can manipulate the tags using `MapTags` operation, much like `MapKeys` but operating on the tags for each packaet under a uniform renaming rule.

In [23]:
tag_mapper = ob.MapTags(key_map={"file_name": "day"})

for tag, packet in tag_mapper(dataset1):
    print(tag, packet)

{'day': 'day1'} {'txt_file': PosixPath('../examples/dataset1/day1.txt')}
{'day': 'day2'} {'txt_file': PosixPath('../examples/dataset1/day2.txt')}
{'day': 'day3'} {'txt_file': PosixPath('../examples/dataset1/day3.txt')}
{'day': 'day4'} {'txt_file': PosixPath('../examples/dataset1/day4.txt')}
{'day': 'day6'} {'txt_file': PosixPath('../examples/dataset1/day6.txt')}


### Chaining operations

As you might expect, you can chain multiple operations one after another to construct a more complex stream. Below, we first apply the key mapping and then map tags.

In [24]:
packet_mapper = ob.MapPackets(key_map={"txt_file": "content"})
key_mapped_stream = packet_mapper(dataset1)

tag_mapper = ob.MapTags(key_map={"file_name": "day"})
tag_and_packet_mapped = tag_mapper(key_mapped_stream)

for tag, packet in tag_and_packet_mapped:
    print(f"Mapped Packet {packet} with tag {tag}")

Mapped Packet {'content': PosixPath('../examples/dataset1/day1.txt')} with tag {'day': 'day1'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day2.txt')} with tag {'day': 'day2'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day3.txt')} with tag {'day': 'day3'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day4.txt')} with tag {'day': 'day4'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day6.txt')} with tag {'day': 'day6'}


It's worth emphasizing again that all computations are triggered only when you iterate through the final stream `tag_and_key_mapped`

Although not recommended as it reduces readability, you can create and immediately apply `mapper` to achieve the same processing in a fewer lines of code (albeit, with worse readability):

In [25]:
# totally valid, but difficult to read and thus not recommended
for tag, packet in ob.MapTags(key_map={"file_name": "day"})(
    ob.MapPackets(key_map={"txt_file": "content"})(dataset1)
):
    print(f"Mapped Packet {packet} with tag {tag}")

Mapped Packet {'content': PosixPath('../examples/dataset1/day1.txt')} with tag {'day': 'day1'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day2.txt')} with tag {'day': 'day2'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day3.txt')} with tag {'day': 'day3'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day4.txt')} with tag {'day': 'day4'}
Mapped Packet {'content': PosixPath('../examples/dataset1/day6.txt')} with tag {'day': 'day6'}


### Joining multiple streams into a single stream
Now that we have looked at how you can manipulate a single stream, let's turn our eyes to how you can work with more than one streams together.

By the far the most common multi-stream operations will be to join two (or more) streams into a single, bigger stream. 
You can combine multiple streams into one by using `Join` operation, matching packets from each stream based on the matching tags. If tags from two streams have shared key, the value must be identical for all shared keys for the two packets to be matched. The matched packets are then merged into a one (typically larger) packet and shipped to the output stream.

Let's see what happens if we join `dataset1` and `dataset2`, where:

In [26]:
# dataset 1
print("Dataset 1:")
for tag, packet in dataset1:
    print(f"Tag: {tag}, Packet: {packet}")

# dataset 2
print("\nDataset 2:")
for tag, packet in dataset2:
    print(f"Tag: {tag}, Packet: {packet}")

Dataset 1:
Tag: {'file_name': 'day1'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt')}
Tag: {'file_name': 'day2'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day2.txt')}
Tag: {'file_name': 'day3'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day3.txt')}
Tag: {'file_name': 'day4'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day4.txt')}
Tag: {'file_name': 'day6'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day6.txt')}

Dataset 2:
Tag: {'file_name': 'session_day1'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day1.bin')}
Tag: {'file_name': 'session_day3'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day3.bin')}
Tag: {'file_name': 'session_day4'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day4.bin')}
Tag: {'file_name': 'session_day5'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day5.bin')}


Any guess what would happen?

In [27]:
join_op = ob.Join()

for tag, packet in join_op(dataset1, dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

You may be surprised to see that the joined stream is completely empty! This is because packets from both streams were tagged with key `file_name`, causing the `Join` to combine packets only if the value of `file_name` matches exactly. Since no filenames matched, the resulting stream was empty!

This is where we can make use of the other `mappers` to our advantage and achieve more useful join.

First, let's completely rename the tag key for one of the streams and see what would happen.

In [28]:
dataset1_retagged = ob.MapTags(key_map={"file_name": "day"})(dataset1)

for i, (tag, packet) in enumerate(join_op(dataset1_retagged, dataset2)):
    print(f"{i + 1:02d} Tag: {tag}, Packet: {packet}")

01 Tag: {'day': 'day1', 'file_name': 'session_day1'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day1.bin')}
02 Tag: {'day': 'day1', 'file_name': 'session_day3'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day3.bin')}
03 Tag: {'day': 'day1', 'file_name': 'session_day4'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day4.bin')}
04 Tag: {'day': 'day1', 'file_name': 'session_day5'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day5.bin')}
05 Tag: {'day': 'day2', 'file_name': 'session_day1'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day2.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day1.bin')}
06 Tag: {'day': 'day2', 'file_name': 'session_day3'}, Packet: {'txt_file': PosixPath(

We are now getting something -- in fact, quite a few things. If you look carefully at the `packet`, you'll notice that it now contains two keys/arguments -- `txt_file` and `bin_data`, combining the packets from the two datasets. 

The `tags` also now contain two keys `day` from the re-tagged dataset1 stream and `file_name` from unchanged dataset2 stream.

Since the two streams share no common tags, the `Join` operation results in *full-multiplexing* of two streams. With the streams from dataset1 and dataset2 containing 5 packet and 4 packets, respectively, you get $5 \times 4 = 20$ packets

However, it is not all too useful if all `Join` can do is to produce either 0 packet or a full combination of packets from two streams. The true value of `Join` lies in its ability to match two packets that are *related* to each other. 

In our example datasets, you likely noticed that files from both datasets are associated with a day. Let's now try to join the two dataset streams by matching by the day!

Although we could achieve the desired effect by changing how we load the source, passing in custom `tag_function` into `GlobSource`, let's achieve the same by using another `mapper` called `Transform`. `Transform` effectively combines `MapKey` and `MapTag` but further allows you to provide a function that will receive the tag and packet, one at a time, and return a (potentially modified) tag and/or packet, achieving the desired transformation.

In [30]:
def transform_dataset2(tag, packet):
    # Extract the second half of the filename containing day
    new_tag = {"day": tag["file_name"].split("_")[1]}
    return new_tag, packet


# Speical mappers like transform can be found in the orcabridge.mapper module
dataset2_transformer = ob.mapper.Transform(transform_dataset2)

retagged_dataset2 = dataset2_transformer(dataset2)

for tag, packet in retagged_dataset2:
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'day': 'day1'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day1.bin')}
Tag: {'day': 'day3'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day3.bin')}
Tag: {'day': 'day4'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day4.bin')}
Tag: {'day': 'day5'}, Packet: {'bin_data': PosixPath('../examples/dataset2/session_day5.bin')}


Now we have dataset2 packets tagged with `day`, let's `join`` with a mapped dataset1!

In [31]:
# change filename to day for dataset1
tag_mapper = ob.MapTags(key_map={"file_name": "day"})
retagged_dataset1 = tag_mapper(dataset1)

join_op = ob.Join()
joined_stream = join_op(retagged_dataset1, retagged_dataset2)

for tag, packet in joined_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'day': 'day1'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day1.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day1.bin')}
Tag: {'day': 'day3'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day3.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day3.bin')}
Tag: {'day': 'day4'}, Packet: {'txt_file': PosixPath('../examples/dataset1/day4.txt'), 'bin_data': PosixPath('../examples/dataset2/session_day4.bin')}


Nice! We have now formed a stream where packets from two streams are paired meaningfully based on matching `day`!

Now we have explored quite a bit on how to manipulate data stream using `mapper` operations, it's time to turn to the other half ot he operations: `pods`

## Introducing new files into stream with `Pod`

While `mapper` operations are useful in altering tags, packets, and in combining multiple streams, a data pipeline is not really useful if it cannot produce new resultsin the form of new data -- that is, introduce new files into the stream. This is precisely where `Pod` operations come in!

In fact, we have already been working with a `pod` all along -- `sources`. If you think about it, `sources` also introduce files into the stream. It is just special in that it takes no input streams (hence the name, `source`).

We now will explore how you can create a more common type of pod -- a *function* `pod` that takes in a stream and return a new stream potentially introducing entirely new data file!

### Working with `FunctionPod`

The easiest way to create a function-like `pod` is to create a `FunctionPod`, passing in a Python function. Let's start by creating a pod that will count the number of lines in a file.

We first define the function.

In [32]:
from os import PathLike


def count_lines(txt_file: PathLike) -> None:
    with open(txt_file, "r") as f:
        lines = f.readlines()
    print(f"File {txt_file} has {len(lines)} lines.")

Next we instantiate a function pod from the function.

In [33]:
# create a function pod
function_pod = ob.FunctionPod(count_lines, output_keys=[])

Once function pod is available, you can execute it on any compatible stream

In [34]:
# apply the function pod on a stream
processed_stream = function_pod(dataset1)

for tag, packet in processed_stream:
    print(f"Tag: {tag}, Packet: {packet}")

File ../examples/dataset1/day1.txt has 24 lines.
Tag: {'file_name': 'day1'}, Packet: {}
File ../examples/dataset1/day2.txt has 15 lines.
Tag: {'file_name': 'day2'}, Packet: {}
File ../examples/dataset1/day3.txt has 27 lines.
Tag: {'file_name': 'day3'}, Packet: {}
File ../examples/dataset1/day4.txt has 22 lines.
Tag: {'file_name': 'day4'}, Packet: {}
File ../examples/dataset1/day6.txt has 22 lines.
Tag: {'file_name': 'day6'}, Packet: {}


Notice that the returned `packet` is empty because the function returns no values. Such a function pod may still be useful for achieving computations/processing via *side effects* (e.g., submitting HTTP requests in the function body)l, but it is not the standard approach in performing computations where you'd want the results to persis.

Next, let's see how to achieve more common scenario where you perform some computation and you now would like to save the result into a file. Dataset2 binary actually contains a list of floats values. Let's define a function to compute a few statistics and save them to a file in a temporary directory.

In [35]:
import json
import tempfile

import numpy as np


def compute_stats(bin_file: PathLike, output_file=None):
    print("Computing stats for file:", bin_file)
    # create a temporary file to store the status and return the file path
    with open(bin_file, "rb") as f:
        data = f.read()
    data = np.frombuffer(data)
    print(data)
    data_stats = {}
    data_stats["mean"] = np.mean(data)
    data_stats["std"] = np.std(data)
    data_stats["min"] = np.min(data)
    data_stats["max"] = np.max(data)
    data_stats["n_elements"] = len(data)

    # if output_file is none, create a temporary file. Else, use the given output_file to save the data_stats
    if output_file is None:
        output_file = Path(tempfile.mkdtemp()) / "statistics.json"
    # write as json
    with open(output_file, "w") as f:
        json.dump(data_stats, f)
    return output_file

In [36]:
fp_stats = ob.FunctionPod(compute_stats, output_keys=["stats"])

# change the key from 'bin_data' to 'bin_file', matching the function's input
mapped_dataset2 = ob.MapPackets(key_map={"bin_data": "bin_file"})(dataset2)

for tag, packet in fp_stats(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Computing stats for file: ../examples/dataset2/session_day1.bin
[-1.08209134 -0.66806394  0.42870206 -0.09321731 -3.14078305  1.33520433
  1.11085152  1.31931842 -1.19915697  0.07701737  1.30020807  0.27541194
  0.84430062  0.18236837 -0.83039631 -1.66166191  0.8720775  -1.72170657
 -0.01962253 -0.18050553  1.35478472  0.69928177  0.7314272  -0.06915687
 -0.08364667 -0.45551653  0.70752188  1.02283734 -0.18612795  0.8767394
 -1.542636    1.04685484 -2.1311672  -1.34874222  0.61977577 -0.33880262
  0.6624482   0.60257325 -3.04901544 -0.20685843 -0.08997232  0.88932232]
Tag: {'file_name': 'session_day1'}, Packet: {'stats': PosixPath('/tmp/tmpm2wka6il/statistics.json')}
Computing stats for file: ../examples/dataset2/session_day3.bin
[ 0.56114059 -1.34902274  1.0665563   0.71890802  0.65244834  1.04369548
  0.54872876  2.19365207  0.53864286 -1.44108823 -0.55651539  0.1603561
 -0.93869224  0.64645323 -1.08815337  1.40972393 -0.14662931  1.34692375
  0.38400938 -1.23004316  1.34426647 -0.07

Note that in our function `compute_stats`, the computed stats are saved as `json` file into a temporary file. While this works to pass data from one to another within the pipeline, the result cannot be easily retrieved outside of the immediate usage. In fact, the computation result is very likely to disappear in some time (afterall, it's a temporary file). In fact, if you were to execute the same computation by iterating the second time over `stats_stream`, you will see that it invokes the functions yet again, and produces an entirely different set of temporary files. Since the content of computation didn't change, this is cearly quite wasteful!

In [37]:
# everytime you run the following loop, new computations are performed and
# saved in a different set of temporary files
for tag, packet in fp_stats(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Computing stats for file: ../examples/dataset2/session_day1.bin
[-1.08209134 -0.66806394  0.42870206 -0.09321731 -3.14078305  1.33520433
  1.11085152  1.31931842 -1.19915697  0.07701737  1.30020807  0.27541194
  0.84430062  0.18236837 -0.83039631 -1.66166191  0.8720775  -1.72170657
 -0.01962253 -0.18050553  1.35478472  0.69928177  0.7314272  -0.06915687
 -0.08364667 -0.45551653  0.70752188  1.02283734 -0.18612795  0.8767394
 -1.542636    1.04685484 -2.1311672  -1.34874222  0.61977577 -0.33880262
  0.6624482   0.60257325 -3.04901544 -0.20685843 -0.08997232  0.88932232]
Tag: {'file_name': 'session_day1'}, Packet: {'stats': PosixPath('/tmp/tmpciwa2xl_/statistics.json')}
Computing stats for file: ../examples/dataset2/session_day3.bin
[ 0.56114059 -1.34902274  1.0665563   0.71890802  0.65244834  1.04369548
  0.54872876  2.19365207  0.53864286 -1.44108823 -0.55651539  0.1603561
 -0.93869224  0.64645323 -1.08815337  1.40972393 -0.14662931  1.34692375
  0.38400938 -1.23004316  1.34426647 -0.07

In the next section we will see how we can have the computation restuls stored using storage-backed function pods.

### [Technical aside] Caching stream

**NOTE**: This section concerns an implementation detail of `Oracbridge` that is not fundamentally related to the design of the system. In particular, the issue described in this section (and the associated *solution*) is not relevant to the full-implementation that `Orcapod` will be. If you are reading this document primarily to understand the concepts essential to Orcapod, you are advised to skip this section entirely. However, if you intend to make use of `oracabridge` in an actual application, read on to learn critical limitations associated with single-producer single-consumer (SPSC) design of the `orcabridge` and how you can ameloiorate this using `CacheStream` mapper effectively within your pipeline.

In [38]:
# create a cache stream operation
cache_stream = ob.mapper.CacheStream()
# change the key from 'bin_data' to 'bin_file', matching the function's input
mapped_dataset2 = ob.MapPackets(key_map={"bin_data": "bin_file"})(dataset2)
stats_stream = fp_stats(mapped_dataset2)

# now cache the stream
cached_stream = cache_stream(stats_stream)

# iterate over the cached stream
for tag, packet in cached_stream:
    print(f"Tag: {tag}, Packet: {packet}")

HashableMixin.__hash__ called on CacheStream instance without identity_structure() implementation. Falling back to super().__hash__() which is not stable across sessions.


Computing stats for file: ../examples/dataset2/session_day1.bin
[-1.08209134 -0.66806394  0.42870206 -0.09321731 -3.14078305  1.33520433
  1.11085152  1.31931842 -1.19915697  0.07701737  1.30020807  0.27541194
  0.84430062  0.18236837 -0.83039631 -1.66166191  0.8720775  -1.72170657
 -0.01962253 -0.18050553  1.35478472  0.69928177  0.7314272  -0.06915687
 -0.08364667 -0.45551653  0.70752188  1.02283734 -0.18612795  0.8767394
 -1.542636    1.04685484 -2.1311672  -1.34874222  0.61977577 -0.33880262
  0.6624482   0.60257325 -3.04901544 -0.20685843 -0.08997232  0.88932232]
Tag: {'file_name': 'session_day1'}, Packet: {'stats': PosixPath('/tmp/tmpukvddhuv/statistics.json')}
Computing stats for file: ../examples/dataset2/session_day3.bin
[ 0.56114059 -1.34902274  1.0665563   0.71890802  0.65244834  1.04369548
  0.54872876  2.19365207  0.53864286 -1.44108823 -0.55651539  0.1603561
 -0.93869224  0.64645323 -1.08815337  1.40972393 -0.14662931  1.34692375
  0.38400938 -1.23004316  1.34426647 -0.07

The first time we iterate over the `cached_stream`, you see that the function `compute_stats` is getting executed as we'd expect. However, it's when running it the second time you'd notice something is different.

In [40]:
for tag, packet in cached_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'file_name': 'session_day1'}, Packet: {'stats': PosixPath('/tmp/tmpukvddhuv/statistics.json')}
Tag: {'file_name': 'session_day3'}, Packet: {'stats': PosixPath('/tmp/tmpat3rm4dk/statistics.json')}
Tag: {'file_name': 'session_day4'}, Packet: {'stats': PosixPath('/tmp/tmpuj3tiu8k/statistics.json')}
Tag: {'file_name': 'session_day5'}, Packet: {'stats': PosixPath('/tmp/tmp6yohu0pw/statistics.json')}


Since the output packets from `stats_stream` have been cached, iterating through `cached_stream` for the second time simply returned the cached packets without causing new computation. Although this may sound like a good way to prevent recomputing the same thing more than once, `CacheStream` comes with significant demerits. Since all observed packets are stored in memory, having too many `CacheStream` in the pipeline may be very memory resource heavy. Also, unlike store-backed function, as we'll see shortly, `CacheStream` stores the packets as seen from one iteration of the underlying stream. If the underlying stream would have produced new and diffirent packets (e.g., because additional `bin` files are added to the dataset), `CacheStream` won't be able to update itself without you explicitly clearing the cache. Finally, unlike storage backed function pod, computation is *not memoized* and thus same exact computation may still take place if two or more packets are identical in the content and thus would have yielded identical output.

## Using storage-backed function pod

Although the simple `FunctionPod` worked as expected, it's lack of ability to store computation results significantly limits its utility. You certainly wouldn't want to be computing everything from scratch if it can be avoided.

The good news is that you can easily equip a function pod with an ability to store and retrieve previously stored packets. All you have to do is create an instance of `DataStore` and pass it in at the construction of the `FunctionPod`.

Here we are going to configure and use `DirDataStore` where all `packets` and output `packet` contents are stored in a designated directory.

In [41]:
data_store = ob.DirDataStore("./pod_data")

In [42]:
# use default storage directory of './pod_data'. You could specify a different directory by passing `store_dir` argument
fp_stats_stored = ob.FunctionPod(
    compute_stats, output_keys=["stats"], data_store=data_store
)

Now your `FunctionPod` is equipped with an ability to store and retrieve stored packets!

In [43]:
for tag, packet in fp_stats_stored(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'file_name': 'session_day1'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-c63c34c5fefaa0f2e9bba3edcf6c861c/statistics.json'}
Tag: {'file_name': 'session_day3'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-9dfef842f88463f5145ab0d4c06e3938/statistics.json'}
Tag: {'file_name': 'session_day4'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-4f3dfa71356fe8f226be66aa8dffbc55/statistics.json'}
Tag: {'file_name': 'session_day5'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-26bffc293c82e14cde904274e0c63afd/statistics.json'}


As before, the very first time you run, all computations take place. Now watch what happens when you run it again.

In [44]:
for tag, packet in fp_stats_stored(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'file_name': 'session_day1'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-c63c34c5fefaa0f2e9bba3edcf6c861c/statistics.json'}
Tag: {'file_name': 'session_day3'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-9dfef842f88463f5145ab0d4c06e3938/statistics.json'}
Tag: {'file_name': 'session_day4'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-4f3dfa71356fe8f226be66aa8dffbc55/statistics.json'}
Tag: {'file_name': 'session_day5'}, Packet: {'stats': 'pod_data/compute_stats/15da3b08791f51d9/sha256-26bffc293c82e14cde904274e0c63afd/statistics.json'}


Notice that this time, the function `compute_stats` was **not** invoked. Rather the computation results from the previous run were *memoized* and *retrieved*, sparing us the unecessary computation!