# Using Orcabridge

In this notebook, we will explore the basic usage of Orcabridge library.

Below we explore the usage of `orcabridge` package, enumerating the core components. Many of these will correspond directly to [core concepts](./01_orcabridge_core_concepts%20copy.ipynb) introduced in in [part 1](./01_orcabridge_core_concepts%20copy.ipynb).

## Working with streams

`Stream` is fundamental to Orcapod data pipeline, representing *edges* in a directed acyclic graph (DAG) of an Orcapod pipeline. `Stream` is best thought of as a flowing stream of `packets` -- a unit of data in Oracpod. A `packet` is essentially a ditionary mapping argument names to a `pathset` (that is, one or more files with arbitrary nesting). Ultimately, a pod will receive and work on the `packet`, looking up the pathset that matches the expected argument names defined as the inputs into the pod. Before we explore creating and using `pod`, we will create a very basic `stream` called `GlobStream`, sourcing from a directory. A packet is formed for each file that matches the specified *glob* pattern.

In [None]:
from orcabridge.source import GlobSource

Let's create a data source out of all `*.txt` files found in the folder `examples/dataset1`

In [None]:
ls ../examples/dataset1

In [None]:
dataset1 = GlobSource("txt_file", "../examples/dataset1", "*.txt")

We can then obtain `stream` from a `source` by invoking the source with `Source()`. The return `stream` acts as an iterator over the `packet` and its `tag`.
For convenience, `source` can be treated synonymously with a `stream`, allowing you to directly iterate over the content.

In [None]:
for tag, packet in dataset1():
    print(f"Packet {packet} with tag {tag}")

In [None]:
# equivalent to above but more natural without the need to call `dataset1()`
for tag, packet in dataset1:
    print(f"Packet {packet} with tag {tag}")

A few things to note. When creating the `GlobSource` we pass in the argument name to be associated with the `pathset` matching our glob pattern (`*.txt` in this case). By default, the `GlobSource` tags each packet with a key of `file_name` and value of the name of the file that was matched (minus the file extension). This behavior can be easily changed by passing in a custom function for tag generation at the time of `GlobSource` creation.

In [None]:
from pathlib import Path

dataset1_custom = GlobSource(
    "data",
    "../examples/dataset1",
    "*.txt",
    tag_function=lambda x: {"date": Path(x).stem},
)

In [None]:
for tag, packet in dataset1_custom:
    print(f"Packet {packet} with tag {tag}")

Custom tag function would allow one to extract information useful in controlling the flow of the data pipeline from the file path or even the file content. We will return to this a bit later.

In general, a packet is generated and starts flowing into a `stream` **only** when you ask for it by iterating through the elements. This allows for a series of streams and pods to be chained together without immediately invoking any computation.

Let's go ahead and load another source from a folder containing multiple `*.bin` files, representing data collected on different days.

In [None]:
dataset2 = GlobSource("bin_data", "../examples/dataset2", "*.bin")

for tag, packet in dataset2:
    print(f"Packet {packet} with tag {tag}")

Now we have two streams to work with, let's explore how we can manipulate/control the flow of streams using `operations` and, specifically, `mapper` operations.

## Manipulating streams with `operations`

As defined ealier in the [core concepts](./01_orcabridge_core_concepts%20copy.ipynb#core-concepts), we refer to any computation/transformation that works on stream(s) as `operations` in the pipeline. If the Orcapod pipeline were to be viewed as a DAG, the `streams` would be the edges connecting *nodes* that are the `operations`. 

`Operations` can be divided into three categories based on their roles in the processing and manipulating streams. `Source`, `Mappers` and `Pods`.  We have already seen an example of `Source` earlier when we worked with `GlobSource`. Officially, `Source` is an `operation` that produces a `stream` without taking in any inputs. They are best thought of as entry points of data into the pipeline.



`Mappers` are `operations` that controls and alter the streams but *without generating or modifying new data files*. As we will see shortly, `mappers` work to alter the stream by alterning packet tags and/or packet content, but critically will never create or modify new files that were not already present somewhere in the stream feeding into the `mapper` node. While this might sound like an unnecessary restriction on what `mappers` can do, we will see that this property guarantees that *mappers can not ever alter the reproducibility of computational chains*.

The third category of `operations` are `Pods`, these operations are **allowed to generate and flow new files into the streams** *based on* inputs they receive from other streams. Aside from `Source`, which takes no inputs, `Pods` are the only operations that can introduce new files into the stream.

We will explore pods in great detail later. First let's get to know `mappers`.

### Controling data streams with `Mappers`

Once you have created `source` from which streams can be formed, you can alter the stream by applying various `mappers`. More precisely, a `mapper` can work on tags and/or packets.

### Map keys
Likey one of the most common mapper operation to be found in Orcapod pipeline is `MapKeys` mapper. As the name implies, it let's you alter the keys (argument names) found in the `packet`.

In [None]:
from orcabridge.mapper import MapKeys

print("Before mapping:")
for tag, packet in dataset1:
    print(f"Packet {packet} with tag {tag}")


# create a new stream mapping packet keys 'txt_file' to 'content'
key_mapper = MapKeys(key_map={"txt_file": "content"})

print("After mapping:")
for tag, packet in key_mapper(dataset1):
    print(f"Mapped Packet {packet} with tag {tag}")

You'd notice that for each packet, the key `txt_file` was replaced with `content` without altering the pointed `path` or the associated tag. As the keys of the packets will be used as the name of arguments when invoking pods on a stream, we will see that `MapKeys` are commonly used to *map* the correct path to the argument.

### Map tags
As we have already seen, each packet in the stream is associated with a tag, often derived from the data source. In the case of `GlobFileSource`, the tags are by default the name of the file that formed the packet. These tags are used to *transiently* identify the packet and will be used when matching packets across multiple streams (as we will see shortly in `Join` operation). You can manipulate the tags using `MapTags` operation, much like `MapKeys` but operating on the tags for each packaet under a uniform renaming rule.

In [None]:
from orcabridge.mapper import MapTags

tag_mapper = MapTags(tag_map={"file_name": "day"})

for tag, packet in tag_mapper(dataset1):
    print(tag, packet)

### Chaining operations

As you might expect, you can chain multiple operations one after another to construct a more complex stream. Below, we first apply the key mapping and then map tags.

In [None]:
key_mapper = MapKeys(key_map={"txt_file": "content"})
key_mapped_stream = key_mapper(dataset1)

tag_mapper = MapTags(tag_map={"file_name": "day"})
tag_and_key_mapped = tag_mapper(key_mapped_stream)

for tag, packet in tag_and_key_mapped:
    print(f"Mapped Packet {packet} with tag {tag}")

It's worth emphasizing again that all computations are triggered only when you iterate through the final stream `tag_and_key_mapped`

Although not recommended as it reduces readability, you can create and immediately apply `mapper` to achieve the same processing in a fewer lines of code (albeit, with worse readability):

In [None]:
# totally valid, but difficult to read and thus not recommended
for tag, packet in MapTags(tag_map={"file_name": "day"})(
    MapKeys(key_map={"txt_file": "content"})(dataset1)
):
    print(f"Mapped Packet {packet} with tag {tag}")

### Joining multiple streams into a single stream
Now that we have looked at how you can manipulate a single stream, let's turn our eyes to how you can work with more than one streams together.

By the far the most common multi-stream operations will be to join two (or more) streams into a single, bigger stream. 
You can combine multiple streams into one by using `Join` operation, matching packets from each stream based on the matching tags. If tags from two streams have shared key, the value must be identical for all shared keys for the two packets to be matched. The matched packets are then merged into a one (typically larger) packet and shipped to the output stream.

Let's see what happens if we join `dataset1` and `dataset2`, where:

In [None]:
# dataset 1
print("Dataset 1:")
for tag, packet in dataset1:
    print(f"Tag: {tag}, Packet: {packet}")

# dataset 2
print("\nDataset 2:")
for tag, packet in dataset2:
    print(f"Tag: {tag}, Packet: {packet}")

Any guess what would happen?

In [None]:
from orcabridge.mapper import Join

join_op = Join()

for tag, packet in join_op(dataset1, dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

You may be surprised to see that the joined stream is completely empty! This is because packets from both streams were tagged with key `file_name`, causing the `Join` to combine packets only if the value of `file_name` matches exactly. Since no filenames matched, the resulting stream was empty!

This is where we can make use of the other `mappers` to our advantage and achieve more useful join.

First, let's completely rename the tag key for one of the streams and see what would happen.

In [None]:
dataset1_retagged = MapTags(tag_map={"file_name": "day"})(dataset1)

for i, (tag, packet) in enumerate(join_op(dataset1_retagged, dataset2)):
    print(f"{i+1:02d} Tag: {tag}, Packet: {packet}")

We are now getting something -- in fact, quite a few things. If you look carefully at the `packet`, you'll notice that it now contains two keys/arguments -- `txt_file` and `bin_data`, combining the packets from the two datasets. 

The `tags` also now contain two keys `day` from the re-tagged dataset1 stream and `file_name` from unchanged dataset2 stream.

Since the two streams share no common tags, the `Join` operation results in *full-multiplexing* of two streams. With each stream containing 4 packets, you get $4 \times 4 = 16$ packets

However, it is not all too useful if all `Join` can do is to produce either 0 packet or a full combination of packets from two streams. The true value of `Join` lies in its ability to match two packets that are *related* to each other. 

In our example datasets, you likely noticed that files from both datasets are associated with a day. Let's now try to join the two dataset streams by matching by the day!

Although we could achieve the desired effect by changing how we load the source, passing in custom `tag_function` into `GlobSource`, let's achieve the same by using another `mapper` called `Transform`. `Transform` effectively combines `MapKey` and `MapTag` but further allows you to provide a function that will receive the tag and packet, one at a time, and return a (potentially modified) tag and/or packet, achieving the desired transformation.

In [None]:
from orcabridge.mapper import Transform


def transform_dataset2(tag, packet):
    # Extract the second half of the filename containing day
    new_tag = {"day": tag["file_name"].split("_")[1]}
    return new_tag, packet


dataset2_transformer = Transform(transform_dataset2)

retagged_dataset2 = dataset2_transformer(dataset2)

for tag, packet in retagged_dataset2:
    print(f"Tag: {tag}, Packet: {packet}")

Now we have dataset2 packets tagged with `day`, let's `join`` with a mapped dataset1!

In [None]:
# change filename to day for dataset1
tag_mapper = MapTags(tag_map={"file_name": "day"})
retagged_dataset1 = tag_mapper(dataset1)

join_op = Join()
joined_stream = join_op(retagged_dataset1, retagged_dataset2)

for tag, packet in joined_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Nice! We have now formed a stream where packets from two streams are paired meaningfully based on matching `day`!

Now we have explored quite a bit on how to manipulate data stream using `mapper` operations, it's time to turn to the other half ot he operations: `pods`

## Introducing new files into stream with `Pod`

While `mapper` operations are useful in altering tags, packets, and in combining multiple streams, a data pipeline is not really useful if it cannot produce new resultsin the form of new data -- that is, introduce new files into the stream. This is precisely where `Pod` operations come in!

In fact, we have already been working with a `pod` all along -- `sources`. If you think about it, `sources` also introduce files into the stream. It is just special in that it takes no input streams (hence the name, `source`).

We now will explore how you can create a more common type of pod -- a *function* `pod` that takes in a stream and return a new stream potentially introducing entirely new data file!

### Working with `FunctionPod`

The easiest way to create a function-like `pod` is to create a `FunctionPod`, passing in a Python function. Let's start by creating a pod that will count the number of lines in a file.

We first define the function.

In [None]:
from os import PathLike


def count_lines(txt_file: PathLike) -> None:
    with open(txt_file, "r") as f:
        lines = f.readlines()
    print(f"File {txt_file} has {len(lines)} lines.")

Next we instantiate a function pod from the function.

In [None]:
from orcabridge.pod import FunctionPod

# create a function pod
function_pod = FunctionPod(count_lines, output_keys=[])

Once function pod is available, you can execute it on any compatible stream

In [None]:
# apply the function pod on a stream
processed_stream = function_pod(dataset1)

for tag, packet in processed_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Notice that the returned `packet` is empty because the function returns no values. Such a function pod may still be useful for achieving computations/processing via *side effects* (e.g., submitting HTTP requests in the function body)l, but it is not the standard approach in performing computations where you'd want the results to persis.

Next, let's see how to achieve more common scenario where you perform some computation and you now would like to save the result into a file. Dataset2 binary actually contains a list of floats values. Let's define a function to compute a few statistics and save them to a file in a temporary directory.

In [None]:
import numpy as np
import tempfile
import json


def compute_stats(bin_file: PathLike, output_file=None):
    print("Computing stats for file:", bin_file)
    # create a temporary file to store the status and return the file path
    with open(bin_file, "rb") as f:
        data = f.read()
    data = np.frombuffer(data)
    print(data)
    data_stats = {}
    data_stats["mean"] = np.mean(data)
    data_stats["std"] = np.std(data)
    data_stats["min"] = np.min(data)
    data_stats["max"] = np.max(data)
    data_stats["n_elements"] = len(data)

    # if output_file is none, create a temporary file. Else, use the given output_file to save the data_stats
    if output_file is None:
        output_file = Path(tempfile.mkdtemp()) / "statistics.json"
    # write as json
    with open(output_file, "w") as f:
        json.dump(data_stats, f)
    return output_file

In [None]:
fp_stats = FunctionPod(compute_stats, output_keys=["stats"])

# change the key from 'bin_data' to 'bin_file', matching the function's input
mapped_dataset2 = MapKeys(key_map={"bin_data": "bin_file"})(dataset2)

for tag, packet in fp_stats(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Note that in our function `compute_stats`, the computed stats are saved as `json` file into a temporary file. While this works to pass data from one to another within the pipeline, the result cannot be easily retrieved outside of the immediate usage. In fact, the computation result is very likely to disappear in some time (afterall, it's a temporary file). In fact, if you were to execute the same computation by iterating the second time over `stats_stream`, you will see that it invokes the functions yet again, and produces an entirely different set of temporary files. Since the content of computation didn't change, this is cearly quite wasteful!

In [None]:
# everytime you run the following loop, new computations are performed and
# saved in a different set of temporary files
for tag, packet in fp_stats(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

In the next section we will see how we can have the computation restuls stored using storage-backed function pods.

### [Technical aside] Caching stream

**NOTE**: This section concerns an implementation detail of `Oracbridge` that is not fundamentally related to the design of the system. In particular, the issue described in this section (and the associated *solution*) is not relevant to the full-implementation that `Orcapod` will be. If you are reading this document primarily to understand the concepts essential to Orcapod, you are advised to skip this section entirely. However, if you intend to make use of `oracabridge` in an actual application, read on to learn critical limitations associated with single-producer single-consumer (SPSC) design of the `orcabridge` and how you can ameloiorate this using `CacheStream` mapper effectively within your pipeline.

In [None]:
from orcabridge.mapper import CacheStream

# create a cache stream operation
cache_stream = CacheStream()
# change the key from 'bin_data' to 'bin_file', matching the function's input
mapped_dataset2 = MapKeys(key_map={"bin_data": "bin_file"})(dataset2)
stats_stream = fp_stats(mapped_dataset2)

# now cache the stream
cached_stream = cache_stream(stats_stream)

# iterate over the cached stream
for tag, packet in cached_stream:
    print(f"Tag: {tag}, Packet: {packet}")

The first time we iterate over the `cached_stream`, you see that the function `compute_stats` is getting executed as we'd expect. However, it's when running it the second time you'd notice something is different.

In [None]:
for tag, packet in cached_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Since the output packets from `stats_stream` have been cached, iterating through `cached_stream` for the second time simply returned the cached packets without causing new computation. Although this may sound like a good way to prevent recomputing the same thing more than once, `CacheStream` comes with significant demerits. Since all observed packets are stored in memory, having too many `CacheStream` in the pipeline may be very memory resource heavy. Also, unlike store-backed function, as we'll see shortly, `CacheStream` stores the packets as seen from one iteration of the underlying stream. If the underlying stream would have produced new and diffirent packets (e.g., because additional `bin` files are added to the dataset), `CacheStream` won't be able to update itself without you explicitly clearing the cache. Finally, unlike storage backed function pod, computation is *not memoized* and thus same exact computation may still take place if two or more packets are identical in the content and thus would have yielded identical output.

## Using storage-backed function pod

Although the simple `FunctionPod` worked as expected, it's lack of ability to store computation results significantly limits its utility. You certainly wouldn't want to be computing everything from scratch if it can be avoided.

This is where storage-backed function pods step in. As the name indicates, these are `FunctionPod` that has a stroage back-end that allows for the computation results to be **memoized**, such that if the same function call with identical inputs occur, the *memoized* result will be returned instead of computing it again.

Let's take a look at a specific example of stroage-backed function pod, `FunctionpodWithDirStorage`, which as the name suggests, stores the computation results into a directory you specify. If you omit the directory specification, it will automatically create and store the result in the local directory `./pod_data`

In [None]:
from orcabridge.pod import FunctionPodWithDirStorage

# use default storage directory of './pod_data'. You could specify a different directory by passing `store_dir` argument
fp_stats_stored = FunctionPodWithDirStorage(compute_stats, output_keys=["stats"])

Once created, stored `FunctionPod` can be used in an identical fashion to a regular `FunctionPod`

In [None]:
for tag, packet in fp_stats_stored(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

As before, the very first time you run, all computations take place. Now watch what happens when you run it again.

In [None]:
for tag, packet in fp_stats_stored(mapped_dataset2):
    print(f"Tag: {tag}, Packet: {packet}")

Notice that this time, the function `compute_stats` was **not** invoked. Rather the computation results from the previous run were *memoized* and *retrieved*, sparing us the unecessary computation!