# QoL Improving Features of `orcabridge`

In the [previous notebook](./02-02-advanced-usage.ipynb), we explored the `orcabridge` package and learned how to build and execute a simple pipeline using concepts like `streams`, `operations` and `pods`.

For an example, we saw that we can define a function pod to wrap a function and to feed in a stream with the packet keys properly mapped into argument names expected by the pod:

In [1]:
from orcabridge.source import GlobSource
from orcabridge.pod import FunctionPod
from orcabridge.mapper import MapPackets

source = GlobSource("text_file", "../examples/dataset1", "*.txt")


def process_data(data):
    # perform data processing on data
    # return result file path
    return "path/to/result/file"


fp_process = FunctionPod(process_data, ["output_data"])

packet_mapper = MapPackets({"text_file": "data"})  # map packet key text_file to data

# chain them together into a pipeline
mapped_stream = packet_mapper(source)
processed_data_stream = fp_process(mapped_stream)

processed_data_stream.head()  # see the first 5 packets

Tag: {'file_name': 'day1'}, Packet: {'output_data': 'path/to/result/file'}
Tag: {'file_name': 'day2'}, Packet: {'output_data': 'path/to/result/file'}
Tag: {'file_name': 'day3'}, Packet: {'output_data': 'path/to/result/file'}
Tag: {'file_name': 'day4'}, Packet: {'output_data': 'path/to/result/file'}


While separately creating all `mapper` and `pods` and then chaining them helps to rigorously define the data pipeline, admittedly it can get quite verbose and cumbersome.

Fortunately, `orcabrdige` has a number of quality-of-life (QoL) improving features that will help you much more quickly create and combine `operations` and `streams` to define your pipeline without losing the full expressivity. In this notebook, we will explore such QoL improvement features together.

## `function_pod` decorator for simple `FunctionPod` creation

We saw that we can use `FunctionPod` class to wrap an existing function and to associate `output_keys` to rigorously define a `FunctionPod` object that can be used to perform computations on streams of data.

Often, you'd want to define a function intending to only use it as a `FunctionPod`. In that case, you can simplify the `FunctionPod` creation by decorating the function with the `function_pod` decorator:

In [2]:
from orcabridge.pod import function_pod
import json
import tempfile
from pathlib import Path

json_source = GlobSource("json_file", "../examples/dataset2", "*.json")


@function_pod(["output_data"])
def extract_name_from_json(json_file):
    with open(json_file, "r") as f:
        data = json.load(f)
    output_data = {"info": ""}
    if "name" in data:
        output_data["info"] = data["name"]
    output_path = Path(tempfile.mkdtemp()) / "output.json"
    with open(output_path, "w") as f:
        json.dump(output_data, f)
    return output_path

With the above code, the decorator takes the decorated function and creates a FunctionPod with the specified output arguments ("output_data" in this case).

The name `extract_name_from_json` now holds the resulting `FunctionPod` that can be immeidately applied to a stream.

In [3]:
extract_name_from_json(json_source).head()  # preview the first 5 packets

Tag: {'file_name': 'info_day1'}, Packet: {'output_data': PosixPath('/tmp/tmpn0nn30b3/output.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'output_data': PosixPath('/tmp/tmpqg13bjib/output.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'output_data': PosixPath('/tmp/tmp1f_08m5t/output.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'output_data': PosixPath('/tmp/tmpq3x8a298/output.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'output_data': PosixPath('/tmp/tmp_cma7686/output.json')}


If you need to access the original function, it can be retrieved by accessing the `function` attribute

In [4]:
output_path = extract_name_from_json.function("../examples/dataset2/info_day2.json")

with open(output_path, "r") as f:
    data = json.load(f)
    print(data)  # {'info': 'John Doe'}

{'info': 'Day 2 experiment'}


## Mapping tags and packets with `>>` operator

As you chain multiple pods together forming a complex pipeline, you are bound to make frequent use of `MapPackets` to *rename* the output argument from one pod into another argunemt name for the next pod. We have already seen how this can be achieved by creating a specific instance of `MapPackets`, initializing the object with a dictionary defining the name mapping.

Consider the following data source and function pod:

In [5]:
json_files = GlobSource("json_file", "../examples/dataset2", "*.json")


@function_pod(["line_count"])
def count_lines(text_file):
    with open(text_file, "r") as f:
        lines = f.readlines()
    line_count = len(lines)
    output_path = Path(tempfile.mkdtemp()) / "line_count.json"
    with open(output_path, "w") as f:
        json.dump({"line_count": line_count}, f)
    return output_path

If I want to apply the function pod to count and save the number of lines present in the JSON files from the data source, I will have to create a `MapPackets` that renames the output argument `"json_file"` itno `"text_file"` expected by the `count_lines` function.

In [6]:
json_to_text = MapPackets(
    {"json_file": "text_file"}
)  # map packet key json_file to text_file

Finally we can chain them together into a functional pipeline:

In [7]:
line_info = count_lines(json_to_text(json_files))

line_info.head()  # preview the first 5 packets

Tag: {'file_name': 'info_day1'}, Packet: {'line_count': PosixPath('/tmp/tmpbrelyuro/line_count.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'line_count': PosixPath('/tmp/tmp1mgotcqw/line_count.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'line_count': PosixPath('/tmp/tmp8mjlafrx/line_count.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'line_count': PosixPath('/tmp/tmpvs_r2obl/line_count.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'line_count': PosixPath('/tmp/tmpha6qrjs2/line_count.json')}


This is all fine until you start having many more `Pods` and `streams` in your pipeline that needs to be connected together. Many of these connection would need the `MapPackets` `mapper` to be inserted for the function to work properly -- that could be a lot of `MapPackets` you have to create!

Because `MapPackets` is such a common operation, `orcabridge` provides a very simple shortcut for creating a *mapped stream* from another stream using right shift (`>>`) operator.

In [8]:
mapped_stream = json_files >> {"json_file": "text_file"}

mapped_stream.head()

count_lines(mapped_stream).head()

Tag: {'file_name': 'info_day1'}, Packet: {'text_file': PosixPath('../examples/dataset2/info_day1.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'text_file': PosixPath('../examples/dataset2/info_day2.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'text_file': PosixPath('../examples/dataset2/info_day3.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'text_file': PosixPath('../examples/dataset2/info_day4.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'text_file': PosixPath('../examples/dataset2/info_day5.json')}
Tag: {'file_name': 'info_day1'}, Packet: {'line_count': PosixPath('/tmp/tmp_o8blef4/line_count.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'line_count': PosixPath('/tmp/tmp7s3mk_9p/line_count.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'line_count': PosixPath('/tmp/tmpdv672rbb/line_count.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'line_count': PosixPath('/tmp/tmpbvxkwo29/line_count.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'line_count': PosixPa

That's it! Hopefully you'd agree that this is far more convenient than having to define your own `MapPackets` mapper! Using the `>>` operator, the whole pipeline would have looked like:

In [9]:
# preview the first 5 packets
count_lines(json_files >> {"json_file": "text_file"}).head()

Tag: {'file_name': 'info_day1'}, Packet: {'line_count': PosixPath('/tmp/tmpcscf2auv/line_count.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'line_count': PosixPath('/tmp/tmpwtsrkvg4/line_count.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'line_count': PosixPath('/tmp/tmpw4cj_kso/line_count.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'line_count': PosixPath('/tmp/tmpyo6pc_fw/line_count.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'line_count': PosixPath('/tmp/tmp3up8exm6/line_count.json')}


Not only is this simpler to type, we believe it actually makes the pipeline creation more intuitive and expressive of your intention!

### Mapping tags and advanced mapping

We just saw how the rightshift operator can be used to simplify the `MapPackets` operation creation. How about `MapTags`? We can get `MapTags` equivalent operation also by using the rightshift (`>>`) operator, but with the help of an additional function: `tag()`.

In [10]:
from orcabridge.mapper import tag, packet

(json_files >> tag({"file_name": "experiment_day"})).head()

Tag: {'experiment_day': 'info_day1'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day1.json')}
Tag: {'experiment_day': 'info_day2'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day2.json')}
Tag: {'experiment_day': 'info_day3'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day3.json')}
Tag: {'experiment_day': 'info_day4'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day4.json')}
Tag: {'experiment_day': 'info_day5'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day5.json')}


Now if you were to closely inspect `MapPackets` and `MapPackets`, you would know that it is capable of taking in some additional arguments such as `drop_unmapped`. Using `tag()` and `packet()` helper functions would let you specify those arguments as well while using the `>>` operator.

In [11]:
# no packet key matches `data_file`: by default, this will lead to an empty packet
(json_files >> {"data_file": "file_path"}).head()

Tag: {'file_name': 'info_day1'}, Packet: {}
Tag: {'file_name': 'info_day2'}, Packet: {}
Tag: {'file_name': 'info_day3'}, Packet: {}
Tag: {'file_name': 'info_day4'}, Packet: {}
Tag: {'file_name': 'info_day5'}, Packet: {}


In [12]:
# you can preseve unmapped packet key by using `packet` function with `drop_unmapped=False`
(json_files >> packet({"data_file": "file_path"}, drop_unmapped=False)).head()

Tag: {'file_name': 'info_day1'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day1.json')}
Tag: {'file_name': 'info_day2'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day2.json')}
Tag: {'file_name': 'info_day3'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day3.json')}
Tag: {'file_name': 'info_day4'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day4.json')}
Tag: {'file_name': 'info_day5'}, Packet: {'json_file': PosixPath('../examples/dataset2/info_day5.json')}
