In [1]:
import orcapod as op
import shutil

We will also make heavy use of PyArrow:

In [2]:
import pyarrow as pa

### Preparing the environment

In this notebook, we will create a local directory called `pipeline_data` and store results in there. To make sure we get reproducibile results, we start by making sure that this directory does not exist locally.

In [3]:
shutil.rmtree("./pipeline_data", ignore_errors=True)

### Creating streams

At the moment, there is only one way to create stream and that is by wrapping a PyArrow table.

In [4]:
table = pa.Table.from_pydict(
    {
        "a": [1, 2, 3],
        "b": ["x", "y", "z"],
        "c": [True, False, True],
        "d": [1.1, 2.2, 3.3],
    }
)

Use `op.streams.ImmutableTableStream` to turn table into a stream. You will also have to specify which columns are the tags.

In [5]:
stream = op.streams.ImmutableTableStream(table, tag_columns=["a", "b"])

### Working with streams

Once you have a stream, you can iterate through tag, packet pair:

In [6]:
for tag, packet in stream:
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'a': 1, 'b': 'x'}, Packet: {'c': True, 'd': 1.1}
Tag: {'a': 2, 'b': 'y'}, Packet: {'c': False, 'd': 2.2}
Tag: {'a': 3, 'b': 'z'}, Packet: {'c': True, 'd': 3.3}


You can also get all tag packet pairs as a list of tuples by calling `.flow()`

In [7]:
stream.flow()

[(ArrowTag(data={'a': 1, 'b': 'x'}, meta_columns=0, context='std:v0.1.0:default'),
  ArrowPacket(data={'c': True, 'd': 1.1}, meta_columns=0, context='std:v0.1.0:default')),
 (ArrowTag(data={'a': 2, 'b': 'y'}, meta_columns=0, context='std:v0.1.0:default'),
  ArrowPacket(data={'c': False, 'd': 2.2}, meta_columns=0, context='std:v0.1.0:default')),
 (ArrowTag(data={'a': 3, 'b': 'z'}, meta_columns=0, context='std:v0.1.0:default'),
  ArrowPacket(data={'c': True, 'd': 3.3}, meta_columns=0, context='std:v0.1.0:default'))]

Every stream can be converted into a table with `as_table()` method

In [8]:
stream.as_table()

pyarrow.Table
a: int64
b: string
c: bool
d: double
----
a: [[1,2,3]]
b: [["x","y","z"]]
c: [[true,false,true]]
d: [[1.1,2.2,3.3]]

Optionally, you can pass in arguments to `as_table` to have system columns included in the table

`include_source` adds `source` column for each data (non-tag) column patterned like `_source_{column}` and will contain information about where that particular value orginated from.

In [9]:
stream.as_table(include_source=True)

pyarrow.Table
a: int64
b: string
c: bool
d: double
_source_c: large_string
_source_d: large_string
----
a: [[1,2,3]]
b: [["x","y","z"]]
c: [[true,false,true]]
d: [[1.1,2.2,3.3]]
_source_c: [[null,null,null]]
_source_d: [[null,null,null]]

`include_content_hash` will compute `content_hash` for each packet and include it as `_content_hash` column

In [10]:
stream.as_table(include_content_hash=True)

pyarrow.Table
a: int64
b: string
c: bool
d: double
_content_hash: large_string
----
a: [[1,2,3]]
b: [["x","y","z"]]
c: [[true,false,true]]
d: [[1.1,2.2,3.3]]
_content_hash: [["arrow_v0.1@3de5f8a7b9a2fe5e6cc3c84e0368a21e807abe655b5a4dc58efc9b5487e3d9a8","arrow_v0.1@cc022b33fc80a6639d2051d6d19a0162a832ce309367e426433e7401390b6e20","arrow_v0.1@b0bb7434c813b4d5d7c3a5445a0ac3804739388a20a78d6d910b8c02d9ec5653"]]

Alternatively, you can pass in a custom column name to use for the content hash column

In [11]:
stream.as_table(include_content_hash="my_hash_values")

pyarrow.Table
a: int64
b: string
c: bool
d: double
my_hash_values: large_string
----
a: [[1,2,3]]
b: [["x","y","z"]]
c: [[true,false,true]]
d: [[1.1,2.2,3.3]]
my_hash_values: [["arrow_v0.1@3de5f8a7b9a2fe5e6cc3c84e0368a21e807abe655b5a4dc58efc9b5487e3d9a8","arrow_v0.1@cc022b33fc80a6639d2051d6d19a0162a832ce309367e426433e7401390b6e20","arrow_v0.1@b0bb7434c813b4d5d7c3a5445a0ac3804739388a20a78d6d910b8c02d9ec5653"]]

Finally, `include_data_context` adds data context column as `_context_key` which captures information about the OrcaPod version, hasher version etc that were used when generting that packet.

In [12]:
stream.as_table(include_data_context=True)

pyarrow.Table
a: int64
b: string
c: bool
d: double
_context_key: large_string
----
a: [[1,2,3]]
b: [["x","y","z"]]
c: [[true,false,true]]
d: [[1.1,2.2,3.3]]
_context_key: [[null,null,null]]

### Tags and Packets

The tags and packets returned by the streams can be thought of as special dictionary.

In [13]:
all_tags_and_packets = stream.flow()

In [14]:
tag, packet = all_tags_and_packets[0]

In [15]:
tag

ArrowTag(data={'a': 1, 'b': 'x'}, meta_columns=0, context='std:v0.1.0:default')

In [16]:
packet

ArrowPacket(data={'c': True, 'd': 1.1}, meta_columns=0, context='std:v0.1.0:default')

The element of tag/packet can be accessed just like dictionary:

In [17]:
tag["a"]

1

In [18]:
tag["b"]

'x'

In [19]:
packet["c"]

True

In [20]:
packet["d"]

1.1

They have a few methods that will provide additional insights:

In [21]:
# Returns typespec (dictionary of key to type)
packet.types()

{'c': bool, 'd': float}

In [22]:
# entry names as strings
packet.keys()

('c', 'd')

They can also be converted to an Arrow table by calling `as_table`

In [23]:
packet.as_table()

pyarrow.Table
c: bool
d: double
----
c: [[true]]
d: [[1.1]]

And schema is conveniently available as:

In [24]:
packet.arrow_schema()

c: bool
d: double

You can also get a plain dictionary from tag/packet with `as_dict`

In [25]:
tag.as_dict()

{'a': 1, 'b': 'x'}

Packet contains some additional data such as `source_info`

In [26]:
packet.source_info()

{'c': None, 'd': None}

These additional data can be included when converting to dict or table

In [27]:
packet.as_dict(include_source=True)

{'c': True, 'd': 1.1, '_source_c': None, '_source_d': None}

In [28]:
packet.as_table(include_source=True)

pyarrow.Table
c: bool
d: double
_source_c: large_string
_source_d: large_string
----
c: [[true]]
d: [[1.1]]
_source_c: [[null]]
_source_d: [[null]]

The hash of tag/packet can be computed with `content_hash()` method. The result will be cached so that it won't be computed again unnecessarily.

In [29]:
tag.content_hash()

'arrow_v0.1@6e1143896d73d370757811b52ceeea8d1d456cd30206416fbf81754e1cea5568'

## Working with operators

We start getting into orcapod computation when we use operators. At the time of the writing, only `Join` operator is implemented fully but more are to come very shortly.

Let's prepare two streams:

In [30]:
table1 = pa.Table.from_pydict(
    {
        "id": [0, 1, 4],
        "a": [1, 2, 3],
        "b": ["x", "y", "z"],
    }
)

table2 = pa.Table.from_pydict(
    {
        "id": [0, 1, 2],
        "c": [True, False, True],
        "d": [1.1, 2.2, 3.3],
    }
)

stream1 = op.streams.ImmutableTableStream(table1, tag_columns=["id"])
stream2 = op.streams.ImmutableTableStream(table2, tag_columns=["id"])

We now join the two streams by instantiating the Join operator and then passing in the two streams:

In [31]:
join = op.operators.Join()

In [32]:
joined_stream = join(stream1, stream2)

Calling an operator on stream(s) immediately performs checks to make sure that the input streams are comaptible with the operator but otherwise it does NOT trigger any computation. Computation occurs only when you try to **access the output stream's content via iteration, flow, or through conversion to table**.

In [33]:
for tag, packet in joined_stream:
    print(f"Tag: {tag}, Packet: {packet}")

Tag: {'id': 0}, Packet: {'a': 1, 'b': 'x', 'c': True, 'd': 1.1}
Tag: {'id': 1}, Packet: {'a': 2, 'b': 'y', 'c': False, 'd': 2.2}


The output of the computation is automatically cached so that as long as you access the same output stream, you won't be triggering unnecessary recomputation!

In [34]:
joined_stream.as_table()

pyarrow.Table
id: int64
a: int64
b: string
c: bool
d: double
----
id: [[0,1]]
a: [[1,2]]
b: [["x","y"]]
c: [[true,false]]
d: [[1.1,2.2]]

## Working with Function Pods

Now we have explored the basics of streams, tags, packets, and operators (i.e. Join), it's time to explore the meat of `orcapod` -- `FunctionPod`s! Let's start by defining a very simple function pod that takes in two numbers and return the sum.

In [35]:
@op.function_pod(output_keys=["sum"])
def add_numbers(a: int, b: int) -> int:
    """A simple function pod that adds two numbers."""
    return a + b

You'll notice that, aside from the `op.function_pod` decorator, this is nothing but an ordinary Python function with type hints! The type hints are crucial however, as this will be used by `orcapod` system to validate the input streams into your pods and to be able to predict if the output of your pod can be fed into another operator/pod without an issue.

Once you have function pod defined, you can already use it on streams just like operators. Let's prepare a stream that has entries for `a` and `b` and then feed them into the function pod.

In [36]:
input_table = pa.Table.from_pydict(
    {
        "id": [0, 1, 2, 3, 4],
        "a": [1, 2, 3, 4, 5],
        "b": [10, 20, 30, 40, 50],
    }
)

input_stream = op.streams.ImmutableTableStream(input_table, tag_columns=["id"])

In [37]:
# run the stream through the function pod!
output_stream = add_numbers(input_stream)

And that's it! Believe it or not, that is all it takes to set up the computation. The actual computation will be triggered the first time you access the content of the output stream.

In [38]:
output_stream

KernelStream(kernel=FunctionPod:add_numbers(a: int, b: int)-> <class 'int'>, upstreams=(ImmutableTableStream(table=['id', 'a', 'b'], tag_columns=('id',)),))

In [39]:
for t, p in output_stream:
    print(f"Tag: {t}, Packet: {p}")

Tag: {'id': 0}, Packet: {'sum': 11}
Tag: {'id': 1}, Packet: {'sum': 22}
Tag: {'id': 2}, Packet: {'sum': 33}
Tag: {'id': 3}, Packet: {'sum': 44}
Tag: {'id': 4}, Packet: {'sum': 55}


Simple, right?

## Chaining operators and pods into a pipeline

Now that we have seen how to define and run pods, it's time to put them together into a concrete pipeline. To do so, we will construct a `Pipeline` instance. When doing so, we have to pass in a place to save data to, so we will also prepare a data store.

In [40]:
data_store = op.stores.BatchedDeltaTableArrowStore(base_path="./pipeline_data")

pipeline = op.Pipeline(name="MyPipelin", pipeline_store=data_store)

Once we have the pipeline ready, we can define the pipeline by simply running & chaining operators and pods **inside the pipeline context**. Typically, you'd want to define your function pods before hand:

In [41]:
@op.function_pod(output_keys=["sum"])
def add_numbers(a: int, b: int) -> int:
    """A simple function pod that adds two numbers."""
    return a + b


@op.function_pod(output_keys=["product"])
def multiply_numbers(a: int, b: int) -> int:
    """A simple function pod that multiplies two numbers."""
    return a * b


@op.function_pod(output_keys=["result"])
def combine_results(sum: int, product: int) -> str:
    """A simple function pod that combines results."""
    return f"Sum: {sum}, Product: {product}"

In [42]:
# now defien the pipeline
with pipeline:
    sum_results = add_numbers(input_stream)
    product_results = multiply_numbers(input_stream)
    final_results = combine_results(sum_results, product_results)

You can access individual elements of the pipeline as an attribute. By default, the attribute is named after the operator/pod name.

In [43]:
pipeline.add_numbers

PodNode(pod=FunctionPod:add_numbers)

Notice that elements of the pipeline is wrapped in a `Node`, being either `PodNode` or `KernelNode`.

You can fetch results of the pipeline through these nodes. For example, you can access the saved results of the pipeline as Polars dataframe by access the `df` attribute.

In [44]:
pipeline.add_numbers.df

You'll notice that `df` comes back empty because the pipeline is yet to run. Let's now trigger the pipeline to fill the nodes with computation results!

In [45]:
pipeline.run()

This will cause all nodes in the pipeline to run and store the results.

Now let's take a look at the computed results:

In [46]:
pipeline.add_numbers.df

id,sum
i64,i64
0,11
1,22
2,33
3,44
4,55


You now have the computations saved at each node!

### Labeling nodes in the pipeline

When constructing the pipeline, each invocation of the operator/pod results in a new node getting added, with the name of the node defaulting to the name of the operator/pod. If you use the same pod multiple times, then the nodes will be given names of form `{pod_name}_0`, `{pod_name}_1`, and so on.

While this is helpful default behavior, you'd likely want to explicitly name each node so you can more easily understand what you are accessing within the pipeline. To achieve this, you can explicitly label each invocation with `label=` argument in the call.

In [47]:
data_store = op.stores.BatchedDeltaTableArrowStore(base_path="./pipeline_data")

pipeline = op.Pipeline(name="MyPipelin", pipeline_store=data_store)

In [48]:
# now defien the pipeline
with pipeline:
    sum_results = add_numbers(input_stream, label="my_summation")
    product_results = multiply_numbers(input_stream, label="my_product")
    final_results = combine_results(
        sum_results, product_results, label="my_final_result"
    )

In [49]:
pipeline.my_summation.df

id,sum
i64,i64
0,11
1,22
2,33
3,44
4,55


In [50]:
pipeline.my_product.df

id,product
i64,i64
0,10
1,40
2,90
3,160
4,250


In [51]:
pipeline.my_final_result.df

id,result
i64,str
0,"""Sum: 11, Product: 10"""
1,"""Sum: 22, Product: 40"""
2,"""Sum: 33, Product: 90"""
3,"""Sum: 44, Product: 160"""
4,"""Sum: 55, Product: 250"""


Notice that despite just freshly creating the pipeline, each node already had results filled in! This is because the results from the previous pipeline execution was smartly fetched back. Critically, this was done only because Orcapod noticed that you had an identical pipeline with the same inputs and same operators/pods so that you can reuse the result as is. Should the structure of pipeline been different, the wront results would not be loaded.