This repository has been archived by the owner on Jul 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 37
WIP: hamilton + quokka / pyspark #269
Draft
skrawcz
wants to merge
5
commits into
main
Choose a base branch
from
quokka_example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So nothing special. It dos make things testable, but not an ideal UX.
skrawcz
changed the title
Adds example showing hamilton + quokka today
WIP: hamilton + quokka
Jan 5, 2023
Thoughts to think about:
|
Sketch of some code converting the hello world example # Data Loading
# Filtering is part of data loading -- do we also expose columns like this?
@extract_columns(*["l_quantity", "l_extendedprice", "l_discount", "l_tax", "l_returnflag", "l_linestatus"])
def lineitem(qc: QuokkaContext, path: str,
filter: str = "l_shipdate <= date '1998-12-01' - interval '90' day") -> DataStream:
"""Loads and filters data from the lineitem table"""
ds: DataStream = qc.read_csv(path, sep="|", has_header=True)
if filter:
ds = ds.filter(filter)
return ds
# transforms we want people to write
def disc_price(l_extendedprice: pl.Series, l_discount: pl.Series) -> pl.Series:
"""Computes the discounted price"""
return l_extendedprice * (1 - l_discount)
def charge(l_extendedprice: pl.Series, l_discount: pl.Series, l_tax: pl.Series) -> pl.Series:
"""Computes the charge"""
return l_extendedprice * (1 - l_discount) * (1 + l_tax)
@groupby("l_returnflag", "l_linestatus", order_by=[...])
def grouped_lineitem(l_quantity: pl.Series, l_extendedprice: pl.Series,
disc_price: pl.Series, charge: pl.Series, l_discount: pl.Series,
l_returnflag: pl.Series, l_linestatus: pl.Series) -> GroupedDataStream:
pass
# maybe more subtly
def grouped_lineitem(l_returnflag: pl.Series, l_linestatus: pl.Series, *, l_quantity: pl.Series, l_extendedprice: pl.Series,
disc_price: pl.Series, charge: pl.Series, l_discount: pl.Series,
) -> GroupedDataStream:
pass |
This is highly experimental. Just getting a feel for: 1. the hamilton API one would need to write. 2. what we'd need to adjust within Hamilton to make things work as expected. There is a lot of hackyness. But basically it seems that with a graph adapter, assuming correctly node traversal, then at least for the case of a single datastream object we can correctly derive things. Edge cases to handle: - multiple input datastreams - joins -- what's the syntax for that? - schema validation -- we use that to understand what should or shouldn't be used. API thoughts: - groupby doesn't seem terrible, but really the question is: where do you house the group by logic? in the adapter? or in the function? - need to figure out some join syntax
skrawcz
force-pushed
the
quokka_example
branch
from
January 6, 2023 06:36
448d094
to
72d6411
Compare
Parking a thought -- what about just enabling hamilton type functions instead of with_column? Basically given a datastream, that's the input and the output is another datastream def disc_price(l_extendedprice: pl.Series, l_discount: pl.Series) -> pl.Series:
"""Computes the discounted price"""
return l_extendedprice * (1 - l_discount)
def charge(l_extendedprice: pl.Series, l_discount: pl.Series, l_tax: pl.Series) -> pl.Series:
"""Computes the charge"""
return l_extendedprice * (1 - l_discount) * (1 + l_tax)
def main(qc, path):
temp_module = ad_hoc_utils.create_temporary_module(disc_price, charge)
adapter = QuokkaGraphAdapter_V2(base.DictResult())
lineitem = qc.read_csv(path, sep="|", has_header=True)
d = lineitem.filter("l_shipdate <= date '1998-12-01' - interval '90' day")
dr = hamilton.Driver({}, temp_module, adapter=adapter)
d = dr.execute(["disc_price", "charge"])
f = d.groupby(["l_returnflag", "l_linestatus"], orderby=["l_returnflag", "l_linestatus"]).agg(
{
"l_quantity": ["sum", "avg"],
"l_extendedprice": ["sum", "avg"],
"disc_price": "sum",
"charge": "sum",
"l_discount": "avg",
"*": "count",
}
)
return f.collect() |
Hello world continues to work. Basically the next steps would be to: 1. create a join syntax. 2. implement some more examples that have a join. Other thoughts: - every decorator should tag a node that it produced by it.
Not terrible. Still need to design the join abstraction -- but validation that approach to quokka transferred to approach to pyspark.
This does not work.. Trying to replicate the tpc-h-3 benchmark.
Tweaking the above slightly: def disc_price(l_extendedprice: pl.Series, l_discount: pl.Series) -> pl.Series:
"""Computes the discounted price"""
return l_extendedprice * (1 - l_discount)
def charge(l_extendedprice: pl.Series, l_discount: pl.Series, l_tax: pl.Series) -> pl.Series:
"""Computes the charge"""
return l_extendedprice * (1 - l_discount) * (1 + l_tax)
def main(qc, path):
temp_module = ad_hoc_utils.create_temporary_module(disc_price, charge)
adapter = QuokkaGraphAdapter_V2()
lineitem = qc.read_csv(path, sep="|", has_header=True)
d = lineitem.filter("l_shipdate <= date '1998-12-01' - interval '90' day")
dr = hamilton.Driver({}, temp_module, adapter=adapter)
d = dr.execute(["disc_price", "charge"], inputs={c: d for c in d.schema}) # default is to append columns to passed in dataframe
f = d.groupby(["l_returnflag", "l_linestatus"], orderby=["l_returnflag", "l_linestatus"]).agg(
{
"l_quantity": ["sum", "avg"],
"l_extendedprice": ["sum", "avg"],
"disc_price": "sum",
"charge": "sum",
"l_discount": "avg",
"*": "count",
}
)
return f.collect() The QuokkaGraphAdapter_V2 adapter then intercepts and massages the internals appropriately to return a datastream with the extra columns. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Working on how to make Hamilton handle Quokka better.