Conversation
We default to pandas, but if someone wants to use something like polars, this is a very quick way to support that. The assumption is that whatever dataframe library we're supporting, allows for a "dictionary" like access to get the columns out.
Polars is a competing dataframe library. This is a minimal example based off the pandas hello world that shows it's pretty easy to right polars code with Hamilton too. Note: we don't use the `select` syntax here. I don't know whether this is a best practice or not. But hopefully someone from the polars community could help set us straight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, all about these features but we should be a little smarter about how we do it -- this extract_columns
approach is super brittle.
Some ideas:
- Use singledispatch (or something like that) and register types
- Define a set of operations on dataframes and implementations of them for each one
- Have multiple decorators, using something like
applies_to
and then a dispatch-type approach.
Open to other ideas. Also this is making the assumption that we should know everything on instantiation of the decorator, which is wrong. We can actually derive it on calling of the decorator.
In fact, we may want to add a better validation step that validates when calling it -- our current one doesn't really do what we want.
self, | ||
*columns: Union[Tuple[str, str], str], | ||
fill_with: Any = None, | ||
df_type: Type = pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, idea for the above. Take inspiration from singledispatch, and...
(1) Don't instantiate it on the dataframe type
(2) register additional types in plugins
(3) use dispatch on the node/DAG we get in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not sure how we can do singledispatch
here without some major surgery on extract_columns
; we want classes that aren't in the same module (don't want to add polars as a dependency to main package), where the only difference is really types in most cases, maybe a minor implementation change or two...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be solved with plugins and registration of extra implementations of decorators.
E.G. this is simple:
- Polars plugin appends all its decorator implementations that are polars-specific
- Polars plugin tells us the condition of when to use them pending the nodes we're modifying
- We look to see what plugins are available and use the one that applies
I don't think extract_columns should have two different ways to call it? Its just going to get ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have several ways to implement a plugin.
It's mainly figuring out what should be abstracted.
So I like the idea of abstracting the "dataframe" instead of the decorator. That way we use "black box" delegation, and the decorator is simple, one for all dataframes, it's the black box delegation that takes care of the specific implementation/behavior... For example, this is the code I was prototyping to see how it would look in my latest commit.
The anonymous functions here don't need to be type annotated. What we're returning is dependent on the Node specification.
a2a3df9
to
3febfb0
Compare
To show how we might not require users to pass in the dataframe and series types. Instead we have some functions that register themselves, in addition to using singledispatch which will at runtime choose the right type. This is conceptual code -- we'd need to clean things up and add support for dask dataframes, etc.
3febfb0
to
6c0f5c6
Compare
TODOs:
|
This was completed in #273 |
Polars is gaining traction, we should have an example. This PR helps show an example that matches our hello world. It requires one adjust to the
extract_columns
decorator to function. Otherwise the user right now has to create their own build result function. Which seems fine for now, but it's something we could build asf-hamilton-polars
package for to house.Changes
How I tested this
Notes
Checklist