Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Polars example #263

Closed
wants to merge 4 commits into from
Closed

Polars example #263

wants to merge 4 commits into from

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Jan 1, 2023

Polars is gaining traction, we should have an example. This PR helps show an example that matches our hello world. It requires one adjust to the extract_columns decorator to function. Otherwise the user right now has to create their own build result function. Which seems fine for now, but it's something we could build a sf-hamilton-polars package for to house.

Changes

  1. prototypes change to extract_columns to handle providing the dataframe types.
  2. adds example polars code.

How I tested this

  1. Tested this locally.

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

We default to pandas, but if someone wants to use something like
polars, this is a very quick way to support that.

The assumption is that whatever dataframe library we're supporting,
allows for a "dictionary" like access to get the columns out.
Polars is a competing dataframe library. This is a minimal
example based off the pandas hello world that shows
it's pretty easy to right polars code with Hamilton too.

Note: we don't use the `select` syntax here. I don't know
whether this is a best practice or not. But hopefully someone
from the polars community could help set us straight here.
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, all about these features but we should be a little smarter about how we do it -- this extract_columns approach is super brittle.

Some ideas:

  1. Use singledispatch (or something like that) and register types
  2. Define a set of operations on dataframes and implementations of them for each one
  3. Have multiple decorators, using something like applies_to and then a dispatch-type approach.

Open to other ideas. Also this is making the assumption that we should know everything on instantiation of the decorator, which is wrong. We can actually derive it on calling of the decorator.

In fact, we may want to add a better validation step that validates when calling it -- our current one doesn't really do what we want.

examples/polars/my_functions.py Outdated Show resolved Hide resolved
self,
*columns: Union[Tuple[str, str], str],
fill_with: Any = None,
df_type: Type = pd.DataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, idea for the above. Take inspiration from singledispatch, and...

(1) Don't instantiate it on the dataframe type
(2) register additional types in plugins
(3) use dispatch on the node/DAG we get in

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 not sure how we can do singledispatch here without some major surgery on extract_columns; we want classes that aren't in the same module (don't want to add polars as a dependency to main package), where the only difference is really types in most cases, maybe a minor implementation change or two...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be solved with plugins and registration of extra implementations of decorators.

E.G. this is simple:

  1. Polars plugin appends all its decorator implementations that are polars-specific
  2. Polars plugin tells us the condition of when to use them pending the nodes we're modifying
  3. We look to see what plugins are available and use the one that applies

I don't think extract_columns should have two different ways to call it? Its just going to get ugly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have several ways to implement a plugin.

It's mainly figuring out what should be abstracted.

So I like the idea of abstracting the "dataframe" instead of the decorator. That way we use "black box" delegation, and the decorator is simple, one for all dataframes, it's the black box delegation that takes care of the specific implementation/behavior... For example, this is the code I was prototyping to see how it would look in my latest commit.

hamilton/function_modifiers/expanders.py Outdated Show resolved Hide resolved
hamilton/function_modifiers/expanders.py Outdated Show resolved Hide resolved
The anonymous functions here don't need to be type annotated.
What we're returning is dependent on the Node specification.
@skrawcz skrawcz force-pushed the polars_example branch 2 times, most recently from a2a3df9 to 3febfb0 Compare January 2, 2023 04:32
To show how we might not require users to pass in
the dataframe and series types. Instead we have
some functions that register themselves,
in addition to using singledispatch which will
at runtime choose the right type.

This is conceptual code -- we'd need to clean things
up and add support for dask dataframes, etc.
@skrawcz skrawcz marked this pull request as draft January 2, 2023 15:24
@skrawcz
Copy link
Collaborator Author

skrawcz commented Jan 4, 2023

TODOs:

  • determine dataframe interface
  • refactor into separate packages

@skrawcz
Copy link
Collaborator Author

skrawcz commented Jan 28, 2023

This was completed in #273

@skrawcz skrawcz closed this Jan 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants