Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Decorator to specify value and dag node inputs #162

Closed
wangkev opened this issue Jul 23, 2022 · 17 comments · Fixed by #163
Closed

Decorator to specify value and dag node inputs #162

wangkev opened this issue Jul 23, 2022 · 17 comments · Fixed by #163
Assignees
Labels
enhancement New feature or request

Comments

@wangkev
Copy link

wangkev commented Jul 23, 2022

Is your feature request related to a problem? Please describe.
I am looking for a decorator to specify both value and dag node inputs. parameterized allows for value inputs, parameterized_inputs allows for dag node inputs, but there are no decorators that allow you to do both.

Describe the solution you'd like
A decorator that allows for both value and dag node inputs.

Describe alternatives you've considered
I tried just decorating a function with both decorators, but that is not supported.

Additional context

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 23, 2022

@wangkev thanks for posting.

Do you mind providing a bit more information around your use case? Otherwise we've talked about this possibility -- we just haven't had the use case for it.

Feel free to post here -- or join us in slack to chat there.

@skrawcz skrawcz added the triage label for issues that need to be triaged. label Jul 23, 2022
@wangkev
Copy link
Author

wangkev commented Jul 24, 2022

There are a few use cases. For example, to implement an adder or shifter function.

def add(field: pd.Series, n: int):
    return field + n
def shift(field: pd.Series, window: int):
    return field.shift(window)

We would want to parameterize field and n/window.

More generally, many pandas.Series methods take as input other scalar arguments.

@elijahbenizzy
Copy link
Collaborator

So its basically a way to do 1 function where it explodes out to multiple use-cases, right?

It's an interesting tension between repetitive code (E.G. repeating multiple input/output parameterizations) and making everything readable. A common way we've seen this is something like:

def _shift_helper(**kwargs):
    # this is a slight limitation of the `@does` API -- happy to think this through a little more
    series_arg = [item for item in kwargs.values() if isinstance(kwarg, pd.Series)]
    window_args = [item for item in kwargs.values() if isinstance(kwarg, int)] 
    return series_arg.shift(window_arg)


@does(_shift_helper)
def shift_artifact_1(field_1: pd.Series, window: int=3) -> pd.Series:
    pass

@does(_shift_helper)
def shift_artifact_2(field_2: pd.Series, window: int=5) -> pd.Series:
    pass

# And so on

This is nice as everything is written down in a function but there's a lot of added code. I think when there gets to be enough parametrizations, what you suggested would be pretty nice.

E.G. something like:

@parametrized_everything_for_lack_of_a_better_name(
    field_1=({"field" : "field_1", "shift" : 3}, "field_1 shifted by 3"),
    field_2=({"field" : "field_2", "shift" : 5}, "field_2 shifted by 5"),
def shift(field: pd.Series, window: int):
    return field.shift(window)  

The cool part about this from an implementation perspective is we could simplify the code by gutting @parametrized and @parametrized_inputs , replacing it with subclasses/delegation of this new decorator.

Also, what exactly breaks with the two combined? Feels like that could be an approach as well.

@wangkev
Copy link
Author

wangkev commented Jul 25, 2022

Yes, exactly. I have settled on something like the former you mentioned for now, just to have something that works. The preferred, imo, is definitely the latter, which I find is clearer and more concise.

@elijahbenizzy
Copy link
Collaborator

Awesome, I'm inclined to agree. Might give this a stab in the next week -- if I do, would you mind testing out on a branch/giving feedback?

There are a few APIs I'm thinking about -- the interesting thing is we have to specify if an input is a literal (E.G. a string), or coming from an input. One option could be...

@parametrized_full(
    field_1=({"field" : dependency("field_1"), "shift" : 3}, "field_1 shifted by 3"),
    field_2=({"field" : dependency("field_2"), "shift" : 5}, "field_2 shifted by 5"),
def shift(field: pd.Series, window: int):
    return field.shift(window)  

Or its alternative:

@parametrized_full(
    field_1=({"field" : "field_1", "shift" : literal(3)}, "field_1 shifted by 3"),
    field_2=({"field" : "field_2", "shift" : literal(5)}, "field_2 shifted by 5"),
def shift(field: pd.Series, window: int):
    return field.shift(window)  

Any other ideas? I'm kinda into requiring this decorator to have both:

@parametrized_full(
    field_1=({"field" : dependency("field_1"), "shift" : literal(3)}, "field_1 shifted by 3"),
    field_2=({"field" : dependency("field_2"), "shift" : literal(5)}, "field_2 shifted by 5"),
def shift(field: pd.Series, window: int):
    return field.shift(window)  

The advantage of this is that it allows something really expressive, and easy delegation to @parametrized and @parametrized_inputs.

The annoying thing is that @parametrized should really be called @parametrized_values, and the new one should just be @parametrized.... Could be a good change for 2.0 but we haven't started thinking too much about that.

@wangkev
Copy link
Author

wangkev commented Jul 26, 2022

Absolutely - happy to test out and give feedback!

I also kind of like explicitly specifying a node vs. a literal. Although, I do find the expression API of polars to be quite elegant (https://pola-rs.github.io/polars-book/user-guide/dsl/expressions.html). This would be in favor of the first option, with only nodes explicitly specified.

Other thing I would note is whether it would make sense to keep the parameterized docstring approach as with parameterized_inputs now. Maybe it is more flexible with an explicit docstring.

@elijahbenizzy
Copy link
Collaborator

OK, had a lot of fun with this. Not complete yet, but on its way: #163.

polars looks awesome! hadn't seen it before, but definitely want to check it out -- those transformations look quite slick. I'm in favor of specifying both upstream and literal (and this currently would break otherwise), but I'd consider otherwise.

@elijahbenizzy
Copy link
Collaborator

Hey! Ready for review -- would love your thoughts on the API! #163

@skrawcz skrawcz added enhancement New feature or request and removed triage label for issues that need to be triaged. labels Aug 6, 2022
@skrawcz
Copy link
Collaborator

skrawcz commented Aug 6, 2022

@wangkev in case it's helpful, you can either checkout the branch and do pip install -e ., or install the branch directly from github. I believe this should work pip install "git+https://github.com/stitchfix/hamilton@full_parametrized". Instead of https you could do ssh.

@skrawcz skrawcz linked a pull request Aug 6, 2022 that will close this issue
12 tasks
@wangkev
Copy link
Author

wangkev commented Aug 6, 2022

Hi -- thanks for the update, this is neat! The overall functionality seems to be exactly what we discussed. I have a few thoughts from playing around:

  1. parameterize_values and parameterize_inputs do not like it when I pass in literal and upstream inputs. I would have expected that to work, i.e. to be able to be consistent with parameterize. Don't think it's a necessity, but did cause a bit of confusion in the beginning.
  2. Would it make sense to have consistent naming between literal/upstream and parameterize_values/parameterize_inputs?

Thanks again for the quick update on this!

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Aug 7, 2022

Hi -- thanks for the update, this is neat! The overall functionality seems to be exactly what we discussed. I have a few thoughts from playing around:

  1. parameterize_values and parameterize_inputs do not like it when I pass in literal and upstream inputs. I would have expected that to work, i.e. to be able to be consistent with parameterize. Don't think it's a necessity, but did cause a bit of confusion in the beginning.
  2. Would it make sense to have consistent naming between literal/upstream and parameterize_values/parameterize_inputs?

Thanks again for the quick update on this!

Thanks for taking it for a spin! Re: (1) the idea was to have it be a more speciifc API, but its hard to say it shouldn't accept them? E.G. why would it not allows upstream for @parameterize_inputs?

Re (2) slightly torn -- appreciate consistency but parameterize_upstream and parameterize_literal doesn't feel particularly readable. Perhaps we should replace upstream and literal with input and value, although those feel a little less clear to me... Thoughts?

@wangkev
Copy link
Author

wangkev commented Aug 8, 2022

Re: (1) the idea was to have it be a more speciifc API, but its hard to say it shouldn't accept them? E.G. why would it not allows upstream for @parameterize_inputs?

Agree.

Re (2) slightly torn -- appreciate consistency but parameterize_upstream and parameterize_literal doesn't feel particularly readable. Perhaps we should replace upstream and literal with input and value, although those feel a little less clear to me... Thoughts?

I was thinking the same thing about naming. Elsewhere in the documentation (https://github.com/stitchfix/hamilton/blob/main/basics.md?plain=1#L16), upstream/inputs is called a node, so maybe it makes sense to keep that terminology? Then it will be node vs value?

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Aug 8, 2022

Oh interesting. So:

{'foo' : node('foo_source'), 'bar' : value('bar')}?

Then keep @parameterize_inputs and @parameterize_values? I don't think @parameterize_node would make too much sense...

Maybe @parameterize_sources and @parameterize_values makes sense? Along with source(...) and value(...)?

source and value is what we're really changing...

Thanks for the feedback!

@skrawcz
Copy link
Collaborator

skrawcz commented Aug 8, 2022

🤔 I think this is a power user feature, so node is probably better than source? Though node will clash with the package & class name we have... hmm.

We should be aiming for naming where if someone has used it a couple of times, they wont need to look at the documentation to remember what the distinctions are...

So current options are:

  1. source & value.
  2. source & literal.
  3. input & value. input would override a built in term so I think this removes this option.
  4. input & literal. input would override a built in term so I think this removes this option.
  5. upstream & value.
  6. upstream & literal.
  7. dependency & value.
  8. dependency & literal.
  9. field & value.
  10. field & literal.

Any others?

@elijahbenizzy
Copy link
Collaborator

input can be namespaced, but agreed we want to steer clear from builtins. IMO:

  • @parameterize_sources and {'arg' : source('arg_source')}
  • @parameterize_values and {'arg' : value('arg_value')}

Are internally consistent, logical, and easy to understand. inputs are not specific enough, upstream feels hard to grok, dependency is a little too long. ALthough dependency and value would work as well...

@wangkev
Copy link
Author

wangkev commented Aug 9, 2022

Another option is field vs value.

@elijahbenizzy
Copy link
Collaborator

OK, so I think I'm going to go with the above:

  • @parameterize_sources and source(...)
  • @parameterize_values and value(...)

IMO its the clearest and most concise - any objections?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants