Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone from non-parameters #1630

Merged
merged 2 commits into from
Nov 1, 2016

Conversation

ChrisBeaumont
Copy link
Contributor

Currently Task.clone only propagates values from source to target when both attributes are Parameters. This PR propagates data from source to target parameter regardless of whether the source attribute is a parameter or something else (property, class or instance-level attribute, etc).

The use case for this has come up in cases where we interface between generic and specific tasks. A simplified example, consider some generic classes to POST several files referenced in a manifest to an external url:

class Upload(luigi.Task):
    url = luigi.Parameter()
    file = luigi.Parameter()

class UploadFromManifest(luigi.Task):
    url = luigi.Parameter()
    manifest = luigi.Parameter()

    def run(self):
        files = open(manifest).readlines()
        yield [self.clone(cls=Upload, file=file) for file in files]   

as well as a more specific subtask where url is derived from a more fundamental parameter:

class MonthlyReport(UploadFromManifest):
    month = luigi.Parameter()
    manifest = luigi.Parameter()

    @property
    def url(self):
        return 'http://test.com/reports/%s' % self.month

This fails at the clone step because, although url is defined for MonthlyReport, it's not a parameter and thus not propagated to Upload

@erikbern
Copy link
Contributor

erikbern commented Apr 4, 2016

not sure about this. it seems too magic and i think it could lead to unexpected behavior later.

what's the issue you're trying to solve here? your example isn't very useful because url is directly derived from month

@ChrisBeaumont
Copy link
Contributor Author

it seems too magic and i think it could lead to unexpected behavior later.

Yeah maybe, the failing tests (which I haven't digested yet) might suggest this as well. Here's a slightly more fleshed out use case to describe where I'm coming from:

In practice the Upload and UploadFromManifest tasks (which I actually do use), have several more parameters:

class RequestMixin(object):
    url = luigi.Parameter()
    headers = DictParameter(default={})
    params = DictParameter(default={})
    username = luigi.Parameter(default=None, significant=False)
    password = luigi.Parameter(default=None, significant=False)

class Upload(luigi.Task, RequestMixin):
    payload_file = luigi.Parameter()

    def run(self):
        # builds and sends request, saves response content as task output

class UploadFromManifest(luigi.Task, RequestMixin):
    manifest = luigi.Parameter()
    def run(self):
         yield [self.clone(cls=Upload, payload_file=file) for file in open(self.manifest).readlines()]

we use the generic UploadFromManifest in several different pipelines to transfer data between services. In most concrete cases all of the parameters in in RequestMixin are either known at task-definition time, or derived from a more context-specific parameter. So they might be specifiable as something like:

class UploadToServiceFoo(UploadFromManifest):
    release_number = luigi.IntParameter()

    # static for this task
    username = settings.FooServiceUser
    password = settings.FooServicePassword
    headers = {'Content-Type': 'text/csv'}
    params = {'uploader: 'luigi'}

    #url derives from release_number
    @property
    def url(self):
        return "http://foo.com/%s" % self.release_number

This PR makes the above code functional, and I rather like its flat+declarative structure. Without it you'd need to write:

class UploadToService(luigi.WrapperTask):
    <same declarations as above>

    def requires(self):
         yield UploadManifest(username=self.username, password=self.password, headers=self.headers, params=self.params, url=self.url)

ie you lose the intended benefits of clone (not having to re-enumerate every parameter when describing dependencies). Furthermore the first example throws an exception like missing required parameter url, which surprised me since url is clearly defined by the task.

Another small consideration: the following code works currently (maybe you'd consider it an antipattern, I'm not sure):

class SpecificUpload(Upload):
    release_number = luigi.Parameter()

    username = settings.username
    password = settings.password
    headers = {'Content-Type':'text/csv'}    
    url = 'http://foo.com'

    @property
    def payload_file(self):
        return "%s.csv" % self.release_number

In other words, overriding parameters with class-level attributes or properties currently works, but propagating these overrides through clone does not. That feels a little asymmetrical to me.

However, maybe you disagree, or there's another mechanism for addressing this use case that I'm missing?

@Tarrasch
Copy link
Contributor

Tarrasch commented Apr 4, 2016

I'm quite positive to this change and I find this to make sense. We want the clone-behavior for Parameter-attributes to feel as similar as possible for other attributes.

Let me know what insights you can get from the failed test case.

@ChrisBeaumont
Copy link
Contributor Author

@Tarrasch @erikbern ok the problematic test case boils down to this:

class X(luigi.Task):
    n = luigi.IntParameter(default=42)

@inherits(X)
class Z(luigi.Task):
    n = None

    def requires(self):
        return self.clone_parent()

assert Z().requires().n == 42

or more simply

class X(luigi.Task):
    n = luigi.IntParameter(default=42)

class Z(luigi.Task):
    n = None

assert Z().clone(cls=X).n == 42

It was added in this commit by @erikbern

The intention of this PR is to change that behavior, so that n=None is treated as an override. Personally I find the behavior in this PR to be easier to reason about. However I don't know what the original intention of the failing test is, or whether it implies that other user code makes the same assumption as this test does. It's easy to rewrite the test failure so that it passes, but will wait to hear from you whether you think this behavior change is appropriate.

@erikbern
Copy link
Contributor

erikbern commented Apr 4, 2016

So the real issue you are trying to solve here is you want a subclass to be able to override base class parameter and fix them to certain value?

I haven't thought of this use case

@ChrisBeaumont
Copy link
Contributor Author

So the real issue you are trying to solve here is you want a subclass to be able to override base class parameter and fix them to certain value?

Well this is already possible -- a subclass can override base class parameters with constants (class attributes or properties). I'm proposing the override behavior for clone should behave the same way as it does for subclassing.

@ChrisBeaumont
Copy link
Contributor Author

I haven't thought of this use case

IMO the utility of this comes into play when you extract out common generic tasks. Because they are general-purpose, they often need several parameters to be fully specified. But often times those settings should not be considered "parametrize-able" in the downstream, special-purpose task. For example the Upload task takes url as a parameter, but in a downstream task UploadToFoo I don't want the user to be able to change the value of url from the command line -- maybe it always needs to be foo.com, or else the implementation of UploadToFoo doesn't make sense. Setting url=foo.com at the class level is a nice way to do this, and it would be nice if self.clone(Upload) passed this value along for me.

@Tarrasch
Copy link
Contributor

@ChrisBeaumont any updates on this?

@ChrisBeaumont
Copy link
Contributor Author

@Tarrasch in my opinion this PR is "final", and the question is whether you and @erikbern think it's appropriate to merge. If it's relevant we added this feature to Counsyl's Luigi installation several months ago, and have found it useful.

@Tarrasch Tarrasch merged commit 499f7fc into spotify:master Nov 1, 2016
@Tarrasch
Copy link
Contributor

Tarrasch commented Nov 1, 2016

Thanks @ChrisBeaumont. After reading this once and twice. It's clear to me that this new behavior is better and makes more sense. Usually the luigi dictators are pretty hesitant to changes to luigi core hence the prolonged discussion. I hope it wasn't discouraging and sorry for the very long delay in merging!

This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants