Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add range functionality for monthly cadence. #2601

Merged
merged 2 commits into from Dec 13, 2018

Conversation

lallea
Copy link
Contributor

@lallea lallea commented Dec 9, 2018

Description

Add range functionality for monthly cadence.

Motivation and Context

RangeMonthly simplifies running backfilling jobs that run on a monthly basis, in a similar manner as RangeDaily, RangeHourly, etc.

Note that the PR adds a dependency on python-dateutil for date arithmetic.

Have you tested this? If so, how?

tox -e flake8
tox -e py27-core

No new test failures.

Also running in production since a few days back.

@ulzha, I think you might be a suitable reviewer for this one?

Bring in python-dateutil for date arithmetic.
@lallea
Copy link
Contributor Author

lallea commented Dec 9, 2018

Rebased.

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never used any of the Range functionality here. So i've limited my review accordingly.

Based around the descriptions of stop and months_forward, i'm confused at the expected behavior if both are left as their defaults. Could you explain?

luigi/tools/range.py Outdated Show resolved Hide resolved
default=None,
description="ending month, exclusive. Default: None - work forward forever")
months_back = luigi.IntParameter(
default=13, # Little over a year
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for a default which is a little over a year?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is arbitrary, but it is common to desire to keep a window of datasets back to slightly more than an interesting interval, e.g. in order to be able to compare with last year's numbers. RangeDaily and RangeHourly have 100 days window back - slightly more than a quarter.

luigi/tools/range.py Outdated Show resolved Hide resolved
@@ -40,6 +40,7 @@ def get_static_files(path):
'tornado>=4.0,<5',
# https://pagure.io/python-daemon/issue/18
'python-daemon<2.2.0',
'python-dateutil==2.7.5',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to lock to this specific version? Or < 2.8? or < 3.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of securing the software supply chain from rogue code, it is better to lock down the version. From the perspective of getting bugs fixed, <2.8 or <3.0 would be better. I'm happy to go either way.

Luigi seems not to have locked down other versions, so the latter would be more consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea unfortunately I which is better. It's a tradeoff a @lallea said

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I think it's misguided to pin down with == here instead of with an an "at least" constraint, it will create more issues in the future than it will solve.

also see #2662 and #2679

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for exact dependencies is to prevent scenarios such as https://thenewstack.io/attackers-up-their-game-with-latest-npm-package-compromise/.

With at least dependencies, every time we "pip install luigi" in a container image build, we risk getting a transitive rogue dependency. Given that Luigi typically runs with access to a data lake, such a scenario is a potential disaster.

That being said, it seems that the cure is worse than the disease in absence of a shading mechanism, and I agree that we should settle for at least dependencies. In JVM land, exact dependencies is the norm, but I guess it does not work out as well in Python. I guess that each build that uses Luigi has to take responsibility for the chain instead, e.g. by freezing transitive dependencies.

Copy link
Contributor

@soxofaan soxofaan Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @lallea,
I share your concern of blindly pulling new versions and the security implications, but I think that should be handled at the application level (pip freeze, pipenv lock, ...), not at the library level ("lib A requires at least version X of lib B")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

There are some interesting differences from the JVM world, which I hadn't thought of earlier. Luigi is not really a library, but a framework with its own main function. It is uncommon to run Luigi and need to import other python libraries into the same interpreter, so version conflicts within the same process are rare.

Pip induces us to choose a single set of versions for a complete container image, however, so Luigi's dependencies may conflict with dependencies for e.g. pyspark or tensorflow, even in a case where Luigi runs them in separate processes. Having a virtual environment for each script is painful. Perhaps we should all use pex. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luigi is not really a library

I disagree here. Luigi is pretty useless as a standalone application. To leverage it you at least have to write a script (or app) where you do import luigi to define your tasks, which means that you are using it as a library, no?

FWIW: we practically never use luigi's built in main function and instead call luigi.build() from our own "main" function. Our app has a lot of other dependencies/libraries, which are loaded in same process as luigi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are different ways to use Luigi, which is fine.

We always schedule luigi.main() from cron jobs, and never put any business logic in Luigi DSL. All non-trivial processing is kept in separate programs for better decoupling and testing. Hence, we have little need to import other libraries, and I didn't consider the potential conflict. Thanks again for pointing out and mending. :-)

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good to merge for me.


Unlike the Range* classes with shorter intervals, this class does not perform bulk optimisation.
It is assumed that the number of months is low enough not to motivate the increased complexity.
Hence, there is no class RangeMonthlyBase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay for simplicity. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, that is why we use Luigi! :-)

@Tarrasch Tarrasch merged commit ca0aa9a into spotify:master Dec 13, 2018
@lallea lallea deleted the range-monthly branch December 13, 2018 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants