New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add range functionality for monthly cadence. #2601
Conversation
Bring in python-dateutil for date arithmetic.
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never used any of the Range functionality here. So i've limited my review accordingly.
Based around the descriptions of stop
and months_forward
, i'm confused at the expected behavior if both are left as their defaults. Could you explain?
default=None, | ||
description="ending month, exclusive. Default: None - work forward forever") | ||
months_back = luigi.IntParameter( | ||
default=13, # Little over a year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the motivation for a default which is a little over a year
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is arbitrary, but it is common to desire to keep a window of datasets back to slightly more than an interesting interval, e.g. in order to be able to compare with last year's numbers. RangeDaily and RangeHourly have 100 days window back - slightly more than a quarter.
@@ -40,6 +40,7 @@ def get_static_files(path): | |||
'tornado>=4.0,<5', | |||
# https://pagure.io/python-daemon/issue/18 | |||
'python-daemon<2.2.0', | |||
'python-dateutil==2.7.5', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to lock to this specific version? Or < 2.8? or < 3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of securing the software supply chain from rogue code, it is better to lock down the version. From the perspective of getting bugs fixed, <2.8 or <3.0 would be better. I'm happy to go either way.
Luigi seems not to have locked down other versions, so the latter would be more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea unfortunately I which is better. It's a tradeoff a @lallea said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for exact dependencies is to prevent scenarios such as https://thenewstack.io/attackers-up-their-game-with-latest-npm-package-compromise/.
With at least dependencies, every time we "pip install luigi" in a container image build, we risk getting a transitive rogue dependency. Given that Luigi typically runs with access to a data lake, such a scenario is a potential disaster.
That being said, it seems that the cure is worse than the disease in absence of a shading mechanism, and I agree that we should settle for at least dependencies. In JVM land, exact dependencies is the norm, but I guess it does not work out as well in Python. I guess that each build that uses Luigi has to take responsibility for the chain instead, e.g. by freezing transitive dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @lallea,
I share your concern of blindly pulling new versions and the security implications, but I think that should be handled at the application level (pip freeze, pipenv lock, ...), not at the library level ("lib A requires at least version X of lib B")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
There are some interesting differences from the JVM world, which I hadn't thought of earlier. Luigi is not really a library, but a framework with its own main function. It is uncommon to run Luigi and need to import other python libraries into the same interpreter, so version conflicts within the same process are rare.
Pip induces us to choose a single set of versions for a complete container image, however, so Luigi's dependencies may conflict with dependencies for e.g. pyspark or tensorflow, even in a case where Luigi runs them in separate processes. Having a virtual environment for each script is painful. Perhaps we should all use pex. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luigi is not really a library
I disagree here. Luigi is pretty useless as a standalone application. To leverage it you at least have to write a script (or app) where you do import luigi
to define your tasks, which means that you are using it as a library, no?
FWIW: we practically never use luigi's built in main function and instead call luigi.build()
from our own "main" function. Our app has a lot of other dependencies/libraries, which are loaded in same process as luigi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are different ways to use Luigi, which is fine.
We always schedule luigi.main()
from cron jobs, and never put any business logic in Luigi DSL. All non-trivial processing is kept in separate programs for better decoupling and testing. Hence, we have little need to import other libraries, and I didn't consider the potential conflict. Thanks again for pointing out and mending. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good to merge for me.
|
||
Unlike the Range* classes with shorter intervals, this class does not perform bulk optimisation. | ||
It is assumed that the number of months is low enough not to motivate the increased complexity. | ||
Hence, there is no class RangeMonthlyBase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay for simplicity. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, that is why we use Luigi! :-)
Description
Add range functionality for monthly cadence.
Motivation and Context
RangeMonthly simplifies running backfilling jobs that run on a monthly basis, in a similar manner as RangeDaily, RangeHourly, etc.
Note that the PR adds a dependency on python-dateutil for date arithmetic.
Have you tested this? If so, how?
tox -e flake8
tox -e py27-core
No new test failures.
Also running in production since a few days back.
@ulzha, I think you might be a suitable reviewer for this one?