Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace deprecated pykube(-ng) library with official Kubernetes python library #3093

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

AndrewFarley
Copy link

@AndrewFarley AndrewFarley commented Aug 3, 2021

Description

The pykube and pykube-ng libraries that Luigi depends on are flawed and don't seem to support more recent versions of Kubernetes. Both libraries appear to be abandonware, so we should pivot to the official python library for Kubernetes. This pivot will simplify some of the handlings of this library, as certain configuration values are already supported and configurable via environment variables, as is the widely accepted standard for Kubectl and its supported tools and toolchain. This also adds the following...

  • Better/more debug logging about what its doing to/in Kubernetes
  • Event logging before the pod starts up (so you can see what its doing, eg Pulling Image, etc)
  • Cleaned up code a bit
  • Able to detect cluster scaling up (contribution from @dav009)
  • Able to detect cluster scaling up won't work and failing
  • WIP Adding more robust tests to ensure we're handling all success and failure conditions properly

Motivation and Context

Luigi wasn't working for me on a number of my clusters, and after further investigation this was largely due to the outdated and deprecated libraries in-use. I've found numerous other projects which have pivoted over to the official library with a lot of success and used those largely as examples for how to do this.

I've added extra comments and documentation in all places that make sense for others to be able to more-easily understand how/what to do to use and adopt Kubernetes support for Luigi.

Have you tested this? If so, how?

I've updated the unit tests and run a few test pipelines successfully with all features.

Items to complete

  • Implement core feature-set and pivot to the new library
  • Add documentation
  • WIP Add/update the test coverage
  • WIP Add ability to watch logs during a run, not just after a run
  • Merge and publish a new version for all to enjoy

Comment on lines -53 to -61
auth_method = luigi.Parameter(
default="kubeconfig",
description="Authorization method to access the cluster")
kubeconfig_path = luigi.Parameter(
default="~/.kube/config",
description="Path to kubeconfig file for cluster authentication")
max_retrials = luigi.IntParameter(
default=0,
description="Max retrials in event of job failure")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note for reviewers. These things are safe to remove, and no longer need to live in code, as the upstream Kubernetes library supports handling them automatically via Environment Variables. The max_retrials also (unless I misunderstand) is completely unnecessary as it duplicates functionality with backoff_limit. Arguably, one config option here I could keep is auth_method so an end-user can "force" which auth method to use. But, the Kubernetes library should be able to automatically detect and cascade into the "mode" it needs to function.

Comment on lines 67 to 69
try:
kubernetes_api.config.load_incluster_config()
except Exception as e:
try:
kubernetes_api.config.load_kube_config()
except Exception as ex:
raise ex
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: This is the "new" cascading automated configuration handling. If we're in Kubernetes, it will automatically detect this and use in-cluster configuration and API endpoints, if that fails, it will attempt to use the kubeconfig file and env vars that help support it. This logic could (arguably) be allowed to be forced one way or the other via a config variable.

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments; once they're addressed, i'll approve

luigi/contrib/kubernetes.py Outdated Show resolved Hide resolved
luigi/contrib/kubernetes.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
test/contrib/kubernetes_test.py Outdated Show resolved Hide resolved
@AndrewFarley
Copy link
Author

Note: I am merging and fixing up the MR #3089 in this MR. Picture of proof of work, added tons of debug logging as well. MR will be updated shortly I believe addressing all issues...

Screen Shot 2021-08-30 at 3 07 28 AM

@AndrewFarley
Copy link
Author

@dav009 I merged(ish) your branch into mine logicall, but added code to detect if a scale up failed (aka, it won't fit even with scale up) which the previous code would infinite loop because it wasn't looking into the Kubernetes event stream. Pic of proof...

Screen Shot 2021-08-30 at 3 10 12 AM

@AndrewFarley
Copy link
Author

@dlstadther Please have another look and review code, but please don't merge yet, I've decided I need to add some further tests because it's really abysmal to not really have any decent ones here to ensure things work as they should, there's so many different kinds of failure conditions which are trying to be detected here and I think we should test them. I'm not fantastic with Pytest so it might be a bit slow, but I'm working on adding tests which I think are critical before merging this. Only able to work on this in free time, so if anyone else can/wants to help me with tests, please do. I've left some of my hacking at tests commented out.

@AndrewFarley
Copy link
Author

AndrewFarley commented Aug 30, 2021

And @dav009 please review the code here and make sure I encapsulated what you were trying to do with your MR. Overall, this MR will have a lot better QoL support for Kubernetes. There's still tons more things I think could/should be done in this library but mostly pedantic or improvements and could be in future merges instead. I'm still working on redoing some of the tests and will sorta include your tests once I fix them. Things I'd love to see in the future are for example on API failure or timeout or general internet jitter/brief unavailability some automated retry and/or exponential backoff mechanism.

@dav009
Copy link
Contributor

dav009 commented Aug 31, 2021

@AndrewFarley amazing job, thank you. I was also quite frustrated with how outdated this is. I will try to review this soonish.

😅 I made an internal library for my organisation that addresses these (old pykube) and also take into account other issues (logging). I am testing it at the moment. We were planning on releasing it at some point.

@dav009
Copy link
Contributor

dav009 commented Nov 12, 2021

@AndrewFarley In my team we ended up implementing this kubernetes contrib update as an independent package:
https://github.com/optimizely/kubeluigi

@AndrewFarley
Copy link
Author

Ping. Folks, I'm not using Luigi, but this MR still really needs to happen. If anyone has time to rebase and take over this PR, please let me know I'm happy to transfer my fork to you or add you permissions. I know a few folks are using my fork already because it's necessary if you want Kubernetes support, so perhaps one of you can take the reins?

@nlahmi-cymotive
Copy link

Hi, why was this not reviewed yet? I tested it locally and it seems to work great..

@AndrewFarley
Copy link
Author

AndrewFarley commented Jan 26, 2023

Hah, I wrote this 2 years ago and still no one merged it. Is this project dead? I'm really glad this client of mine I recommended to not choose Luigi. Eesh.

@AndrewFarley AndrewFarley force-pushed the replace-deprecated-pykube-library branch from d340099 to a1ef345 Compare January 26, 2023 09:42
@AndrewFarley
Copy link
Author

AndrewFarley commented Jan 26, 2023

I just rebased and re-pushed and validated this works on Kubernetes still. Hopefully, in the hopes and dreams of fellow engineers, someone can see it in their heart to merge this. Or, just don't use Kubernetes. :P Up to you

@dlstadther
Copy link
Collaborator

Apologies for this not being reviewed.

As far "is this project dead?" - there is not consistent maintainer involvement (myself included). The project is still owned by Spotify, but not actively developed (to my knowledge, mostly due to their articles related to their intent to replace Luigi with Flyte (note, this is just my personal assumption)). While I have permissions to merge PRs and push to PyPi, I have not used Luigi for development since changing employers a few years ago. I wish to see this project continue, but without an active community and additional maintainer participation, there's only so much I can do. I sympathize with your disappointment and I'm here to help as I reasonably can in my spare time.

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of review of this PR, I'm open to merging this code. However, 1 item prevents me and 1 item deters me:

  1. flake8 tests are failing
  2. a bunch of the unittests are commented out with TODO notes

I am willing to let point # 2 to slide so that master has a functional k8s module. But # 1 will need to be fixed before I can click the merge button.

@AndrewFarley
Copy link
Author

AndrewFarley commented Jan 28, 2023

@dlstadther I've just fixed the code styling to satisfy that need. I indeed also don't use Luigi and more, and I don't really have tons of time to write/update/etc the tests to validate that this works. I can confirm it did work for me once upon a time, and others above have recently checked its (basic) functionality. I can/should leave the TODOs in the test code commented out, in hopes that someone else can take up the reigns and add the tests.

At least the flake tests pass now - https://github.com/spotify/luigi/actions/runs/4034116089/jobs/6935056885

But it appears there are some other errors now? I don't have much time this weekend, but I will try to review this further sometime soon.

@dlstadther
Copy link
Collaborator

@AndrewFarley , thank you for the quick turnaround to address the failing flake8 tests. I'm good with getting this included once test results allow me to merge. I hope whomever will come use this updated k8s support within luigi will contribute back any improvements and test coverage items.

I just checked the current failures and they're all 1 failure:

  • the test class tries to load a kubeconfig and cannot find one - kubernetes_api.config.load_kube_config()

So looks to me like once a kube config is mocked (or however it is expected) for tests, that we'd be good to go for now.

@AndrewFarley
Copy link
Author

Well, in order for the Kubernetes stuff to work, it needs to initialize the library and config somehow. I will need to have a think and a google around of some creative way to workaround this for testing only which won't have K8s support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants