Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-commit hook for pipeline checks #59

Open
lorenzwalthert opened this issue Sep 21, 2023 · 6 comments · May be fixed by #62
Open

Pre-commit hook for pipeline checks #59

lorenzwalthert opened this issue Sep 21, 2023 · 6 comments · May be fixed by #62
Assignees
Labels
feature New feature or request

Comments

@lorenzwalthert
Copy link

I see you are already using pre-commit hooks in this repo, but you could also expose one. I could help with that (at least conceptually), I have contributed to pre-commit (top ten contributor) and maintain R hooks myself. The idea would be to run the validations on every commit. Folowing sagemaker projects, the user could supply a module that contains a functinon get_pipeline() that returns the pipeline.

@stiebels
Copy link
Owner

Hi @lorenzwalthert
That's an interesting idea! Do you perhaps have a link to a conceptually similar example/project for a pre-commit hook that validates a user-provided object (ideally, in Python)? Would love to look into that to get a better idea of how it'd work.

@lorenzwalthert
Copy link
Author

lorenzwalthert commented Sep 24, 2023

Do you perhaps have a link to a conceptually similar example/project for a pre-commit hook that validates a user-provided object (ideally, in Python)?

Not really, but https://github.com/pre-commit/pre-commit-hooks contains hooks with good practice. Here's how I'd imagine it to work for your package, with language: python:

Expose a CLI
Your python package exposes a command line executable that is installed during the installation process of your package (like pip install sagemaker-rightline). In your case, you could expose sagemaker-rightline with sub-command validate and paths to modules that contain a get_pipeline() function. E.g. with the CLI builder typer. That executable is a requirement for pre-commit, but most hooks use an executable that was not specifically deisnged for pre-commit, e.g. flake8, [ruff](https://github.com/astral-sh/ruff-pre-commit/blob/main/.pre-commit-hooks.yaml#L4) and other hooks. We could assume that in the same directory as the pipeline definition file, a file validation.py exists that defines the validation. Folliwng Sagemaker Projects, I think this is the structure they use.

git root/ 
  pipelines/
    {name of pipeline}/
      pipeline.py 
      validation.py

Then, we'd call sagemaker-rightline validate src/pipelines/inference/pipeline.py and that would then

  • load the pipeline definition through calling get_pipeline() from the pipeline.py file referenced above.
  • load the configuration from validation.py referenced above.
  • run the check.
  • emit the data frame with the results
  • throw non-zero exit status if one or more violatinons are found.

Define the pre-commit logic

Now that we have a command line utility to access the functionality from this package, we need to connect it to pre-commit.

  • Create a .pre-commit-config.yaml in this repo and specify language: python, entry_point: sagemaker-rightline validate and pass_filenames: false, types: [file, python] and id: validate. This means the CLI program sagemaker-rightline validate is invoked everytime a python file is staged for commit and the file names won't be passed to the CLI program (as it's not like a formatter hook that processes files 1 by 1, i.e. black file1.py file2.py), but rather, the user will specify them statically (see below).
  • Users will have to create a new entry in their .pre-commit-config.yaml:
    repos:
    -   repo: https://github.com/stiebels/sagemaker-rightline
        rev: main # or tag
        hooks:
        -   id: validate
             args: [validate src/pipelines/inference/pipeline.py, path/to/others.py]

What's ran effectively then is entry_point + args, and file-names are not passed to the hook since we specified son in the .pre-commit-hooks.config.

The hooks will then always run when you change a python file, not just the pipeline definition, since the pipeline definition itself could depend on other modules and you want to ensure changes there don't break your pipeline. If all dependencies are contained in the file that defines the pipeline, you could also implement a different logic. As long as the validations don't take very long, it should be fine to run that hook on each commit.

@stiebels
Copy link
Owner

Thanks a lot for this writeup! Appreciate it, @lorenzwalthert ! I'm having a bit of a busy week, but plan to get back in detail next week.

@stiebels
Copy link
Owner

stiebels commented Oct 26, 2023

Hi @lorenzwalthert ,
It took a while, but I started working on it. Got a basic version up and running. Still needs some re-design and a lot of refactoring in the next days/weeks. #62

@stiebels stiebels linked a pull request Oct 26, 2023 that will close this issue
4 tasks
@stiebels stiebels added the feature New feature or request label Oct 28, 2023
@stiebels stiebels self-assigned this Oct 28, 2023
@stiebels
Copy link
Owner

Hi @lorenzwalthert,
If you're keen, would be great if you could give the PR a look/review: #62
I think it's now in a state where it makes sense. Looking forward to your thoughts!

@lorenzwalthert
Copy link
Author

Sorry for the delay @stiebels. I hope I can find time next week to look at the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants