Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load DAGs per git-sync #177

Closed
6 tasks done
adwk67 opened this issue Oct 7, 2022 · 15 comments
Closed
6 tasks done

Load DAGs per git-sync #177

adwk67 opened this issue Oct 7, 2022 · 15 comments
Assignees
Labels
customer-request release/2023-04 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. type/feature-new

Comments

@adwk67
Copy link
Member

adwk67 commented Oct 7, 2022

Issue #150 introduces one possibility for general management of external resources: for DAGs, the recommendation from Airflow is to use git-sync (see e.g. here for more info). This issue covers implementing git-sync in a container to regularly keep DAGs updated.

An example of this is given here. The airflow CRD can be changed to include an optional section shown below this (in airflow the roles are expected to have the same config, so the git-sync definition can be at top-level. This might need to be revisited if git-sync is a feature useful for other operators, such as nifi workflows etc.).

There are quite a number of parameters that can be set (see here and the sections that follow the link). Analagous to the SparkHistoryServerSpec the proposal is to define mandatory fields and expose the rest via a map;

...
  credentialsSecret: test-airflow-credentials
  gitSync:
    #---------------------------------------------
    # necessary and mandatory
    #---------------------------------------------
    name: git-sync
    repo: https://github.com/kubernetes/git-sync
    dagsDirectory: dags
    wait: 60
    #---------------------------------------------
    #  optional
    #---------------------------------------------
    gitSyncConf:
      - # depth: default 1
      ...
  webservers:
    roleGroups:
...

A git-sync container can only sync a single repo, so multiple repos would require a container each.
A git-sync block may occur at top-level (as would be the case for e.g. airflow), or at role level, if the product only requires external git resources for a specific component (i.e. the init-container will only be created for that role).

Acceptance critiera

  • Git-sync implemented and documented.
  • Strategy how to handle multiple git repos containing multiple DAGs developed and documented (the implementation of this is not covered by this ticket, but there should be a strategy to look at in a separate issue).
    • Strategy
      • implement gitsync as a list under a top-level cluster-config.
      • current implementation only considers the first element of this list (thus avoiding a breaking change when a list is processed)
      • when implementing multiple repositories
        • one gitsync container is required per repository endpoint
        • code will be written to consolidate these endpoints into a single folder to used as the DAG folder (possibly by using the webhook mechanism documented here)
  • The git-sync image must be available in off-line mode e.g. must be mirrored (image is incorporated into the airflow product images)
  • remove PVC-usage from tests and documentation.
    • N.B. inform @backstreetkiwi when this is complete so that node labels (needed for the airflow PVC test) can be removed from T2 cluster nodes

See stackabletech/docker-images#337

@fhennig
Copy link
Member

fhennig commented Jan 23, 2023

My understanding is that #150 needs to be implemented together with this, to allow for multiple different DAG loading mechanisms, is that right? @adwk67

Or is it possible to implement the git-sync mechanism without breaking changes?

Another question about my understanding: The operator would get some git links, and clone the repos when starting up, presumably into an emptyDir. Is that right? No provisioning of volumes by the user required.

@adwk67
Copy link
Member Author

adwk67 commented Feb 6, 2023

My understanding is that #150 needs to be implemented together with this, to allow for multiple different DAG loading mechanisms, is that right? @adwk67
Or is it possible to implement the git-sync mechanism without breaking changes?
Another question about my understanding: The operator would get some git links, and clone the repos when starting up, presumably into an emptyDir. Is that right? No provisioning of volumes by the user required.

  • This won't be a breaking change as we would be introducing a new CRD snippet, which should be applicable to other operators, too.
  • this would be instead of, or alongside External resources: replace PVC-usage with dependency sources #150 i.e. they are independent of each other
  • git-sync gets repo details and then periodically pulls changes to an emptyDir from where (in the case of airflow) DAGs can be mounted to the configured DAG folder

@lfrancke
Copy link
Member

Are you two in agreement now? If so I'm happy to move it on

@sbernauer
Copy link
Member

Just throwing in the idea of including all the stuff from k8s.gcr.io/git-sync/git-sync:v3.2.2 in our Airflow Image. (We can e.g. extract it out of the upstream image)
We were able to do so in all other products AFAIK, eliminating the need to specify any image for the user. Another benefit is not overseeing the image when it comes to building arm images

@adwk67
Copy link
Member Author

adwk67 commented Feb 20, 2023

Just throwing in the idea of including all the stuff from k8s.gcr.io/git-sync/git-sync:v3.2.2 in our Airflow Image

I like this idea and will aim to incorporate it.

@fhennig
Copy link
Member

fhennig commented Feb 20, 2023

I agree

@backstreetkiwi
Copy link
Contributor

Does that mean I can finally remove the node labels? (see unticked box in "acceptance criteria")?
paging @adwk67

@sbernauer
Copy link
Member

Just a side note: I we still rely on any nodeSelectors in any test for any operator we should probably use affinity instead - which is now supported :)

@adwk67
Copy link
Member Author

adwk67 commented Mar 21, 2023

Does that mean I can finally remove the node labels? (see unticked box in "acceptance criteria")?

Yes

@lfrancke
Copy link
Member

I just browsed through the docs and see that the "PersistentVolumeClaim" has gone missing. Has that feature been removed?

@adwk67
Copy link
Member Author

adwk67 commented Mar 22, 2023

I just browsed through the docs and see that the "PersistentVolumeClaim" has gone missing. Has that feature been removed?

Yes, it was problematic and was one of the reasons we wanted to have git-sync instead.

@lfrancke
Copy link
Member

In that case though that removal needs to be documented in the changelog

@adwk67
Copy link
Member Author

adwk67 commented Mar 24, 2023

The functionality is still there - we only removed the example/documentation. I've updated the changelog to reflect this in a separate PR.

@lfrancke
Copy link
Member

Okay, I'm not sure I do understand why we remove the docs entirely then?
Do we want to remove the functionality in the future?

@adwk67
Copy link
Member Author

adwk67 commented Mar 24, 2023

Using PVCs for this kind of customer-facing thing is problematic as it depends on what type of PVC-access (read-write-many etc.) is available on a given cloud and if RWX is not available (as is the case with Ionos, for example) then node selection needs to be used to make sure this approach works. I don't think we should give the impression that we recommend or support it. And we didn't add any functionality for this: we just documented how it could be done.

@lfrancke lfrancke added the release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-request release/2023-04 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. type/feature-new
Projects
Archived in project
6 participants