Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fetching beam args at runtime from env vars. Fixes issue #3326 #3455

Closed
wants to merge 7 commits into from
Closed

Support fetching beam args at runtime from env vars. Fixes issue #3326 #3455

wants to merge 7 commits into from

Conversation

ConverJens
Copy link
Contributor

Currently, TFX only support passing Beam args as static values defined at pipeline compile time. This is undesirable since one Beam needs to pass credentials to in order to use certain file systems, e.g. S3.

This PR allows Beam args to be populated from environment variables, which in turn can be populated by k8s secrets, and hence leaves no credentials visible in code nor pipeline definition yaml files.

This PR fixes the issues for beam args highlighted in: #3326

@google-cla google-cla bot added the cla: yes label Mar 25, 2021
@ConverJens ConverJens changed the title Support fetching beam args at runtime from env vars. Fixes 3326 Support fetching beam args at runtime from env vars. Fixes issue #3326 Mar 25, 2021
@ConverJens
Copy link
Contributor Author

@kennethyang404 Build is now passing except for pre-existing lint error and one new lint error in the same file due to following the same convention.

@ConverJens
Copy link
Contributor Author

@kennethyang404 It would be awesome if there is enough time to get this into release 0.29.0!

@ConverJens
Copy link
Contributor Author

@kennethyang404 Do you have time to review this or can you pass it to someone else? Along with #3541 this eliminates all clear text password leaks.

@zhitaoli
Copy link
Contributor

I think a problem here is that the additional_pipeline_args cannot be overridden on each component, therefore this would force all components to use the same beam pipeline args, which conflicts with the intent in be20d08

A better, but possibly much more involved way to do this is:

  • support values from SECRET ENVs as place holders;
  • support placeholders in beam_pipeline_args.

What do you think, @kennethyang404 and @ConverJens ?

@ConverJens
Copy link
Contributor Author

@kennethyang404 @zhitaoli
I implemented this in the way that I saw would introduce the least amount of changes to the existing code.

I must admit that I don't understand how placeholders work in this context nor how the implementation would look like.

Would another option be to add another argument to the new beam_base_component? Something like beam_pipeline_args_from_env or similar?

The current implementation uses basically this logic but instead of introducing this new parameter, a specific keyword passed to additional_beam_args was choose.

What do you think?

@ConverJens
Copy link
Contributor Author

@neuromage @1025KB @jiyongjung0 WDYT? Merge or reimplement after the 0.30.0 release? I'm open for both but this is a blocker for us taking TFX into production since it leaks credentials.

@jiyongjung0
Copy link

IIUC, per-component beam_pipeline_arg of beam_base_component doesn't work with Kubeflow yet because base_beam_component only works with TFX IR and kubeflow support for TFX IR is WIP.
I also think that it would be best if the suggestion from Zhitao is implemented but it might be better to wait a bit until TFX IR support lands because the related codes will change significantly.

@ConverJens
Copy link
Contributor Author

@jiyongjung0 That seems reasonable. When is TFX IR expected to land? Next release, Q2 etc?
Maybe we can keep this PR open and once TFX IR support is released, we can have a discussion on the best way to implement this functionality.
WDYT?

@jiyongjung0
Copy link

TFX IR support of KubeflowDagRunner is in #3447 (It is hard to read due to lack of rebase.). So it won't take months I hope.

Your suggestion sounds good to me. 😄

@ConverJens
Copy link
Contributor Author

Sound promising and looking forward to it!

@github-actions
Copy link
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the stale label May 30, 2021
@github-actions github-actions bot closed this Jun 5, 2021
@ConverJens
Copy link
Contributor Author

@zhitaoli @jiyongjung0 @kennethyang404
We have now migrated to TFX 1.0.0rc1 and I will be creating a new PR and make an attempt to implement the solution proposed by @zhitaoli, as it seems like a good and flexible way forward.

@jiyongjung0
Copy link

FYI, although #3447 was merged, it is not included in TFX 1.0.0 release branch. It will be included in 1.1.0 (or whatever version after 1.0.0). Thank you so much for the contribution!

@ConverJens
Copy link
Contributor Author

@jiyongjung0 Thank you for the information! As far as I can see, in TFX 1.0.0rc1 there is an per-component-option of adding beam args but you're saying that this doesn't work/has no effect if run on KubeFlow, correct?

Even if that's the case, hopefully it will be fine implementing placeholder support in beam args, even though they are on pipeline level.
WDYT?

@ConverJens
Copy link
Contributor Author

ConverJens commented Aug 16, 2021

@jiyongjung0 @zhitaoli
I have now started working on the PR for this and I'm basing it off the 1.2.0rc0 tag.

However, I'm unfamiliar with both the new TFX IR as well as placeholders. Do you have any ideas or pointer on how you would like to see this implemented? More specifically:

  1. how to add env var support for placeholders (in the case of k8s secrets, the user should be responsible for mounting these as the appropriate env var, e.g. using pipeline_operator_funcs)
  2. how to allow beam args to accept placeholders

A discussion around this would be very appreciated!

@ConverJens
Copy link
Contributor Author

I opened a new PR, perhaps we can continue the discussion there? #4147

@jiyongjung0
Copy link

@kennethyang404 , could you take a look on this PR again? I'm not familiar with the placeholder stuff...

@ConverJens
Copy link
Contributor Author

@jiyongjung0 @kennethyang404 Could you please check the other PR again? It was approved but the build failed since it timed out on pips dependency resolving. To get your attention I requested another review and not it looks as it was never approved.

Could you just kindly approve it again and give it more build time so it can finish and it can be merged? This has been going on for quite a while now. It would be really great if we can finally close it.

@ruoyu90
Copy link
Contributor

ruoyu90 commented Dec 10, 2021

@ConverJens curious about the value overwrite strategy upon conflicts: why do we choose to deprioritize the later-bound value (environment var)? Environment vars can be changed more flexibly and can serve as the same functionality as "flags". WDYT?

@ConverJens
Copy link
Contributor Author

@ruoyu90 Good point!

I chose this strategy since I deemed it the least risk of impacting existing workflows and that probably caused the least amount of confusion.

I agree that it might be more natural to allow env vars to take precedence.

Note that this PR is closed but a link to the new PR is referred to above and I think we settled on rasing an error in case of conflicting arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants