-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add google cloud storage funcs #722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this PR, Mark! This is a really strong start. I am still getting set up with GCP and working through security, but I have some comments in the meantime.
As part of this PR, please feel free to add yourself as a contributor here in the DESCRIPTION
file.
I am having trouble getting set up with GCS. Apparently something is wrong with the service account I just created? > library(googleCloudStorageR)
>
> gcs_setup()
ℹ ==Welcome to googleCloudStorageR v0.6.0 setup==
This wizard will scan your system for setup options and help you with any that are missing.
Hit 0 or ESC to cancel.
1: Create and download JSON service account key
2: Setup auto-authentication (JSON service account key)
3: Setup default bucket
Selection: 2
───────────────────────────────────────────────────────────────────────
Do you want to configure for all R sessions or just this project?
1: All R sessions (Recommended)
2: Project only
Selection: 1
───────────────────────────────────────────────────────────────────────
x No environment argument detected: GCS_AUTH_FILE
✓ Validated Client ID file /home/landau/.gcp/client_secret_CENSORED.apps.googleusercontent.com.json
✓ Found Client ID project: CENSORED
ℹ Using Client ID via GAR_CLIENT_JSON=/home/landau/.gcp/client_secret_CENSORED.apps.googleusercontent.com.json
───────────────────────────────────────────────────────────────────────
Do you want to provision a service account for your project?
1: Yes, I need a service account key
2: No, I already have one downloaded
Selection: 1
ℹ No roles specified to configure for service key
ℹ Creating service key file - choose service account name (Push enter for default 'googleauthr')
service account name: targets-test-service-account
ℹ Creating service account targets-test-service-account
✓ Setting client.id from /home/landau/.gcp/client_secret_CENSORED.apps.googleusercontent.com.json
ℹ 2021-12-12 21:31:54 > Request Status Code: 403
Error in value[[3L]](cond) :
API returned: Permission iam.serviceAccounts.get is required to perform this operation on service account projects/CENSORED/serviceAccounts/targets-test-service-account@CENSORED.iam.gserviceaccount.com. But the "principal" for this service account has the "Custom Service Account Admin" role, which I previously created with permissions like |
Could you try it with GitHub version gcs_setup()
ℹ ==Welcome to googleCloudStorageR v0.6.0.9000 setup==
This wizard will scan your system for setup options and help you with any that are missing.
Hit 0 or ESC to cancel.
1: Create and download JSON service account key
2: Setup auto-authentication (JSON service account key)
3: Setup default bucket
Selection: 1
───────────────────────────────────────────────────────────────────────────────────────────────────────
ℹ Using local project .Renviron
✓ Validated Client ID file /Users/mark/auth/clients/project_id-desktop.json
✓ Found Client ID project: project_id
ℹ Using Client ID via GAR_CLIENT_JSON=/Users/mark/dev/auth/clients/project_id-desktop.json
───────────────────────────────────────────────────────────────────────────────────────────────────────
Do you want to provision a service account for your project?
1: Yes, I need a service account key
2: No, I already have one downloaded
Selection: 1
ℹ Creating service key file - choose service account name (Push enter for default 'googlecloudstorager')
service account name: targets-test-service-account
ℹ Creating service account targets-test-service-account
✓ Setting client.id from /Users/mark/auth/clients/project_id-desktop.json
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort
Authentication complete.
ℹ 2021-12-13 08:34:59 > Request Status Code: 404
ℹ 2021-12-13 08:34:59 > Creating new service account: targets-test-service-account@project_id.iam.gserviceaccount.com
ℹ 2021-12-13 08:34:59 > Creating service accountId - targets-test-service-account
ℹ 2021-12-13 08:35:00 > Checking existing roles
ℹ 2021-12-13 08:35:00 > Granting roles: roles/storage.admin to accountIds: targets-test-service-account@project_id.iam.gserviceaccount.com
ℹ 2021-12-13 08:35:01 > Creating secret auth key for service account targets-test-service-account for project project_id
ℹ 2021-12-13 08:35:02 > Writing secret auth JSON key to googlecloudstorager-auth-key.json and adding to .gitignore
✓ Move googlecloudstorager-auth-key.json to a secure folder location outside of your working directory
Have you moved the file?
1: No
2: No way
3: Absolutely
Selection: |
Unfortunately, I ran into the same 403 error with development |
Did you see the help video? That may help with some obscure but critical detail the docs may have left out ;) |
Thanks Mark, that help video is fantastic! I think I am mostly set up now, except for a couple failing tests: > googleCloudRunner::cr_setup_test()
ℹ Perform deployments to test your setup is working. Takes around 5mins. ESC or 0 to skip.
✓ Successfully auto-authenticated via /home/landau/.gcp/targets-development-svc-acct-auth-key.json
✓ Validated authentication in GCE_AUTH_FILE
Select which deployments to test
1: All tests
2: Cloud Build - Docker
3: Cloud Run - plumber API with Pub/Sub
4: Cloud Build - R script
5: Cloud Scheduler - R script
Selection: 1
ℹ Attempting Docker deployment on Cloud Build via cr_deploy_docker()
2021-12-15 09:39:40 -- No objects found
ℹ 2021-12-15 09:39:40 > Dockerfile found in /home/landau/R/R-4.1.2/library/googleCloudRunner/example/
── #Deploy docker build for image: gcr.io/targets-development/example ──────────────────────────────────────────────────────────
── #Upload /home/landau/R/R-4.1.2/library/googleCloudRunner/example/ to gs://targets-development-bucket/example.tar.gz ───────
ℹ 2021-12-15 09:39:40 > Uploading example.tar.gz to targets-development-bucket/example.tar.gz
2021-12-15 09:39:40 -- File size detected as 885 bytes
ℹ 2021-12-15 09:39:42 > Cloud Build started - logs:
https://console.cloud.google.com/cloud-build/builds/1dd63d49-529d-4f1a-a670-c08c216e604c?project=756889153822
ℹ 2021-12-15 09:39:42 > Waiting for Cloud Build...
(|) Build time: [00:00:57] ( 2% of timeout: 600s)
ℹ 2021-12-15 09:40:45 > Build finished with status: FAILURE
ℹ 2021-12-15 09:40:45 > gcr.io/targets-development/example:latest and gcr.io/targets-development/example:$BUILD_ID
x Something is wrong with Cloud Build setup
ℹ Attempting deployment of plumber API on Cloud Run via cr_deploy_plumber()
ℹ 2021-12-15 09:40:45 > Uploading /home/landau/R/R-4.1.2/library/googleCloudRunner/example/ folder for Cloud Run
ℹ 2021-12-15 09:40:45 > Dockerfile found in /home/landau/R/R-4.1.2/library/googleCloudRunner/example/
── #Deploy docker build for image: gcr.io/targets-development/example ──────────────────────────────────────────────────────────
── #Upload /home/landau/R/R-4.1.2/library/googleCloudRunner/example/ to gs://targets-development-bucket/example.tar.gz ───────
ℹ 2021-12-15 09:40:45 > Uploading example.tar.gz to targets-development-bucket/example.tar.gz
2021-12-15 09:40:45 -- File size detected as 884 bytes
ℹ 2021-12-15 09:40:46 > Cloud Build started - logs:
https://console.cloud.google.com/cloud-build/builds/5a5468f3-4e57-4637-8f72-992f09ce7867?project=756889153822
ℹ 2021-12-15 09:40:46 > Waiting for Cloud Build...
(-) Build time: [00:00:46] ( 2% of timeout: 600s)
ℹ 2021-12-15 09:41:37 > Build finished with status: FAILURE
ℹ 2021-12-15 09:41:37 > gcr.io/targets-development/example:$BUILD_ID
ℹ 2021-12-15 09:41:37 > Error building Dockerfile
x Something is wrong with Cloud Run setup
ℹ Testing Cloud Build R scripts deployments via cr_deploy_r()
ℹ 2021-12-15 09:41:38 > Deploy R script cr_rscript_2021121639579298094138 to Cloud Build
ℹ 2021-12-15 09:41:38 > Cloud Build started - logs:
https://console.cloud.google.com/cloud-build/builds/38802327-9117-41d1-8f88-2eedaddf928f?project=756889153822
ℹ 2021-12-15 09:41:38 > Waiting for Cloud Build...
(\) Build time: [00:01:11] ( 2% of timeout: 600s)
ℹ 2021-12-15 09:42:55 > Build finished with status: SUCCESS
✓ Cloud Build R scripts deployed successfully
ℹ Testing scheduling R script deployments via cr_deploy_r(schedule = '* * * * *')
ℹ 2021-12-15 09:42:55 > Deploy R script cr_rscript_2021121639579375094255 to Cloud Build
ℹ 2021-12-15 09:42:55 > Scheduling R script on cron schedule: 15 21 * * *
✓ Scheduled Cloud Build R scripts deployed successfully
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Test summary ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
ℹ Something is wrong with Cloud Build setup
ℹ Something is wrong with Cloud Run setup
ℹ Cloud Build R scripts deployed successfully
ℹ Scheduled Cloud Build R scripts deployed successfully
✓ Deployment tests complete! Docker test build log:
Cloud Run Plumber build log:
|
R/utils_gcp.R
Outdated
bucket = bucket, | ||
version = version | ||
), | ||
error = function(condition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an error condition that would allow gcp_gcs_exists()
to tell if the object really does not exist (as opposed to a configuration error etc.)? When testing locally, gcp_gcs_exists()
at first incorrectly returned FALSE
(silently) because I had the wrong version of googleCloudStorageR
installed and gcs_get_object()
could not accept a generation
argument. Detecting HTML 400 errors has been helpful for AWS:
Line 15 in e144bdb
http_400 = function(condition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rlang::abort()
has a class
argument to give custom classes to errors which can then be detected with tryCatch()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just filed cloudyr/googleCloudStorageR#153.
I was able to run all the tests, and they pass. This PR looks great, I will merge when all the specific comments are addressed. |
Great! The errors you got in set-up from |
That's odd, I cannot reproduce the lint either. But those trailing whitespaces could actually be in the call to |
There is also a lint in |
Yes it was whitespace in the DESCRIPTION - merged main in as well. |
Hi @wlandau happy new year :) How is this looking now? I have worked on some nice DAG functions in the meantime which will be very useful I think - at the moment it uploads/downloads everything before and after builds but if the GCS native storage was enabled it would speed up things a lot for when big files are involved. I'm a little stuck when trying to enable only individual target steps to build, not quite got my head around how to handle the state between steps but an edge case anyhow I think. Here is an example output from workflow I have for another project now:
library(googleCloudRunner)
bs <- cr_buildstep_targets_multi(
last_id = "bigquery", # signals which step to download artifacts from
task_image = "gcr.io/xxx/xact-api:master",
task_args = list(secretEnv = "XACT_PW"))
cr_build_targets(
bs,
options = list(env = c("XACT_ORGANIZATIONID=1234",
"XACT_USER=USERNAME")),
execute = "now", # runs it immediatly vs make a file to run on CI/CD
availableSecrets = cr_build_yaml_secrets("XACT_PW","xact-pw")
) The builds that can run concurrently so big speed up for some pipelines, costs nothing to run, no cloud servers to set up etc. Build log: ℹ 2022-01-08 21:38:40 > targets cloud location: gs://xxxx/xact-api
ℹ 2022-01-08 21:38:40 > Resolving targets::tar_manifest()
── # Building DAG: ─────────────────────────────────────────────────────────────────────────────
• [get previous _targets metadata] -> [lookup_sub_file]
• [] -> [cmd_args]
• [] -> [sx_mappings_file]
• [] -> [surveyid_file]
• [lookup_sub_file] -> [lookup_sub]
• [sx_mappings_file] -> [sx_mappings]
• [cmd_args, surveyid_file] -> [surveyIds]
• [surveyIds] -> [survey_as]
• [surveyIds] -> [lookup_id]
• [surveyIds] -> [survey_qs]
• [survey_as] -> [a_audit]
• [lookup_id, lookup_sub, survey_as, sx_mappings] -> [processed]
• [lookup_id, survey_qs, surveyIds] -> [q_audit]
• [cmd_args, processed] -> [bigquery]
• [q_audit] -> [q_audit_file]
• [bigquery] -> [ Upload Artifacts ]
....
ℹ 2022-01-08 21:38:47 > File size detected as 5.6 Mb
ℹ 2022-01-08 21:38:48 > Running Cloud Build for targets workflow in /Users/mark/dev/xxxx/xact-api
ℹ Cloud Build started - logs:
<https://console.cloud.google.com/cloud-build/builds/xxxx-4ddf-4186-859a-f46db1e65e03?project=278775085929>
✓ Build finished with status: SUCCESS and took ~[01m44s]
ℹ 2022-01-08 21:40:48 > Downloading to download_folder: /Users/mark/dev/xxx/_targets
✓ Saved xact-api/_targets/buildtime.txt to _targets/buildtime.txt ( 29 bytes )
✓ Saved xact-api/_targets/meta/meta to _targets/meta/meta ( 3.5 Kb )
✓ Saved xact-api/_targets/meta/process to _targets/meta/process ( 56 bytes )
✓ Saved xact-api/_targets/meta/progress to _targets/meta/progress ( 518 bytes )
✓ Saved xact-api/_targets/objects/a_audit to _targets/objects/a_audit ( 971 bytes )
✓ Saved xact-api/_targets/objects/bigquery to _targets/objects/bigquery ( 4.6 Kb )
✓ Saved xact-api/_targets/objects/cmd_args to _targets/objects/cmd_args ( 46 bytes )
✓ Saved xact-api/_targets/objects/lookup_id to _targets/objects/lookup_id ( 200 bytes )
✓ Saved xact-api/_targets/objects/lookup_sub to _targets/objects/lookup_sub ( 516 bytes )
✓ Saved xact-api/_targets/objects/processed to _targets/objects/processed ( 4.6 Kb )
✓ Saved xact-api/_targets/objects/q_audit to _targets/objects/q_audit ( 5.7 Kb )
✓ Saved xact-api/_targets/objects/q_audit_file to _targets/objects/q_audit_file ( 44 bytes )
✓ Saved xact-api/_targets/objects/surveyIds to _targets/objects/surveyIds ( 223 bytes )
✓ Saved xact-api/_targets/objects/survey_as to _targets/objects/survey_as ( 346.8 Kb )
✓ Saved xact-api/_targets/objects/survey_qs to _targets/objects/survey_qs ( 21.1 Kb )
✓ Saved xact-api/_targets/objects/sx_mappings to _targets/objects/sx_mappings ( 1.2 Kb )
── # Built targets on Cloud Build with status: SUCCESS ─────────────────────────────────────────
ℹ 2022-01-08 21:40:51 > Build artifacts downloaded to /Users/mark/dev/xxx/_targets |
You too, Mark!
Almost there. The only remaining issue is https://github.com/ropensci/targets/pull/722/files#r769782900. I think it is important to somehow avoid false negatives in
Nice!
Is the roadblock in
That's really cool! At some point, I think |
The custom http error is in now, I can move it to
I couldn't get my head around how one would create a script that would just call CB within it but keep state. Perhaps shortcutting is what is needed. e.g. If someone wanted to run just one target on CB - so they have:
How to replace process_cb <- function(big_data){
# something?
}
list(
tar_target(input, "file1", ...),
tar_target(process_cb, process(input)),
tar_target(output, write.csv(process))
) or perhaps this is what |
Awesome!
Whatever you think is best, either seems like a good final resting place.
Yeah, one conceivable approach is for user code inside a target to call CB/CR.
But it has a couple performance disadvantages:
The internal orchestration in |
I just tried 8204e42, and I am getting issues with bucket <- random_bucket_name()
projectId <- Sys.getenv("GCE_DEFAULT_PROJECT_ID")
googleCloudStorageR::gcs_create_bucket(bucket, projectId = projectId)
gcp_gcs_exists(key = "x", bucket = bucket) # expected FALSE with no error
#> Error: API returned: No such object: targets-test-bucket-22e53e537837ffab7e8f5e21a7bbd90da937ade61ea/x
traceback()
#> 17: stop("API returned: ", error_message, call. = FALSE)
#> 16: checkGoogleAPIError(req)
#> 15: doHttrRequest(req_url, request_type = http_header, the_body = the_body,
#> customConfig = customConfig, simplifyVector = simplifyVector)
#> 14: ob()
#> 13: googleCloudStorageR::gcs_get_object(key, bucket = bucket, meta = TRUE,
#> generation = version)
#> 12: withCallingHandlers(expr, message = function(c) if (inherits(c,
#> classes)) tryInvokeRestart("muffleMessage"))
#> 11: loud(googleCloudStorageR::gcs_get_object(key, bucket = bucket,
#> meta = TRUE, generation = version)) at utils_gcp.R#36
#> 10: gcp_gcs_head(key = key, bucket = bucket, version = version) at utils_gcp.R#51
#> 9: gcp_gcs_head_true(key = key, bucket = bucket, version = version)
#> 8: doTryCatch(return(expr), name, parentenv, handler)
#> 7: tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 6: tryCatchList(expr, classes, parentenv, handlers)
#> 5: tryCatch(gcp_gcs_head_true(key = key, bucket = bucket, version = version),
#> http_404 = function(condition) {
#> FALSE
#> }) at utils_gcp.R#12
#> 4: gcp_gcs_exists(key = "x", bucket = bucket)
#> 3: eval_bare(expr, quo_get_env(quo))
#> 2: quasi_label(enquo(object), label, arg = "object")
#> 1: expect_false(gcp_gcs_exists(key = "x", bucket = bucket)) |
It turned out it needed to be on tryCatch(gcs_get_object("blah", meta = TRUE), http_404 = function(x) FALSE)
ℹ 2022-01-10 21:10:29 > Request Status Code: 404
x Downloading blah ... failed
[1] FALSE It should work now if the latest GitHub version of googleAuthR is installed? |
Awesome, works on my end now! Merging. |
Woohoo, thanks! |
Prework
Related GitHub issues and pull requests
Summary
Add functions for google cloud storage versioning of objects. I copied across the AWS functions and modified them for GCS. All tests are passing locally, but to replicate you will need a GCS account setup.
I also found a skip that was missing from clustermq tests.
I wasn't sure where to put the test files as they didn't run in testthat root folder, but the skip will ensure it still passes where they are, but let me know if you prefer the file is only copied in when testing locally. I put them in both locations for now.