Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new board_gcs() for Google Cloud Storage #695

Merged
merged 11 commits into from
Jan 13, 2023
Merged

Conversation

juliasilge
Copy link
Member

Closes #572

@juliasilge
Copy link
Member Author

As I am working through this, I have a few questions that I hope you can help me with @MarkEdmondson1234.

The first is the logging/messaging that googleCloudStorageR does:

library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
b %>% pin_write(1:10, "small-numbers", type = "rds")
#> Creating new version '20221220T011855Z-4d1eb'
#> ℹ 2022-12-19 18:18:55 > File size detected as  189 bytes
#> 
#> ℹ 2022-12-19 18:18:55 > File size detected as  61 bytes
#> 
#> Writing to pin 'small-numbers'
b %>% pin_write(100:110, "big-numbers", type = "json")
#> Creating new version '20221220T011856Z-b89e8'
#> ℹ 2022-12-19 18:18:56 > File size detected as  187 bytes
#> 
#> ℹ 2022-12-19 18:18:56 > File size detected as  46 bytes
#> 
#> Writing to pin 'big-numbers'
b %>% pin_write(1:10, "small-numbers", type = "json")
#> Creating new version '20221220T011856Z-c3943'
#> ℹ 2022-12-19 18:18:57 > File size detected as  191 bytes
#> 
#> ℹ 2022-12-19 18:18:57 > File size detected as  23 bytes
#> 
#> Writing to pin 'small-numbers'

Created on 2022-12-19 with reprex v2.0.2

Is there a straightforward way for me to suppress this messaging? An option? I looked around but didn't find it quickly. I think it is getting too noisy, with the pins messages and the googleCloudStorageR messages together.

The second is about how I am supposed to use prefix and delimiter to get to a directory-like mode for listing objects. I can use prefix to, for example, find all the files that belong to one pin:

library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
googleCloudStorageR::gcs_list_objects(
  bucket = b$bucket,
  prefix = "small-numbers"
)
#>                                                      name      size
#> 1           small-numbers/20221220T011807Z-4d1eb/data.txt 189 bytes
#> 2  small-numbers/20221220T011807Z-4d1eb/small-numbers.rds  61 bytes
#> 3           small-numbers/20221220T011808Z-c3943/data.txt 191 bytes
#> 4 small-numbers/20221220T011808Z-c3943/small-numbers.json  23 bytes
#> 5           small-numbers/20221220T011855Z-4d1eb/data.txt 189 bytes
#> 6  small-numbers/20221220T011855Z-4d1eb/small-numbers.rds  61 bytes
#> 7           small-numbers/20221220T011856Z-c3943/data.txt 191 bytes
#> 8 small-numbers/20221220T011856Z-c3943/small-numbers.json  23 bytes
#>               updated
#> 1 2022-12-20 01:18:08
#> 2 2022-12-20 01:18:08
#> 3 2022-12-20 01:18:09
#> 4 2022-12-20 01:18:09
#> 5 2022-12-20 01:18:55
#> 6 2022-12-20 01:18:56
#> 7 2022-12-20 01:18:57
#> 8 2022-12-20 01:18:57

Created on 2022-12-19 with reprex v2.0.2

However, when I try to use delimiter = "/", that throws everything away, since they all have / in them:

library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
googleCloudStorageR::gcs_list_objects(
  bucket = b$bucket,
  prefix = "small-numbers",
  delimiter = "/"
)
#> ℹ 2022-12-19 18:22:56 > No objects found
#> data frame with 0 columns and 0 rows

Created on 2022-12-19 with reprex v2.0.2

I want to get something more like a list of directories. I can work around when I know the pin name, but I would like to get pin_list() working, largely so we can support the new workflow outlined in #692 for GCS. You can check out how that happens for Azure and AWS.

@MarkEdmondson1234
Copy link

suppressMessages() should do it for the upload feedback. If it needs more fine grain control it could be conditional on options(googleAuthR.verbose = 4) - the default is 3, 4 will suppress all messages.

The objects aren't really in folders more just names with / in them, so listing folders isn't directly supported. If folder structure is important perhaps make new buckets?

Prefix is only to filter results coming back that begin with x, so can do pseudo "list a folder" with "everything that starts with "folder/"

@juliasilge
Copy link
Member Author

@MarkEdmondson1234 it looks like googleCloudStorageR::gcs_get_object() does not respect googleAuthR.verbose:

withr::with_options(
  list(googleAuthR.verbose = 4),
  {
    googleCloudStorageR::gcs_auth("~/google-pins.json")
    googleCloudStorageR::gcs_get_object(
      object_name = "big-numbers/20221220T011808Z-b89e8/big-numbers.json",
      bucket = "pins-dev",
      saveToDisk = tempfile()
    )
  }
)
#> ℹ Downloading big-numbers/20221220T011808Z-b89e8/big-numbers.json
#> ✔ Saved big-numbers/20221220T011808Z-b89e8/big-numbers.json to /var/folders/hv/…
#> 
#> [1] TRUE

Created on 2022-12-20 with reprex v2.0.2

I believe other functions like googleCloudStorageR::gcs_upload() and googleCloudStorageR::gcs_list_objects() do respect the option. Would you like me to open an issue? I will use suppressMessages() for that one function for now.

@juliasilge
Copy link
Member Author

I'm going to keep pin_list() turned off for now, because it looks like we don't have the same support for handling the flat namespace that we do in Azure and AWS. Since we don't have pin_list(), that means we don't have the ability to make a manifest file. If we get requests for this, we can look into our options for implementing pin_list().

@juliasilge juliasilge marked this pull request as ready for review December 20, 2022 22:47
@juliasilge
Copy link
Member Author

@MarkEdmondson1234 do you have the bandwidth to do a review of this PR and see if you have any suggestions for better use of the GCS API?

R/board_gcs.R Outdated
Comment on lines 76 to 85
rlang::check_installed("sodium")
path_to_encrypted_json <- fs::path_package("pins", "secret", "pins-gcs-testing.json")
raw <- readBin(path_to_encrypted_json, "raw", file.size(path_to_encrypted_json))
pw <- Sys.getenv("PINS_GCS_PASSWORD", "")
json <- sodium::data_decrypt(
bin = raw,
key = sodium::sha256(charToRaw(pw)),
nonce = secret_nonce()
)
googleCloudStorageR::gcs_auth(json_file = rawToChar(json))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up taking the approach outlined in the gargle vignette "Managing tokens securely", and it seems to be working great!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution for local testing

Comment on lines +94 to +96
pin_list.pins_board_gcs <- function(board, ...) {
NA
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I discussed earlier in this PR, no pin_list() for the time being because we don't have good tooling for handling the flat namespace in a bucket. This is different from AWS and Azure. We can try coming back to this later if it is a high priority issue for folks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to enable it this would need to download a list of all objects, then parse through it locally for those ending with /. Its doable, but could get slow if a very large (10k+) number of objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see how we handle that for AWS and Azure. Both of those rely on underlying functionality already available in paws.storage and AzureStor (ways to identify directory-like structure, common prefixes, etc). I don't think I want to move forward with writing that parsing code here in pins for now; maybe we can collaborate in the future on adding features like this to googleCloudStorageR if it is a high priority for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use gcs_list_objects() plus a prefix? Because it returns all objects "recursively"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right:

library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-testing")
googleCloudStorageR::gcs_list_objects(bucket = b$bucket)
#>                                                         name      size
#> 1        big-numbers/20230110T204627Z-b89e8/big-numbers.json  46 bytes
#> 2                big-numbers/20230110T204627Z-b89e8/data.txt 187 bytes
#> 3              small-numbers/20230110T204626Z-4d1eb/data.txt 189 bytes
#> 4     small-numbers/20230110T204626Z-4d1eb/small-numbers.rds  61 bytes
#> 5              small-numbers/20230110T204628Z-c3943/data.txt 191 bytes
#> 6    small-numbers/20230110T204628Z-c3943/small-numbers.json  23 bytes
#> 7            test-kcH7P3OJjX/20230110T205041Z-6e4eb/data.txt 192 bytes
#> 8 test-kcH7P3OJjX/20230110T205041Z-6e4eb/test-kcH7P3OJjX.rds  44 bytes
#>               updated
#> 1 2023-01-10 20:46:28
#> 2 2023-01-10 20:46:28
#> 3 2023-01-10 20:46:26
#> 4 2023-01-10 20:46:27
#> 5 2023-01-10 20:46:29
#> 6 2023-01-10 20:46:29
#> 7 2023-01-10 20:50:41
#> 8 2023-01-10 20:50:41
googleCloudStorageR::gcs_list_objects(bucket = b$bucket, prefix = "big-numbers")
#>                                                  name      size
#> 1 big-numbers/20230110T204627Z-b89e8/big-numbers.json  46 bytes
#> 2         big-numbers/20230110T204627Z-b89e8/data.txt 187 bytes
#>               updated
#> 1 2023-01-10 20:46:28
#> 2 2023-01-10 20:46:28

Created on 2023-01-12 with reprex v2.0.2

In the packages for AWS and Azure, there is existing support for identifying directory-like structures or to find "common prefixes" but googleCloudStorageR doesn't have that as of today.

Comment on lines +120 to +121
paths <- fs::path_split(unique(fs::path_dir(resp$name)))
version_from_path(map_chr(paths, ~ .x[[length(.x)]]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I did to deal with the flat namespace situation; suggestions welcome for a nicer option!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the concept of object versioning in GCS, does this match what pin_versions does? https://cloud.google.com/storage/docs/object-versioning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one of the ideas of pins is to provide versioning, even if the data practitioner is not allowed to turn on GCS versioning because of, say, the bucket retention policy.

gcs_download <- function(board, key) {
path <- fs::path(board$cache, key)
if (!fs::file_exists(path)) {
suppressMessages(googleCloudStorageR::gcs_get_object(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one function doesn't seem to respect googleAuthR.verbose so for now we'll use suppressMessages().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its the use of the cli:: message bar - yes please open an issue as should be a quick fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path
}

gcs_file_exists <- function(board, name) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this via gcs_get_object("obj_name", meta=TRUE) and then tryCatch(http_404 = FALSE) - will be much quicker and direct only inspecting HTTP headers and not content. Custom error types are supported in the HTTP responses to help this workflow. Listing objects could be slow for large amount of pins

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying! Hmmm...

It's possible this function could have a better name, but it's not always looking for a specific object; sometimes it is looking for a directory-like thing. Given the information at hand when calling, for example, pin_exists(), I think we do need to list the objects in the bucket rather than get a specific object.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

R/board_gcs.R Outdated Show resolved Hide resolved
R/board_gcs.R Outdated Show resolved Hide resolved
R/board_gcs.R Outdated Show resolved Hide resolved
R/board_gcs.R Outdated Show resolved Hide resolved
R/board_gcs.R Outdated Show resolved Hide resolved
R/board_gcs.R Outdated Show resolved Hide resolved
Comment on lines +94 to +96
pin_list.pins_board_gcs <- function(board, ...) {
NA
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use gcs_list_objects() plus a prefix? Because it returns all objects "recursively"?

prefix = paste0(board$prefix, dir, "/")
)

if (nrow(resp) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably not needed unless when nrow() is a 0, there's no name column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is the situation:

library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-testing")
resp <- googleCloudStorageR::gcs_list_objects(
  bucket = b$bucket,
  prefix = paste0(b$prefix, "cats-and-dogs", "/")
)
#> ℹ 2023-01-12 10:50:27 > No objects found
resp
#> data frame with 0 columns and 0 rows

Created on 2023-01-12 with reprex v2.0.2

R/board_gcs.R Outdated Show resolved Hide resolved
@juliasilge juliasilge merged commit 08d7dad into main Jan 13, 2023
@juliasilge juliasilge deleted the google-cloud-storage branch January 13, 2023 19:18
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement modern google cloud storage board
3 participants