Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New force_identical_write arg #735

Merged
merged 12 commits into from
May 5, 2023
Merged

New force_identical_write arg #735

merged 12 commits into from
May 5, 2023

Conversation

juliasilge
Copy link
Member

Addresses #589

This does not solve all our hashing issues (for example, the pin_hash still only looks at the pin contents and not the metadata) but it is a small step toward a better hashing situation.

After talking with users, I think that an opt-in new argument is the best way forward (rather than, say, a new function or new default behavior). This new check_hash arg comes after the dots and defaults to FALSE. It does require another read of pin_meta() to get the old hash.

library(pins)
b <- board_temp(versioned = TRUE)
b %>% pin_write(1:10, "nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Creating new version '20230426T224520Z-7f884'
#> Writing to pin 'nice-numbers'

## to change timestamp:
Sys.sleep(2)

b %>% pin_write(1:10, "nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Warning: The hash of pin "nice-numbers" has not changed and will not be stored.

Created on 2023-04-26 with reprex v2.0.2

With this PR as is now, pin_write() will still return the name invisibly but does not store the pin and generates a warning. Thoughts?

@kmasiello
Copy link

Tested and works as expected.
If I'm a user, I can work with approach well, though I think the documentation should be explicit in noting that the hash is not affected by changed metadata. I like that it's an opt-in argument, which means I will be reading the docs to understand what I'm getting into.
There might be situations where the user would want the status (pinned or didn't pin) to be returned in addition to the pin name. Every use case I envision though, I can noodle through how I'd determine the pinning status by other means. I think it would be wise to keep our ears open as to whether this is a need from the community that we'd implement in the future.

@juliasilge
Copy link
Member Author

juliasilge commented May 1, 2023

Thank you so much @kmasiello! 🙌

There might be situations where the user would want the status (pinned or didn't pin) to be returned in addition to the pin name.

This is a good point. Right now, pin_write() only returns the name (like "julia.silge/nice-numbers") but we have been discussing in #592 how to better organize what should be returned. I'll add this info to that issue and we can see what would work best.

Do we think that now in this PR pin_write() should return something different than the name if it does not update the pin? Or should we wait for broader changes to what is returned?

cli::cli_warn(
"The hash of pin {.val {name}} has not changed and will not be stored."
)
return(invisible(name))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return something different here, like NULL maybe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's correct; it should return what pin_write() usually returns.

@juliasilge
Copy link
Member Author

juliasilge commented May 1, 2023

Refined the warning message a smidge:

library(pins)
b <- board_connect()
#> Connecting to Posit Connect 2023.03.0 at <https://colorado.posit.co/rsc>
b %>% pin_write(1:10, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/really-nice-numbers'
b %>% pin_write(1:10, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Warning: 
#> ! The hash of pin "julia.silge/really-nice-numbers" has not changed.
#> • Your pin will not be stored.
b %>% pin_write(1:11, "julia.silge/really-nice-numbers", check_hash = TRUE)
#> Guessing `type = 'rds'`
#> Writing to pin 'julia.silge/really-nice-numbers'

Created on 2023-05-01 with reprex v2.0.2

@juliasilge juliasilge marked this pull request as ready for review May 1, 2023 18:34
@juliasilge juliasilge requested a review from hadley May 1, 2023 18:36
@kmasiello
Copy link

Do we think that now in this PR pin_write() should return something different than the name if it does not update the pin? Or should we wait for broader changes to what is returned?

My gut says this should be handled with the broader considerations of what we return. I don't want to see a bespoke solution here that isn't compatible with future work.

@kmasiello
Copy link

Additional suggestion and comments after reviewing with users -
check_hash may be better named as check_data_hash to emphasize that metadata is not checked.

Checking metadata for changes is an outstanding ask. A sample scenario is using check_hash = TRUE but wanting to update the metadata to reflect a "current as of" date. You would want to be able to indicate the pin is still fresh, even if the data hasn't changed.

@juliasilge
Copy link
Member Author

A sample scenario is using check_hash = TRUE but wanting to update the metadata to reflect a "current as of" date.

Can I make sure I am understanding this? It sounds like users are expressing a desire to:

  • not fill up their disk space with copies of the same file, and at the same time
  • have the metadata reflect the pin's freshness, like the last time a write was attempted (info more than just "latest")

This would require a more extensive change than in this PR, will be pretty tough in the near term, and likely would require special handling of different backends (S3 vs. Connect vs. folders vs. ...). As of now, updating the metadata but not the pin contents isn't something straightforward to implement. We could consider how to approach this problem in the future.

I tend to think we should stick with check_hash as the argument name since we will likely think through checking portions of the metadata as well (title, description, tags, and user metadata could all currently change without changing the pin hash).

I tend to think we should still go with the change as scoped here because it lets us offer a solution to the storing-many-copies problem. It does mean folks have to choose between having the metadata reflect the last time the pin contents really changed and the last time a write was attempted; this still seems better than the situation as it is now.

@kmasiello
Copy link

Agreed. As currently scoped, this helps the main use case tremendously.
If I'm very keen to only store the changed data frame but have a record of the last write attempt, I can bake this logging info into my upstream ETL process.
Similar to the other comment about returning the "wrote / didn't write" status, it will be good to keep ears open to use cases around metadata changes.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems so simple — I love it 😄

I also refreshed my memory of pin_hash(), and I'm confident it's ok to be used in this way. There is some chance that you might get the same hash value with different data values, but it's extremely small. (Or maybe this is better: if you hashed a dataset every day for two years, you're more likely to get hit by a meteor than get a hash collision.)

R/pin-read-write.R Outdated Show resolved Hide resolved
R/pin-read-write.R Outdated Show resolved Hide resolved
@@ -111,6 +116,17 @@ pin_write <- function(board, x,
)
meta$user <- metadata

if (check_hash) {
old_hash <- possibly_pin_meta(board, name)$pin_hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think this might fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to let someone use check_hash = TRUE when they write a pin for the first time, before there is any metadata. old_hash returns as NULL here when there isn't any metadata.

Copy link
Member

@hadley hadley May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think I'd prefer an explicit pin_exists() check? Is there some reason not to do that?

cli::cli_warn(
"The hash of pin {.val {name}} has not changed and will not be stored."
)
return(invisible(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's correct; it should return what pin_write() usually returns.

@@ -64,6 +64,10 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' use the default for `board`
#' @param tags A character vector of tags for the pin; most important for
#' discoverability on shared boards.
#' @param check_hash Check whether the pin contents are identical to the last
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider defaulting this to TRUE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's try it. A release is still a little ways out so we can see if it is a problem in an expected way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually @hadley the other reason I had not set it to TRUE originally is it adds one additional call to pin_meta(). Do we think this is OK? All the API calls can be especially painful on Connect, but maybe adding yet one more set isn't much marginal difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I think it's probably still worth it. It definitely saves you a lot of time/space in the case where the data hasn't changed. We'll definitely need to promote that change heavily.

NEWS.md Outdated Show resolved Hide resolved
@@ -64,6 +64,10 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' use the default for `board`
#' @param tags A character vector of tags for the pin; most important for
#' discoverability on shared boards.
#' @param check_hash Check whether the pin contents are identical to the last
#' version if one exists (using the hash), and then **do not store** the pin
#' again. This argument does not check the pin metadata, only the pin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording isn't quite right because the argument isn't checking anything. But I couldn't think of better wording off the top of my head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I change the whole arg to compare_hash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be better because there is an internal function already called check_hash(). Take a look and see if you have feedback on the new name/docs.

pin_write(b, 1:5, "x", type = "rds")
pin_write(b, 1:6, "x", type = "rds")
pin_write(b, 1:7, "x", type = "rds")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have to change a couple of tests if the default is compare_hash = TRUE

@@ -64,6 +64,10 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' use the default for `board`
#' @param tags A character vector of tags for the pin; most important for
#' discoverability on shared boards.
#' @param compare_hash Compare the pin contents to the last version if one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but I wondered about something even more direct like write_if_different? Or write_unchanged_data? What do you think?

Copy link
Member Author

@juliasilge juliasilge May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about force_identical_write? This one would default to FALSE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these ideas really speak to me (including mine), so just go with your heart 😄

@juliasilge juliasilge changed the title New check_hash arg New force_identical_write arg May 5, 2023
@juliasilge juliasilge merged commit cc3c160 into main May 5, 2023
11 of 13 checks passed
@juliasilge juliasilge deleted the check-hash-arg branch May 5, 2023 18:12
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants