Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite control #230

Closed
Martin-Jung opened this issue Aug 28, 2018 · 10 comments · Fixed by #266
Closed

Overwrite control #230

Martin-Jung opened this issue Aug 28, 2018 · 10 comments · Fixed by #266
Labels
feature a feature request or enhancement [up|down]load ⬆️ ⬇️ wip work in progress
Milestone

Comments

@Martin-Jung
Copy link

Two minor things:

  • Can the parameter to the local file for drive_update() be harmonized with the parameter for drive_upload() ? Currently the former is called "file" and the latter is called "path". Somewhat confusing...

  • Would it be possible to have an "overwrite = TRUE" parameter in drive_upload ? Or alternatively enable a fallback option (parameter "create = TRUE") in drive_update() so that it creates files instead of raising an error if the file is not existing. Or combine both functions into one ?

@DataStrategist
Copy link

The troubling thing about overwrite = TRUE being missing is that it's creating a bit of chaos in my google drive... it neither provides error nor overwrites, it just creates another file with the exact same filename (not even incrementing by 1,2, etc). This is a bit dramatic! Solvable?

@jennybc
Copy link
Member

jennybc commented Aug 28, 2018

My head is very far away from this right now but a quick reaction:

drive_upload() and drive_update() are separate for at least a couple of reasons:

  • They hit different endpoints of the API. Of course the purpose of a wrapper is to smooth over API issues, but these are very different operations as far as Drive is concerned.
  • "upload" is expressly meant to create a new file id, "update" is expressly meant to preserve an existing file id (because maybe something else is pointing to it) but modify its media and/or metadata.
  • The lack of clear "update" functionality has been an ongoing thorn in my side with googlesheets so I didn't want to repeat that mistake.

Here's why it's not a no-brainer for drive_upload() to offer to overwrite. Offer to overwrite ... what? File names and parent folders are a very lightweight construct in Drive. You can have multiple files with the same name, files can have multiple parents, it's basically the Wild West. It's not a normal file system. Unless the user gives me a file id (directly or indirectly), I don't know that I'm supposed to overwrite. That's what drive_update() is for.

I'm happy to make some usability improvements, but hopefully this gives you more context for the way things currently are. Can you refine the wish list in light of this?

@LucyMcGowan
Copy link
Member

re: the difference in parameters path (in drive_upload()) versus file (in drive_update()), I think they ought to stay different because they are really different things.

  • In drive_upload() the second argument can only be a path (the location they would like the media deposited) i.e. it cannot be a file they want to update.
  • In drive_update() it can only be the file they want to update.

We tried to be consistent in these parameter names such that when it is referencing an actual file we use file (like in drive_update(), drive_publish(), drive_share(), the first parameter of drive_mv() or drive_cp(), etc.) and when it is referencing a location on Drive, we use path (such as drive_upload(), the second parameter of drive_mv() or drive_cp(), etc.)

@Martin-Jung
Copy link
Author

Martin-Jung commented Aug 29, 2018

re: @jennybc @LucyMcGowan
Thanks for the quick response and comments. Very insightful.

I know that the google api is not a normal file system, yet my workflow is organized in a way assuming that it is. Most data on my google drive is organized in hierachical folders and I usually want to update a specific file in a specific folder. As mentioned by @DataStrategist as well I think that drive_upload creating a separate file with the same file name (even when a path is explicitly set) is counterinuitive (although not any different as when I upload the file manually through the google drive webinterace).

I think the functionality that I am missing in googledrive is something like a drive_sync function that combines both drive_upload and drive_update and only works by specifying an explict path on the google drive. So similar as what the backup and sync tool does. If the file in this path does not exist it should create it, otherwise overwrite it.

My current workflow looks like this:

# First I create the file by uploading it to a specific folder
drive_upload(media = "XX_results_latest.rds",
             path = "~/Project_ABC/DataDrop/XX_results_latest.rds")
# Now whenenver I want to update the latest version of this file I need to run separate call to drive_update
drive_update(media = "XX_results_latest.rds",
             file =  "~/Project_ABC/DataDrop/XX_results_latest.rds")


# It would be easier if there were a functionality like this, which either creates or updates the file in a single call
drive_sync(media =  "XX_results_latest.rds", 
            file = "~/Project_ABC/DataDrop/XX_results_latest.rds")

# IDEA: Could also be split by individual files oder entire folders ?
drive_sync_file(media = "myfolder/file.txt",
                file = "~/Project_ABC/DataDrop/file.txt")
drive_sync_folder(media = "myfolder/",
                  path = "~/Project_ABC/DataDrop/")

As mentioned in the beginning, this is only a suggestion. Thanks for all your work on the googledrive package.

@DataStrategist
Copy link

Actually @Martin-Jung I think when you say

although not any different as when I upload the file manually through the google drive webinterace).

that's not correct. My use case is that I have a process that does stuff on my VM and dumps out a file which should be uploaded to google drive. as the does stuff goes again, it should create a new file which should overwrite the first. If I do it manually and drag and drop a new version into drive, it DOES replace the file, providing the following info:
image

Therefore, in order to emulate that behavior, I think gs_upload() overwriting or not should be parametrized, with the default behavior being to overwrite the file, but provide an option to keep seperate, just as drive does.

However, @jennybc if you want to keep it as is, perhaps at very least when a user uploads something that already exists on the drive there should be a warning saying "By the way user, since that filename already existed and google drive is a crazy filesystem, I have created another file with that name"... just because perhaps having one filename is behavior that a reasonable user might expect? 😄

FWIW, I fixed the problem by drive_rm(FILENAME) before uploading. Now everything purrs like a kitten :)

@jennybc
Copy link
Member

jennybc commented Apr 26, 2019

Having just skimmed this and some other issues, I feel like an overwrite parameter might be the answer:

  • overwrite = NA is default, is current behaviour, does nothing
  • overwrite = TRUE and overwrite = FALSE give user a way to explicitly opt in or out. In either case, it authorizes us to hunt down any pre-existing file with the same name and trash it or abort, respectively. This search is potentially expensive, so we don't want to do it by default.

@jennybc jennybc added feature a feature request or enhancement [up|down]load ⬆️ ⬇️ labels Apr 26, 2019
jennybc added a commit that referenced this issue Aug 10, 2019
@jennybc jennybc added the wip work in progress label Aug 10, 2019
@jennybc jennybc added this to the v1.0.0 milestone Aug 10, 2019
jennybc added a commit that referenced this issue Aug 13, 2019
Truly closes #230
@jennybc
Copy link
Member

jennybc commented Aug 13, 2019

Hello People of Issue #230!

I'm working on a large PR (#266) that adds an overwrite argument to lots of functions.

And aided by that infrastructure and mindset, I've been able to add a new function:

drive_put <- function(media, path = NULL, name = NULL, ..., type = NULL, verbose = TRUE) {...}

which, in pseudo-code, does this:

target_filepath <- <determined from `path`, `name`, and `media`>
hits <- <get all Drive files at target_filepath>
if (no hits) {
  drive_upload(media, path, name, type, ..., verbose)
} else if (exactly 1 hit) {
  drive_update(hit, media, ..., verbose)
} else {
  ERROR
}

I'm not done yet, but I am to the point where input from people who try it and care about this stuff would be very useful.

Here's how to install from that PR:

devtools::install_gitub("tidyverse/googledrive#266")

@Martin-Jung
Copy link
Author

Cheers @jennybc , 👍
Just tested the new function and it does work for me with the example code below.

devtools::install_gitub("tidyverse/googledrive#266")
library(googledrive)

# Creates duplicate files
drive_upload(drive_example("chicken.csv"))
drive_upload(drive_example("chicken.csv"))

# Creates a single file
x <- drive_upload(drive_example("chicken.csv"))
drive_put(drive_example("chicken.csv"))

@jennybc
Copy link
Member

jennybc commented Aug 13, 2019

@Martin-Jung

So does drive_put() meet your needs? Along with the ability to specify overwrite = TRUE/FALSE in other functions?

@Martin-Jung
Copy link
Author

It does. The feature that I was specifically missing was a function to update an existing file on google drive. The drive_put function seems to do well. I haven't checked the other functions with overwrite parameters, but I am sure they will be handy too.

@jennybc jennybc changed the title Consistent naming and overwrite for drive_upload Overwrite control Aug 14, 2019
jennybc added a commit that referenced this issue Aug 14, 2019
* Refactor to reveal similarity in these functions

* Devote less attention to parent edge cases

In previous refactoring commit, I stopped checking whether both `path` and `parents` were specified in `drive_upload()`. I claim this is incredibly rare and not worth the code and headspace to maintain.

Let the behaviour be undefined / undescribed and, if it is wrong or surprising and actually matters, someone will speak up.

* Don't try to delete a file more than once

Tests added in the next commit revealed that, for filepath input to `drive_rm()`, we tried to delete each file n times, where n = the number of filepaths for that file. n can be greater than 1 if a file has multiple parents.

* Test that drive_update() can add a parent folder

I am removing checks and tests around unusual parent situations elsewhere. I intend to send people to drive_update() for all weird parent folder situations. So I want an explicit test that you can add a parent folder this way.

* Centralize more path handling

* Use rationalize_path_name()

* Deprecate `parent`, in favor of `path`

Revisit this later with lifecycle

* Changes in drive_create() mean this test hits the API now

* Test rationalize_path_name()

* Introduce `overwrite`

Closes #230

* Oops forgot drive_rename()

* Add NEWS section

* Oops, forgot drive_mv()

* Test check_for_overwrite() and, therefore, the `overwrite` arguments

* No, really, `parent` is a file ID

Refine tests

* Show `overwrite` in examples

* Add handy "recency" example for drive_find()

* Tweak NEWS

* Extract overwrite_hits()

* Maybe it's a little aggressive/dangerous to empty trash here

But I have seen that the presence of files on Drive, in the trash, can affect tests. Tests should technically leave the trash the way the found it and be "trash-aware". But sometimes it's easiest to just clean it out.

* Add drive_put()

Truly closes #230

* Test for drive_put()

* More doc updates re: drive_put()

* Put in the usual skips

* Add "local" now that so many other functions deal in a "remote" overwrite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement [up|down]load ⬆️ ⬇️ wip work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants