Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drive_ls returning variable number of files #277

Closed
wilvancleve opened this issue Sep 27, 2019 · 15 comments
Closed

drive_ls returning variable number of files #277

wilvancleve opened this issue Sep 27, 2019 · 15 comments

Comments

@wilvancleve
Copy link

I have a Team Drive directory with 570 pdf files in it.

I identify the directory in question using a dribble:

pdf_dribble <- drive_ls(pattern="application_pdfs", team_drive=as_id(season$id))

When I call drive_ls, I get a variable (usually wrong) number of files returned each time I call drive_ls:

application_files = drive_ls(path=as_dribble(pdf_dribble), pattern="pdf", n_max=10000)

Sometimes this will return 100 files, sometimes 500, sometimes 569, but almost never the correct number.

Any thoughts?

@jennybc
Copy link
Member

jennybc commented Sep 30, 2019

Sounds possibly related to #272. I haven't experienced this myself yet and no one's provided, say, a dribble containing these weird results. So I'm currently at a bit of a loss re: tackling this. I'm in a "listening and thinking" phase ... 🤔

Maybe it will happen to me soon.

@wilvancleve
Copy link
Author

Happy to provide test data. It's a large number of files (~550) and when running drive_ls multiple times with the same inputs I can get different results with successive calls; it appears that some calls get abbreviated results (as though n_max is being ignored and google is returning a subset for brevity's sake).

@jessjaco
Copy link

It looks like do_paginated_request is giving inconsistent results. If you pause here and run the request a bunch of times, it returns differing numbers of items, for example:

> do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  as_dribble %>% 
  nrow
[1] 1722
do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  as_dribble %>% 
  nrow
[1] 1022
do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  as_dribble %>% 
  nrow
[1] 422
do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  as_dribble %>% 
  nrow
[1] 322

Worse even it appears to be giving duplicate results rather than incomplete ones:

do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  duplicated(.$id) %>% 
  sum
[1] 84
do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  duplicated(.$id) %>% 
  sum
[1] 0
do_paginated_request(request, n_max=Inf, n=function(x) length(x$files),verbose=FALSE) %>% 
  map("files") %>% 
  flatten %>% 
  duplicated(.$id) %>% 
  sum
[1] 13

I wonder if this can be reproduced using the bare API or if it's something in gargle.

@jennybc
Copy link
Member

jennybc commented Jan 14, 2020

I still have yet to experience this phenomenon or get enough data to truly study it.

But I have formed an untestable hypothesis about the root cause and installed a fix 🤞

Needless to say, please open a new issue if you update to this dev version and still see the phenomenon.

@wilvancleve
Copy link
Author

Installed via devtools (the master branch) and still seeing a variable number of files returned. Source folder has 280 files. First call returned 256, then a bunch of calls returned 280, then a final call returned 269.

Is there more detailed debug info I can provide?

@jennybc
Copy link
Member

jennybc commented Jan 15, 2020

Is this with a plain vanilla drive_ls()?

Can you do some exploratory analysis / comparison of those different return values?

Can you verify that there are no duplicated file IDs (the id column)?

Is there anything you can say about the 280 - 269 = 11 or 280 - 256 = 24 files that are sometimes missing? Are they being edited, were they recently created, do they all live in one subfolder, etc etc?

@jennybc
Copy link
Member

jennybc commented Jan 15, 2020

The root problem is that we are working with a paginated result, i.e. the results come in batches. I am now guarding against those pages containing replicated results. But I cannot guard against the inverse problem, which is that some results appear in no page (?).

I suspect that this is what you are now seeing, because you report no error.

@jennybc
Copy link
Member

jennybc commented Jan 15, 2020

@wilvancleve Nevermind, I can see what you see in #288. I have some homework to do, but this may be something to ask about upstream.

@spocks
Copy link

spocks commented Apr 2, 2020

After weeks of debugging I realize this issue is causing error in my code. Any plan to fix it on github or CRAN version?

@jennybc
Copy link
Member

jennybc commented Apr 2, 2020

@spocks drive_ls() is fixed on GitHub?

https://github.com/tidyverse/googledrive/blame/c8108a913d023b6f80ba07f75bdeb4e0b4520094/NEWS.md#L5

which is how this issue got closed above.

@spocks
Copy link

spocks commented Apr 3, 2020

@jennybc I updated the googledrive to the latest github version (1.0.0.9000) however I still have this issue.

@jennybc
Copy link
Member

jennybc commented Apr 3, 2020

Have you definitely restarted R?

Also, the version number is not definitive for dev versions, i.e. many source states have the same dev version. We generally only bump the dev version when a change is important for another (dev) package.

If you've ruled all of that out, please open a new issue.

@aleksandereiken
Copy link

aleksandereiken commented Sep 22, 2021

Dear Jenny,
I still experience the same issue as described above, also with the googledrive dev version 2.0.0.9000 (I have restarted, and tested with sessionInfo(), that the version is the one above).

My code is this:

# Authenticate
googledrive::drive_auth()

# Find IDs of team drive folders
ids_of_team_drives <- googledrive::shared_drive_find()

# Find team drive of interest
team_drive_id <- ids_of_team_drives[which(ids_of_team_drives$name == "my_team_drive"),]$id

# List folders within team drive of interest
folders_within_team_drive <- googledrive::drive_ls(
  googledrive::as_dribble(
    googledrive::as_id(
      team_drive_id
    )
  )
)

# Download tibble with content of "my_sub_folder"
files <- googledrive::drive_ls(
  googledrive::as_dribble(
    googledrive::as_id(
      folders_within_team_drive[which(folders_within_team_drive$name == "my_sub_folder"), ]$id
    )
  )
)

The folder "my_sub_folder" contains 126 images of jpeg format, however, when I re-run the code above, a tibble with everything between 100 and 126 rows is returned.

Thanks for looking into this when you have time. And thank you for an amazing package!!

@jennybc
Copy link
Member

jennybc commented Sep 22, 2021

At this point, I think this is as "fixed" as I can make it. The root problem is this:

The root problem is that we are working with a paginated result, i.e. the results come in batches. I am now guarding against those pages containing replicated results. But I cannot guard against the inverse problem, which is that some results appear in no page (?).

I.e. googledrive is faithfully presenting the results Google gives us, but sometimes those results are incomplete. 😕

My best advice re: what to do about this is in here:

#288

specifically:

#288 (comment)

Now, I realize you already using drive_ls().

BTW your code is much harder to read (and write!) than it needs to be. This is presumably not the problem, but is still an improvement you might enjoy.

I think the above can be rewritten as:

target_team_drive <- shared_drive_get("my_team_drive")
target_folder <- drive_get("my_sub_folder", shared_drive = target_team_drive)
files <- drive_ls(target_folder)

You might try specifying the corpus in drive_ls(), i.e. drive_ls(target_folder, corpus = "allDrives"). I don't think that should matter and if it helped that would be interesting to me.

@aleksandereiken
Copy link

Okay, thanks a lot for you time and comment. I will look into the q clause to see if it can somehow help. Thanks for the code review too! I will test it out and let you know if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants