Get revision history of a document in Google Drive #218

benmarwick · 2018-06-09T18:30:00Z

Is this possible with the googledrive pkg? I can see a little Python repo here that seems to do something like this: https://github.com/larsks/gitdriver My use case is trying to quantify who added how much text to a google doc. And this fascinating essay hints at a few possibilities: http://features.jsomers.net/how-i-reverse-engineered-google-docs/

My efforts so far are documented here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed

jennybc · 2018-06-10T03:08:42Z

I've not waded into this myself (and only read but did not run the gist), but it's something I find quite interesting. The googledrive package does not provide any explicit or high-level support for the revision endpoints today. At the very least, I hope the low-level API functions are helpful and I'm definitely interested to hear/think more about whether there are high-level user-facing functions that could be useful. Or maybe this should be a separate set of case studies or a little package.

benmarwick · 2018-06-11T07:07:23Z

Yes, thanks to this pattern that I spotted in the googledrive source, I can get to the revisions via the Drive v2 API:

req2 <- build_request(
  path = "drive/v2/files/{fileId}/revisions",
  method = "GET",
  params = list(
    fileId = "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"
  ),
  token = drive_token()
)
revs2 <-  process_response(make_request(req2))

These patterns were a huge help! I've edited my gist to simplify my workflow and show where the pain points are (and add plots of the revision data I can get :). To summarise, for each revision, we can easily get the username, date and time, and export URLs for the file at that revision. But we cannot get any info about the contents of the file.

To workaround that, we can use the export URLs provided by the API to download all the revisions, then use R to count the words in each (or whatever variable we want to diff on) and compute the differences in word count between each file. These URLs look like this https://docs.google.com/feeds/download/documents/export/Export?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&revision=28367&exportFormat=txt

The worst part of the workflow I have come up with so far is that the only way I have found to download these revision files is with browseURL, and that is not good at all. This is because we can't control the download location, the destination filename, and different browsers have different behaviours, and it's really disruptive to the user when there are more than a few revisions to download because the browser actions take over the system. So that's not ideal.

It would be great if we can use drive_download or similar to download these revisions directly from R. I guess it would be some combination of new functions using generate_request and make_request, but I can't quite see how to do it. Am I on the right track?

LucyMcGowan · 2018-06-12T02:05:42Z

@benmarwick, I think you want to grab the revisionIds using the drive.revision.list endpoint like this:

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = file$id 
 )
)
res <- make_request(req)
revision <- process_response(res)
revisionId <- revision$revisions[[2]]$id

and then do something like this to actually pull the document contents:

## Get revision contents
request <- generate_request(
  endpoint = "drive.revisions.get",
  params = list(
    fileId = file$id,
    revisionId = revisionId
  )
)

res <- make_request(request)

benmarwick · 2018-06-12T06:27:54Z

Thanks for taking a look! Here's what I get with a public google doc with just two revisions:

# this doc is public, you can add more revisions if you like!
fileId <- "1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE"

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = fileId 
  )
)
res <- make_request(request)
revision <- process_response(res)

Yes, that works, here is the list of revisions

# 
# $`kind`
# [1] "drive#revisionList"
# 
# $revisions
# $revisions[[1]]
# $revisions[[1]]$`kind`
# [1] "drive#revision"
# 
# $revisions[[1]]$id
# [1] "1"
# 
# $revisions[[1]]$mimeType
# [1] "application/vnd.google-apps.document"
# 
# $revisions[[1]]$modifiedTime
# [1] "2015-06-19T02:04:11.377Z"
# 
# 
# $revisions[[2]]
# $revisions[[2]]$`kind`
# [1] "drive#revision"
# 
# $revisions[[2]]$id
# [1] "70"
# 
# $revisions[[2]]$mimeType
# [1] "application/vnd.google-apps.document"
# 
# $revisions[[2]]$modifiedTime
# [1] "2018-06-10T18:09:41.200Z"

But we can't get the doc's contents with the other method you proposed:

revisionId <- revision$revisions[[2]]$id

  ## Get revision contents
  request <- generate_request(
    endpoint = "drive.revisions.get",
    params = list(
      fileId = fileId,
      revisionId = revisionId,
      alt = "media"
    )
  )

res <- make_request(request)

We get this error message:


# Response [https://www.googleapis.com/drive/v3/files/1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE/revisions/70?alt=media]
# Date: 2018-06-12 06:23
# Status: 403
# Content-Type: application/json; charset=UTF-8
# Size: 382 B
# {
#   "error": {
#     "errors": [
#       {
#         "domain": "global",
#         "reason": "fileNotDownloadable",
#         "message": "Only files with binary content can be downloaded. Use Expor...
#     "locationType": "parameter",
#     "location": "alt"
#    }
# ...

Seems like only the v2 API gives the export URLs to specific versions of the google doc. V3, which you are using in the above, doesn't seem to give access to the actual content of the revisions.

But, when I look at the web traffic while navigating between revisions in a doc using File -> Version history -> See version history, I see that the main URL that delivers the content for a version looks like this:

https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898&smv=4&srfn=false&ern=false&token=AC4w5VgBbwN5o4jfoEoUNDwrkoTtT_b4vQ%3A1528698882690&ouid=102337699239108482617&includes_info_params=true

I can recognise the file ID, 1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps, and I see start= and end= that refer to revision IDs that I recognise from the API. I do not know what token= and ouid= are, but I guess something to do with authentication. The URL will work fine without them, e.g. this will still get the JSON in my browser:

https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898

When I run this URL in my browser I get a JSON file that is that specific revision, it matches what I see on the Google Docs revision page. Are there any clues in this URL that hint at a way to use googledrive to download these revision files? Because that would be heaps better than using browseURL to get them.

LucyMcGowan · 2018-06-12T12:56:25Z

Ah I had a typo above, I think it should work now for non-Google files, but you're right for some reason v3 doesn't support revision export for Google native files (https://issuetracker.google.com/u/1/issues/62825716) 😞. I am not sure this is exactly what you are looking for, but you can build that url with an httr query instead of having to navigate to it in the browser like this:

url <- httr::modify_url(
  url = "https://docs.google.com/feeds/download/documents/export/Export",
  query = list(
    id = file$id,
    revision = revisionId,
    exportFormat = "txt"
  )
)
x <- httr::GET(url)

benmarwick · 2018-06-12T17:25:24Z

Thank you very much, that seems to do it! Here's how I've put together your suggestions to see if I understand how to use them:

For a given google doc (in the native gdoc format, not a binary file), we can get a list of all revisions like this, as you show:

fileId <- "1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE"

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = fileId 
  ),
  token = drive_token()
)
res <- make_request(request)
revision <- process_response(res)

Now we want a vector of the revision ids so we can iterate over them to get the full content of the file at each revision:

# get all revision Ids
all_revision_Ids <- map_chr(revision$revisions, "id") %>% as.numeric()

Now we write a little function to export the contents of the google doc at a specific revision, using the httr methods you showed above. We choose to export the doc as a plain text file. I've added in the drive_token() so we can use this function with private or public documents:

library(httr)
get_content_of_revisions <- function(fileId, revisionId){

url <- modify_url(
  url = "https://docs.google.com/feeds/download/documents/export/Export",
  query = list(
    id = fileId,
    revision = revisionId,
    exportFormat = "txt"
  )
)
x <- GET(url, verbose(), drive_token())
}

Now we can use this function to contact the google drive API and get the content of each revision of the google doc:

all_revisions_contents <- map2(fileId, all_revision_Ids, get_content_of_revisions)

And finally we can convert the responses from the API into plain text, and tidy it a little bit, ready for some exploratory data analysis, etc.

library(rvest)
all_revisions_contents_text <- map(all_revisions_contents, 
                                   ~read_html(.x) %>% 
                                     html_text() %>% 
                                     str_squish())

The output here is a list where each item is a character vector with a length of one. Each item in the list holds the text of the google doc at a specific revision. Great! Much better than my approach of downloading the files via the browser, thanks again!

LucyMcGowan · 2018-06-12T19:12:19Z

@benmarwick: this is awesome, glad we've got it sorted. It'd be great to get this written up into an article if you'd be willing to submit a PR (it need not be long, what you've got in the comment above ☝️ with a bit of filler explanation would be great).

benmarwick · 2018-06-14T05:56:52Z

Thanks again, yes I'd be happy to submit a PR for an article to narrate this is bit more elegantly for other users.

LucyMcGowan · 2018-06-15T20:28:39Z

Perfect! @benmarwick see #219 -- thank you!

FrancoisR95 · 2025-02-07T12:54:07Z

Thank you for this code as remedy to the unfortunate absence of revision management in googledrive (still as of 2.1.1).

The above code still works with googledrive 2.1.1 but some functions have been renamed or removed in googledrive and their name need to be updated in the above code:
generate_request() => renamed into request_generate()
make_request() => renamed into request_make()
process_response() => removed. Instead, use gargle::response_process(), as done inside googledrive.

LucyMcGowan closed this as completed Jun 12, 2018

LucyMcGowan mentioned this issue Jun 15, 2018

📝 Write an article about retrieving revision content #219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get revision history of a document in Google Drive #218

Get revision history of a document in Google Drive #218

benmarwick commented Jun 9, 2018 •

edited

Loading

jennybc commented Jun 10, 2018

benmarwick commented Jun 11, 2018

LucyMcGowan commented Jun 12, 2018 •

edited

Loading

benmarwick commented Jun 12, 2018 •

edited

Loading

LucyMcGowan commented Jun 12, 2018 •

edited

Loading

benmarwick commented Jun 12, 2018

LucyMcGowan commented Jun 12, 2018 •

edited

Loading

benmarwick commented Jun 14, 2018

LucyMcGowan commented Jun 15, 2018

FrancoisR95 commented Feb 7, 2025

Get revision history of a document in Google Drive #218

Get revision history of a document in Google Drive #218

Comments

benmarwick commented Jun 9, 2018 • edited Loading

jennybc commented Jun 10, 2018

benmarwick commented Jun 11, 2018

LucyMcGowan commented Jun 12, 2018 • edited Loading

benmarwick commented Jun 12, 2018 • edited Loading

LucyMcGowan commented Jun 12, 2018 • edited Loading

benmarwick commented Jun 12, 2018

LucyMcGowan commented Jun 12, 2018 • edited Loading

benmarwick commented Jun 14, 2018

LucyMcGowan commented Jun 15, 2018

FrancoisR95 commented Feb 7, 2025

benmarwick commented Jun 9, 2018 •

edited

Loading

LucyMcGowan commented Jun 12, 2018 •

edited

Loading

benmarwick commented Jun 12, 2018 •

edited

Loading

LucyMcGowan commented Jun 12, 2018 •

edited

Loading

LucyMcGowan commented Jun 12, 2018 •

edited

Loading