Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get revision history of a document in Google Drive #218

Closed
benmarwick opened this issue Jun 9, 2018 · 9 comments
Closed

Get revision history of a document in Google Drive #218

benmarwick opened this issue Jun 9, 2018 · 9 comments

Comments

@benmarwick
Copy link

benmarwick commented Jun 9, 2018

Is this possible with the googledrive pkg? I can see a little Python repo here that seems to do something like this: https://github.com/larsks/gitdriver My use case is trying to quantify who added how much text to a google doc. And this fascinating essay hints at a few possibilities: http://features.jsomers.net/how-i-reverse-engineered-google-docs/

My efforts so far are documented here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed

@jennybc
Copy link
Member

jennybc commented Jun 10, 2018

I've not waded into this myself (and only read but did not run the gist), but it's something I find quite interesting. The googledrive package does not provide any explicit or high-level support for the revision endpoints today. At the very least, I hope the low-level API functions are helpful and I'm definitely interested to hear/think more about whether there are high-level user-facing functions that could be useful. Or maybe this should be a separate set of case studies or a little package.

@benmarwick
Copy link
Author

Yes, thanks to this pattern that I spotted in the googledrive source, I can get to the revisions via the Drive v2 API:

req2 <- build_request(
  path = "drive/v2/files/{fileId}/revisions",
  method = "GET",
  params = list(
    fileId = "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"
  ),
  token = drive_token()
)
revs2 <-  process_response(make_request(req2))

These patterns were a huge help! I've edited my gist to simplify my workflow and show where the pain points are (and add plots of the revision data I can get :). To summarise, for each revision, we can easily get the username, date and time, and export URLs for the file at that revision. But we cannot get any info about the contents of the file.

To workaround that, we can use the export URLs provided by the API to download all the revisions, then use R to count the words in each (or whatever variable we want to diff on) and compute the differences in word count between each file. These URLs look like this https://docs.google.com/feeds/download/documents/export/Export?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&revision=28367&exportFormat=txt

The worst part of the workflow I have come up with so far is that the only way I have found to download these revision files is with browseURL, and that is not good at all. This is because we can't control the download location, the destination filename, and different browsers have different behaviours, and it's really disruptive to the user when there are more than a few revisions to download because the browser actions take over the system. So that's not ideal.

It would be great if we can use drive_download or similar to download these revisions directly from R. I guess it would be some combination of new functions using generate_request and make_request, but I can't quite see how to do it. Am I on the right track?

@LucyMcGowan
Copy link
Member

LucyMcGowan commented Jun 12, 2018

@benmarwick, I think you want to grab the revisionIds using the drive.revision.list endpoint like this:

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = file$id 
 )
)
res <- make_request(req)
revision <- process_response(res)
revisionId <- revision$revisions[[2]]$id

and then do something like this to actually pull the document contents:

## Get revision contents
request <- generate_request(
  endpoint = "drive.revisions.get",
  params = list(
    fileId = file$id,
    revisionId = revisionId
  )
)

res <- make_request(request)

@benmarwick
Copy link
Author

benmarwick commented Jun 12, 2018

Thanks for taking a look! Here's what I get with a public google doc with just two revisions:

# this doc is public, you can add more revisions if you like!
fileId <- "1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE"

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = fileId 
  )
)
res <- make_request(request)
revision <- process_response(res)

Yes, that works, here is the list of revisions

# 
# $`kind`
# [1] "drive#revisionList"
# 
# $revisions
# $revisions[[1]]
# $revisions[[1]]$`kind`
# [1] "drive#revision"
# 
# $revisions[[1]]$id
# [1] "1"
# 
# $revisions[[1]]$mimeType
# [1] "application/vnd.google-apps.document"
# 
# $revisions[[1]]$modifiedTime
# [1] "2015-06-19T02:04:11.377Z"
# 
# 
# $revisions[[2]]
# $revisions[[2]]$`kind`
# [1] "drive#revision"
# 
# $revisions[[2]]$id
# [1] "70"
# 
# $revisions[[2]]$mimeType
# [1] "application/vnd.google-apps.document"
# 
# $revisions[[2]]$modifiedTime
# [1] "2018-06-10T18:09:41.200Z"

But we can't get the doc's contents with the other method you proposed:

revisionId <- revision$revisions[[2]]$id

  ## Get revision contents
  request <- generate_request(
    endpoint = "drive.revisions.get",
    params = list(
      fileId = fileId,
      revisionId = revisionId,
      alt = "media"
    )
  )

res <- make_request(request)

We get this error message:


# Response [https://www.googleapis.com/drive/v3/files/1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE/revisions/70?alt=media]
# Date: 2018-06-12 06:23
# Status: 403
# Content-Type: application/json; charset=UTF-8
# Size: 382 B
# {
#   "error": {
#     "errors": [
#       {
#         "domain": "global",
#         "reason": "fileNotDownloadable",
#         "message": "Only files with binary content can be downloaded. Use Expor...
#     "locationType": "parameter",
#     "location": "alt"
#    }
# ...

Seems like only the v2 API gives the export URLs to specific versions of the google doc. V3, which you are using in the above, doesn't seem to give access to the actual content of the revisions.

But, when I look at the web traffic while navigating between revisions in a doc using File -> Version history -> See version history, I see that the main URL that delivers the content for a version looks like this:

https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898&smv=4&srfn=false&ern=false&token=AC4w5VgBbwN5o4jfoEoUNDwrkoTtT_b4vQ%3A1528698882690&ouid=102337699239108482617&includes_info_params=true

I can recognise the file ID, 1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps, and I see start= and end= that refer to revision IDs that I recognise from the API. I do not know what token= and ouid= are, but I guess something to do with authentication. The URL will work fine without them, e.g. this will still get the JSON in my browser:

https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898

When I run this URL in my browser I get a JSON file that is that specific revision, it matches what I see on the Google Docs revision page. Are there any clues in this URL that hint at a way to use googledrive to download these revision files? Because that would be heaps better than using browseURL to get them.

@LucyMcGowan
Copy link
Member

LucyMcGowan commented Jun 12, 2018

Ah I had a typo above, I think it should work now for non-Google files, but you're right for some reason v3 doesn't support revision export for Google native files (https://issuetracker.google.com/u/1/issues/62825716) 😞. I am not sure this is exactly what you are looking for, but you can build that url with an httr query instead of having to navigate to it in the browser like this:

url <- httr::modify_url(
  url = "https://docs.google.com/feeds/download/documents/export/Export",
  query = list(
    id = file$id,
    revision = revisionId,
    exportFormat = "txt"
  )
)
x <- httr::GET(url)

@benmarwick
Copy link
Author

Thank you very much, that seems to do it! Here's how I've put together your suggestions to see if I understand how to use them:

For a given google doc (in the native gdoc format, not a binary file), we can get a list of all revisions like this, as you show:

fileId <- "1QyrBDyzf2QZJ8aHhlICBqeaELlcp2sRmTFGJI8aRhwE"

## Get Revision Id
request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = fileId 
  ),
  token = drive_token()
)
res <- make_request(request)
revision <- process_response(res)

Now we want a vector of the revision ids so we can iterate over them to get the full content of the file at each revision:

# get all revision Ids
all_revision_Ids <- map_chr(revision$revisions, "id") %>% as.numeric()

Now we write a little function to export the contents of the google doc at a specific revision, using the httr methods you showed above. We choose to export the doc as a plain text file. I've added in the drive_token() so we can use this function with private or public documents:

library(httr)
get_content_of_revisions <- function(fileId, revisionId){

url <- modify_url(
  url = "https://docs.google.com/feeds/download/documents/export/Export",
  query = list(
    id = fileId,
    revision = revisionId,
    exportFormat = "txt"
  )
)
x <- GET(url, verbose(), drive_token())
}

Now we can use this function to contact the google drive API and get the content of each revision of the google doc:

all_revisions_contents <- map2(fileId, all_revision_Ids, get_content_of_revisions)

And finally we can convert the responses from the API into plain text, and tidy it a little bit, ready for some exploratory data analysis, etc.

library(rvest)
all_revisions_contents_text <- map(all_revisions_contents, 
                                   ~read_html(.x) %>% 
                                     html_text() %>% 
                                     str_squish())

The output here is a list where each item is a character vector with a length of one. Each item in the list holds the text of the google doc at a specific revision. Great! Much better than my approach of downloading the files via the browser, thanks again!

@LucyMcGowan
Copy link
Member

LucyMcGowan commented Jun 12, 2018

@benmarwick: this is awesome, glad we've got it sorted. It'd be great to get this written up into an article if you'd be willing to submit a PR (it need not be long, what you've got in the comment above ☝️ with a bit of filler explanation would be great).

@benmarwick
Copy link
Author

Thanks again, yes I'd be happy to submit a PR for an article to narrate this is bit more elegantly for other users.

@LucyMcGowan
Copy link
Member

Perfect! @benmarwick see #219 -- thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants