-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get revision history of a document in Google Drive #218
Comments
I've not waded into this myself (and only read but did not run the gist), but it's something I find quite interesting. The googledrive package does not provide any explicit or high-level support for the revision endpoints today. At the very least, I hope the low-level API functions are helpful and I'm definitely interested to hear/think more about whether there are high-level user-facing functions that could be useful. Or maybe this should be a separate set of case studies or a little package. |
Yes, thanks to this pattern that I spotted in the
These patterns were a huge help! I've edited my gist to simplify my workflow and show where the pain points are (and add plots of the revision data I can get :). To summarise, for each revision, we can easily get the username, date and time, and export URLs for the file at that revision. But we cannot get any info about the contents of the file. To workaround that, we can use the export URLs provided by the API to download all the revisions, then use R to count the words in each (or whatever variable we want to diff on) and compute the differences in word count between each file. These URLs look like this The worst part of the workflow I have come up with so far is that the only way I have found to download these revision files is with It would be great if we can use |
@benmarwick, I think you want to grab the
and then do something like this to actually pull the document contents:
|
Thanks for taking a look! Here's what I get with a public google doc with just two revisions:
Yes, that works, here is the list of revisions
But we can't get the doc's contents with the other method you proposed:
We get this error message:
Seems like only the v2 API gives the export URLs to specific versions of the google doc. V3, which you are using in the above, doesn't seem to give access to the actual content of the revisions. But, when I look at the web traffic while navigating between revisions in a doc using File -> Version history -> See version history, I see that the main URL that delivers the content for a version looks like this: I can recognise the file ID, 1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps, and I see start= and end= that refer to revision IDs that I recognise from the API. I do not know what token= and ouid= are, but I guess something to do with authentication. The URL will work fine without them, e.g. this will still get the JSON in my browser: When I run this URL in my browser I get a JSON file that is that specific revision, it matches what I see on the Google Docs revision page. Are there any clues in this URL that hint at a way to use googledrive to download these revision files? Because that would be heaps better than using browseURL to get them. |
Ah I had a typo above, I think it should work now for non-Google files, but you're right for some reason v3 doesn't support revision export for Google native files (https://issuetracker.google.com/u/1/issues/62825716) 😞. I am not sure this is exactly what you are looking for, but you can build that url with an httr query instead of having to navigate to it in the browser like this:
|
Thank you very much, that seems to do it! Here's how I've put together your suggestions to see if I understand how to use them: For a given google doc (in the native gdoc format, not a binary file), we can get a list of all revisions like this, as you show:
Now we want a vector of the revision ids so we can iterate over them to get the full content of the file at each revision:
Now we write a little function to export the contents of the google doc at a specific revision, using the httr methods you showed above. We choose to export the doc as a plain text file. I've added in the
Now we can use this function to contact the google drive API and get the content of each revision of the google doc:
And finally we can convert the responses from the API into plain text, and tidy it a little bit, ready for some exploratory data analysis, etc.
The output here is a list where each item is a character vector with a length of one. Each item in the list holds the text of the google doc at a specific revision. Great! Much better than my approach of downloading the files via the browser, thanks again! |
@benmarwick: this is awesome, glad we've got it sorted. It'd be great to get this written up into an article if you'd be willing to submit a PR (it need not be long, what you've got in the comment above ☝️ with a bit of filler explanation would be great). |
Thanks again, yes I'd be happy to submit a PR for an article to narrate this is bit more elegantly for other users. |
Perfect! @benmarwick see #219 -- thank you! |
Is this possible with the googledrive pkg? I can see a little Python repo here that seems to do something like this: https://github.com/larsks/gitdriver My use case is trying to quantify who added how much text to a google doc. And this fascinating essay hints at a few possibilities: http://features.jsomers.net/how-i-reverse-engineered-google-docs/
My efforts so far are documented here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed
The text was updated successfully, but these errors were encountered: