Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: googleLanguageR #127

Closed
3 tasks
MarkEdmondson1234 opened this issue Jun 18, 2017 · 39 comments
Closed
3 tasks

Submission: googleLanguageR #127

MarkEdmondson1234 opened this issue Jun 18, 2017 · 39 comments

Comments

@MarkEdmondson1234
Copy link

MarkEdmondson1234 commented Jun 18, 2017

Summary

  • What does this package do? (explain in 50 words or less):

This package contains functions for analysing language through the Google Cloud Machine Learning APIs

  • Paste the full DESCRIPTION file inside a code block below:
Package: googleLanguageR
This package lets you call the 'Google Cloud Translation API' for detection and 
  translation of text.  It also allows you to analyse text using the 
'Google Natural Language API' 
  for analysing sentiment, entities or syntax.  
You can also analyse sound files and transcribe them
  to text via the 'Google Cloud Speech API'.

Data extraction for the translation and Speech to text API calls, text analysis for the entity detection API call.

  • Who is the target audience?

Analysts working with sound files and/or text that need translation and/or text analysis such as sentiment, entity detection etc.

The Natural Language API is replicated by many R packages such as tidytext but it does offer a bigger trained dataset to work from than what an analyst can supply themselves.

The Speech to text I'm not aware of any R packages that do this, especially with the Google one as its pretty new. there may be other APIs that are called via R that do it.

The Translation API, the recently released cld2 does language detection offline (and was the prompt I came across rOpenSci again) and is recommend way to detect language first to see if its text that needs translation, but then the translation itself I'm not aware of any R packages, again I'm fairly sure none call the new Google Translate API as its fairly new - note this is the one that uses neural nets and not the older one used online and has been available for ages.

Requirements

Confirm each of the following by checking the box. This package:

  • [x ] does not violate the Terms of Service of any service it interacts with.
  • [ x] has a CRAN and OSI accepted license.
  • [x ] contains a README with instructions for installing the development version.
  • [ x] includes documentation with examples for all functions.
  • [x ] contains a vignette with examples of its essential functions and uses.
  • [x ] has a test suite. (but only works locally at the moment due to authenticated API)
  • [ x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov. (but see above, pending a fix around this)
  • [ x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • [ x] Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • [x ] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
    Tests fail online as not authenticated:
    OK: 0 SKIPPED: 0 FAILED: 3

    1. Error: NLP returns expected fields (@test_gl.R#7)
    2. Error: Speech recognise expected (@test_gl.R#20)
    3. Error: Translation works (@test_gl.R#34)
  • [x ] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
    I haven't a short, lowercase name for the package name. This is largely due to a standard set for all the Google API packages I've written to help aid discoverability in Google search. I've been bitten by similar packages calling the Google Analytics API that were called rga and RGA...

  • If this is a resubmission following rejection, please explain the change in circumstances:
    NA

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@noamross
Copy link
Contributor

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @MarkEdmondson1234! A few things I'd like to clear up before assigning reviewers:

  • CI and codecov checks are currently failing
  • It looks like this requires the dev version of googleAuthR (at least for testing?). I would add this to a DESCRIPTION Remotes: field so that
  • I was only able to get tests to run locally by adding a call to gl_auth() with an absolute path to the JSON credentials to testthat.R. I suspect this is not the way to do this. Can you add developer's instructions of how to set up the environment for local testing? A CONTRIBUTING.md file would be the appropriate place for this.
  • There are some test failures due to what looks like the fact that Google translations seem change over time. Can you make these robust to these changes, or can the Google language API use a fixed model? One options might be to require translations to be within some range of the reference via stringdist. Or a more limited test would be to check the language of the output via cld2.
Failed ---------------------------------------------------------------------------------
1. Failure: Speech recognise expected (@test_gl.R#23) -------------------------------
result$transcript not equal to `test_result`.
1/1 mismatches
x[1]: "to administer medicine to animals Is frequent give very difficult matter and yet sometimes it's ne
x[1]: cessary to do so"
y[1]: "to administer medicine to animals Is frequent very difficult matter and yet sometimes it's necessa
y[1]: ry to do so"


2. Failure: Translation works (@test_gl.R#37) -----------------------------------------------------------
japan$translatedText not equal to `test_result`.
1/1 mismatches
x[1]: "薬を動物に投与することはしばしば非常に難しいことですが、時にはそれを行う必要があります"
y[1]: "動物に医薬品を投与することはしばしば非常に困難な問題ですが、時にはそれを行う必要があります"

Finally, here are the outputs from goodpractice::gp(). No major flags here, but we will want these points addressed by the end of the review.

── GP googleLanguageR ───────────────────────────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 60% of code lines are covered by test cases.  (Ed: 100% coverage is not required, but this gives reviewers guidance of where to look for code that may need additional testing).

    R/natural-language.R:68:NA
    R/natural-language.R:69:NA
    R/natural-language.R:70:NA
    R/natural-language.R:71:NA
    R/natural-language.R:72:NA
    ... and 79 more lines

  ✖ add a "URL" field to DESCRIPTION. It helps users find information about your package online. If your package does not have a homepage, add an URL to GitHub, or the CRAN package package page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting services provide bug trackers for free,  https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if you use '<-'.

    R/translate.R:219:15

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters

    R/auth.R:8:1
    R/natural-language.R:45:1
    R/natural-language.R:47:1
    R/natural-language.R:65:1
    R/speech.R:52:1
    ... and 18 more lines

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

    R/utilities.R:10:43

  ✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml These were most likely included in error. See section. ‘Package structure’ in the ‘Writing R Extensions’ manual. (Ed: This should be added to .Rbuildignore)
───────────────────────────────

@MarkEdmondson1234
Copy link
Author

MarkEdmondson1234 commented Jun 24, 2017

Great thanks! Have sorted out the easy wins in this commit https://github.com/MarkEdmondson1234/googleLanguageR/commit/5e959de9c03f507d9ada8dc0bfa16486c31dddb7

Testing

Ok this is sorted now the cache tests work.

Translations improving breaking tests

There are some test failures due to what looks like the fact that Google translations seem change over time. Can you make these robust to these changes, or can the Google language API use a fixed model?

Hmm, that translation has improved since the test was written! I will look at how to fix the model version or something. The stringdist sounds like a good idea, is it this library?

goodpractice results

Ah, I fixed those before the submission, but forgot to commit. Doh.

@noamross
Copy link
Contributor

noamross commented Jun 26, 2017

Thanks, Mark. Yes, that's the stringdist lib I was thinking of. Do let me know when you think you've made changes to sufficient to send to reviewers. We'd like to make sure (a) that the package is going to be stable during the review so the reviewers have a constant code base to work with, and (b) that there aren't major anticipated changes after review. It's fine if you are using the dev version of googleAuthR if we expect this to be stable.

Also, tests currently fail without this auth, so I'm not sure the mock setup is working properly. (It looks like interactive login is still requested). This needs to be fixed before sending to review.

Some testing/auth considerations that can also wait until other reviewers weigh in: I'd make putting the credentials file path in GL_AUTH a standard auth method. This is generally best practice, so that once users do this they do not need to do additional auth in any session. (If some additional exchange is needed, maybe make this the default arg for gl_auth(), or running auth on package load if tokens are not found in some default storage location). I would split tests out into those requiring an account and those that don't, using testthat::skip_if_not() to check for GL_AUTH.

@MarkEdmondson1234
Copy link
Author

MarkEdmondson1234 commented Jul 4, 2017

Hi @noamross , hope you got some holidays, just back from one myself :)

I think I have sorted all the testing issues.

I now have two types of tests: unit tests that use the mock caches; and integration tests that will call the API if you have GL_AUTH defined. If GL_AUTH is not present, it skips them.

The unit tests need no authentication, and pass on Travis/Codecovr.

The integration tests are also using stringdist with the API responses, with a margin of error of 10 characters as they do seem variable, and pass locally.

I got some weird errors regarding Travis not being able to encode Japanese text used in the demos that I'd like sorted out, but its not critical, I removed the Japanese examples for now.

@noamross
Copy link
Contributor

noamross commented Jul 5, 2017

Thanks, @MarkEdmondson1234. Tests are all passing on my end and goodpractice::gp() gives a nice clean output (below for reviewer reference). I am now seeking reviewers.

Also, you can be the first to use test our new dynamic RO package status badge! You can add the following to your README.

[![](https://ropensci.org/badges/127_status.svg)](https://github.com/ropensci/onboarding/issues/127)

It updates daily and shows "Under review" for packages at stage 2 (seeking reviewers) or higher and "Peer reviewed" once you are approved.

── GP googleLanguageR ───────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. 71% of code lines are covered by test cases.

    R/natural-language.R:74:NA
    R/natural-language.R:75:NA
    R/natural-language.R:76:NA
    R/natural-language.R:77:NA
    R/natural-language.R:78:NA
    ... and 55 more lines

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/auth.R:8:1
    R/speech.R:52:1
    R/speech.R:91:1
    R/speech.R:93:1
    R/translate.R:24:1
    ... and 16 more lines

Also, FYI, some typos from devtools::spell_check()

> devtools::spell_check()
  WORD            FOUND IN
analyse         description:2,3
Analyse         gl_nlp.Rd:30
analysing       googleLanguageR.Rd:9, description:3
entites         gl_nlp.Rd:30
langauge        gl_translate_list.Rd:10
seperate        gl_translate_language.Rd:31

@MarkEdmondson1234
Copy link
Author

Thanks! Shiny new badges. :) It says "unknown" for now, if that is right.

@noamross
Copy link
Contributor

noamross commented Jul 9, 2017

Reviewers assigned:

Reviewer: @jooolia
Reviewer: @nealrichardson
Due date: 2017-07-31

@noamross
Copy link
Contributor

A friendly reminder to @jooolia and @nealrichardson that your reviews are due in one week.

@nealrichardson
Copy link

nealrichardson commented Jul 28, 2017

Review summary

Overall, I think the package is a nice contribution to the R community. The language APIs it wraps are quite powerful, and I can easily see how having a natural way to access them could greatly enhance the kind of work that one could do in R.

I do have some suggestions for improvements to the core functions of the package, as well as for the tests and documentation. A few of the suggestions are particularly significant.

Installation/Setup/Readme

  • Installation was smooth.
  • I had no trouble using gl_auth to specify a credentials file.
  • Minor note: "Useage" reads as a misspelling to me, but maybe it's a regionalism I'm not aware of.

Using the package

My general feedback on the key functions is that they could be improved by being more consistent in behavior among themselves, and by having documentation about the responses they return.

gl_nlp

I wanted to take a column of text data from a survey I had, free-form text responses that respondents gave, and do sentiment analysis on it.

df <- foreign::read.spss("survey.sav")
sentiment <- gl_nlp(df$comments)

I got a big long spew of log message, and then a validation error that my input needed to be a length-1 character vector. So:

  • the docs should be explicit that you can only specify a single string, if that's the requirement (they say "character vector");
  • the function should validate its input before logging a message about the query;
  • I question the value of the logging in this case, or at least the default log level--it seems like it is just echoing my command;
  • I'd encourage supporting a vectorized input, even if it means you have to make N requests internally (though I'd hope that the API has a way of doing at least some of the queries more intelligently)

I then selected a single element from that column of text and sent it, and I got a quick response. Then I had to figure out what I got back.

  • It would really help if the docs described the contents of the object returned and how to interpret them, beyond saying "a list".

gl_language_detect

I gave it the same column of text data as before.

lang <- gl_language_detect(df$comments)

which errored with "Total URL must be less than 2000 characters"

  • What should one do in this case? A more constructive error message would help.
  • Why? Does the API restrict this way? I find that surprising, particularly since you're doing a POST request, which in principle could take a very large request body.

I selected a subset of the column and retried:

> lang <- gl_translate_detect(df$comments[995:997])
Detecting language: 1879 characters -                                                   SOME QUESTIONS DIDNT HAVE GOOD CHOICES. THE RIGHT                                                   ...
> lang
[[1]]
  confidence isReliable language
1          1      FALSE      und

[[2]]
  isReliable language confidence
1      FALSE       en  0.9981712

[[3]]
  isReliable language confidence
1      FALSE      und          1
  • Again, would really help if the man page documented the return object in more detail.
  • Seems like this would more logically return as a single data.frame rather than a list of 1-row data.frames. Much more easily analyzed and manipulated.

gl_language_translate

Interestingly, this function works differently from the others. First, my input was apparently too long again, but instead of erroring, it warned me and truncated:

Warning message:
In gl_translate_language(df$comments[995:997]) :
  Too many characters to send to API, only sending first 1500.  Got 1879

Second, as I suggested that gl_translate_detect should, this one does return a data.frame.

Consistency in a package is important. I favor the data.frame return for these functions. And I think I favor erroring and forcing the user to send a shorter request rather than automatically truncating because of the specter of rate limiting--I'd rather have the say on what requests are going through rather than having the package truncate my text potentially mid-word and burning through some of my daily character/request limit. Alternatively, if you get a vector input and can chunk the requests cleanly by vector elements (first 2 elements in one request, next 5 in the next, etc.) and then return a single object, perhaps it would be nicer--that way, you're concealing the constraints of the API from the R user, who shouldn't have to think about URL character lengths. Regardless, all functions should handle input length issues consistently.

gl_speech_recognise

  • Consider adding an Americanized alias (recognize)
  • The man page does not describe the function return at all.

Tests

  • Important: your mocks/cached responses appear to have your bearer auth token in them, which is a security risk for you. (Just readRDS one of the files and inspect the object.) At a minimum, if you have not already, you should reset/get a new token and kill that one since it's been made public on GitHub.
  • You may also want to consider using the httptest package to help you have cleaner API fixtures that don't include your auth token. httptest can also help you to get to 100% line coverage easily: you can assert that the right requests are made with the various API versions and "gs://" URLs (things that are among your untested lines currently) without needing to actually make all of those requests or maintain mocks for their responses. See the package vignette for some examples.
  • Even where you do have line coverage, the tests are thin, particularly in showing what the API response object looks like. This is particularly important because, as I mentioned above, the man pages also don't give much guidance on what the API responses look like. As a user, I'd like to have some kind of reference for what to expect that the functions return, and with respect to tests, some kind of assurance of the data contract. While the man pages ideally should describe the responses in more detail, you can also use tests to give more detailed examples that show the shape of a return object, possibly more naturally than you can in the @examples section of the docs.
  • You have non-ASCII text in test files, which if I recall correctly from a previous CRAN rejection of mine, will cause problems when they run your tests on Solaris. You can resolve this by moving the non-ASCII text to a separate file that gets sourced or otherwise read in and place that after a skip_on_cran. Just placing it in a test-file after a skip_on_cran isn't sufficient because the test file will fail to source. Cf. Crunch-io/rcrunch@7f74971

API Usage

These might point more to the capabilities of googleAuthR than this package.

  • Sys.sleep(getOption("googleLanguageR.rate_limit")) seems inefficient. You're adding 500ms to every request unnecessarily. Presumably the Google API will tell you if you've exceeded your limit by responding with a 429 response code, so instead, your API client could handle a 429 by sleeping and retrying, and otherwise proceed normally. That way, you only pay the penalty if you exceed the limit, not on every request. Let the server decide your rate limiting. There are examples out there of other packages that do this (twitteR and crunch are two examples I'm aware of).
  • It also seems that the operative logic in check_rate isn't exercised in tests.
  • I wonder if you could simplify some query param logic, as in the "translate" functions, if gar_api_generator handled it (rather than concatenating the query strings to the URL yourself, could delegate to httr).

Style/minutiae

  • Function imports: using importFrom rather than :: would make for more readable code IMO, and according to http://r-pkgs.had.co.nz/namespace.html#imports, it is marginally faster.
  • Consider defining true <- unbox(TRUE) to avoid repetition--that appears a lot. Consider also that auto_unbox=TRUE perhaps should be the default in your API client's JSON serializer and use I() to prevent unboxing on attributes that must always be arrays.
  • getOption takes a 'default' arg so https://github.com/MarkEdmondson1234/googleLanguageR/blob/master/R/utilities.R#L28-L29 can be condensed
  • Consider adding "docstring" comments or similar for internal functions that explain why they exist and what they do. Some have useful comments like that but others like myMessage would benefit from them.
  • I noticed what appeared to be a couple of superfluous sprintf/paste0 with single strings/no interpolation, e.g. https://github.com/MarkEdmondson1234/googleLanguageR/blob/master/R/translate.R#L24
  • Is it really necessary to assert_that on your inputs on internal functions? While the cost of doing so is negligible when you're inside functions that make HTTP requests (network latency (and Sys.sleep(getOption("googleLanguageR.rate_limit"))) will dominate the running time), but it seems like overkill since you control the functions that are passing inputs to them.

Hope this is helpful, and please let me know if you have any questions that I can clear up.

Package Review Checklist

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 9


@noamross
Copy link
Contributor

Thank you for the excellent and thorough review, @nealrichardson! Quick request - could you append the reviewer template to your review? It just has some quick boxes to check.

@MarkEdmondson1234
Copy link
Author

Wonderful feedback, thanks so much.

Important: your mocks/cached responses appear to have your bearer auth token in them

Hmm, blast. Back to the drawing board then, I think its going to have to be just local tests until I figure that out. If https://github.com/nealrichardson/httptest helps me out, brilliant.

@jooolia
Copy link
Member

jooolia commented Jul 31, 2017

Hi, my review will be one day late. :( Had a weekend that was unexpectedly without internet access. I will post my review tomorrow morning.
Julia

@jooolia
Copy link
Member

jooolia commented Aug 1, 2017

Hi! Thanks for the opportunity to review this package. I found the package quite interesting and learned some new things by looking over the code and tests. Please see below for my review.

Review googleLanguageR

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README

  • It was clear that this package leverages the API, but I was not sure who the intended target audience is. Businesses? Researchers?

  • Installation instructions: for the development version of package and any non-standard dependencies in README

  • This was very clear and well thought out! I had never used these Google APIs before, but I was able to get set up and running by following the installation instructions.

library(googleLanguageR)
## Successfully authenticated via C:/Users/julia_lenovo/School/Reviews/Ropenscience/Test project-00e9fee42a1e.json
  • Vignette(s) demonstrating major functionality that runs successfully locally

The Readme.md has a brief introduction with a few examples, but the package could be improved with vignettes illustrating more interesting and elaborate use cases.

Also I think that the Readme.md could be improved by creating a Readme.Rmd since there were a few errors and inconsistencies in the code in the Readme.md. For example:

  • Line 131japan <- gl_translate_language(result_brit_freq$transcript, target = "ja") result_brit_freq - Not yet a variable, so this example doesn't work.

  • In "Language Detection" in the Readme.md, two calls (lines 147-156) to gl_translate_detect() produce results that have different structures. I could not reproduce this.

  • Function Documentation: for all exported functions in R help

All functions have help pages, however the explanations are very minimalist.

  • Examples for all exported functions in R Help that run successfully locally

There were very minimal examples in the exported functions.

Specifically:

  • gl_auth() has no example and help is very minimal. Would it be too repetitive to reuse some of the info from the Readme.md?

  • gl_nlp() - not all languages supported in this function. I think that this should be made clear. The value section could be more descriptive in describing the analyses returned.

  • gl_speech_recognise() The descriptions were quite useful and well-documented for this function.
    -gl_translate_detect() Nice that you mention another language package that works offline and for free.

  • gl_translate_language() - I think it would be helpful to explain that this function can also detect the language so that you can avoid a step if desired.

  • It would be helpful have an example where the text is split up for translation. Does this have any effect on the translation? Perhaps not since the text is ~2000 characters long?

  • gl_translate_list() no example.

  • googleLanguageR - would be nice to put link to the Readme.Md (or to a vignette or two) that has examples.

  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Current guidelines are very nice and encouraging. It would be excellent to add in instructions about how to contribute in a respectful way, e.g. a Contributor Covenant Code of Conduct.

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

Further notes in section on code review.

  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

There are several different ways that functions are named in this package. There is Camel case, package.verb naming and some others, but not much of the recommended snake_case. Is this to match the Google API that you are using? If so, maybe all the functions and (as much as possible) the variables could be named in the same way for consistency.

Readme.md

  • The description of the package could be longer.
  • citation info is missing.
Code review:

auth.R - ok

googleLanguageR.R - ok

natural-language.R

  • For the assertthat test (line 60) should this test appear earlier, i.e. before using string?
  • function f(), which you use to post the API call, could use more descriptive naming.
  • Users should be told that only certain languages are allowed here. Is there a reason for this?

speech.R

  • same comment for function f()

translate.R

  • same comment for function f()
    test <- gl_translate_list()

  • Since gl_translate_list() returns a dataframe I think it would be nice to rename the function.

  • why are all of these languages not supported by gl_nlp()

  • gl_translate_detect()

gl_translate_detect("katten sidder på måtten")
## Detecting language: 33 characters - katten sidder på måtten...
## [[1]]
##   confidence isReliable language
## 1  0.1961327      FALSE       da
  • why does it say 33 characters and in the Readme it says 39?

  • gl_translate_language()

Wanted to see how it would do with html pages.

le_temps_ven = readLines('https://www.letemps.ch/monde/2017/08/01/venezuela-deux-principaux-opposants-ont-arretes')
result_le_temps <- gl_translate_language(le_temps_ven)
## Warning in gl_translate_language(le_temps_ven): Too many characters to send
## to API, only sending first 1500. Got 148098
## 2017-08-01 14:46:07 -- Translating: 1500 characters - <!DOCTYPE html><html lang="fr" dir="ltr">  <head>    <meta charset="utf-8" /><script type="text/jav<script>var _sf_startpt=(new Date()).getTime();</s<meta name="Generator" content="Drupal 8 (https://<meta name="MobileOptimized" content="width" /><meta name="HandheldFriendly" content="true" /><meta name="viewport" content="width=device-width,<meta name="twitter:image" content="https://assets<meta name="twitter:card" content="summary_large_i<meta name="twitter:site" content="@letemps" /><meta name="twitter:title" content="Venezuela: deu<meta name="twitter:description" content="Deux jou<script>ltx8="Disabled";</script><meta name="google-site-verification" content="z0t<meta name="p:domain_verify" content="d0baf6f70127<meta property="fb:pages" content="319393291423771<meta property="fb:app_id" content="10016088677790<meta name="theme-color" content="#930025" /><meta name="description" content="Deux jours aprè<meta property="og:title" content="Venezuela: deux<meta property="og:image" content="https://assets.<meta property="og:url" content="https://www.letem<meta property="og:description" content="Deux jour<meta property="og:type" content="article" /><link rel="amphtml" href="https://www.letemps.ch/n<link rel="shortcut icon" href="https://assets.let<link rel="image_src" href="https://assets.letemps<link rel="dns-prefetch" href="//assets.letemps.ch<link rel="dns-prefetch" href="//letemps.wemfbox.c<link rel="dns-prefetch" 
(trimmed output)
## Request failed [413]. Retrying in 1 seconds...
## Request failed [413]. Retrying in 2 seconds...
## 2017-08-01 14:46:12> Request Status Code: 413
## Warning: No JSON content found in request
## Error: lexical error: invalid char in json text.
##                                        <!DOCTYPE html> <html lang=en> 
##                      (right here) ------^

Not clear what format the html should be in. Would be useful to have an example. I will try a site without https.

library(RCurl)
## Loading required package: bitops
slate_rcurl <- getURL("http://www.slate.fr/story/149310/adieu-scaramucci-plus-beau-succes-administration-trump",
                ssl.verifypeer = FALSE)
result_slate <- gl_translate_language(slate_rcurl)
## Warning in gl_translate_language(slate_rcurl): Too many characters to send
## to API, only sending first 1500. Got 53747
## 2017-08-01 14:46:13 -- Translating: 1500 characters - 
## <!DOCTYPE html>
## <html lang="fr">
## <head>
##   <meta c...
result_slate
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 translatedText
## 1 \n<!DOCTYPE html>\n<html lang="fr">\n<head>\n  <meta charset="utf-8">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />\n\n  <link rel="shortcut icon" href="http://www.slate.fr/favicon.ico" />\n\n    <title>Adieu Scaramucci, le plus beau succès de l'administration Trump | Slate.fr</title>\n\n  <meta name="robots" content="index,follow">\n  <meta name="description" content="Ses histoires de bite vont terriblement nous manquer." />\n  <meta name="keywords" content="Monde,Donald Trump,communication politique,Stephen Bannon" />\n  \n  <link rel="canonical" href="http://www.slate.fr/story/149310/adieu-scaramucci-plus-beau-succes-administration-trump" />\n    \n  <link rel="apple-touch-icon" href="http://www.slate.fr/apple-touch-icon.png" />\n  <link rel="alternate" type="application/rss+xml" title="Slate.fr" href="/rss.xml" />\n\n    <meta property="og:title" content="Adieu Scaramucci, le plus beau succès de l'administration Trump" 
##   detectedSourceLanguage
## 1                     en

Does not detect the appropriate source language.

result_slate_fr <- gl_translate_language(slate_rcurl,
                                            source = "fr")
## Warning in gl_translate_language(slate_rcurl, source = "fr"): Too many
## characters to send to API, only sending first 1500. Got 53747
## 2017-08-01 14:46:15 -- Translating: 1500 characters - 
## <!DOCTYPE html>
## <html lang="fr">
## <head>
##   <meta c...
result_slate_fr
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      translatedText
## 1 \n<! DOCTYPE html>\n<Html lang = "en">\n<Head>\n  <Meta charset = "utf-8">\n  <Meta name = "viewport" content = "width = device-width, initial-scale = 1.0, maximum-scale = 1.0, user-scalable =\n\n  <Link rel = "shortcut icon" href = "http://www.slate.com/favicon.ico" />\n\n    <Title> Farewell Scaramucci, the Trump's greatest success | Slate.fr </ title>\n\n  <Meta name = "robots" content = "index, follow">\n  <Meta name = "description" content = "His dick stories are going to be terribly missed." />\n  <Meta name = "keywords" content = "World, Donald Trump, political communication, Stephen Bannon"\n  \n  <Link rel = "canonical" href = "http://www.slate.fr/story/149310/adieu-scaramucci-plus-beau-succes-administration-trump" />\n    \n  <Link rel = "apple-touch-icon" href = "http://www.slate.fr/apple-touch-icon.png" />\n  <Link rel = "alternate" type = "application / rss + xml" title = "Slate.fr" href = "/ rss.xml"\n\n    <Meta property = "og: title" content = "Farewell Scaramucci, the best success of the Trump administration"

Success with translation when source is specified. I was expecting the html to be stripped from the text when I first saw the option for using html. Could that be an option so that the text is more likely to fit within the constraints of the number of characters?

basic_html <- '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <title>A very basic HTML Page</title>
</head>
<body>
    Here I have created a very simple html page. Will it be translated?
</body>
</html>'

basic_html_tl <- gl_translate_language(basic_html,
                                       target = "fr",
                                            format = "html")
## 2017-08-01 14:46:15 -- Translating: 419 characters - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Trans...
basic_html_tl
##                                                                                                                                                                                                                                                         translatedText
## 1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title> Une page HTML très simple </title></head><body> Ici, j&#39;ai créé une très simple page html. Est-ce que cela sera traduit? </body></html>
##   detectedSourceLanguage
## 1                     en

Ok this works. Neat, however I think there should be more information about how to use this with more complex pages.

utilities.R

It would be useful to have a bit more documentation on what these functions are doing and why. Inline comments could be useful here.

test_unit.R

  • Could there be a test to see if you are connected to the internet? Or more specifically at least able to access the API? Because otherwise all of the tests can fail without a clear indication of why.

  • cat() : could message() be used instead?

  • NLP -

  • would it be useful to test that Unicode characters are always being parsed correctly? e.g. could check that the parsed string length is the same.

  • test for gl_translate_list() checks that it is a dataframe, but not really anything else about it.

  • could be good to add in some tests where you expect to get an error if the input or the options are not correctly specified.

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6


Review Comments

Overall I think that this package is very useful and most functions work as intended. I think the package could be greatly improved by more thorough documentation and examples.

@noamross
Copy link
Contributor

noamross commented Aug 1, 2017

Thanks for your thorough review, @jooolia! @MarkEdmondson1234, let us know when you believe you have addressed these reviews so that reviewers can take a second look, but feel free to ask questions in the meantime.

@MarkEdmondson1234
Copy link
Author

MarkEdmondson1234 commented Aug 2, 2017

This is great, thanks so much @jooolia - it may be a few weeks before I respond (as I have my own review to do!) but this will help a tonne.

@MarkEdmondson1234
Copy link
Author

MarkEdmondson1234 commented Aug 11, 2017

I think I am ready with my review responses - please download to GitHub to review. Overall its been really great and have found it all really useful.

  • moved to httptest test suite, more tests
  • standardised function outputs to tibbles (or lists of tibbles where it makes sense)
  • Support vectorised inputs, and use a tryCatch to move too large inputs to a vectorised version if it can.
  • the translation API calls where actually using GET, which gave limits it shouldn't have, fixed that
  • improved documentation on object outputs and added more examples
  • Shortened key function names to gl_speech, gl_translate so as not to confuse with gl_translate_languages which lists languages (former gl_translate_list)
  • Added vignettes and published website
  • removed rate limit and rely on API retries instead
  • moved to using importFrom instead of ::
  • Added examples of who and what problems the package could be useful for, and a code of conduct
  • changed myMessage to my_message
  • corrected input test of gl_nlp
  • renamed f() to call_api()
  • More support and examples on how to translate HTML, using rvest
  • moved handling unicode to httr internals

Things I kept despite flagging up:

  • Logging the string you pass in - I have found this really useful when passing a lot of text in to see how progress is going
  • assert_that on inputs to internal functions - this is my habit I have to mitigate bugs that works for me - when I tweak internal functions it catches me if I forget to remove an argument, for example, and as its an API call I don't think it adds enough overhead to be significant.
  • function naming - All exported functions are snake_case, the only other internal function that is not is is.gcs which is to match is.character etc.
  • The messages in the tests need to be cat() instead of message() so they show up in the travis logs

help!

The CRAN checks all complete locally, but on Travis it complains about Package vignettes without corresponding PDF/HTML when building the vigneetes - I've tried various combinations but can't get rid of it - any ideas?

I would rather it not build them on Travis, but the options in travis of
r_build_args: --no-build-vignettes --no-manual --no-resave-data r_check_args: --no-build-vignettes --no-manual

...seem to have no effect.

@nealrichardson
Copy link

Yep, I've pulled the latest and hope to get back to you by the weekend.

@MarkEdmondson1234
Copy link
Author

@nealrichardson I forgot to update some tests, so the version now is the best one to review. Worked around the vignette issue as well (by avoiding API calls within them)

@jooolia
Copy link
Member

jooolia commented Aug 18, 2017

Hi @MarkEdmondson1234!

Thanks for taking my comments into account and making many changes to the package. Notably I find that all aspects of the documentation are much improved (help pages, examples, vignettes, contributing, etc). I updated my previous comment to show that now I recommend approving this package. 👍

Cheers, Julia

@nealrichardson
Copy link

@MarkEdmondson1234, this version of the package is a major step forward. I'm impressed with the new documentation and see lots of improvements in the code as well. In retrying the test examples from my initial review, however, I hit some new issues that I think count as bugs, and I ran into a few sharp corners where the messaging and error handling could be more hospitable to your users. I do think these issues should be addressed before the package is formally accepted and released, and I don't think they'll require much effort on your part to resolve--we're really close.

I also note at the bottom a few other suggestions for your consideration that you can take or leave, or decide to revisit after the review process.

New concerns

gl_nlp

Repeating my initial example, it now works on a text vector, sort of:

> sentiment <- gl_nlp(df$comments)
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
[SO MUCH MORE LOG]
2017-08-20 14:08:44 -- annotateText for 'I can't find a grade low enough to express my diss...'
Error in cbind_all(x) : Argument 2 must be length 7, not 8
2017-08-20 14:08:44 -- annotateText for '...'
2017-08-20 14:08:44 -- annotateText for 'The House Republicans are holding the Country Host...'
[SO MUCH MORE LOG]
2017-08-20 14:08:47 -- annotateText for 'Questions about Congress should take into account ...'
Error in cbind_all(x) : Argument 2 must be length 7, not 8
In addition: Warning message:
In call_api(the_body = body) :
  API Data failed to parse.  Returning parsed from JSON content.
                    Use this to test against your data_parse_function.
2017-08-20 14:08:48 -- annotateText for '...'
[SO MUCH MORE LOG]
2017-08-20 14:09:01 -- annotateText for 't was OK, but you should add the question: How nai...'
Error in cbind_all(x) : Argument 2 must be length 4, not 6
In addition: Warning message:
In call_api(the_body = body) :
  API Data failed to parse.  Returning parsed from JSON content.
                    Use this to test against your data_parse_function.
[SO MUCH MORE LOG, and more failures]
Warning message:
In call_api(the_body = body) :
  API Data failed to parse.  Returning parsed from JSON content.
                    Use this to test against your data_parse_function.
  • The errors in cbind_all(x) sound like they should have halted execution of the function, yet I got a result--what did I even get? And the warning about API Data failed to parse. Returning parsed from JSON content. Use this to test against your data_parse_function. doesn't sound like something a user of googleLanguageR should ever see or have to reason about.
  • It appears that the function makes API requests even if the strings are zero-length, all whitespace, or NA, and that seems like something that should be avoided, particularly if we're subject to API rate limiting. I'm guessing that's not a behavior you considered, but that's what users like me are for: stumbling onto weird edge cases and new uses. So, I suggest determining what the API would return for an empty string query and just return that object if the string is empty/missing, without making a request. (To test that with httptest, you can assert that no API request is made with expect_no_request.)
  • The output would be more useful if it were a list of data.frames/tbls where possible (if I understand the return correctly, it would be most reasonable for "documentSentiment" and "language"), rather than lists of lists of tbls. I passed the function a column of text to do sentiment analysis on, and I'd like a vector of equal length (or data.frame of the same number of rows) back for the sentiment. I'd rather not have to iterate over a list of lists to extract that vector.
  • Logging
    • Would it be possible to make it more obvious how to turn down the log level? You said you found it useful at its current level, but with this much text and this many requests, my console was just filled with my data printed back at me. That sounds useful as an "info" or "debug" log level that you can turn up to, but not the default--at least not the default that I'd like to be able to set for myself.
    • It seems that you're printing "..." even when the string being logged is not truncated, which is odd.

gl_translate_detect

Trying the same example as before with the full column of data,

> lang <- gl_translate_detect(df$comments)
Detecting language: 10126 characters [...SO MUCH MESSAGING...]
2017-08-21 19:13:22> Request Status Code: 400
Error: JSON fetch error: Too many text segments

It's a different error from before but the same problem--how do I fix? I don't know how to interpret "JSON fetch error".

Repeating the request that worked last time (selecting a few rows) was good and fast, and response was in a useful shape, much nicer than before.

> lang <- gl_translate_detect(df$comments[995:997])
Detecting language: 255 characters - SOME QUESTIONS DIDNT HAVE GOOD CHOICES. THE RIGHT ...
> lang
# A tibble: 3 x 4
  confidence isReliable language
       <dbl>      <lgl>    <chr>
1  1.0000000      FALSE      und
2  0.9981712      FALSE       en
3  1.0000000      FALSE      und
# ... with 1 more variables: text <chr>

Friendly suggestions

Here are a few things to consider, or things to look into for the future, but that I don't consider gating issues.

Dependencies/installation

You've added several new package dependencies in this iteration, and I urge you to reconsider doing so. Following the analysis in that blog post, I see that while you declared 4 new dependencies, you're pulling a total of 8 new packages:

> setdiff(new, old)
[1] "dplyr"     "purrr"     "tibble"    "bindrcpp"  "glue"      "pkgconfig"
[7] "rlang"     "bindr"    

Most of those come with dplyr:

> setdiff(new, new_without_dplyr)
[1] "dplyr"     "bindrcpp"  "glue"      "pkgconfig" "bindr"    

This may sound academic, but I did have brief trouble installing from source the latest version of googleLanguageR, and I got an ominous-sounding warning about my dplyr and Rcpp versions:

neal:googleLanguageR neal.richardson$ R CMD INSTALL .
* installing to library ‘/Users/neal.richardson/R’
ERROR: dependency ‘purrr’ is not available for package ‘googleLanguageR’
* removing ‘/Users/neal.richardson/R/googleLanguageR’
neal:googleLanguageR neal.richardson$ R -e 'install.packages("purrr")''
Installing package into ‘/Users/neal.richardson/R’
(as ‘lib’ is unspecified)
trying URL 'http://cran.at.r-project.org/bin/macosx/el-capitan/contrib/3.4/purrr_0.2.3.tgz'
Content type 'application/x-gzip' length 231939 bytes (226 KB)
==================================================
downloaded 226 KB


The downloaded binary packages are in
	/var/folders/ly/dnxts6pd22q98109b0nsbq2h0000gp/T//RtmpLfYT7J/downloaded_packages
> 
neal:googleLanguageR neal.richardson$ R CMD INSTALL .
* installing to library ‘/Users/neal.richardson/R’
* installing *source* package ‘googleLanguageR’ ...
** R
** inst
** preparing package for lazy loading
Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
Please reinstall dplyr to avoid random crashes or undefined behavior.
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
Please reinstall dplyr to avoid random crashes or undefined behavior.
* DONE (googleLanguageR)

You've only imported 3 functions from dplyr: select, bind_cols, and full_join. If it were me (and this may be a minority perspective), I'd use the right incantations of [, cbind, and merge, respectively, and drop the dependency on dplyr to keep things simpler and cleaner for the end users of the package.

Loading and authenticating

After I resolved my trouble installing the package, I got an unexpected message when I loaded the package:

> library(googleLanguageR)
Setting scopes to https://www.googleapis.com/auth/cloud-platform
Set any additional scopes via options(googleAuthR.scopes.selected = c('scope1', 'scope2')) before loading library.

I have no idea what that means, what a scope is, and why I would want or should have more of them. Makes me wonder whether it's better not to show a startup message at all.

Anyway, I decided to proceed with repeating my testing from before, with my column of open text comments from a survey, starting with gl_nlp:

> sentiment <- gl_nlp(df$comments)
2017-08-20 14:06:39 -- annotateText for '...'
2017-08-20 14:06:39> No authorization yet in this session!
2017-08-20 14:06:39> No  .httr-oauth  file exists in current working directory.  Run gar_auth() to provide credentials.
2017-08-20 14:06:39> Token doesn't exist
Error: Invalid token
In addition: Warning message:
In checkTokenAPI(shiny_access_token) : Invalid local token

Whoops! That's a lot of (redundant?) messaging to tell me that I need to authenticate. Right, I remember now, I have a .json credential file I need to supply. Let me follow the instructions the error message gave me:

> gar_auth("creds.json")
Error in gar_auth("creds.json") : could not find function "gar_auth"

Oops! I'm guessing that's in googleAuthR, given the function prefix, so let's try that:

> googleAuthR::gar_auth("creds.json")
Error in read_cache_token(token_path = token) : 
  Cannot read token from alleged .rds file:
creds.json

Apparently not. So this is when I look back at the Readme and do it right.

> gl_auth("creds.json")

That could have gone a lot smoother. For the sake of your distracted future users, if you handled the auth failure more gracefully and gave the right helpful message, it would be a lot friendlier.

Tests

I still would ideally want to see higher test coverage. Looks like the majority of the lines not covered are in a rate-limiting function. Particularly since they're not hit most of the time in normal use, it would be nice to have some tests that exercise them just to prevent accidental breakage in the future when you're working on the package.

@noamross
Copy link
Contributor

Thanks for the follow-up, @jooolia and @nealrichardson. @MarkEdmondson1234, I concur with Neal's follow-up requests. As for Neal's suggestions, I think you should address the tests and authentication especially, though authentication might be deferred if you have specific plans that are dependent on expected updates to googleAuthR (or use of gargle.

@MarkEdmondson1234
Copy link
Author

Thanks again @nealrichardson it sounds like we're close :)

Logging - ok ok, I've turned it down :) This should just be a documentation issue, to get more/less logs you can use options(googleAuthR.verbose) - I set the NLP feedback down from 3 to 2. Default is 3, so you should see less, and I can get my more verbose logs back if I set options(googleAuthR.verbose = 2)

The errors in cbind_all(x) sound like they should have halted execution of the function, yet I got a result--what did I even get? And the warning about API Data failed to parse. Returning parsed from JSON content. Use this to test against your data_parse_function. doesn't sound like something a user of googleLanguageR should ever see or have to reason about.

Would you be able to share the data that got you that bug? The messaging you see should never be seen by a final end user but does output an object that can be used to see what went wrong when the response wasn't parsed properly.

The logs are useful to see that 0-length is being passed in:

2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'

Will stop that.

The output would be more useful if it were a list of data.frames/tbls where possible (if I understand the return correctly, it would be most reasonable for "documentSentiment" and "language"), rather than lists of lists of tbls. I passed the function a column of text to do sentiment analysis on, and I'd like a vector of equal length (or data.frame of the same number of rows) back for the sentiment. I'd rather not have to iterate over a list of lists to extract that vector.

Is this referring to when you only request one analysis, such as nlp_type = analyzeSentiment ? I don't see how to return just a data.frame of all the analysis given back otherwise (they are varying lengths and columns per text sent in)

> lang <- gl_translate_detect(df$comments)
Detecting language: 10126 characters [...SO MUCH MESSAGING...]
2017-08-21 19:13:22> Request Status Code: 400
Error: JSON fetch error: Too many text segments

That error should have been caught and turned into a multiple API request, may I take a look at the data you sent in to see what happened?

Dependencies/installation

Hmm, I have to date kept tidyverse out of my packages due to its weight even though a big fan interactively, but thought I'd test the waters, particularly as rOpenSci embraces it in several of its packages.

Is adoption of tidyverse enough now it can be assumed to be on every R user's computer already? I guess "not yet" is the answer :) I think the auto type checking etc. of dplyr are safer code than the base alternatives, but dplyr in particular is weighty so I'll look to keep purrr and tibble but remove dplyr

Auth messages

Thanks for that, that is something I'll change to be less confusing in googleAuthR

Whilst I await gargle eagerly so I can move off the authentication responsibilities to that, the "AuthoR" part of googleAuthR will remain such as the API function factory and the skeleton package building, so any migration of code to gargle will be part of the next major version.

Tests

The rate limiting was stopped by that bug I raised on httptest with too large responses, but should be triggered now. I'll add more tests to try and cover the bugs raised on this thread and elsewhere, and going forwards.

@nealrichardson
Copy link

@MarkEdmondson1234,
Sure, I've shared the text that threw these errors here. Just readRDS that file.

For gl_nlp, there are six strings that cause errors, as far as I can tell, and looking at them, I don't see what's special about them. You can see which they are by doing

comments <- readRDS(filename)
sentiment <- gl_nlp(comments)
comments[vapply(sentiment, is.null, logical(1))]

As for the shape of the return from gl_nlp, here's a (non-tidy) way of reshaping what currently returns to match what I'd expect:

out <- sapply(c("sentences", "tokens", "entities", "documentSentiment", "language"), 
    function (n) {
        results <- lapply(sentiment, "[[", n)
        if (n == "documentSentiment") {
            results <- do.call("rbind", results)
        } else if (n == "language") {
            results <- unlist(results)
        }
        return(results)
    }, simplify=FALSE)

That would make it a lot more natural to work with documentSentiment especially.

Re: dplyr, I had it installed, but it didn't match my Rcpp version. Apparently I had updated one at some point but not the other. Just an example of the kind of dependency complexity that it's all too easy to get into in any language but especially in R. And I agree that the type checking it provides is good, but the Google API should provide a strong enough data contract that you know what types you should be getting in the API responses.

Re: auth, maybe you can just catch the googleAuthR error and raise a different, more appropriate one (like one that points you at gl_auth instead of gar_auth).

Re: tests, you don't actually need a big mock file to trigger the behavior you're trying to test. You can delete most/all of the POST body from your mock (v2-38634c-POST.R), which makes the file small again. All that matters for httptest is that the filename matches the request/body, so it will load this file and get the "400 Text too long" error response you're trying to test the handling of.

Speaking of that .R mock file, it turns out that we've exposed your bearer token again :( The .json mocks that httptest records are clean, but you got a .R file because you were testing a non-200 response behavior, and the full httr response object contains your request headers, as we saw previously in your old .rds cache. Very sorry about that. So you'll want to kill that token again, and for good measure, you can delete the token string from v2-38634c-POST.R--replace it with something else, or nothing.

If it's any consolation, this has spurred me to finally implement token redacting by default in httptest's capture_requests context. I've been bitten by this before too and have had to hand-delete auth tokens from my mocks, which is no fun, but I hadn't come up with a good, clean way to do the sanitizing. I think I have a good approach now--branch started here, if you want to follow along/try it out. So hopefully this won't ever happen again to any of us.

@noamross
Copy link
Contributor

Looks like things are coming along. Just a quick note about tidyverse dependencies: we don't take a strong position one way or the other. My personal experience is that, from a user perspective things are relatively smooth and it is usually installed, but from a developer perspective the rapid iteration and increasingly large number of connected packages result in greater maintenance burden.

@MarkEdmondson1234
Copy link
Author

I ran through the above today and the latest version has address them I think:

NLP output

Thanks very much for the comments dataset, I found it was triggering some weird API responses so dealt with those which would have affected the output shape. (many-to-one entities for one word - i.e. US Army = US and US Army)

All the 0-length strings in there are now skipped rather than wasting an API call.

The NLP output now puts each result into its own list. I found a neat purrr way to parse based on Neal's code:

  the_types <- c("sentences", "tokens", "entities", "documentSentiment")
  the_types <- setNames(the_types, the_types)
  out <- map(the_types, ~ map_df(api_results, .x))
  out$language <- map_chr(api_results, ~ if(is.null(.x)){ NA } else {.x$language})

The NA is there for language so it outputs something, so you can compare with your original input, otherwise it is a different length and you don't know which it applies to.

I think I'd like to keep the purrr/tibble dependency, I can always go back to base if it proves too troublesome in the future, and it does make nicer looking code.

I'm afraid I did put a bit of logging back in when calling the API with the number of characters passed, I just found it unnerving to watch a blinking cursor for long calls, and I think it can stop expensive mistakes happening for end users.

So the comment file parses into this:

all_types <- gl_nlp(comments)
## a lot of logs letting you know you are paying for calls to the API ;)

str(all_types, 1)
#List of 5
# $ sentences        :'data.frame':	242 obs. of  4 variables:
# $ tokens           :'data.frame':	2108 obs. of  17 variables:
# $ entities         :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	427 obs. of  7 variables:
# $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	189 obs. of  2 variables:
# $ language         : chr [1:1000] NA NA NA NA ...

If you want just sentiment:

sentiment <- gl_nlp(comments, nlp_type = "analyzeSentiment")
## a lot of logs letting you know you are paying for calls to the API ;)

str(sentiment, 1)
#List of 3
# $ sentences        :'data.frame':	242 obs. of  4 variables:
# $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	189 obs. of  2 variables:
# $ language         : chr [1:1000] NA NA NA NA ...

Translate error

I forgot to put the tryCatch in for the detect language too, so that is working as intended now:

detected <- gl_translate_detect(comments)
# 2017-08-31 14:00:06 -- Detecting language: 10120 characters
# Request failed [400]. Retrying in 1.7 seconds...
# Request failed [400]. Retrying in 3.8 seconds...
# 2017-08-31 14:00:13> Request Status Code: 400
# 2017-08-31 14:00:13 -- Attempting to split into several API calls
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 12 characters
#2017-08-31 14:00:13 -- Detecting language: 19 characters
#2017-08-31 14:00:14 -- Detecting language: 0 characters

I would use gl_translate though more as its the same charge and detects the language in one call.

googleAuthR messages

I got rid of a few that have built up over the years, should be a lot less now.

Tests

I'll use the comments file to expand these out a bit. Thanks for the warning about tokens, I've redacted and changed the key.

The too many characters per 100 seconds test I'll have a think about, perhaps I can just get rid of R's counting method and rely on the HTTP error, in which case the response can be tested against.

@nealrichardson
Copy link

Ok. Errors seem resolved, so that's good. I still think the (new) gl_nlp return isn't quite right in this case though. In skipping the empty/missing strings, now the return value isn't a similar shape to the input. My input text vector is length 1000, but only "language" in the return value is length 1000. "documentSentiment" should have 1000 rows too, with 0's for the rows where the strings were empty (that's what the API returned). And for "sentences", "tokens", "entities", in the fully stacked tibbles the function now returns, there's no way (unless I'm missing it) to connect the rows in the response back to the input data. To resolve that, you could add another column to each that contained the input string, or you could add an integer index, or you could leave "sentences", "tokens", and "entities" as lists of tibbles instead of single tibbles each.

Do I read correctly Detecting language: 0 characters that means that gl_translate_detect will also make requests for empty strings?

Sorry my test data is so misbehaved that it triggers all of these issues, though that probably makes it good for testing since the world is a messy place.

@sckott sckott mentioned this issue Sep 7, 2017
14 tasks
@MarkEdmondson1234
Copy link
Author

No worries, its a good test :)

The original text (that was passed) is available in the $sentences$content list. After a few plays with it I'm thinking it would actually be better to throw an error if the input text is not in suitable form (e.g. a zero length "" like in the comments test set) and put the onus on the user to supply clean data. Do you think thats fair?

The problems I'm seeing in trying to make a same shaped as the input vector is that the API returns variable lengths for each - for instance for the comments data set it returns the 242 valid sentences, for tokens its 2108 words, entities its 427. The only way I can see to associate each result with its original input would be to go back to the format it was before (e.g. each API response has its own set of lists).

Perhaps its just a documentation issue - if you want it with your own list, you can lapply/map the API call, if you want to analyse all the text form its by the default.

e.g.

comments <- readRDS("comments.rds")

## read all the text into single tibbles
all_analysis <- gl_nlp(comments)
str(all_analysis, 1)
List of 6
 $ sentences        :'data.frame':	242 obs. of  4 variables:
 $ tokens           :'data.frame':	2108 obs. of  17 variables:
 $ entities         :Classestbl_df’, ‘tbland 'data.frame':	427 obs. of  7 variables:
 $ documentSentiment:Classestbl_df’, ‘tbland 'data.frame':	189 obs. of  2 variables:
 $ language         : chr [1:1000] NA NA NA NA ...
 $ text             : chr [1:1000] NA NA NA NA ...

## get an analysis per comment
list_analysis <- lapply(comments, gl_nlp)
str(list_analysis[14:15], 2)
List of 2
 $ :List of 2
  ..$ language: chr NA
  ..$ text    : chr NA
 $ :List of 6
  ..$ sentences        :'data.frame':	1 obs. of  4 variables:
  ..$ tokens           :'data.frame':	2 obs. of  17 variables:
  ..$ entities         :Classestbl_df’, ‘tbland 'data.frame':	1 obs. of  5 variables:
  ..$ documentSentiment:Classestbl_df’, ‘tbland 'data.frame':	1 obs. of  2 variables:
  ..$ language         : chr "en"
  ..$ text             : chr "great survey"

@nealrichardson
Copy link

So, the use case I was envisioning that got us here was that I have this survey, and it has some open-text response variables in it, and I want to do sentiment analysis on that text and see what characteristics are associated with positive/negative sentiment. Naively, something like:

df$respondent_sentiment <- gl_nlp(df$comments)$documentSentiment$score
summary(lm(respondent_sentiment ~ age + gender + education + ideology, data=df))

I can't do that (at least not as cleanly) if "documentSentiment" may have fewer rows than the input data.

Here's a patch that gives me the shape of output I'd want:

diff --git a/R/natural-language.R b/R/natural-language.R
index 9ce55fa..9de3716 100644
--- a/R/natural-language.R
+++ b/R/natural-language.R
@@ -94,9 +94,10 @@ gl_nlp <- function(string,
   .x <- NULL
 
   ## map api_results so all nlp_types in own list
-  the_types <- c("sentences", "tokens", "entities", "documentSentiment")
+  the_types <- c("sentences", "tokens", "entities")
   the_types <- setNames(the_types, the_types)
-  out <- map(the_types, ~ map_df(api_results, .x))
+  out <- map(the_types, ~ map(api_results, .x))
+  out$documentSentiment <- map_df(api_results, ~ if(is.null(.x)){ data.frame(magnitude=NA_real_, score=NA_real_) } else {.x$documentSentiment})
   out$language <- map_chr(api_results, ~ if(is.null(.x)){ NA } else {.x$language})
 
   compact(out)

Using that, I get a shape like this:

> sent <- gl_nlp(comments)
> str(sent, 1)
List of 5
 $ sentences        :List of 1000
 $ tokens           :List of 1000
 $ entities         :List of 1000
 $ documentSentiment:'data.frame':	1000 obs. of  2 variables:
 $ language         : chr [1:1000] NA NA NA NA ...

I don't have a clear intuition of what I'd do with "sentences", "tokens", or "entities" in this context, so I left them as a list of tibbles (or NULL if the input is missing/empty, though arguably that should be a 0-row tibble). Leaving them as separate tibbles per input row lets the user decide if they want to do something separately to each or to stack them all up like you have it now. But if you stack, you can't go back the other way because there's nothing in "sentences", "tokens", or "entities" that connects back to the original row of input. $sentences$content doesn't serve that function if the input value contains more than one sentence:

> comments[1000]
[1] "Thank you so much for this survey. It gives me a chance to explain what im thinking about the government."
> sent$sentences[[1000]]
                                                                 content
1                                     Thank you so much for this survey.
2 It gives me a chance to explain what im thinking about the government.
  beginOffset magnitude score
1           0       0.3   0.3
2          35       0.2   0.2

But "documentSentiment" you can return in a single tibble/data.frame because the return is always one row per input entry.

So, that's my use case and what I'd expect the output to look like. Does that sound reasonable to you?

Finally, it turns out that you still depend on dplyr. I only discovered this because I installed a development version of R this week and had to delete my R package library and reinstall everything in the process (a real joy). Apparently I had not yet reinstalled dplyr, and repeating the gl_nlp exercise, I got a bunch of errors like this:

2017-09-10 21:33:31 -- annotateText: 4 characters
Error : `map_df()` requires dplyr
In addition: Warning message:
In call_api(the_body = body) :
  API Data failed to parse.  Returning parsed from JSON content.
                    Use this to test against your data_parse_function.

So you should either replace your two calls to map_df with something else (perhaps the ol' do.call(rbind, ...) trick), or explicitly import dplyr so that some poor soul like me doesn't successfully install googleLanguageR but then fail to be able to use it.

@MarkEdmondson1234
Copy link
Author

Definitely adding you as a contributor @nealrichardson - thanks so much for the code explaining the data shape, it feels the package is already at my usual 0.3.0 given this testing/feedback.

I've updated the changes and dropped the map_df which needed dplyr for a base version:

my_map_df <- function(.x, .f, ...){

  .f <- purrr::as_mapper(.f, ...)
  res <- map(.x, .f, ...)
  Reduce(rbind, res)

}

All tests changed to reflect the new (final?) NLP output shape.

@nealrichardson
Copy link

Looks good to me! I (finally!) have checked the "I recommend approving this package" box on the list above. I do still encourage you to push for higher test coverage, but I won't condition my approval on that. Just something for you to keep in mind going forward.

Excellent work putting together and enriching this package. It provides a nice R interface around a useful API. I look forward to the occasion when I should need to use it.

@MarkEdmondson1234
Copy link
Author

Thanks! I will keep iterating on the tests and documentation as I go.

May I add @nealrichardson and @jooolia as contributors? Are the below details ok?

c(person("Neal", "Richardson", role = "ctb",  email = "neal.p.richardson@gmail.com"),
  person("Julia", "Gustavsen", role = "ctb", email = "j.gustavsen@gmail.com"))

@noamross
Copy link
Contributor

Approved! Thank you, @MarkEdmondson1234, @nealrichardson, and @jooolia for an excellent package and review process that has made it even better!

@MarkEdmondson1234, we generally encourage people to use the rev role for reviewers, as we see this role as distinct from contributors. You can see an example here: https://github.com/ropensci/bikedata/blob/master/DESCRIPTION#L7-L10 . I note that this has become an allowed role only recently, so you'll need to build the package using the development version of R for submittal to CRAN.

Also, you are getting this NOTE from R CMD check, which I'd recommend fixing before submittal to CRAN but won't hold up our end:

checking for portable file names ... NOTE
Found the following non-portable file paths:
  googleLanguageR/tests/testthat/mock/language.googleapis.com/v1/documents-annotateText-45bb10-POST.json
  googleLanguageR/tests/testthat/mock/language.googleapis.com/v1/documents-annotateText-89db45-POST.json
  googleLanguageR/tests/testthat/mock/speech.googleapis.com/v1/speech-longrunningrecognize-e400d0-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2-0055ce-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2-04272d-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2-36b63d-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2/detect-b110ba-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2/detect-c6a946-POST.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2/languages-10be03.json
  googleLanguageR/tests/testthat/mock/translation.googleapis.com/language/translate/v2/languages-bfb420.json
Tarballs are only required to store paths of up to 100 bytes and cannot
store those of more than 256 bytes, with restrictions including to 100
bytes for the final component.
See section ‘Package structure’ in the ‘Writing R Extensions’ manual.

To-dos:

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
  • Add the rOpenSci footer to the bottom of your README
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
  • Update any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). Let us know if you are interested, and if so @stefaniebutland will be in touch about it.

@MarkEdmondson1234
Copy link
Author

Thanks @nealrichardson ! It was all in all a very positive experience.

I have updated and transferred, and would be happy to do a blog post.

@MarkEdmondson1234
Copy link
Author

May I ask where the package website will sit now? I can maintain a fork on my GitHub as that CNAME is set up with pages for code.markedmondson.me, or it can sit wherever rOpenSci has one.

@noamross
Copy link
Contributor

@MarkEdmondson1234 rOpenSci doesn't have a domain for package websites. It's fine that it lives on both your personal page and ropensci.github.io/googleLanguageR (as it is already).

@stefaniebutland
Copy link
Member

Hi @MarkEdmondson1234. Great to hear you'd like to write a post about your pkg. I've sent you some details via DM on rOpenSci slack

@MarkEdmondson1234
Copy link
Author

A quick note for @noamross - the DESCRIPTION based on the example at https://github.com/ropensci/bikedata/blob/master/DESCRIPTION#L7-L10 got rejected from CRAN as it missed angular brackets around the URL, it would modify the example to:

person("Bea", "Hernández", role = "rev",
        comment = "Bea reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/116>"),
    person("Elaine", "McVey", role = "rev",
comment = "Elaine reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/116>"),

@noamross
Copy link
Contributor

Thanks, @MarkEdmondson1234!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants