-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submission: googleLanguageR #127
Comments
Editor checks:
Editor commentsThanks for your submission @MarkEdmondson1234! A few things I'd like to clear up before assigning reviewers:
Finally, here are the outputs from
|
Great thanks! Have sorted out the easy wins in this commit https://github.com/MarkEdmondson1234/googleLanguageR/commit/5e959de9c03f507d9ada8dc0bfa16486c31dddb7 TestingOk this is sorted now the cache tests work. Translations improving breaking tests
Hmm, that translation has improved since the test was written! I will look at how to fix the model version or something. The goodpractice resultsAh, I fixed those before the submission, but forgot to commit. Doh. |
Thanks, Mark. Yes, that's the stringdist lib I was thinking of. Do let me know when you think you've made changes to sufficient to send to reviewers. We'd like to make sure (a) that the package is going to be stable during the review so the reviewers have a constant code base to work with, and (b) that there aren't major anticipated changes after review. It's fine if you are using the dev version of googleAuthR if we expect this to be stable. Also, tests currently fail without this auth, so I'm not sure the mock setup is working properly. (It looks like interactive login is still requested). This needs to be fixed before sending to review. Some testing/auth considerations that can also wait until other reviewers weigh in: I'd make putting the credentials file path in |
Hi @noamross , hope you got some holidays, just back from one myself :) I think I have sorted all the testing issues. I now have two types of tests: unit tests that use the mock caches; and integration tests that will call the API if you have The unit tests need no authentication, and pass on Travis/Codecovr. The integration tests are also using I got some weird errors regarding Travis not being able to encode Japanese text used in the demos that I'd like sorted out, but its not critical, I removed the Japanese examples for now. |
Thanks, @MarkEdmondson1234. Tests are all passing on my end and Also, you can be the first to use test our new dynamic RO package status badge! You can add the following to your README.
It updates daily and shows "Under review" for packages at stage 2 (seeking reviewers) or higher and "Peer reviewed" once you are approved.
Also, FYI, some typos from
|
Thanks! Shiny new badges. :) It says "unknown" for now, if that is right. |
Reviewers assigned: Reviewer: @jooolia |
A friendly reminder to @jooolia and @nealrichardson that your reviews are due in one week. |
Review summaryOverall, I think the package is a nice contribution to the R community. The language APIs it wraps are quite powerful, and I can easily see how having a natural way to access them could greatly enhance the kind of work that one could do in R. I do have some suggestions for improvements to the core functions of the package, as well as for the tests and documentation. A few of the suggestions are particularly significant. Installation/Setup/Readme
Using the packageMy general feedback on the key functions is that they could be improved by being more consistent in behavior among themselves, and by having documentation about the responses they return. gl_nlpI wanted to take a column of text data from a survey I had, free-form text responses that respondents gave, and do sentiment analysis on it.
I got a big long spew of log message, and then a validation error that my input needed to be a length-1 character vector. So:
I then selected a single element from that column of text and sent it, and I got a quick response. Then I had to figure out what I got back.
gl_language_detectI gave it the same column of text data as before.
which errored with "Total URL must be less than 2000 characters"
I selected a subset of the column and retried:
gl_language_translateInterestingly, this function works differently from the others. First, my input was apparently too long again, but instead of erroring, it warned me and truncated:
Second, as I suggested that gl_translate_detect should, this one does return a data.frame. Consistency in a package is important. I favor the data.frame return for these functions. And I think I favor erroring and forcing the user to send a shorter request rather than automatically truncating because of the specter of rate limiting--I'd rather have the say on what requests are going through rather than having the package truncate my text potentially mid-word and burning through some of my daily character/request limit. Alternatively, if you get a vector input and can chunk the requests cleanly by vector elements (first 2 elements in one request, next 5 in the next, etc.) and then return a single object, perhaps it would be nicer--that way, you're concealing the constraints of the API from the R user, who shouldn't have to think about URL character lengths. Regardless, all functions should handle input length issues consistently. gl_speech_recognise
Tests
API UsageThese might point more to the capabilities of
Style/minutiae
Hope this is helpful, and please let me know if you have any questions that I can clear up. Package Review ChecklistPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 9 |
Thank you for the excellent and thorough review, @nealrichardson! Quick request - could you append the reviewer template to your review? It just has some quick boxes to check. |
Wonderful feedback, thanks so much.
Hmm, blast. Back to the drawing board then, I think its going to have to be just local tests until I figure that out. If https://github.com/nealrichardson/httptest helps me out, brilliant. |
Hi, my review will be one day late. :( Had a weekend that was unexpectedly without internet access. I will post my review tomorrow morning. |
Hi! Thanks for the opportunity to review this package. I found the package quite interesting and learned some new things by looking over the code and tests. Please see below for my review. Review googleLanguageRPackage ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
library(googleLanguageR)
The Readme.md has a brief introduction with a few examples, but the package could be improved with vignettes illustrating more interesting and elaborate use cases. Also I think that the Readme.md could be improved by creating a Readme.Rmd since there were a few errors and inconsistencies in the code in the Readme.md. For example:
All functions have help pages, however the explanations are very minimalist.
There were very minimal examples in the exported functions. Specifically:
Current guidelines are very nice and encouraging. It would be excellent to add in instructions about how to contribute in a respectful way, e.g. a Contributor Covenant Code of Conduct. Functionality
Further notes in section on code review.
There are several different ways that functions are named in this package. There is Camel case, package.verb naming and some others, but not much of the recommended snake_case. Is this to match the Google API that you are using? If so, maybe all the functions and (as much as possible) the variables could be named in the same way for consistency. Readme.md
Code review:auth.R - ok googleLanguageR.R - ok natural-language.R
speech.R
translate.R
gl_translate_detect("katten sidder på måtten")
Wanted to see how it would do with html pages. le_temps_ven = readLines('https://www.letemps.ch/monde/2017/08/01/venezuela-deux-principaux-opposants-ont-arretes')
result_le_temps <- gl_translate_language(le_temps_ven)
Not clear what format the html should be in. Would be useful to have an example. I will try a site without https. library(RCurl)
slate_rcurl <- getURL("http://www.slate.fr/story/149310/adieu-scaramucci-plus-beau-succes-administration-trump",
ssl.verifypeer = FALSE)
result_slate <- gl_translate_language(slate_rcurl)
result_slate
Does not detect the appropriate source language. result_slate_fr <- gl_translate_language(slate_rcurl,
source = "fr")
result_slate_fr
Success with translation when source is specified. I was expecting the html to be stripped from the text when I first saw the option for using html. Could that be an option so that the text is more likely to fit within the constraints of the number of characters? basic_html <- '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>A very basic HTML Page</title>
</head>
<body>
Here I have created a very simple html page. Will it be translated?
</body>
</html>'
basic_html_tl <- gl_translate_language(basic_html,
target = "fr",
format = "html")
basic_html_tl
Ok this works. Neat, however I think there should be more information about how to use this with more complex pages. utilities.R It would be useful to have a bit more documentation on what these functions are doing and why. Inline comments could be useful here. test_unit.R
Final approval (post-review)
Estimated hours spent reviewing: 6 Review CommentsOverall I think that this package is very useful and most functions work as intended. I think the package could be greatly improved by more thorough documentation and examples. |
Thanks for your thorough review, @jooolia! @MarkEdmondson1234, let us know when you believe you have addressed these reviews so that reviewers can take a second look, but feel free to ask questions in the meantime. |
This is great, thanks so much @jooolia - it may be a few weeks before I respond (as I have my own review to do!) but this will help a tonne. |
I think I am ready with my review responses - please download to GitHub to review. Overall its been really great and have found it all really useful.
Things I kept despite flagging up:
help!The CRAN checks all complete locally, but on Travis it complains about I would rather it not build them on Travis, but the options in travis of ...seem to have no effect. |
Yep, I've pulled the latest and hope to get back to you by the weekend. |
@nealrichardson I forgot to update some tests, so the version now is the best one to review. Worked around the vignette issue as well (by avoiding API calls within them) |
Thanks for taking my comments into account and making many changes to the package. Notably I find that all aspects of the documentation are much improved (help pages, examples, vignettes, contributing, etc). I updated my previous comment to show that now I recommend approving this package. 👍 Cheers, Julia |
@MarkEdmondson1234, this version of the package is a major step forward. I'm impressed with the new documentation and see lots of improvements in the code as well. In retrying the test examples from my initial review, however, I hit some new issues that I think count as bugs, and I ran into a few sharp corners where the messaging and error handling could be more hospitable to your users. I do think these issues should be addressed before the package is formally accepted and released, and I don't think they'll require much effort on your part to resolve--we're really close. I also note at the bottom a few other suggestions for your consideration that you can take or leave, or decide to revisit after the review process. New concerns
gl_nlpRepeating my initial example, it now works on a text vector, sort of:
gl_translate_detectTrying the same example as before with the full column of data,
It's a different error from before but the same problem--how do I fix? I don't know how to interpret "JSON fetch error". Repeating the request that worked last time (selecting a few rows) was good and fast, and response was in a useful shape, much nicer than before.
Friendly suggestionsHere are a few things to consider, or things to look into for the future, but that I don't consider gating issues. Dependencies/installationYou've added several new package dependencies in this iteration, and I urge you to reconsider doing so. Following the analysis in that blog post, I see that while you declared 4 new dependencies, you're pulling a total of 8 new packages:
Most of those come with
This may sound academic, but I did have brief trouble installing from source the latest version of googleLanguageR, and I got an ominous-sounding warning about my dplyr and Rcpp versions:
You've only imported 3 functions from Loading and authenticatingAfter I resolved my trouble installing the package, I got an unexpected message when I loaded the package:
I have no idea what that means, what a scope is, and why I would want or should have more of them. Makes me wonder whether it's better not to show a startup message at all. Anyway, I decided to proceed with repeating my testing from before, with my column of open text comments from a survey, starting with
Whoops! That's a lot of (redundant?) messaging to tell me that I need to authenticate. Right, I remember now, I have a .json credential file I need to supply. Let me follow the instructions the error message gave me:
Oops! I'm guessing that's in
Apparently not. So this is when I look back at the Readme and do it right.
That could have gone a lot smoother. For the sake of your distracted future users, if you handled the auth failure more gracefully and gave the right helpful message, it would be a lot friendlier. TestsI still would ideally want to see higher test coverage. Looks like the majority of the lines not covered are in a rate-limiting function. Particularly since they're not hit most of the time in normal use, it would be nice to have some tests that exercise them just to prevent accidental breakage in the future when you're working on the package. |
Thanks for the follow-up, @jooolia and @nealrichardson. @MarkEdmondson1234, I concur with Neal's follow-up requests. As for Neal's suggestions, I think you should address the tests and authentication especially, though authentication might be deferred if you have specific plans that are dependent on expected updates to googleAuthR (or use of gargle. |
Thanks again @nealrichardson it sounds like we're close :) Logging - ok ok, I've turned it down :) This should just be a documentation issue, to get more/less logs you can use The errors in cbind_all(x) sound like they should have halted execution of the function, yet I got a result--what did I even get? And the warning about API Data failed to parse. Returning parsed from JSON content. Use this to test against your data_parse_function. doesn't sound like something a user of googleLanguageR should ever see or have to reason about. Would you be able to share the data that got you that bug? The messaging you see should never be seen by a final end user but does output an object that can be used to see what went wrong when the response wasn't parsed properly. The logs are useful to see that 0-length is being passed in: 2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...'
2017-08-20 14:08:37 -- annotateText for '...' Will stop that.
Is this referring to when you only request one analysis, such as > lang <- gl_translate_detect(df$comments)
Detecting language: 10126 characters [...SO MUCH MESSAGING...]
2017-08-21 19:13:22> Request Status Code: 400
Error: JSON fetch error: Too many text segments That error should have been caught and turned into a multiple API request, may I take a look at the data you sent in to see what happened? Dependencies/installationHmm, I have to date kept Is adoption of Auth messagesThanks for that, that is something I'll change to be less confusing in Whilst I await TestsThe rate limiting was stopped by that bug I raised on |
@MarkEdmondson1234, For
As for the shape of the return from
That would make it a lot more natural to work with Re: Re: auth, maybe you can just catch the Re: tests, you don't actually need a big mock file to trigger the behavior you're trying to test. You can delete most/all of the POST body from your mock (v2-38634c-POST.R), which makes the file small again. All that matters for Speaking of that .R mock file, it turns out that we've exposed your bearer token again :( The .json mocks that If it's any consolation, this has spurred me to finally implement token redacting by default in httptest's |
Looks like things are coming along. Just a quick note about tidyverse dependencies: we don't take a strong position one way or the other. My personal experience is that, from a user perspective things are relatively smooth and it is usually installed, but from a developer perspective the rapid iteration and increasingly large number of connected packages result in greater maintenance burden. |
I ran through the above today and the latest version has address them I think: NLP outputThanks very much for the comments dataset, I found it was triggering some weird API responses so dealt with those which would have affected the output shape. (many-to-one entities for one word - i.e. US Army = US and US Army) All the 0-length strings in there are now skipped rather than wasting an API call. The NLP output now puts each result into its own list. I found a neat the_types <- c("sentences", "tokens", "entities", "documentSentiment")
the_types <- setNames(the_types, the_types)
out <- map(the_types, ~ map_df(api_results, .x))
out$language <- map_chr(api_results, ~ if(is.null(.x)){ NA } else {.x$language}) The NA is there for language so it outputs something, so you can compare with your original input, otherwise it is a different length and you don't know which it applies to. I think I'd like to keep the I'm afraid I did put a bit of logging back in when calling the API with the number of characters passed, I just found it unnerving to watch a blinking cursor for long calls, and I think it can stop expensive mistakes happening for end users. So the comment file parses into this: all_types <- gl_nlp(comments)
## a lot of logs letting you know you are paying for calls to the API ;)
str(all_types, 1)
#List of 5
# $ sentences :'data.frame': 242 obs. of 4 variables:
# $ tokens :'data.frame': 2108 obs. of 17 variables:
# $ entities :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 427 obs. of 7 variables:
# $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 189 obs. of 2 variables:
# $ language : chr [1:1000] NA NA NA NA ... If you want just sentiment: sentiment <- gl_nlp(comments, nlp_type = "analyzeSentiment")
## a lot of logs letting you know you are paying for calls to the API ;)
str(sentiment, 1)
#List of 3
# $ sentences :'data.frame': 242 obs. of 4 variables:
# $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 189 obs. of 2 variables:
# $ language : chr [1:1000] NA NA NA NA ... Translate errorI forgot to put the detected <- gl_translate_detect(comments)
# 2017-08-31 14:00:06 -- Detecting language: 10120 characters
# Request failed [400]. Retrying in 1.7 seconds...
# Request failed [400]. Retrying in 3.8 seconds...
# 2017-08-31 14:00:13> Request Status Code: 400
# 2017-08-31 14:00:13 -- Attempting to split into several API calls
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 0 characters
#2017-08-31 14:00:13 -- Detecting language: 12 characters
#2017-08-31 14:00:13 -- Detecting language: 19 characters
#2017-08-31 14:00:14 -- Detecting language: 0 characters I would use googleAuthR messagesI got rid of a few that have built up over the years, should be a lot less now. TestsI'll use the comments file to expand these out a bit. Thanks for the warning about tokens, I've redacted and changed the key. The too many characters per 100 seconds test I'll have a think about, perhaps I can just get rid of R's counting method and rely on the HTTP error, in which case the response can be tested against. |
Ok. Errors seem resolved, so that's good. I still think the (new) Do I read correctly Sorry my test data is so misbehaved that it triggers all of these issues, though that probably makes it good for testing since the world is a messy place. |
No worries, its a good test :) The original text (that was passed) is available in the The problems I'm seeing in trying to make a same shaped as the input vector is that the API returns variable lengths for each - for instance for the comments data set it returns the 242 valid sentences, for tokens its 2108 words, entities its 427. The only way I can see to associate each result with its original input would be to go back to the format it was before (e.g. each API response has its own set of lists). Perhaps its just a documentation issue - if you want it with your own list, you can e.g. comments <- readRDS("comments.rds")
## read all the text into single tibbles
all_analysis <- gl_nlp(comments)
str(all_analysis, 1)
List of 6
$ sentences :'data.frame': 242 obs. of 4 variables:
$ tokens :'data.frame': 2108 obs. of 17 variables:
$ entities :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 427 obs. of 7 variables:
$ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 189 obs. of 2 variables:
$ language : chr [1:1000] NA NA NA NA ...
$ text : chr [1:1000] NA NA NA NA ...
## get an analysis per comment
list_analysis <- lapply(comments, gl_nlp)
str(list_analysis[14:15], 2)
List of 2
$ :List of 2
..$ language: chr NA
..$ text : chr NA
$ :List of 6
..$ sentences :'data.frame': 1 obs. of 4 variables:
..$ tokens :'data.frame': 2 obs. of 17 variables:
..$ entities :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 5 variables:
..$ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables:
..$ language : chr "en"
..$ text : chr "great survey"
|
So, the use case I was envisioning that got us here was that I have this survey, and it has some open-text response variables in it, and I want to do sentiment analysis on that text and see what characteristics are associated with positive/negative sentiment. Naively, something like:
I can't do that (at least not as cleanly) if "documentSentiment" may have fewer rows than the input data. Here's a patch that gives me the shape of output I'd want:
Using that, I get a shape like this:
I don't have a clear intuition of what I'd do with "sentences", "tokens", or "entities" in this context, so I left them as a list of tibbles (or
But "documentSentiment" you can return in a single tibble/data.frame because the return is always one row per input entry. So, that's my use case and what I'd expect the output to look like. Does that sound reasonable to you? Finally, it turns out that you still depend on
So you should either replace your two calls to |
Definitely adding you as a contributor @nealrichardson - thanks so much for the code explaining the data shape, it feels the package is already at my usual 0.3.0 given this testing/feedback. I've updated the changes and dropped the my_map_df <- function(.x, .f, ...){
.f <- purrr::as_mapper(.f, ...)
res <- map(.x, .f, ...)
Reduce(rbind, res)
} All tests changed to reflect the new (final?) NLP output shape. |
Looks good to me! I (finally!) have checked the "I recommend approving this package" box on the list above. I do still encourage you to push for higher test coverage, but I won't condition my approval on that. Just something for you to keep in mind going forward. Excellent work putting together and enriching this package. It provides a nice R interface around a useful API. I look forward to the occasion when I should need to use it. |
Thanks! I will keep iterating on the tests and documentation as I go. May I add @nealrichardson and @jooolia as contributors? Are the below details ok? c(person("Neal", "Richardson", role = "ctb", email = "neal.p.richardson@gmail.com"),
person("Julia", "Gustavsen", role = "ctb", email = "j.gustavsen@gmail.com")) |
Approved! Thank you, @MarkEdmondson1234, @nealrichardson, and @jooolia for an excellent package and review process that has made it even better! @MarkEdmondson1234, we generally encourage people to use the Also, you are getting this NOTE from R CMD check, which I'd recommend fixing before submittal to CRAN but won't hold up our end:
To-dos:
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). Let us know if you are interested, and if so @stefaniebutland will be in touch about it. |
Thanks @nealrichardson ! It was all in all a very positive experience. I have updated and transferred, and would be happy to do a blog post. |
May I ask where the package website will sit now? I can maintain a fork on my GitHub as that |
@MarkEdmondson1234 rOpenSci doesn't have a domain for package websites. It's fine that it lives on both your personal page and ropensci.github.io/googleLanguageR (as it is already). |
Hi @MarkEdmondson1234. Great to hear you'd like to write a post about your pkg. I've sent you some details via DM on rOpenSci slack |
A quick note for @noamross - the DESCRIPTION based on the example at https://github.com/ropensci/bikedata/blob/master/DESCRIPTION#L7-L10 got rejected from CRAN as it missed angular brackets around the URL, it would modify the example to:
|
Thanks, @MarkEdmondson1234! |
Summary
This package contains functions for analysing language through the Google Cloud Machine Learning APIs
URL for the package (the development repository, not a stylized html page):
https://github.com/MarkEdmondson1234/googleLanguageR
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
Data extraction for the translation and Speech to text API calls, text analysis for the entity detection API call.
Analysts working with sound files and/or text that need translation and/or text analysis such as sentiment, entity detection etc.
yours differ or meet our criteria for best-in-category?
The Natural Language API is replicated by many R packages such as
tidytext
but it does offer a bigger trained dataset to work from than what an analyst can supply themselves.The Speech to text I'm not aware of any R packages that do this, especially with the Google one as its pretty new. there may be other APIs that are called via R that do it.
The Translation API, the recently released
cld2
does language detection offline (and was the prompt I came across rOpenSci again) and is recommend way to detect language first to see if its text that needs translation, but then the translation itself I'm not aware of any R packages, again I'm fairly sure none call the new Google Translate API as its fairly new - note this is the one that uses neural nets and not the older one used online and has been available for ages.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
[x ] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Tests fail online as not authenticated:
OK: 0 SKIPPED: 0 FAILED: 3
[x ] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
I haven't a short, lowercase name for the package name. This is largely due to a standard set for all the Google API packages I've written to help aid discoverability in Google search. I've been bitten by similar packages calling the Google Analytics API that were called rga and RGA...
If this is a resubmission following rejection, please explain the change in circumstances:
NA
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
The text was updated successfully, but these errors were encountered: