New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monkeylearn: natural language processing with the monkeylearn API #45

Closed
maelle opened this Issue May 19, 2016 · 45 comments

Comments

Projects
None yet
7 participants
@maelle
Copy link
Member

maelle commented May 19, 2016

    1. What does this package do? (explain in 50 words or less)
      This package provides access to the existing modules of the monkeylearn API, allowing for instance Named Entity Recognition, language detection, attribution of topics to texts.
    1. Paste the full DESCRIPTION file inside a code block below.
Package: monkeylearn
Type: Package
Title: Access to the Monkeylearn API for text classifiers and extractors
Version: 0.1.0
Authors@R: person("Maëlle", "Salmon", email = "maelle.salmon@yahoo.se", role = c("aut", "cre"))
Description: The package allows using some services of Monkeylearn which is
    a Machine Learning platform on the cloud that allows software companies and
    developers to easily extract actionable data from text.
License: GPL (>= 2)
LazyData: TRUE
URL: http://github.com/masalmon/monkeylearn
BugReports: http://github.com/masalmon/monkeylearn/issues
Encoding: UTF-8
RoxygenNote: 5.0.1
Suggests: testthat,
    knitr,
    rmarkdown
Imports: jsonlite, dplyr, httr
VignetteBuilder: knitr
    1. URL for the package (the development repository, not a stylized html page)
      https://github.com/masalmon/monkeylearn
    1. What data source(s) does it work with (if applicable)?
      It works with monkeylearn API
      http://monkeylearn.com/
      Note that one can register using one's Github account.
    1. Who is the target audience?
      Anyone that wants to use Natural Language Processing methods without e.g. first installing Java, or someone that already uses monkeylearn and wants to integrate it in a R workflow (but only for text processing, not for module development)
    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?
      Yes the openNLP package provides several Natural Language Processing methods such as Named Entity Recognition but has a Java dependency that makes its installation difficult for some users.
    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
    • [ x] This package does not violate the Terms of Service of any service it interacts with.
    • [ x] The repository has continuous integration with Travis CI and/or another service
    • [ x] The package contains a vignette
    • [x ] The package contains a reasonably complete README with devtools install instructions
    • [ x] The package contains unit tests
    • [ x] The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
    • Are there any package dependencies not on CRAN?
    • [ x] Do you intend for this package to go on CRAN?
    • [ x] Does the package have a CRAN accepted license?
    • Did devtools::check() produce any errors or warnings? If so paste them below.
    1. Please add explanations below for any exceptions to the above:
      It's not an explanation to any exception but I wanted to underline I'm not sure I've chosen the best solution for dealing with the throttle limit (I retry 5 times if there is a 429 API error which is throttle limit), so I'd very much appreciate feedback on this (and on anything else, obviously!).
    1. If this is a resubmission following rejection, please explain the change in circumstances.
@sckott

This comment has been minimized.

Copy link
Member

sckott commented May 19, 2016

Thanks for your submission @masalmon - We'll get back to you soon

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented May 24, 2016

Hi @masalmon, I'll be your handling editor and am currently seeking reviewers. FYI, this submission prompted discussion among the editors on two topics:

  • Fit: Previous API wrappers we've accepted had fit under accessing and depositing in databases, geospatial processing, and plotting services. This package fit none of those, but clearly fit in with a growing theme - extracting data from unstructured sources (pdftools and tabulizer being examples). We've added a new category to our policies document to explicitly include packages, API wrappers or not, in this category, without expanding to "any API wrapper".
  • API Stability: With so many processing-as-a-service startups in the market, its inevitable that APIs will go through periods of instability and maturation, and many may be short-lived, and thus package maintenance requirements may be high or packages may become obsolete. We came to no conclusion from this, but will be keeping an eye on API wrapper packages to see how our system copes with this churn.
@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented May 24, 2016

Cool, thanks. Actually geoparser also extracts data from unstructured text & depends on a processing-as-a-service startup although it's a geospatial thing too.

I know I'm creating work for myself later with API packages. ropenaq also breaks when the API changes... I hope I'll be able to cope & that I'll be able to find replacements for the APIs which will disappear 😸

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented May 25, 2016

Reviewers: @hilaryparker @leeper
Due date: 2016-06-15

@sckott

This comment has been minimized.

Copy link
Member

sckott commented Jun 15, 2016

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks 😺 (ropensci-bot)

@ropenscibot

This comment has been minimized.

Copy link

ropenscibot commented Jun 23, 2016

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 29 days, please get your review in soon, thanks 😺 (ropensci-bot)

@ropenscibot

This comment has been minimized.

Copy link

ropenscibot commented Jul 9, 2016

@hilaryparker
Due date: 2016-06-15 - hey there, it's been 45 days, please get your review in soon, thanks 😺 (ropensci-bot)

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Jul 11, 2016

Additional reviewer: @leeper

@leeper

This comment has been minimized.

Copy link
Member

leeper commented Jul 12, 2016

This is a slim and easy-to-use package that works exactly as described. The documentation is generally complete and all examples in docs and README run as expected. I think it's really close to being ready for acceptance, but there are a couple of issues and I have some suggestions on usability.

  1. I get an error when building vignettes from GitHub (build_vignettes = TRUE):

     Quitting from lines 35-41 (monkeylearn_intro.Rmd) 
     Error: processing vignette 'monkeylearn_intro.Rmd' failed with diagnostics:
     HTTP failure: 401
     Invalid token header. No credentials provided.
     Execution halted

    This seems to be due to requiring an API key for examples to run. For CRAN release, the code may need to be pre-executed so the vignettes can build. This issue also emerges on R CMD check in a few places:

    1. examples, which should probably be wrapped in \dontrun{}
    2. test suite, which probably needs to be run conditionally using whichever CRAN-identifying method is appropriate (version number, testthat's check, a simple if() based on API key availability, etc.).
  2. I also get a NAMESPACE note in R CMD check:

     * checking R code for possible problems ... NOTE
     monkeylearn_check: no visible binding for global variable 'text'
     monkeylearn_parse: no visible global function definition for 'is'
     Undefined global functions or variables:
       is text
     Consider adding
       importFrom("graphics", "text")
       importFrom("methods", "is")
     to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
     contains 'methods').
    

    That is all pretty minor.

  3. The code itself also looks good and is very straightforward, seeing as there are only two functions: monkeylearn_extract() and monkeylearn_classify(). For both functions, the output is a list of two tbl_dfs. Only the first one "results" is really useful. The second, "headers", is basically request metadata. I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list. This will improve efficiency because the headers won't be run through both as.data.frame() and tbl_df() and will make the results easier to access.

  4. The only purpose for the headers response seems to be to track the x.query.limit.remaining HTTP header. I would handle this in two ways, both different from current behavior:

    1. Report this value when verbose = TRUE.
    2. Store the value in a non-exported package environment and have internal functions monkeylearn_get_extractor() and monkeylearn_get_classify() update the value after each request and then issue a warning when these values get low.
  5. The functions should use message() rather than print() for diagnostics when verbose = TRUE. This occurs in several places:

  6. The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods. For example, the README and examples require the user to input something like "ex_isnnZRbS" as the value of extractor_id ala monkeylearn_extract(request = text, extractor_id = "ex_y7BPYzNG"). If this could instead be something like monkeylearn_extract(request = text, extractor_id = .entity), monkeylearn_classify(request = text, classifier_id = .profanity), etc. it would greatly simplify the use of the most commonly needed methods.

  7. One really small code style comment: on if (), while (), etc. statements, I think Hadley, Google, and most style guides recommend spacing between brackets, so that:

     while(class(output) == "try-error" && try_number < 6){
     ...
     }

    becomes

     while (class(output) == "try-error" && try_number < 6) {
     ...
     }
  8. It might be useful to note in the README that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.

  9. It might make the README a little flashier if it had some graphs. That's totally not necessary at all, but it seems like that captures peoples' attention better than pure text.

Overall, really easy to use and really nicely done.

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Jul 12, 2016

Thank your for the review, @leeper!

@masalmon, you may yet receive another review, but go ahead and proceed with changes in the meantime.

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Jul 12, 2016

One suggestion regarding outputting tbl_df()'s from your functions. Per the update to tibble and this comment from Hadley, it may make sense to replace tbl_df() (as well as as.data.frame()) with tibble::as_tibble() to anticipate future changes in the tidyverse. This won't increase the package weight as dplyr already depends on tibble. If you do so, specify tibble (>= 1.1) in Imports:.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 12, 2016

Many thanks!! 👍 I am on holidays this week so will implement the suggested changes next week. ☺

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 20, 2016

Hi @leeper, thanks again, this will be very useful!

I have only questions about 2 points:

  • I'm not sure I understand what you mean by "I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list"? Do you mean adding the information from request as columns to the results data frame instead of having them separately? So that there would be a column "x.query.limit.request.queries" with the same value for the whole data frame? But you say "list", so I'm pretty sure I'm misunderstanding your suggestion.
  • And does the following point mean you think that the remaining number of queries should not be part of the output, or that there should be an additional warning?

I had chosen to output a list of data frames because I did it this way for ropenaq but obviously I'm open to changes. 😀

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Jul 20, 2016

@masalmon R objects have an "attributes" component that stores metadata such as dimensions and names, but also can hold arbitrary R objects as metadata. See this section in Advanced R, the R Manual section, ?attributes and ?attr. So you can attach the headers to your output and access it with attributes(output)$headers or attr(output, "headers"). I believe @leeper is suggesting that since this header data isn't inherently tabular and consists of a few values, just leaving it as a list makes more sense than converting it to a data frame/tibble.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 20, 2016

@noamross cool, thank you!

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 20, 2016

Okay so I now see why it (maybe) made sense to have headers as a tibble: when the input is a vector of texts (and not a single text), the headers is a table with one line per original text. It would be harder to have this if headers were a list. I had the same strategy in geoparser.

I'd tend to keep this because I don't know any better solution for this case, and if I do, I'd replace the current "text index" in the headers and results tables by a md5 checksum as I currently do in geoparser.

Any thought?

@leeper

This comment has been minimized.

Copy link
Member

leeper commented Jul 26, 2016

Yup, that works. Then there's a consistent structure regardless of whether it's one call or many.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 28, 2016

The function now outputs a data.frame with headers as a df attribute. 😄

I am still not sure about the following part of my code in both functions:

    output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
    # for the case when the server returns nothing
    # try 5 times, not more
    try_number <- 1
    while(class(output) == "try-error" && try_number < 6) {
      message(paste0("Server returned nothing, trying again, try number", i))
      Sys.sleep(2^try_number)
      output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
      try_number <- try_number + 1
    }

    # check the output -- if it is 429 try again (throttle limit)
    while(!monkeylearn_check(output)) {
      output <- monkeylearn_get_extractor(request_part, key, extractor_id)
    }

So if the server returns nothing I re-try up to 5 times with each time more waiting time. And if the response from the server has a 429 code which indicates throttle limit, I wait 60 seconds.

I feel this is quite clumsy, maybe I could make this better? Any suggestion would be welcome.

Note that I read https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html "For many APIs, the common approach is to retry API calls that return something in the 500 range. However, when doing this, it’s extremely important to make sure to do this with some form of exponential backoff: if something’s wrong on the server-side, hammering the server with retries may make things worse, and may lead to you exhausting quota (or hitting other sorts of rate limits). A common policy is to retry up to 5 times, starting at 1s, and each time doubling and adding a small amount of jitter (plus or minus up to, say, 5% of the current wait time)."

Now another question. @leeper now that headers is a part of the output as an attribute of the data.frame, do you still think reporting x.query.limit.remaining HTTP when verbose=TRUE makes sense? Now I feel that the users themselves could implement something in their code based on x.query.limit.limitand x.query.limit.remaining, which I will make more explicit in the documentation.

@leeper

This comment has been minimized.

Copy link
Member

leeper commented Jul 28, 2016

I'd say write an example of how to check rate limiting and if it's really complicated, perhaps write it into the package. If it turns out to be simple, then just leave it as an example and/or note it in the README.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Jul 30, 2016

I feel I can now answer the review comments. Thanks a lot, @leeper! And thanks @noamross for your contributions to the discussion!

Something I have done independently of this feedback was to add a small vignette where I do a small analysis of the book Little Women.

  1. Great point! I have added a chunk at the beginning of both vignettes now.
  2. Thanks, now there is no NOTE.
  3. Now the output is a data.frame with a data.frame as attributes for representing the headers.
  4. In the documentation of monkeylearn_classify and monkeylearn_extract and in the README+vignette I explain how to get the remaining number of calls, so the user could use them as they wish.
  5. Thanks, I replaced print with message.
  6. Thanks, this is a great suggestion. This is what I have done:
  • For classifiers, when I asked again about a list of all modules, Monkeylearn was kind enough to add an endpoint, which they have not done for extractors because there are less of them. I added the function monkeylearn_classifiers which returns a data.frame of all classifiers, or only of private ones, with their IDs, names and other information such as the language. So the user could use this table to extract the IDs of the classifiers they want to use.
  • For extractors, in the README and vignette I only mention that one can go to the website to browse existing extractors and then save their ID as an environment variable. The reason why I did not add a data.frame of some extractors myself as a .RData is that I would have to choose which characteristics of the extractors to write in this table and for some applications there are different extractors (e.g. sentiment analysis trained on Tweets vs on hotel reviews) and my list would soon be out of date (which by the way is bad for the README and vignette too)... and also I not-so-secretly hope there'll be an endpoint for getting all of them, which might happen when one can customize extractors. Is it ok if I simply wait for a future version of the API before making the extractors IDs easier to find?
  1. I added the missing spaces.
  2. I added the sentence "Note that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.".
  3. As soon as I find a really good idea, I'll add that (probably as a advertisement for a supplementary vignette about one use case)! My current idea is to maybe use the R interface to the Urban Dictionary's API, get random definitions, and look for profanity and abuse detection, before plotting how many of the definitions have a high score. I am open to any cooler suggestion! In the meantime, is it ok I don't have any graph in the README?

Thanks again for the feedback, I'm really happy to be able to improve the package!

@leeper

This comment has been minimized.

Copy link
Member

leeper commented Jul 30, 2016

Nice. Looks good to me!

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 1, 2016

@leeper I had added you to the DESCRIPTION file without asking first, I'm sorry. Are you ok with being in the DESCRIPTION?

@leeper

This comment has been minimized.

Copy link
Member

leeper commented Aug 1, 2016

Sure thing.

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Aug 2, 2016

I was running final checks and I'm getting an error sometimes when running line 94 of the guternbergr vignette. It looks like the vignettes don't get tested on Travis?:

> keywords <- monkeylearn_extract(request = little_women_chapters$text,
                                 extractor_id = "ex_y7BPYzNG",
                                params = list(max_keywords = 3))
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

This appears to arise in extractor.R#97: headers <- rbind(headers, output$headers). When I dive in using options(error=recover) I find that these two objects have different names:

Browse[1]> str(headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:38:54 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49101": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ content.length               : Factor w/ 1 level "11274": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "ff08da4717ca6fadcc5496aa51aad2ea" "0c368463a9e7135209a3ac1eaf3e6120" "1edee1e8747131a45212df1bc007001d" "b765601afb831ea7adb5a89b2d3f5803" ...
Browse[1]> str(output$headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:48:50 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49081": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ transfer.encoding            : Factor w/ 1 level "chunked": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "144b8166ed539c3f712583460143d534" "c562796c5666bb759bbf97b9bc2963c9" "55b5bd0e6b4cd459e574db7e86b88407" "69c7ebf70662c78722ba1a7fccea1de2" ...

One has transfer.encoding and one has content.length, which have different information. I don't know why these are different in different calls. If they should be, dplyr::bind_rows() can handle this and just put NAs for missing values.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 3, 2016

Oh thanks a lot @noamross , fixing this. Actually vignettes seem to get tested on Travis: https://travis-ci.org/masalmon/monkeylearn/builds/148549225

@feconroses is it normal for headers to sometimes contain transfer.encoding and sometimes content.length?

maelle added a commit to ropensci/monkeylearn that referenced this issue Aug 3, 2016

@feconroses

This comment has been minimized.

Copy link

feconroses commented Aug 4, 2016

@masalmon in all honesty, these headers aren't really relevant from a MonkeyLearn point of view. These headers (transfer.encoding and content.length) are generated automatically by nginx, they aren't really relevant for the text analysis made by MonkeyLearn or its process.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

@feconroses ok thanks a lot!

@noamross has my commit solved the bug on your PC?

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Aug 5, 2016

Yes, that has solved it, all checks now passing!

I hope you don't mind one more tweak at the end of a long review:. We're starting to use Gabor Csardi's goodpractice package to check for antipatterns in code. Running goodpractice::gp() yielded two small suggestions. After these you should be good to go.

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data.
    Consider using vapply() instead.

    R/utils.R:97:44
    R/utils.R:105:25
    R/utils.R:112:28

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error
    prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along()
    instead.

    R/classify.R:45:12
    R/extractor.R:70:12
@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

Oh nice I didn't know about this package, I'll soon make the corresponding corrections. Thank you!

maelle added a commit to ropensci/monkeylearn that referenced this issue Aug 5, 2016

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

@noamross now done! :-)

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

Thanks for your patience and help, @noamross, now I really got rid of sapply.

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Aug 5, 2016

Accepted! Thank you @masalmon and @leeper for your review and follow-up. @masalmon, I believe you know the drill from here.

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

Awesome! Yes I do. Thanks again @leeper and @noamross !

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

@sckott could you please create the monkeylearn project in the ropenscilabs Appveyor account and add me as an user? thank you!

@feconroses

This comment has been minimized.

Copy link

feconroses commented Aug 5, 2016

Thanks @masalmon, you are awesome and thanks for making this happen! @leeper, @noamross and @sckott thanks for your time and help :)

@sckott

This comment has been minimized.

Copy link
Member

sckott commented Aug 5, 2016

@masalmon the appveyor is set up

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Aug 5, 2016

Wow this was quick, thanks a lot @sckott !

@maelle

This comment has been minimized.

Copy link
Member Author

maelle commented Sep 11, 2016

For info the package is on CRAN now. Thanks again for the review!

@mdsumner

This comment has been minimized.

Copy link

mdsumner commented Sep 11, 2016

Well done!

On Sun, 11 Sep 2016, 17:35 Maëlle Salmon notifications@github.com wrote:

For info the package is on CRAN now. Thanks again for the review!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#45 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AD6tb4yotFZ8lV3DRWPM7KZQpsHF4H5Pks5qo69CgaJpZM4IiCUj
.

Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment