Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handling of errors in vectorised inputs #55

Closed
engti opened this issue Feb 27, 2019 · 9 comments
Closed

Graceful handling of errors in vectorised inputs #55

engti opened this issue Feb 27, 2019 · 9 comments

Comments

@engti
Copy link

engti commented Feb 27, 2019

I am trying to loop through a dataframe with reviews. Some of which seem to be below the detection threshold. I get errors like

When I try to do something like this:

for (i in 1:nrow(df_filtered[1:10,])) {
      tmp <- safely(gl_nlp(df_filtered$review_text[i]))
      
      api_result[[as.character(df_filtered$id[i])]] <- tmp
      
      print(paste0("Index: ",i," Status: ",length(tmp)))
 }

I get errors like:

2019-02-27 14:50:58 -- annotateText: 65 characters
Request failed [400]. Retrying in 1 seconds...
Request failed [400]. Retrying in 1 seconds...
2019-02-27 14:51:03> Request Status Code: 400
Scopes: https://www.googleapis.com/auth/cloud-language https://www.googleapis.com/auth/cloud-platform
Method: service_json
Error: API returned: Invalid text content: too few tokens (words) to process.

I was thinking why should the whole loop error out due to only 1 bad call. I am using the safely function from purrr. But is there a best practise guide for dealing with these situations somewhere?

Thanks.

@MarkEdmondson1234
Copy link
Collaborator

Did you try it with sending in the column of text as is? It is vectorised so should cope with it, and a tryCatch() in the function should handle errors gracefully. If not let me know - so please try this code and report back what it does:

results <- gl_nlp(df_filtered$review_text)

@engti
Copy link
Author

engti commented Feb 27, 2019

Thanks Mark for the quick response.

I tried it, but upon getting an error, it exits rather than proceeding gracefully. I did manage to get it working though, by using only rows with 20 words in them, and converting all text to UTF 8. Though it was a fiddly process. Let me know if I should close this comment, or you'd like to know more.

2019-02-27 20:18:33 -- annotateText: 14 characters
Auto-refreshing stale OAuth token.
Request failed [400]. Retrying in 1 seconds...
Request failed [400]. Retrying in 2.5 seconds...
2019-02-27 20:18:41> Request Status Code: 400
Scopes: https://www.googleapis.com/auth/cloud-language https://www.googleapis.com/auth/cloud-platform
Method: service_json
Error: API returned: Invalid text content: too few tokens (words) to process.

@MarkEdmondson1234
Copy link
Collaborator

Ok good to know thanks - I will keep issue open to make the fails more graceful.

@MarkEdmondson1234 MarkEdmondson1234 changed the title Handling errors in a loop using Graceful handling of errors in vectorised inputs Feb 27, 2019
@engti
Copy link
Author

engti commented Feb 28, 2019

Many Thanks Mark. Let me know if you need me to test anything in the future.

@thisisnickb
Copy link

thisisnickb commented Jul 2, 2019

Would just like to add that I am having the same issue, and that, unless I have my tryCatch() loop coded incorrectly, I'm also getting the same sort of failure:

This code:

#Use just instances with more than 25 words of text (arbitrary cutoff)
filelist<-lapply(filelist, function(x) subset(x, WordCount>24))

####Push the data up to Google and get the results back####
#Create the storage dataframe
output<-rep(list(NA), length(ids))
names(output)<-as.numeric(ids)

#Run the data through
tryCatch(
  {
    for(i in 1:length(ids)){
      output[[i]]<-gl_nlp(as.character(filelist[[i]]$Content))
    }
  }
)

ultimately produces this error:

unnamed

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Jul 2, 2019 via email

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Jul 2, 2019

The above scenarios should be better now in version 0.2.0.9000 on Github now (install via remotes::install_github("ropensci/googleLanguageR"))

47c0666

For example, the below calls will carry on if there are 400 errors in the first responses:

library(googleLanguageR)
gl_nlp(c("the rain in spain falls mainly on the plain", "err", "", NA))
2019-07-02 22:08:00 -- annotateText: 43 characters
2019-07-02 22:08:01> Request Status Code: 400
2019-07-02 22:08:01 -- Error processing string: 'the rain in spain falls mainly on the plain' API returned: Invalid text content: too few tokens (words) to process.
2019-07-02 22:08:01 -- annotateText: 3 characters
2019-07-02 22:08:02> Request Status Code: 400
2019-07-02 22:08:02 -- Error processing string: 'err' API returned: Invalid text content: too few tokens (words) to process.

Which gives a response like below:

$sentences
$sentences[[1]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$sentences[[2]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$sentences[[3]]
[1] "#error - zero length string"

$sentences[[4]]
[1] "#error - zero length string"


$tokens
$tokens[[1]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$tokens[[2]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$tokens[[3]]
[1] "#error - zero length string"

$tokens[[4]]
[1] "#error - zero length string"


$entities
$entities[[1]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$entities[[2]]
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."

$entities[[3]]
[1] "#error - zero length string"

$entities[[4]]
[1] "#error - zero length string"


$language
[1] "#error -  API returned: Invalid text content: too few tokens (words) to process."
[2] "#error -  API returned: Invalid text content: too few tokens (words) to process."
[3] "#error - zero length string"                                                     
[4] "#error - zero length string"                                                     

$text
[1] "the rain in spain falls mainly on the plain"
[2] "err"                                        
[3] ""                                           
[4] NA                                           

$documentSentiment
# A tibble: 4 x 2
  magnitude score
      <dbl> <dbl>
1        NA    NA
2        NA    NA
3        NA    NA
4        NA    NA

$classifyText
# A tibble: 4 x 2
  name  confidence
  <chr>      <int>
1 NA            NA
2 NA            NA
3 NA            NA
4 NA            NA

Note you do not need to loop through indexes etc. to pass multiple text to the API, send in the vector and it will do one API call per text element. It will skip API calls for empty strings or NA vector elements.

@thisisnickb
Copy link

Fixed - many thanks!

@MarkEdmondson1234
Copy link
Collaborator

One thing I have just realised, is that the "too few tokens (words) to process." error only occurs if you include classifyText in the request e.g. if you use the annotateText default that includes all methods. You can get entity analysis for any number of characters if you specify only that

e.g.

gl_nlp(c("the rain in spain falls mainly on the plain", "err", "", NA), nlp_type = "analyzeEntities")

See https://cloud.google.com/natural-language/docs/reference/rest/v1/documents/classifyText

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants