Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query causes R to crash #247

Closed
ras44 opened this issue May 14, 2018 · 20 comments
Closed

query causes R to crash #247

ras44 opened this issue May 14, 2018 · 20 comments
Labels
reprex

Comments

@ras44
Copy link
Contributor

@ras44 ras44 commented May 14, 2018

I run into an issue where R crashes if I download a large table. I'm not sure if everything is being stored in memory, but I'm using a GCE instance with 15GB of RAM.

Here's a quick reprex:

query <- "
  SELECT *
  FROM `fh-bigquery.dbpedia.place`
"

query_exec(project=project,
  query=query,
  max_pages=Inf,
  page_size=500000,
  use_legacy_sql=FALSE
)
# this correctly fails with a responseTooLarge error

bq_table_download(
  bq_project_query(project, whisker.render(query,config)),
  page_size=200
)
# this continues downloading with page_size 200 and then eventually R crashes

I'm not sure if this is related to #169.

In my case, I'm query a table with thousands of features and would like to pull it into a dataframe in R. Please let me know if I can provide any additional information. Since I'm on R Studio Server on a GCE instance with oob auth, I'm unable to create reprexs with the reprex library.

@hadley
Copy link
Member

@hadley hadley commented May 14, 2018

It's likely to be a bug with the C++ code that turns bq json into data frames. Can you please try and narrow down the cause? It's likely to be one cell that's causing the problem, so I'd suggest first doing a search on the rows (e.g. try max_pages = 1), and then figure out what column is the source.

Alternatively, if you know how to use a C debugger, I can help you figure out the source that way.

@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 14, 2018

I can confirm the query works if I use query_exec, so I think your hunch is probably correct. The difficulty with debugging on max_pages=1 is that it downloads about 75% of the data before failling. That might indicate some kind of nefarious memory leak that isn't being free'd. If you have recommendations for debugging via a C debugger, I'd be happy to devote some time to that as having the speed of bq_table_download will save me time in the long-run.

@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 14, 2018

The crash actually happens during the "Retrieving data" step, not the parse step. Do you think it would still be related to the C++ code in that case?

@hadley
Copy link
Member

@hadley hadley commented May 14, 2018

Oh hmmmm, that is quite surprising, so a C debugger might be the way to go. What platform are you on?

@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 14, 2018

I'm on a CentOS linux platform via ssh, so I'm working primarily via the command line (aside from R Studio Server). I've used gdb in a very limited way with compiled C/C++ programs, but since the C code here is wrapped within R libs, I'm not sure how I would use something like gdb. If there's something within R Studio that I'm unaware of, I'd be happy to learn.

@hadley
Copy link
Member

@hadley hadley commented May 14, 2018

Ok, it shouldn't be too hard. Start R with:

R --debugger=gdb
run

Then run code as usual. When you hit the error get a backtrace with bt and then paste it. I'm not sure if you'll have the package installed with debug symbols, but lets start with the backtrace and we can figure it out from there.

@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 15, 2018

Thanks for the guidance.

When I run with the debugger, I am able to see this response from the API:

Complete
Billed: 0 B
Downloading 639,450 rows in 3198 pages.
Downloading data [---------------------------------------------]   0% eta: 14hError: Exceeded rate limits: Your project: 168476834607 exceeded quota for tabledata.list bytes per second per project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors [rateLimitExceeded]

However, when I run in R Studio, it crashes and I receive a pop-up that says something along the lines of "the previous R Session terminated unexpectedly".

@hadley
Copy link
Member

@hadley hadley commented May 15, 2018

Hmmmm, typically that indicates there's some mismatch of installed C++ libraries so you're getting an crash instead of an error. Next step (unfortunately) is to try reinstalling all your packages.

@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 16, 2018

Thanks for your advice. I am using packrat to manage installed libs on a per-project basis. After updating bindr and bindrcpp I'm unable to reproduce a crash.

@ras44 ras44 closed this as completed May 16, 2018
@ras44
Copy link
Contributor Author

@ras44 ras44 commented May 16, 2018

of course the next time I ran it, it crashed- again during the download step.

Something strange that I noticed is that when I run the same query that previously crashed without packrat managing the libraries for the project, it seems to work. Your comments regarding the libraries make me suspect that there is something strange happening with library management in packrat.

@ras44 ras44 reopened this May 16, 2018
@stPhena
Copy link

@stPhena stPhena commented Aug 7, 2018

I have exactly the same issue. I assume, after reading documentation, that it is because bq_table_download() does not allow specifying a rate limit. BQ allows a maximum of 60 MB/s and 150000 rows/second when downloading data.

Traceback:
 1: bq_parse_files(schema_path, page_paths, n = page_info$n_rows, quiet = bq_quiet(quiet))
 2: bq_table_download(bq_table(...), page_size = 1000, quiet = FALSE)

@ras44
Copy link
Contributor Author

@ras44 ras44 commented Aug 17, 2018

It looks like this issue is related to bigquery quotas. It manifests itself in a variety of ways, with the following ad-hoc solutions:

Too much data returned (large tables with many fields)

  • Limits
    • Maximum bytes per second per project returned by calls to tabledata.list: 60 MB/second
    • Maximum rows per second per project returned by calls to tabledata.list: 150,000/second
  • Solution
    • reduce page_size: bq_table_download(..., page_size=200)
    • reduce max_connections: bq_table_download(..., page_size=100, max_connections=1)

Too many requests (large tables with few fields)

  • Limits
    • API requests per second, per user — 100
  • Solution
    • increase page_size: bq_table_download(..., page_size=2000)
    • reduce max_connections: bq_table_download(..., page_size=100, max_connections=1)

@ras44
Copy link
Contributor Author

@ras44 ras44 commented Aug 29, 2018

I've had some luck diagnosing by adding a very rudimentary callback:

bq_download_fail_callback <- function(page, path) {
  function(err) {
    print(err)
    print(page)
    print(path)
  }
}

and adding:

fail = bq_download_fail_callback(i, paths[[i]])

here:

bigrquery/R/bq-download.R

Lines 126 to 129 in a3ef603

curl::multi_add(handle,
done = bq_download_callback(i, paths[[i]], progress),
pool = pool
)

While this doesn't resolve this issue (I believe there would have to be some kind of smart bigrquery rate-limiting implementation), it does at least provide output to the user that there was an error. Otherwise, R crashes without any output.

It may be helpful to abort the download if one of the requests fails as this might prevent the program from crashing.

@hadley
Copy link
Member

@hadley hadley commented Jan 22, 2019

Reprex:

library(bigrquery)
tbl <-as_bq_table("fh-bigquery.dbpedia.place")
bq_table_download(tbl, page_size = 200)

@hadley
Copy link
Member

@hadley hadley commented Jan 23, 2019

The above query works fine for me. Does anyone else have an example that crashes for them?

@hadley hadley added the reprex label Jan 23, 2019
@ras44
Copy link
Contributor Author

@ras44 ras44 commented Jan 23, 2019

This has been a difficult one to produce a consistent reprex for. Going off the hypothesis that this is related to quotas, possibly attempt to increase the page_size? Failing that, perhaps some variation of page_size and max_connections?

@hadley
Copy link
Member

@hadley hadley commented Jan 23, 2019

I definitely see various errors, but I can't get R to crash.

@ras44
Copy link
Contributor Author

@ras44 ras44 commented Jan 23, 2019

I think this was originally posted running R 3.4.x and with packrat managing the libraries. Have you created a packrat project? I would be happy to test it out again as I'm running 3.5.x now and can set up a packrat project pretty easily.

@hadley
Copy link
Member

@hadley hadley commented Jan 23, 2019

If you only see it with packrat, I think it's highly likely its a package incompatibility problem, not a bug in bigrquery (especially since I can't recreate it locally)

@hadley hadley closed this as completed Jan 23, 2019
@gavinsteininger
Copy link

@gavinsteininger gavinsteininger commented Feb 12, 2020

I am getting the same issue with R 3.6.2 GUI 1.70 El Capitan build (7735). The crashes are intermittent. But with a tendency to happen if I either have not queried for a while ( an hour or so) or the query is big ( 100k plus rows) but they happen at other times as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex
Projects
None yet
Development

No branches or pull requests

4 participants