Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for really large queries #675

Closed
swapneshgandhi opened this issue Jun 26, 2020 · 8 comments
Closed

Support for really large queries #675

swapneshgandhi opened this issue Jun 26, 2020 · 8 comments

Comments

@swapneshgandhi
Copy link

Hi,
We are running into OOM errors for large druid queries we run in Maha The response size are huge upwards of 10GBs to even 200GB in some cases. It seems that Druid response is being loaded completely in Memory.
Is there a way to use streaming or spill these large results to disk?

@pranavbhole
Copy link
Contributor

hello @swapneshgandhi , we faced the same problem and fixed it via OffHeapRowList, it stored the data in the RocksDB Instance.
https://github.com/yahoo/maha/blob/master/core/src/main/scala/com/yahoo/maha/core/query/OffHeapRowList.scala

OffHeapRowList is optional and you need to enable the usage via setting a param in the query pipeline context QueryPipelineContext and pass it while instantiating query pipeline factory.
Hope it helps.

@swapneshgandhi
Copy link
Author

Thanks @pranavbhole, I tried to look into the code a bit and have a few questions?

  1. Does the OffHeapRowList only work for async requests? If so, Can you explain how async requests work?
    I couldn't find a good example. Would be helpful to know how to run async request.
  2. If we use OffHeapRowList, is the result format still the same?
  3. Can we use the RowCSVWriter along with OffHeapRowList for an CSV output format.

thanks.

@patelh
Copy link
Collaborator

patelh commented Jul 1, 2020

@swapneshgandhi technically, the async workers never made it out to open source. Something that needs to be built out for open source. However, should be able to execute an async query like a sync query (the async workers get the request from a queue and execute it, and they use essentially the same query executors - except Hive). The async execution pipeline does not support some of the sync pipeline use cases. You need to change the request type to async. Once you create a queryPipeline, it can be executed with a queryExecutorContext. See the tests for the different executors.

@patelh
Copy link
Collaborator

patelh commented Jul 1, 2020

@swapneshgandhi also, you can provide a CSVRowList to the QueryPipelineBuilder so it will output the results to a file without putting in memory.

@patelh
Copy link
Collaborator

patelh commented Jul 1, 2020

@swapneshgandhi see DefaultQueryPipelineFactory

@patelh
Copy link
Collaborator

patelh commented Jul 1, 2020

@swapneshgandhi and see QueryPipeline.csvRowList(...)

@upendrareddy
Copy link

Hi @patelh/@pravavbhole,

Thanks for your prompt response. I work with Swapnesh and I just need few clarifications in regards to handling larger queries with csv output.

After receiving response from Druid, if we use OffHeapRowList it uses RocksDB to store response. Is the CSV conversion a streaming implementation from RocksDB or does it use Druid response(from memory) directly?

Regards

@patelh
Copy link
Collaborator

patelh commented Aug 10, 2020

@upendrareddy the existing druid query executor uses asynchttpclient underneath for request/response. It gets the response from druid as a string, parses the json string to an AST and constructs the internal response object RowList. The RowList has multiple implementations, one is OffHeapRowList and the other is CSVRowList. Both of which would have lower memory overhead. The optimal solution would be not to parse the result from Druid into an AST but read it as a InputStream and use a low level parser API to parse the response. If you'd like to contribute this change, please do, it would be a great improvement for large Druid responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants