Skip to content

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393

@tustvold

Description

@tustvold
Contributor

Is your feature request related to a problem or challenge?

DataFusion performs CPU bound work within async closures. This causes issues if running IO on the same async runtime, as the cooperative nature of such schedulers allows the CPU bound work to starve servicing of IO. This leads to errors such as apache/arrow-rs-object-store#272.

Describe the solution you'd like

I think at the very least this needs to be better documented, I couldn't find any mention of this in the DataFusion documentation following a cursory search.

I also think more holistic approach would be valuable to this, as it stands the use of async within DataFusion acts as a massive footgun that encourages users to intermix IO and CPU work in a way that is at best inefficient, but this can be tracked as a separate follow on task.

Describe alternatives you've considered

No response

Additional context

No response

Activity

alamb

alamb commented on Sep 9, 2024

@alamb
Contributor

I recommend two things:

  1. Write a blog with background and explanation of why using two threadpools is important with DataFusion and examples of how to do it
  2. Add additional documentation (ideally linking to the blog) with a summary and linking to the blog with content.
changed the title [-]Document DataFusion Threading[/-] [+]Document DataFusion Threading (and how to separate IO and CPU bound work)[/+] on Sep 9, 2024
ozankabak

ozankabak commented on Oct 7, 2024

@ozankabak
Contributor

I think it'd be great to have a good documentation on this.

alamb

alamb commented on Oct 25, 2024

@alamb
Contributor

I think it'd be great to have a good documentation on this.

100% agree -- @itsjunetime and @tustvold are working on a bit of it in apache/arrow-rs#6612. I'll try and help with the documentation as well

changed the title [-]Document DataFusion Threading (and how to separate IO and CPU bound work)[/-] [+]Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work)[/+] on Nov 11, 2024
self-assigned this
on Nov 14, 2024
alamb

alamb commented on Nov 16, 2024

@alamb
Contributor

Documentation

I hope to work on the example a bit more shortly

23 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) · Issue #12393 · apache/datafusion