Skip to content

Describe Data Placement to Guide Distribution of Tasks #1722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions rfcs/proposed/numa_support/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,8 @@ resources that are near to the data they access. oneTBB already provides low-lev
`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms,
flow graph and containers where appropriate.

See [sub-RFC for describing data placement](describe-data-placement.org)

### Improved out-of-the-box performance for high-level oneTBB features.

For high-level oneTBB features that are modified to provide improved NUMA support, we can try to
Expand Down
72 changes: 72 additions & 0 deletions rfcs/proposed/numa_support/describe-data-placement.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#+TITLE: Describing Data Placement to Guide Distribution of Tasks

* Introduction
By default, oneTBB does not make any assumption about the placement of the data which is processed
by a parallel algorithm. There is potential side-effect based on the way memory prefetcher works
implying that closer parallel iterations are executed faster by closer CPU cores. There is also an
interface to guide the scheduler to run execution of whole parallel algorithms on certain CPU
cores/NUMA nodes, which requires on the user side explicit management of separate task arenas
including submission of work into them.

This RFC tries to approach the problem from the other side by proposing, possibly, less intrusive
and, at the same time, more intuitive interface for describing work distribution that minimizes
latency effects when working with non-uniform memory.

* Proposal
The proposal is to add an interface that would allow to specify how parallel iteration space is
associated with different banks of memory, i.e. allowing to describe what memory is accessed by each
parallel iteration.


Sketches about possible interface for the mapping, i.e. iteration <=> memory boundary association:
1. Passing iteration numbers denoting the split points.

Example:
#+begin_src C++
{4096, 8192, 16384, 32768}
#+end_src

Open question:
- How to indicate split points if the memory is distributed in a non-monotonic way. E.g. NUMA 0
contains memory for {4096, 8192} interval, NUMA 1 - {0, 4096}.

2. Intervals associated with an id.

Examples:
#+begin_src C++
{numa_id : tbb::blocked_range, ...}
{/*numa_id*/0 : {/*begin*/0, /*end*/4096 [, grain_size?]}, /*numa_id*/1 : {4096, 8192}, ...}
#+end_src


The overall idea is to give hints to the scheduler so that it can aim at:
- Splitting the tasks in a way where each single task does not cover parallel iterations from
different memory banks
- Assigning the tasks to closer threads taking into account existing thread placement.


Some sketches where user could declare the association:
1. Extending ~tbb::blocked_range~
#+begin_src C++
tbb::blocked_range(begin, end, grainsize, /*association*/{numa_id : {begin, end [, grain_size?]}, ...})
#+end_src
2. Map of blocked ranges ~{numa_id : tbb::blocked_range}~ passed to parallel algorithm
3. Extending ~tbb::partitioner~
#+begin_src C++
partitioner(/*association*/{numa_id : tbb::blocked_range, ...})
partitioner(/*association*/{/*numa_id*/0 : {/*begin*/0, /*end*/4096 [, grain_size?]}, 1 : {4096, 8192}, ...})
#+end_src
4. Prefilled instance of ~tbb::affinity_partitioner~ or explicit setter
#+begin_src C++
// This code demonstrates the use of TBB's affinity partitioner to control data placement
tbb::affinity_partitioner ap1(
{0 : {0, 512}, 1 : {512, 1024}, 2 : {1024, 1536}, 3 : {1536, 2048}}
);

tbb::affinity_partitioner ap2;
ap2.hint({0 : {0, 512}, 1 : {512, 1024}, 2 : {1024, 1536}, 3 : {1536, 2048}});
#+end_src
5. Something close to the API provided by [[https://github.com/oneapi-src/distributed-ranges][Distributed Ranges]] to ensure better interoperability

* Open Questions
<TBD>