Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [multi-core] #[shared] and placement of code and data #211

Closed
japaric opened this issue Jun 29, 2019 · 9 comments
Closed

[RFC] [multi-core] #[shared] and placement of code and data #211

japaric opened this issue Jun 29, 2019 · 9 comments
Labels
RFC This issue needs you input! S-accepted This RFC has been accepted but not yet implemented
Milestone

Comments

@japaric
Copy link
Collaborator

japaric commented Jun 29, 2019

This RFC only affects the multi-core modes proposed in RFC #204.

Background

Multi-core Cortex-M devices usually have several memory regions, each one
connected to a different bus (AHB port in ARM Cortex-M terms) with the goal of
reducing memory contention. Different cores can access different memory regions
without contention and with predictable performance. It is when two, or more,
cores try to access the same memory region that contention occurs and one core
is given priority over the other, resulting in perceived memory access delay on
the second core.

To keep performance predictable it is important that applications carefully
place their resources (code and data) in a way that minimizes contention. To
give an example of the important of memory placement: consider the following
homogeneous multi-core RTFM application running on the LPC55S69, a device with
two Cortex-M33 cores, one Flash bank and 5 RAM regions.

#[rtfm::app(cores = 2, device = lpc55s6x)]
const APP: () = {
   #[task(core = 0)]
   fn a0(cx: a0::Context) {
       asm::bkpt();
   }

   #[init(core = 1, spawn = [a0])]
   fn init(cx: init::Context) {
       cx.spawn.a0();
   }

   #[cfg(contention)]
   #[idle(core = 1)]
   fn idle(cx: idle::Context) -> ! {
       loop {
           asm::nop();
       }
   }
};

If all code is placed in Flash then the response time of task a0 -- in this
case measured as the time it takes to go from interrupt entry to the breakpoint
-- varies depending on whether the second core is doing any work or
sleeping because the second core also loads its instructions from Flash.

Without contention (when cfg(contention) evaluates to false) the response
time is 26 clock cycles; with contention the response time is 31 clock
cycles -- 20% slower. These numbers correspond to a configuration where the
first core is given higher access priority to the Flash.

Proposal

Specify how the framework places function and data in memory, but in a way that
end users can minimize memory contention in their application.

Detailed design

Placement of resources and functions

This RFC proposes that we specify the location of functions and static
variables as follows:

All functions and static mut variables that need to be shared between
the cores will be placed in shared memory; everything else will be placed in
memory local to the core that uses it.

Or in other words the default is to place items in core-local memory.

Examples of items placed in local memory:

  • User code, all functions inside the #[rtfm::app] module
  • Interrupt handlers, which are generated by the framework and used as hardware
    tasks and task dispatchers -- these jump into user code
  • static mut resources -- all of them are core local
  • static resources that are not shared between cores
  • Task local data, the static mut variables that appear at the beginning of
    the body of a task, #[init] or #[idle]
  • The timer queue is, also, always core-local
  • Buffers, free queues and ready queues used for core-local message passing

Examples of items placed in shared memory

  • static resources shared between cores
  • Buffers, free queues and ready queues used for cross-core message passing

#[shared]

A special #[shared] attribute will be added to the syntax. This attribute can
only be applied to static [mut] variables within the #[rtfm::app] module.
The semantics of this attribute is overriding the placement rule defined in the
previous section: this attribute forces the variable to be located in shared
memory
.

The goal of this attribute is reducing memory contention. Consider the following
contrived example:

#[rtfm::app(cores = 2, /* .. */)]
const APP: () = {
    #[init(core = 0, spawn = [a1])]
    fn init(cx: init::Context) {
        #[shared]
        static mut Y: [u8; 128] = [0; 128];

        cx.spawn.a1(Y);
    }

    #[task(core = 0)]
    fn a0(cx: a0::Context) {
        // located on the stack
        let mut x = [0; 128];

        // use `x`
    }

    #[task(core = 1)]
    fn a1(cx: a1::Context, y: &'static mut [u8; 128]) {
        // use `y`
    }
};

Without the #[shared] attribute the execution of task a1 could result in
potentially high memory contention. The reason is that its argument y would be
located in core #0 local memory so any operation on y would cause contention
on that memory region because it is used by all tasks running on core #0 that
use stack allocated variables, "spill registers" or access core-local static
variables.

Using the #[shared] attribute greatly reduces the memory contention that task
a1 can cause to only those moments where tasks running on core #0 concurrently
access shared memory.

This seemingly artificial scenario can easily arise when one uses a lock-free
memory pool or any other form of dynamic memory allocation. The backing storage
for these allocators should be placed in #[shared] memory if the allocation is
likely to cross the core boundary.

Implementation

The realization of the design varies depending on the multi-core mode being
used.

Homogeneous

In homogeneous multi-core mode all items are placed in shared memory by
default. This applies to all items declared outside the #[rtfm::app] module,
including external crates (dependencies). To override this default the framework
will make use of these custom core-local linker sections:

  • .text_{i}
  • .uninit_{i}
  • .bss_{i}
  • .data_{i}

Where the {i} indicates on which core-local memory this section should be
placed. .uninit_{i} is used for buffers (that hold e.g. message payloads),
.bss_{i} for queues (free queue, ready queue, timer queue) and .data_{i} for
all core-local resources -- as there's no way to 100% sure way to tell from
the AST whether a constructor evaluates to all zeros or not.

#[rtfm::app(cores = 2, /* .. */)]
const APP: () = {
    // section = .data_0 (core #0 local)
    static mut X: u64 = 0;
    // section = .data_1 (core #1 local)
    static mut Y: u64 = 0;
    // section = .bss (shared)
    static Z: AtomicU32 = AtomicU32::new(0);

    // section = .text_0 (core #0 local)
    #[init(core = 0)]
    fn init(cx: init::Context) { /* .. */ }

    // section = .text_0 (core #0 local)
    #[idle(core = 0, resources = [X, Z])]
    fn idle(cx: idle::Context) -> ! { /* .. */ }

    // section = .text_1 (core #1 local)
    #[task(core = 1, resources = [Y, Z])]
    fn a1(cx: a1::Context) { /* .. */ }
};

With these linker sections authors of linker scripts can control the placement
of functions and data. Using the LPC55S69 as an example one could write the
following linker section to place all shared variables in region SRAM2 and
dedicate regions SRAM0 and SRAM1 to cores #0 and #1 respectively. To work
around the lack of additional Flash banks the code executed by core #1 is placed
in the SRAM1 region.

MEMORY
{
  FLASH : ORIGIN = 0x00000000, LENGTH = 630K /* for core #0 and shared code */
  SRAM0 : ORIGIN = 0x20000000, LENGTH = 64K  /* for core #0 data */
  SRAM1 : ORIGIN = 0x20010000, LENGTH = 64K  /* for core #1 code and data */
  SRAM2 : ORIGIN = 0x20020000, LENGTH = 64K  /* for shared data */
  /* .. */
}

/* NOTE omitting `ALIGN` instructions for simplicity */
SECTIONS
{
  .text :
  {
    *(.text_0.*);     /* core #0 code */
    *(.text .text.*); /* shared code */
  } > FLASH

  /* core #0 data */
  .data_0 :            { *(.data_0.*); }     > SRAM0 AT > FLASH

  /* core #0 zero-initialized data */
  .bss_0 (NOLOAD) :    { *(.bss_0.*); }      > SRAM0

  /* core #0 uninitialized data */
  .uninit_0 (NOLOAD) : { *(.uninit_0.*); }   > SRAM0

  /* core #1 code */
  .text_1 :            { *(.text_1.*); }     > SRAM1 AT > FLASH

  /* core #1 data */
  .data_1 :            { *(.data_1.*); }     > SRAM1 AT > FLASH

  /* core #1 zero-initialized data */
  .bss_1 (NOLOAD) :    { *(.bss_1.*); }      > SRAM0

  /* core #1 uninitialized data */
  .uninit_1 (NOLOAD) : { *(.uninit_1.*); }   > SRAM0

  /* shared data */
  .data :              { *(.data .data.*); } > SRAM2
  .bss (NOLOAD) :      { *(.bss .bss.*); }   > SRAM2
  .uninit (NOLOAD) :   { *(.uninit.*); }     > SRAM2
}

#[shared]

The effect of the #[shared] attribute on code generation is to not use a
custom linker section on the specified variable.

#[rtfm::app(cores = 2, /* .. */)]
const APP: () = {
    #[init(core = 0)]
    fn init(cx: init::Context) {
        // section = .data_0 (core #0 local)
        static mut X: [u8; 1024] = [0; 1024];

        // section = .bss (shared)
        #[shared]
        static mut Y: [u8; 1024] = [0; 1024];
    }
};

Heterogeneous

Heterogeneous multi-core mode is implemented on top of μAMP and μAMP default is
the opposite of the homogeneous multi-core mode: all items, including
dependencies, are core-local and to place something in shared memory one needs
to opt-in using the #[microamp::shared] attribute.

No custom linker sections are used in heterogeneous multi-core mode.

#[shared]

The effect of the #[shared] attribute is to add the #[microamp::shared]
attribute to the generated static mut variable.

Drawbacks

This complicates the process of writing linker scripts for homogeneous devices
as the author would need to consider how to best map the many linker sections.
However, if the author wants to keep things as simple as possible they can merge
all core-local sections into the default shared memory section as shown below.

SECTIONS
{
  /* core-local and shared code */
  .text            : { *(.text .text.* .text_0.* .text_1.*); } > FLASH

  /* core-local and shared data */
  .bss (NOLOAD)    : { *(.bss .bss.* .bss_0.* .bss_1.*); }     > SRAM0
  .data            : { *(.data .data.* .data_0.* .data_1.*); } > SRAM0 AT > FLASH
  .uninit (NOLOAD) : { *(.uninit.* .bss_0.* .bss_1.*); }       > SRAM0
}

However, this will result in high memory contention.

Final remarks

Even with the help of the framework is easy to run into unintended memory
contention when using the homogeneous multi-core mode because code sharing is
the default. Consider this example for the LPC55S69 using linker script from
the "Homogeneous" section :

#[rtfm::app(cores = 2, device = lpc55s6x)]
const APP: () = {
    // section = .text_0 -> FLASH
    #[idle(core = 0)]
    fn i0(cx: i0::Context) -> ! {
        loop {
            // ..

            foo();

            // ..
        }
    }

    // section = .text_1 -> SRAM1
    #[idle(core = 1)]
    fn i1(cx: i1::Context) -> ! {
        loop {
            // ..

            foo();

            // ..
        }
    }
};

// section = .text -> FLASH
fn foo() {
    // ..
}

foo will be placed in Flash because that's the default for this mode. If foo
is not inlined into i1 then core #1 will run some code off Flash causing
memory contention with core #0, which runs all its code from Flash.

There are some ways around this issue like placing the shared code (.text) in
SRAM3 to at least never cause contention on Flash, or to use instruction
caches for shared code, if the device has one. Though, the best way to solve the
issue may be to use the heterogeneous multi-core mode, even if the device is
homogeneous, as this mode doesn't allow sharing of code, only of data.


cc @Disasm this may be of interest to you. Out of curiosity, are
(instruction) caches mandatory on SMP RISCV devices? Do devices like the K210
usually have caches?

Not familiar with (instruction) caches on Cortex-M devices (I think that in the
ARMv7-M line only Cortex-M7 devices have them -- dunno if ARMv8-M devices have
them) but it seems to me that with the linker script from the "Homogeneous"
section one could (read-only) cache the whole .text section, which contains
all shared code, (and probably also the .rodata section) on the second core
(only) to prevent all contention on Flash memory -- at least in the case of the
LPC55S69.

@japaric japaric added the RFC This issue needs you input! label Jun 29, 2019
@japaric japaric added this to the v0.5.0 milestone Jun 29, 2019
@Disasm
Copy link

Disasm commented Jun 30, 2019

@japaric As far as I can see, RISC-V specs tell only what to do if you have caches, but do not require their presence. However, in real-world devices both I$ and D$ caches are present: K210, FU540 and even FE310.
That's why this Flash memory contention is strange to me: if the flash memory is cached or dual-banked this contention effect should be insignificant. If you really need to take this into account (e.g. for real-time applications), you can calculate worst-case timings for flash access.

#[shared] attribute looks extremely helpful for non-uniform memory architectures (even with homogeneous mode). For example, FU540 contains core-local DTIM and ITIMs visible to all the cores, but with non-uniform access times across them. In this case #[shared] may indicate that corresponding variable shouldn't be placed in the core-local region. Maybe it's also a good idea to add #[local(core = X)] attribute to assign a variable to one of the core-local regions.

LPC55S6x User Manual mentions something called "FMC flash cache", but without any details.

@japaric
Copy link
Collaborator Author

japaric commented Jul 4, 2019

@Disasm thanks for the info

if the flash memory is cached or dual-banked this contention effect should be insignificant

I agree.

However, in real-world devices both I$ and D$ caches are present: K210, FU540 and even FE310.

Is the I$ cache enabled by default in multi-core devices like the K210? Or does it need some setup after a power on reset?

it any case it seems that devices with caches would be OK with merging the input .text_{i} and .text sections into a single output .text provided that an $I cache has been enabled on the whole section.

LPC55S6x User Manual mentions something called "FMC flash cache", but without any details.

(That chip is rather new and the manual is a bit lacking and has some errors in some parts.)

The ARMv8-M architecture does define caches and registers to perform cache operations but it seems that only the Cortex-M35P cores have built-in caches; the LPC55S6x has 2 Cortex-M33 cores.

Reading the CLIDR (Cache Level ID Register) on an actual device returned all zeros which indicates that no (I or D) caches exist at any of the 7 possible levels.

For the LPC54114 (heterogeneous, M4F + M0+) NXP recommends that one of the cores runs all its code from RAM in one of their application notes as the device has only one Flash bank and no caches. I haven't found a similar recommendation / application note for the LPC55S6x.

Maybe it's also a good idea to add #[local(core = X)] attribute to assign a variable to one of the core-local regions.

Note that this RFC pertains only to items declared within the #[app] module and generated by the macro and everything is already core local by default. Letting the user place resources or tasks in core-local memory seems like it would be wrong in most cases (like placing a resource owned by core #0 in memory local to core #1).

Perhaps what you want is something like rust-embedded/cortex-m-rt#164 that can be used in libraries but that's not tied to a particular architecture? (And you can always use #[linker_section] to finely control where things go, provided that you are very careful)

@japaric
Copy link
Collaborator Author

japaric commented Jul 8, 2019

@korken89 @TeXitoi any thoughts on this RFC? This RFC only involves the experimental multi-core API, which we are allowed to change in backwards incompatible ways during the v0.5.x releases, so this may not be what we stabilize at the end but I think the flexibility it allows will let us collect more data while the feature remains experimental (and it's required to get good perf on homogeneous, cache-less devices like the LPC55S69).

@korken89
Copy link
Collaborator

korken89 commented Jul 8, 2019

Overall I think this is a great addition!

One question that I am quite sure is not an issue but that I have got stuck on, the #[shared] expands to #[microamp::shared], what does this mean?
The thought that comes to my mind is; how will the linker know that the shared variable is the same address for both binaries as the linker puts no guarantees on order of linked variables vs order of variables in the original source files.
I guess there is a trick here I am not seeing to fix the address of shared variables across binaries?

@japaric
Copy link
Collaborator Author

japaric commented Jul 8, 2019

I think we can FCP (merge) this then.


@korken89 the microamp framework does the heavy lifting. This blog post describes how it works but the TL;DR is that the #[microamp::shared] places all variables in a single (input) linker section that is forced (using a linker script) to be kept in each compiled binary (cargo-microamp compiles the source code several times; one for each different core). Using a single linker section ensures that no variable is GC-ed or reordered by the linker; cargo-microamp has a post-build sanity check that inspects all the output ELF files and checks that all the #[shared] variables (which live in the output .shared linker section) have the same addresses and sizes.

@TeXitoi
Copy link
Collaborator

TeXitoi commented Jul 8, 2019

Even if I'm not aware of these kind of devices, this proposal seemsclean and flexible.

OK for me.

@korken89
Copy link
Collaborator

korken89 commented Jul 9, 2019

Thanks for the clarification @japaric !
I'll look more into cargo-microamp to get a better understanding, but generating specific linker scripts should indeed to the trick!

japaric referenced this issue in rtic-rs/rtic-syntax Jul 11, 2019
see japaric/cortex-m-rtfm#211
bors bot referenced this issue in rtic-rs/rtic-syntax Jul 11, 2019
13: implement RFCs 211 and 212 r=japaric a=japaric

japaric/cortex-m-rtfm#211 and japaric/cortex-m-rtfm#212

Co-authored-by: Jorge Aparicio <jorge@japaric.io>
@japaric
Copy link
Collaborator Author

japaric commented Aug 20, 2019

The FCP has passed; this RFC is now in the accepted state. Implementation is in PR #205.

@japaric japaric added the S-accepted This RFC has been accepted but not yet implemented label Aug 20, 2019
bors bot added a commit that referenced this issue Sep 15, 2019
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric

this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor
part), #204, #207, #211 and #212.

most cfail tests have been removed because the test suite of `rtfm-syntax`
already tests what was being tested here. The `rtfm-syntax` crate also has tests
for the analysis pass which we didn't have here -- that test suite contains a
regression test for #183.

the remaining cfail tests have been upgraded into UI test so we can more
thoroughly check / test the error message presented to the end user.

the cpass tests have been converted into plain examples

EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114)

people that would like to try out this API but have no hardware can try out the
x86_64 [Linux port] which also has multi-core support.

[Linux port]: https://github.com/japaric/linux-rtfm

closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 
closes #163 
cc #209 (documents how to deal with errors)

Co-authored-by: Jorge Aparicio <jorge@japaric.io>
bors bot added a commit that referenced this issue Sep 15, 2019
205: rtfm-syntax refactor + heterogeneous multi-core support r=japaric a=japaric

this PR implements RFCs #178, #198, #199, #200, #201, #203 (only the refactor
part), #204, #207, #211 and #212.

most cfail tests have been removed because the test suite of `rtfm-syntax`
already tests what was being tested here. The `rtfm-syntax` crate also has tests
for the analysis pass which we didn't have here -- that test suite contains a
regression test for #183.

the remaining cfail tests have been upgraded into UI test so we can more
thoroughly check / test the error message presented to the end user.

the cpass tests have been converted into plain examples

EDIT: I forgot, there are some examples of the multi-core API for the LPC541xx in [this repository](https://github.com/japaric/lpcxpresso54114)

people that would like to try out this API but have no hardware can try out the
x86_64 [Linux port] which also has multi-core support.

[Linux port]: https://github.com/japaric/linux-rtfm

closes #178 #198 #199 #200 #201 #203 #204 #207 #211 #212 
closes #163 
cc #209 (documents how to deal with errors)

Co-authored-by: Jorge Aparicio <jorge@japaric.io>
@japaric
Copy link
Collaborator Author

japaric commented Sep 15, 2019

Done in PR #205

@japaric japaric closed this as completed Sep 15, 2019
andrewgazelka pushed a commit to andrewgazelka/cortex-m-rtic that referenced this issue Nov 3, 2021
211: Bump rand dependency to 0.7 r=korken89 a=therealprof

Signed-off-by: Daniel Egger <daniel@eggers-club.de>

Co-authored-by: Daniel Egger <daniel@eggers-club.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC This issue needs you input! S-accepted This RFC has been accepted but not yet implemented
Projects
None yet
Development

No branches or pull requests

4 participants