Skip to content

Conversation

@robhancocksed
Copy link
Contributor

On the Xilinx MPSoC (Cortex-R5) platform, erratic operation was often seen when an operation which disabled the dcache, such as sys_reboot, was performed. Usually this manifested as an undefined instruction trap due to the CPU jumping to an invalid memory address.

It appears the problem was due to dirty cache lines being present at the time the cache is disabled. Once the cache is disabled, the CPU will ignore the cache contents and read the possibly out-of-date data in main memory. Likewise, since the cache was being cleaned after it was already disabled, if the CPU had already written through changes to some memory locations, cleaning the cache at that point would potentially overwrite those changes with older data.

The fact that the arch_dcache_flush_and_invd_all function was being called to do the cleaning and invalidation also contributed to this problem, because it is a non-inline function which means the compiler will generate memory writes to the stack when the function is called and returns. Corruption of the stack can result in the CPU ending up jumping to garbage addresses when trying to return from functions.

To avoid this problem, the cache is now cleaned prior to the dcache being disabled. This is done by directly calling the L1C_CleanDCacheAll function, which, as it is declared as force inline, should help ensure there are no memory writes, which would create new dirty cache lines, between the cache cleaning and disabling the cache.

Ideally, for maximum safety, the cache cleaning and cache disabling should be done in assembler code, to guarantee that there are no memory writes generated by the compiler during these operations. However, the present change does appear to solve this issue.

On the Xilinx MPSoC (Cortex-R5) platform, erratic operation was often
seen when an operation which disabled the dcache, such as sys_reboot,
was performed. Usually this manifested as an undefined instruction trap
due to the CPU jumping to an invalid memory address.

It appears the problem was due to dirty cache lines being present at the
time the cache is disabled. Once the cache is disabled, the CPU will
ignore the cache contents and read the possibly out-of-date data in main
memory. Likewise, since the cache was being cleaned after it was already
disabled, if the CPU had already written through changes to some memory
locations, cleaning the cache at that point would potentially overwrite
those changes with older data.

The fact that the arch_dcache_flush_and_invd_all function was being
called to do the cleaning and invalidation also contributed to this
problem, because it is a non-inline function which means the compiler
will generate memory writes to the stack when the function is called and
returns. Corruption of the stack can result in the CPU ending up jumping
to garbage addresses when trying to return from functions.

To avoid this problem, the cache is now cleaned and invalidated prior to
the dcache being disabled. This is done by directly calling the
L1C_CleanInvalidateDCacheAll function, which, as it is declared as force
inline, should help ensure there are no memory accesses, which would
populate new cache lines, between the cache cleaning and disabling the
cache.

Ideally, for maximum safety, the cache cleaning and cache disabling
should be done in assembler code, to guarantee that there are no memory
accesses generated by the compiler during these operations. However, the
present change does appear to solve this issue.

Signed-off-by: Robert Hancock <robert.hancock@calian.com>
@robhancocksed robhancocksed force-pushed the cortex-a-r-fix-cache-disable-corruption branch from b5f9482 to 8ec959f Compare April 25, 2025 17:54
@robhancocksed robhancocksed requested a review from JarmouniA April 25, 2025 18:00
@robhancocksed
Copy link
Contributor Author

The twister automated tests are showing a failure of integration.kernel on qemu_cortex_a53/qemu_cortex_a53/smp. However, running the same test locally on this branch passes. I think it is some kind of spurious failure as the binary generated for that test does not even contain/call the code modified here.

@JarmouniA
Copy link
Contributor

However, running the same test locally on this branch passes.

Did you build it with west twister ... or west build ...?

@robhancocksed
Copy link
Contributor Author

However, running the same test locally on this branch passes.

Did you build it with west twister ... or west build ...?

I built and ran the test successfully locally using "west twister".

@kartben kartben merged commit 0e24841 into zephyrproject-rtos:main May 13, 2025
28 checks passed
@robhancocksed robhancocksed deleted the cortex-a-r-fix-cache-disable-corruption branch May 13, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: ARM ARM (32-bit) Architecture area: Cache

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants