Skip to content
This repository has been archived by the owner on Apr 13, 2019. It is now read-only.

icache not flushed correctly #140

Closed
zeldin opened this issue Jul 11, 2018 · 8 comments
Closed

icache not flushed correctly #140

zeldin opened this issue Jul 11, 2018 · 8 comments
Assignees

Comments

@zeldin
Copy link

zeldin commented Jul 11, 2018

It appears that the flush_icache VDSO is not working correctly on Freedom U540.
The following test program:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
 
static void *allocate_code(void)
{
  void *addr = aligned_alloc(4096, 4096);
  mprotect(addr, 4096, PROT_EXEC | PROT_READ | PROT_WRITE);
  memset(addr, 0, 4096);
  return addr;
}
 
static void generate_code(void *p, int n)
{
  uint16_t *code = p;
 
  /* Assumes 0 <= n <= 31 */
  code[0] = 0x4501 | (n<<2); /* c.li a0, n */
  code[1] = 0x8082; /* c.jr ra */
 
  __builtin___clear_cache(code, code+16);
}
 
static int call_code(void *p)
{
  return (*(int (*)(void))p)();
}
 
int main()
{
  void *addr = allocate_code();
 
  int a, b;
 
  generate_code(addr, 3);
  a = call_code(addr);
 
  generate_code(addr, 7);
  b = call_code(addr);
 
  printf("Results: %d %d\n", a, b);
 
  return 0;
}

prints "3 3" when run on the HiFive Unleashed, but should print "3 7".

I'm building the kernel from the riscv-linux-4.15 branch.

@palmer-dabbelt
Copy link
Contributor

I don't have time to look into this right now, but I'm assuming it's probably some sort of mismatch between GCC, glibc, and Linux as to how the fence is supposed to end up getting emitted. Do you mind checking:

  • The generated ELF, to make sure it calls into glibc to perform the fence. You should see something like call __riscv_flush_icache in the generated object. The code that generates this in the current GCC lives in riscv.md and looks like this:
;; Expand in-line code to clear the instruction cache between operand[0] and
;; operand[1].
(define_expand "clear_cache"
  [(match_operand 0 "pmode_register_operand")
   (match_operand 1 "pmode_register_operand")]
  ""
{
#ifdef ICACHE_FLUSH_FUNC
  emit_library_call (gen_rtx_SYMBOL_REF (Pmode, ICACHE_FLUSH_FUNC),
             LCT_NORMAL, VOIDmode, operands[0], Pmode,
             operands[1], Pmode, const0_rtx, Pmode);
#else
  emit_insn (gen_fence_i ());
#endif
  DONE;
})
  • Checking the generated glibc to see if __riscv_flush_icache actually calls into the vDSO. There's some ifunc magic going on here so it can be hard to read, but the code that does it lives in flush-icache.c in glibc and looks like
static int
__riscv_flush_icache_syscall (void *start, void *end, unsigned long int flags)
{
  return INLINE_SYSCALL (riscv_flush_icache, 3, start, end, flags);
}

static func_type
__lookup_riscv_flush_icache (void)
{
  PREPARE_VERSION_KNOWN (linux_version, LINUX_4_15);

  func_type func = _dl_vdso_vsym ("__vdso_flush_icache", &linux_version);

  /* If there is no vDSO entry then call the system call directly.  All Linux
     versions provide the vDSO entry, but QEMU's user-mode emulation doesn't
     provide a vDSO.  */
  if (!func)
    func = &__riscv_flush_icache_syscall;

  return func;
}
  • Checking to see if the kernel's vDSO actually calls into sys_riscv_flush_icache at some point. The code in Linux to handle this lives in sys_riscv.c and looks like
#ifdef CONFIG_SMP
/*
 * Allows the instruction cache to be flushed from userspace.  Despite RISC-V
 * having a direct 'fence.i' instruction available to userspace (which we
 * can't trap!), that's not actually viable when running on Linux because the
 * kernel might schedule a process on another hart.  There is no way for
 * userspace to handle this without invoking the kernel (as it doesn't know the
 * thread->hart mappings), so we've defined a RISC-V specific system call to
 * flush the instruction cache.
 *
 * sys_riscv_flush_icache() is defined to flush the instruction cache over an
 * address range, with the flush applying to either all threads or just the
 * caller.  We don't currently do anything with the address range, that's just
 * in there for forwards compatibility.
 */
SYSCALL_DEFINE3(riscv_flush_icache, uintptr_t, start, uintptr_t, end,
    uintptr_t, flags)
{
    struct mm_struct *mm = current->mm;
    bool local = (flags & SYS_RISCV_FLUSH_ICACHE_LOCAL) != 0;

    /* Check the reserved flags. */
    if (unlikely(flags & ~SYS_RISCV_FLUSH_ICACHE_ALL))
        return -EINVAL;

    flush_icache_mm(mm, local);

    return 0;
}
#endif
  • Checking to see that Linux correctly performs the remote fences. There's some bookkeeping in here to avoid fences on harts that don't need them, and that could be broken. This code lives in smp.c and looks like
/*
 * Performs an icache flush for the given MM context.  RISC-V has no direct
 * mechanism for instruction cache shoot downs, so instead we send an IPI that
 * informs the remote harts they need to flush their local instruction caches.
 * To avoid pathologically slow behavior in a common case (a bunch of
 * single-hart processes on a many-hart machine, ie 'make -j') we avoid the
 * IPIs for harts that are not currently executing a MM context and instead
 * schedule a deferred local instruction cache flush to be performed before
 * execution resumes on each hart.
 */
void flush_icache_mm(struct mm_struct *mm, bool local)
{
    unsigned int cpu;
    cpumask_t others, *mask;

    preempt_disable();

    /* Mark every hart's icache as needing a flush for this MM. */
    mask = &mm->context.icache_stale_mask;
    cpumask_setall(mask);
    /* Flush this hart's I$ now, and mark it as flushed. */
    cpu = smp_processor_id();
    cpumask_clear_cpu(cpu, mask);
    local_flush_icache_all();

    /*
     * Flush the I$ of other harts concurrently executing, and mark them as
     * flushed.
     */
    cpumask_andnot(&others, mm_cpumask(mm), cpumask_of(cpu));
    local |= cpumask_empty(&others);
    if (mm != current->active_mm || !local)
        sbi_remote_fence_i(others.bits);
    else {
        /*
         * It's assumed that at least one strongly ordered operation is
         * performed on this hart between setting a hart's cpumask bit
         * and scheduling this MM context on that hart.  Sending an SBI
         * remote message will do this, but in the case where no
         * messages are sent we still need to order this hart's writes
         * with flush_icache_deferred().
         */
        smp_mb();
    }

    preempt_enable();
}

Thanks, and sorry for the headaches!

@zeldin
Copy link
Author

zeldin commented Jul 11, 2018

Sure.

The generated ELF, to make sure it calls into glibc to perform the fence

It does. Here is the generated code for the line that calls __builtin___clear_cache:

/home/marcus/hack/hi5cache/main.c:23
   10588:       fe843783                ld      a5,-24(s0)
   1058c:       02078713                addi    a4,a5,32
   10590:       fe843783                ld      a5,-24(s0)
   10594:       4601                    li      a2,0
   10596:       85ba                    mv      a1,a4
   10598:       853e                    mv      a0,a5
   1059a:       7ce140ef                jal     ra,24d68 <__riscv_flush_icache>

Checking the generated glibc to see if __riscv_flush_icache actually calls into the vDSO.

Yes. There is first some code to check if something called cached_func.7305 is non-NULL (I assume this is generated by the INLINE_SYSCALL macro), but if it is NULL the code continues and further down there is indeed a call to _dl_vdso_vsym. I also checked that the binary does contain the string __vdso_flush_icache.

Now I had a clever idea though: I ran strace on the binary. This was the result:

execve("./main", ["./main"], 0x3fffacf030 /* 46 vars */) = 0
brk(NULL)                               = 0x75000
brk(0x75f70)                            = 0x75f70
uname({sysname="Linux", nodename="vanille", ...}) = 0
readlinkat(AT_FDCWD, "/proc/self/exe", "/home/marcus/hack/hi5cache/main", 4096) = 31
brk(0x96f70)                            = 0x96f70
brk(0x97000)                            = 0x97000
mprotect(0x77000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
riscv_flush_icache(0x77000, 0x77020, 0) = -1 ENOSYS (Function not implemented)
riscv_flush_icache(0x77000, 0x77020, 0) = -1 ENOSYS (Function not implemented)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
write(1, "Results: 3 7\n", 13Results: 3 7
)          = 13
exit_group(0)                           = ?
+++ exited with 0 +++

Huh? Function not implemented? That doesn't sound right... (The extra interferrence of ptrace actually made the program output the correct result, hilariously enough :-)

Checking to see if the kernel's vDSO actually calls into sys_riscv_flush_icache at some point. The code in Linux to handle this lives in sys_riscv.c and looks like

Yup, looks exactly the same for me. I'm at 2751b6a, which is almost the tip of the riscv-linux-4.15 branch.

Checking to see that Linux correctly performs the remote fences. There's some bookkeeping in here to avoid fences on harts that don't need them, and that could be broken. This code lives in smp.c and looks like

Well, my test program is single-threaded so that should not be an issue. The ENOSYS seems like a much more likely candidate. :-)

@zeldin
Copy link
Author

zeldin commented Jul 11, 2018

I did another test program which I compiled with -Wl,-wrap,_dl_vdso_vsym, in order to check that _dl_vdso_vsym is actually called (man, a working native gdb would be swell... ;-):

#include <stdio.h>

void *__wrap__dl_vdso_vsym(const char *a, void *b)
{
  void *p;
  void *(*real_dl_vdso_vsym) (const char *name, void *version) =
    (void*)0x1057e;

  printf(">>VDSO_VSYM \"%s\"\n", a);
  p = real_dl_vdso_vsym(a, b);
  printf("<<VDSO_VSYM \"%s\" %p\n", a, p);
  return p;
}

int main()
{
  __builtin___clear_cache(main, main);

  return 0;
}

The address 0x1057e was taking from an nm of the resulting binary (the actual symbol is hidden..).
The result of running this program was:

>>VDSO_VSYM "__vdso_flush_icache"
<<VDSO_VSYM "__vdso_flush_icache" 0x200000083c

So yes, _dl_vdso_vsym("__vdso_flush_icache") is called, and it does return non-NULL. Checking with strace reveals that the failing syscall is still invoked.

@zeldin
Copy link
Author

zeldin commented Jul 11, 2018

I checked the disassembly of __riscv_flush_icache a little closer, and what it does is this:

40(sp) = __stack_chk_guard
a5 = cached_func.7305
if a5 != NULL {
  label1:
  (*a5)(arg0, arg1, arg2)
  if 40(sp) != __stack_chk_guard
     __stack_chk_fail()
  return
}
var.hash = 182943605
var.hidden = 1
var.name = "LINUX_4.15"
if var.hash != _dl_elf_hash (var.name)
  __assert_fail("linux_version.hash == _dl_elf_hash (linux_version.name)", "../sysdeps/unix/sysv/linux/riscv/flush-icache.c", 29, "__lookup_riscv_flush_icache")
var.filename = NULL
a5 = _dl_vdso_vsym("__vdso_flush_icache", &var)
if a5 != NULL {
 cached_func.7305 = a5
 goto label1
}
abort()

So unlike the code you pasted, there is no fallback to __riscv_flush_icache_syscall. The function either calls the VDSO or abort()s.

So it all points to the actual VDSO being called, which in turn makes an unsupported system call? I reran strace with -i, and sure enough:

[0000002000000844] riscv_flush_icache(0x10550, 0x10550, 0) = -1 ENOSYS (Function not implemented)

0x2000000844 is 8 bytes after the address returned by _dl_vdso_vsym().

@zeldin
Copy link
Author

zeldin commented Jul 11, 2018

Hm, I think I see the problem. In arch/riscv/kernel/syscall_table.c, this happens:

#undef __SYSCALL
#define __SYSCALL(nr, call)     [nr] = (call),

void *sys_call_table[__NR_syscalls] = {
        [0 ... __NR_syscalls - 1] = sys_ni_syscall,
#include <asm/unistd.h>
};

arch/riscv/include/asm/unistd.h does indeed include <uapi/asm/syscalls.h>, which contains the definition of the syscall. However, arch/riscv/include/uapi/asm/syscalls.h also contains an include guard _ASM__UAPI__SYSCALLS_H, and because the file has already been included once before, the syscall never gets added to the sys_call_table. Removing the include guard should fix the issue.

@zeldin
Copy link
Author

zeldin commented Jul 11, 2018

I removed the include guard from arch/riscv/include/uapi/asm/syscalls.h and rebuilt the kernel. Now the test program works correctly. 😀

@aswaterman
Copy link
Contributor

aswaterman commented Jul 12, 2018 via email

@palmer-dabbelt
Copy link
Contributor

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants