Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Null pointer check for destination to store new vl of Fault-Only-First Loads intrinsics #153

Closed
pcwang-thead opened this issue Jun 21, 2022 · 13 comments

Comments

@pcwang-thead
Copy link

There are two outputs in Fault-Only-First Loads intrinsics: 1) loaded vector; 2) new vl. For example: vint8mf8_t vle8ff_v_i8mf8 (const int8_t *base, size_t *new_vl, size_t vl);, it returns a vector of vint8mf8_t and stores new vl to new_vl.

Sometimes, we just want to ignore the new vl. In previous GCC implementation[1], we won't store new vl to destination new_vl if it is a null pointer. It is common that user will pass null pointer and expect that new vl will be thrown away.

if (new_vl)								\
    {									\
    if (__riscv_xlen == 32)						\
      *new_vl = __builtin_riscv_vreadvlsi ();				\
    else								\
      *new_vl = __builtin_riscv_vreadvldi ();				\
    }

However, we found LLVM has different behavior when we switched to LLVM and some of ours codebases got compilation failures or runtime errors. The reason is that LLVM doesn't do any null pointer checking for destination to store new vl and store to null pointer is an undefined behavior. See also [2] for more previous discussions.

My proposal is that we should do null pointer check for destination to store new vl of Fault-Only-First Loads and document it.
Or we should add a note for users that new_vl should not be a null pointer as @zakk0610 proposed.

References:
[1] https://github.com/riscv-collab/riscv-gcc/blob/riscv-gcc-10.1-rvv-dev/gcc/config/riscv/riscv_vector.h#L289
[2] https://reviews.llvm.org/D126461

@kito-cheng
Copy link
Collaborator

I seconded your proposal, null-pointer checking could be easier removed in the optimization flow, so I am prefer this way (oh, okay, and the GCC implementation is I did :P ), and I believe there is practical scenario since you guys hit this issue.

@topperc
Copy link
Collaborator

topperc commented Jun 21, 2022

What is the use case for not wanting the new_vl value? Don't you need to know how many elements were read to be able to use the loaded data?

@zhongjuzhe
Copy link

zhongjuzhe commented Jun 21, 2022

Trying the solution you gave, you will end up with the assembly as below:

test_vleff_save_new_vl_to_not_null:
vsetvli zero,a3,e8,mf8,ta,mu
vle8ff.v v1,0(a1)
beq a2,zero,.L4
csrr a5, vl
sd a5,0(a2)
.....

I don't think users want to see the branch instructions here. If you really have the cases that users don't want the new_vl value. I think we should try another solution.

Actually,I tried the Clang/LLVM. It seems to work well: https://godbolt.org/z/8r5vznc3o
Please correct me if I misunderstand something, thanks.

@nick-knight
Copy link
Collaborator

I have the same question as Craig. Can anyone please provide a real-world use-case where new_vl is unused by the C intrinsics programmer?

@pcwang-thead
Copy link
Author

What is the use case for not wanting the new_vl value? Don't you need to know how many elements were read to be able to use the loaded data?

I would like to use Stream Cipher(https://en.wikipedia.org/wiki/Stream_cipher) as an example, but after thinking, I think it isn't a typical use case. And I asked my colleagues who use these intrinsics, the answer is that they barely use Fault-Only-First loads, let alone a real-world use-case where new_vl is unused. Most of our failed cases are for extreme testing. So, I have to say, it's hard to find one.

As for Stream Cipher, there are about two important parameters: 1) a byte stream(file stream, network stream, etc) to encrpyt/decrypt, donotes plaintext and ciphertext. 2) an infinite(theoretically) secret key stream, donates key. The encrpyt/decrypt operations are just an XOR between plaintext/ciphertext and key, that is, ciphertext = plaintext XOR key, plaintext = ciphertext XOR key. This can be used for Full-Disk Encryption or something like Transparent Data Encryption, in which we will encrypt/decrypt a whole memory block and we don't really care how many bytes we have processed. But I think this can be done without Fault-Only-First loads.

@pcwang-thead
Copy link
Author

Trying the solution you gave, you will end up with the assembly as below:

test_vleff_save_new_vl_to_not_null: vsetvli zero,a3,e8,mf8,ta,mu vle8ff.v v1,0(a1) beq a2,zero,.L4 csrr a5, vl sd a5,0(a2) .....

I don't think users want to see the branch instructions here. If you really have the cases that users don't want the new_vl value. I think we should try another solution.

That is because we don't know whether new_vl is null or not, so the branch instruction can't be optimized. See also my previous comment: https://reviews.llvm.org/D126461#3569155.

Actually,I tried the Clang/LLVM. It seems to work well: https://godbolt.org/z/8r5vznc3o Please correct me if I misunderstand something, thanks.

As you can see, the assembly are wrong for test_vleff_save_new_vl_to_null and test_vleff_save_new_vl_to_cross_func_nullptr, instructions after vleff have been deleted. As a result, an infinite recursion(crashed with stack overflow) could be constructed by accident, like https://godbolt.org/z/Yq4b8j7ad:

#include <stddef.h>
#include <riscv_vector.h>

int8_t a[]={1,2,3};

vint8mf8_t __attribute__((noinline))
test_vleff_save_new_vl_to_null(const int8_t *base, size_t vl) {
  return vle8ff_v_i8mf8(base, NULL, vl);
}

int main(){
  test_vleff_save_new_vl_to_null(a, 3);
  return 0;
}

The assembly are:

test_vleff_save_new_vl_to_null:         # @test_vleff_save_new_vl_to_null
        vsetvli zero, a1, e8, mf8, ta, mu
        vle8ff.v        v8, (a0)
# recursion here.
main:                                   # @main
        lui     a0, %hi(a)
        addi    a0, a0, %lo(a)
        li      a1, 3
        call    test_vleff_save_new_vl_to_null

@pcwang-thead
Copy link
Author

I tried to implement strlen example in rvv spec doc(https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#unit-stride-fault-only-first-loads), I think both strlen_1 and strlen_2 have the same behavior if I understand and implement correctly (correct me if I am wrong).

https://godbolt.org/z/cvTaYn5Tn

#include <riscv_vector.h>

size_t strlen_1(const char *str) {
  int first;
  size_t gvl = vsetvlmax_e8m8();
  const char *start = str;
  do {
    vint8m8_t v = vle8ff_v_i8m8(str, NULL, gvl);
    vbool1_t m = vmseq_vx_i8m8_b1(v, '\0', gvl);
    first = vfirst_m_b1(m, gvl);
    str += gvl;
  } while (first < 0);
  return (str - start) + (first - gvl);
}

size_t strlen_2(const char *str) {
  int first;
  size_t new_vl;
  size_t gvl = vsetvlmax_e8m8();
  const char *start = str;
  do {
    vint8m8_t v = vle8ff_v_i8m8(str, &new_vl, gvl);
    vbool1_t m = vmseq_vx_i8m8_b1(v, '\0', gvl);
    first = vfirst_m_b1(m, gvl);
    str += new_vl;
  } while (first < 0);
  return (str - start) + (first - new_vl);
}

I don't know if it is a normal use case, but I think this may be an example where new_vl is unused and we just want Fault-Only-First feature to allow speculative loads.

@nick-knight
Copy link
Collaborator

nick-knight commented Jun 22, 2022

Thanks for taking the time to write up this example. I see two problems with strlen_1 which I believe cause its behavior to differ from strlen_2.

The first problem is this code segment:

vint8m8_t v = vle8ff_v_i8m8(str, NULL, gvl);
vbool1_t m = vmseq_vx_i8m8_b1(v, '\0', gvl);
first = vfirst_m_b1(m, gvl);

Suppose that the V-register allocated to v is all zeros on function entry, the FF load returns "early" (with new_vl < gvl), and the length-new_vl substring does not contain \0. In this scenario, strlen_1 may return an incorrect value for the string length (depending on tail policy and how tail-agnostic is implemented).

Of course, you could avoid this problem by, say, initializing v to (say) all-ones. Suppose this is the case. Then the second problem becomes this code segment:

str += gvl;

Suppose that the true end of the string occurs between new_vl and gvl. You are bumping the pointer past the end of the string, so the subsequent FF load will start after the end of the string. Then strlen_1 will return an incorrect value for the string length.

I don't see a simple way to avoid both of these problems without checking new_vl. Perhaps you could initialize v with a sentinel char that we can use to determine new_vl (with extra cost), although I suspect this violates assumptions of libc strlen, since the user must ensure their string excludes the sentinel.

EDIT: @topperc just pointed out to me in a personal communication, vector loads have the following property:

Load instructions may overwrite active destination vector register group elements past the element index at which the trap is reported. Similarly, fault-only-first load instructions may update active destination elements past the element that causes trimming of the vector length (but not past the original vector length). The values of these spurious updates do not have to correspond to the values in memory at the addressed memory locations.

So the mitigations I proposed are probably insufficient. This doesn't change my conclusion that strlen_1 is broken.

@zakk0610
Copy link
Collaborator

You also can see spec example https://github.com/riscv/riscv-v-spec/blob/master/example/strlen.s#L12, it reads the new_vl to bump the pointer as Nick mention above.

@nick-knight
Copy link
Collaborator

Similarly to the exp argument of frexp, I don't think the RVV intrinsics API needs to explicitly mention the case when new_vl is null: I think it is clear from the API that this risks undefined behavior. This implies that both GCC's and LLVM's implementations are conformant.

If, in the future, we discover a use-case for null new_vl, then we can extend the API (and LLVM's implementation) to allow this, without breaking existing code. Or, if we decide it is important to mandate a certain type of runtime exception on null new_vl, then we can again do so (changing GCC's implementation) without breaking existing code, because code that depends on undefined behavior is arguably broken already.

So my suggestion is to not change anything right now.

@pcwang-thead
Copy link
Author

After thinking, I agree that there are some differences between strlen_1 and strlen_2, but not so deadly.

The first problem is this code segment:

vint8m8_t v = vle8ff_v_i8m8(str, NULL, gvl);
vbool1_t m = vmseq_vx_i8m8_b1(v, '\0', gvl);
first = vfirst_m_b1(m, gvl);

Suppose that the V-register allocated to v is all zeros on function entry, the FF load returns "early" (with new_vl < gvl), and the length-new_vl substring does not contain \0. In this scenario, strlen_1 may return an incorrect value for the string length (depending on tail policy and how tail-agnostic is implemented).

Of course, you could avoid this problem by, say, initializing v to (say) all-ones. Suppose this is the case.

Under your given hypothesis, the behavior of strlen_1 and strlen_2 are (elements of v shows in tables, trap is where we get a trap, all values after trap are values of spurious updates):

  1. If the values of spurious updates are all zeros.
a b c d e trap(0) 0 0

Both strlen_1 and strlen_2 return a length 5 that string ends at trap position(which means we consider a trap as '\0').

  1. If the values of spurious updates are not all zeros(but with at least one).
a b c d e trap(f) g 0

Both strlen_1 and strlen_2 return a wrong length 7 that string ends at position after trap.

  1. If the values of spurious updates are all non-zeros.
a b c d e trap(f) g h

For strlen_2, the next FF load will start at position where caused a trap in previous FF load(that is f), so the trap is actually taken at this time since element 0 raises an exception as shown below.

trap(f) g h i j k l m

For strlen_1, the next FF load will start at position after h, which has a huge possibility that it will cause a trap since we got a trap before. This is the difference between strlen_1 and strlen_2.

trap(i) j k l m n o p

Actually, I may have implemented strlen in the wrong way. I think it should be like:

size_t strlen_3(const char *str) {
  int first;
  size_t new_vl;
  size_t gvl = vsetvlmax_e8m8();
  const char *start = str;
  do {
    vint8m8_t v = vle8ff_v_i8m8(str, &new_vl, gvl);
    vbool1_t m = vmseq_vx_i8m8_b1(v, '\0', new_vl); // pass new_vl here.
    first = vfirst_m_b1(m, new_vl); // pass new_vl here.
    str += new_vl;
  } while (first < 0);
  return (str - start) + (first - new_vl);
}

For strlen_3, there is no difference for situation 1, 2 and 3. strlen_3 will start a new FF load and take a trap for element 0 at position where we got a trap in previous FF load, which is the same as strlen in libc I think.

Then the second problem becomes this code segment:

str += gvl;

Suppose that the true end of the string occurs between new_vl and gvl. You are bumping the pointer past the end of the string, so the subsequent FF load will start after the end of the string. Then strlen_1 will return an incorrect value for the string length.

I think it is impossible that the true end of the string occurs after an address which causes a trap, isn't it?

@topperc
Copy link
Collaborator

topperc commented Jun 23, 2022

I think it is impossible that the true end of the string occurs after an address which causes a trap, isn't it?

Suppose the end of the string is in a page that belongs to your process, but is not currently backed by physical memory. It has been swapped to disk and needs to be paged back in when accessed. Here's what will happen.

Suppose the start of that page occurs somewhere after element 0. The load will detect the page fault for the unavailable page. The vl will be trimmed to the elements that were able to be read and returned in new_vl. A well implemented loop will see that the 0 wasn't found in the first new_vl elements. The pointer will be advanced by new_vl and a new load issued on the next loop iteration. Now the start of the unmapped page is in element 0. This will trap to the operating system. The operating system will fetch the page from disk and put it into physical memory. The user process will be restarted and will reissue the load with a vstart of 0. This time the load will succeed without trapping since the page has now been mapped by the OS.

The fault only first load can't tell the difference between a page that the process is allowed to access but isn't mapped and one that the process is not allowed to access. The vl will be trimmed for either case. It's not until the start of the page is at element 0 that the OS will get involved to distinguish the two cases.

@pcwang-thead
Copy link
Author

I think it is impossible that the true end of the string occurs after an address which causes a trap, isn't it?

Suppose the end of the string is in a page that belongs to your process, but is not currently backed by physical memory. It has been swapped to disk and needs to be paged back in when accessed. Here's what will happen.

Suppose the start of that page occurs somewhere after element 0. The load will detect the page fault for the unavailable page. The vl will be trimmed to the elements that were able to be read and returned in new_vl. A well implemented loop will see that the 0 wasn't found in the first new_vl elements. The pointer will be advanced by new_vl and a new load issued on the next loop iteration. Now the start of the unmapped page is in element 0. This will trap to the operating system. The operating system will fetch the page from disk and put it into physical memory. The user process will be restarted and will reissue the load with a vstart of 0. This time the load will succeed without trapping since the page has now been mapped by the OS.

The fault only first load can't tell the difference between a page that the process is allowed to access but isn't mapped and one that the process is not allowed to access. The vl will be trimmed for either case. It's not until the start of the page is at element 0 that the OS will get involved to distinguish the two cases.

Oh, I get it! Thanks! I think I missed something here.

@nick-knight
I discussed with my colleagues and we agree that runtime null pointer check is not necessary if new_vl won't be null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants