Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help debugging dot product #44

Closed
stevobailey opened this issue Mar 15, 2022 · 3 comments
Closed

Help debugging dot product #44

stevobailey opened this issue Mar 15, 2022 · 3 comments

Comments

@stevobailey
Copy link
Contributor

stevobailey commented Mar 15, 2022

Can you help me debug a dot product issue? I am taking two vectors, 9 16-bit elements each, and performing a dot product on them. I am using a 16-bit to 32-bit multiply, followed by a regular reduction sum (not width widening). Below is the relevant assembly.

800129cc <vect_dotProduct>:
800129cc: c500f757            vsetivli  a4,1,e32,m1,ta,mu
800129d0: 5e003457            vmv.v.i v8,0 
800129d4: c50d                  beqz  a0,800129fe <vect_dotProduct+0x32>
800129d6: 04057757            vsetvli a4,a0,e8,m1,ta,mu
800129da: 04877057            vsetvli zero,a4,e16,m1,ta,mu
800129de: 0205d487            vle16.v v9,(a1)
800129e2: 02065507            vle16.v v10,(a2)
800129e6: ee952657            vwmul.vv  v12,v9,v10
800129ea: 01107057            vsetvli zero,zero,e32,m2,tu,mu
800129ee: 02c42457            vredsum.vs  v8,v12,v8
800129f2: 00171793            slli  a5,a4,0x1
800129f6: 95be                  add a1,a1,a5
800129f8: 8d19                  sub a0,a0,a4
800129fa: 963e                  add a2,a2,a5
800129fc: fd69                  bnez  a0,800129d6 <vect_dotProduct+0xa>
800129fe: c500f057            vsetivli  zero,1,e32,m1,ta,mu
80012a02: 0206e427            vse32.v v8,(a3)
80012a06: 8082                  ret

I see it loading the vectors correctly, then multiplying them correctly. However, when it gets to the reduction sum, the pipe_in_ctrl_i.vl_part_0 signal is high. This means the following line of RTL never actually adds the elements into the result. I see the elements appearing one-by-one in the elem_q signal, but the result stays 0.

result_d = ~pipe_in_ctrl_i.vl_part_0 ? (elem_q + reduct_val) : reduct_val;

What is the next step in debugging this? What sets the vl_part_0 signal high? Thanks!

@michael-platzer
Copy link
Contributor

Thanks for reporting this and for the detailed analysis!

The vl_part and vl_part_0 signals encode the VL (vector length) for the part of the vector currently being processed. They are derived from the vl_q and vl_0_q signals in vproc_core.sv.

The encoding is as follows: The value in vl_q is the vector length in bytes (rather than in elements) minus 1. As an example, vsetivli t0,3,e16,m1 (i.e., selecting a vector length of 3 for 16-bit elements) would result in vl_q being set to 5 (3*2-1). Since a VL of 0 cannot be encoded in vl_q, the extra bit vl_0_q is used as a flag to indicate that the content of vl_q is invalid and the vector length is 0 instead.

When one of the execution units processes a part of a vector register, a vl_part and vl_part_0 signal are derived from vl_q and vl_0_q. The encoding is the same, except that these signals only cover the part that is currently being processed (so vl_part_0 would initially be low and vl_part indicates how many bytes of the current part are still covered by the VL setting and once the part being processed is beyond the VL, then vl_part_0 is set).

I initially selected that (rather unintuitive) encoding because it allowed for some optimizations, although I am not sure if that still applies. Maybe it might be better to switch to a simpler encoding for VL.

I ran your code and the reason why vl_part_0 is high is that vl_0_q is set in vproc_core.sv. That in turn is because with some recent refactoring I moved a conditional branch in the code which updates vl_q and vl_0_q upon a vsetvl instruction and forgot to take the special case into account that applies when both rd and rs1 are x0 (keeping the current VL). That is why the instruction vsetvli zero,zero,e32,m2,tu,mu incorrectly sets vl_0_q.

Apologies for that, I fixed it now and hopefully your code should work now.

@stevobailey
Copy link
Contributor Author

Thanks, looks good on my end. Is there an easy way to add a test case for this? How do you generate your test code and data? I'm happy to create PRs for new tests for all these little bugs as I find them.

@michael-platzer
Copy link
Contributor

That would be awesome! New test cases can be added by creating a new assembly (*.S) file in one of the sub-directories of test/. Currently, most of them are unit tests for the individual execution units, grouped into sub-directories named according to the respective unit.

The test/csr/ sub-directory is intended to contain tests for instructions manipulating the vector CSR. Currently, it only contains two test cases. A new test case for checking that the special vsetvl variant for keeping the current VL is handled correctly would best fit under this directory.

The way the test cases work is by executing the main function in the assembly file, which usually reads data from the memory section between vdata_start and vdata_end, modifies it, and writes it back (although some test cases might only write to that section). The test framework then checks whether the final content of that memory section corresponds to the expected reference, which is also encoded in the *.S file (it is the content between vref_start and vref_end).

A way to implement a test case for the special vsetvl variant would be to first set the VL to some value using a regular vsetvl, then use the special variant and then verify that subsequent vector instructions still process the expected number of vector elements. For instance, the following instructions should overwrite the first 11 bytes after vdata_start with 0x5a.

    la              a0, vdata_start

    li              t0, 11
    vsetvli         t0, t0, e16,m2

    vsetvli         x0, x0, e8,m1

    li              t0, 0x5a
    vmv.v.x         v0, t0
    vse8.v          v0, (a0)

Therefore, in order to verify that this code has executed correctly, the memory content between vref_start and vref_end should match the initial content between vdata_start and vdata_end expect that the first 11 bytes are replaced by 0x5a. The initial content between vdata_start and vdata_end can be any values, either generated randomly or simply copied over from another test file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants