Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable libsleefdft and libsleefquad on riscv64 #503

Merged
merged 1 commit into from
Mar 7, 2024

Conversation

sh1boot
Copy link
Contributor

@sh1boot sh1boot commented Jan 17, 2024

Continuation of #496

Mostly just the in-fill of src/arch/helperrvv.h.

src/arch/helperrvv.h Outdated Show resolved Hide resolved
src/arch/helperrvv.h Outdated Show resolved Hide resolved
src/arch/helperrvv.h Outdated Show resolved Hide resolved
src/arch/helperrvv.h Outdated Show resolved Hide resolved
src/arch/helperrvv.h Outdated Show resolved Hide resolved
@sh1boot
Copy link
Contributor Author

sh1boot commented Jan 20, 2024

Enabling inline headers will come next: rivosinc#4

It refactors this code a little bit, but I assume it's better to draw a line under this before moving on.

src/arch/helperrvv.h Outdated Show resolved Hide resolved
@luhenry
Copy link
Contributor

luhenry commented Jan 22, 2024

@blapie could we please get a review on that, IMO it should be ready for RISC-V. Also it's passing on CI: https://github.com/rivosinc/sleef/actions/runs/7596179815

@blapie
Copy link
Collaborator

blapie commented Jan 22, 2024

Hi! Will have a look this week! Thanks for the contribution already!

Copy link
Collaborator

@blapie blapie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me, but might need to have a second look because atm I'm mostly just trusting you on the technicalities.

src/dft/vectortype.h Outdated Show resolved Hide resolved
.github/workflows/build_and_test.yml Show resolved Hide resolved
@blapie blapie mentioned this pull request Jan 25, 2024
@sh1boot
Copy link
Contributor Author

sh1boot commented Jan 30, 2024

Soliciting review from @ericlove and @GlassOfWhiskey. I understand you were contributors at the beginning of this PR.

src/dft/vectortype.h Outdated Show resolved Hide resolved
@sh1boot
Copy link
Contributor Author

sh1boot commented Feb 3, 2024

I'm not sure about some of the cast/truncate implementations, here. vint is a different size from vmask (not sure why, is that ok?), and when casting between the two there are a mix of different policies for which bits to throw away.

Unfortunately if I change the implementation the tests still pass.

What's the expectation for functions that go between vint and vmask?

This has come up where I've extended the logic for quad support, where I did indeed get test failures, but only in RVVM2 tests, where the truncation frees a whole register which the compiler can reuse. In RVVM1 bugs can be hidden because vint uses half a register and if you truncate and un-truncate the data can survive.

@blapie
Copy link
Collaborator

blapie commented Feb 6, 2024

I'm not sure about some of the cast/truncate implementations, here. vint is a different size from vmask (not sure why, is that ok?), and when casting between the two there are a mix of different policies for which bits to throw away.

What do you mean by "there are a mix of different policies"? Do you mean cross vector extensions in SLEEF?
Not sure, if vint and vmask are meant to be of the same size or not, each vector extension could handle them differently. SVE for instance uses the same datatype for both, but it drops the most significant bits I believe, maybe you could use the same approach for truncating/casting?

Unfortunately if I change the implementation the tests still pass.

What's the expectation for functions that go between vint and vmask?

I'm not aware of any particular expectation, as long as the choice is consistent, but my understanding of the dft part of the library is limited.

@ericlove
Copy link
Contributor

ericlove commented Feb 6, 2024

I'm not sure about some of the cast/truncate implementations, here. vint is a different size from vmask (not sure why, is that ok?), and when casting between the two there are a mix of different policies for which bits to throw away.

Sorry for taking so long to join this conversation, but just thought it might be useful to chime in with my memory of why vint and vmask are the way they are in the RVV port (warning: this memory is reconstructed because it was long paged out...take it with a grain of salt).

The vmask type is constrained by the definition of vcast_vm_i_i to require 64-bit elements, even in single precision code in sleefsimdsp.c. Unfortunately, this interacts badly with one of the design principles behind the RVV port, which is that all types should have the same VLMAX (VECTLENSP here), meaning that, for SP (assuming RVVM1 for example), vmask needs to have LMUL=2, while for DP, where the base type vdouble (vfloat64m1_t) already has 64-bit elements, it needs LMUL=1.

Meanwhile, vint follows a different logic. It is used only in the DP codes, but for some reason is constrained to have 32-bit elements. I no longer know why this is--either I ran into bugs in the original port when trying to widen it to 64 bits, or you have just found an opportunity to fix a major incoherence in the RVV design; you'll have to try it out to know which (and either way, we can certainly add the glaringly absent comment that should accompany it). But given this is the case now, it means that to keep VLMAX constant at VECTLENDP, vint needs to have half the LMUL of other 64-bit types, including vmask.

Hopefully one of these constraints I described is unnecessary, and can be relaxed to simplify both the original RVV design, and your work on the DFT extensions.

@sh1boot
Copy link
Contributor Author

sh1boot commented Feb 9, 2024

To clarify what I mean about 'different policies', here's the list of conversions to or from the vmask type that I found (supposing a configuration where m1 represents the minimum 128 bits):

vreinterpret_vf_vm: vncvt_x  256 -> 128
vreinterpret_vm_vf: vwcvtu_x  128 -> 256
vreinterpret_vd_vm: bitcast  128 -> 128
vreinterpret_vm_vd: bitcast  128 -> 128
vreinterpret_vm_vi64: bitcast  128 -> 128
vreinterpret_vi64_vm: bitcast  128 -> 128
vreinterpret_vm_vu64: bitcast  128 -> 128
vreinterpret_vu64_vm: bitcast  128 -> 128

vcast_vm_i64: splat  64 -> 128
vcast_vm_u64: splat  64 -> 128
vcast_vm_i_i: splat  (32+32) -> 128  or (32+32) -> 256 for single-precision
vcast_vm_vi: vwcvt_x  64 -> 128
vcastu_vm_vi: vwsll_x  64 -> 128
vcast_vi_vm: vncvt_x  128 ->64
vcastu_vi_vm: vnsrl 32  128 -> 64

Here I think vncvt_x will truncate on a per-lane basis (64-bit down to 32-bit for both single and double precision configurations), vwcvt_x will sign extend (and vwcvtu_x will zero-extend) on a per-lane basis (32-bit up to 64-bit), and vnsrl 32 will truncate on a per-lane basis like vncvt but selecting the top 32 bits instead of the bottom 32 bits. splat just repeats a 64-bit pattern until the destination is full.

Though that's in my local work tree after I've refactored a bit. Comparing to the version in the current pull request, I see two implementations for vcast_vi_vm(), and the single-precision one is a truncate operation which is definitely wrong. However, I can't think where that came from (evidently I added it but I don't see why I would because I don't think it's used?).

vcast_vm_i_i gives me pause, because it takes a pair of 64-bit inputs but still only creates a 64-bit pattern out of those before splatting. [edit] looking at the usage, it's clearly expected that the arguments are truncated to 32-bit, even though the interface accepts 64-bit.

It looks like vcastu_vi_vm(vcastu_vm_vi(x)) would always return zero, which seems surprising as well. [edit] This is fine, I overlooked the vsll outside of the vwcvt_x.

But all the tests pass.

I'll try rationalising the size of vmask and vint for single-precision and see which options work. If there's a more conventional arrangement then I guess I'll switch to that instead.

@sh1boot
Copy link
Contributor Author

sh1boot commented Feb 10, 2024

The vmask type is constrained by the definition of vcast_vm_i_i to require 64-bit elements, even in single precision code in sleefsimdsp.c.

The way it's used in sleefsimdsp.c it looks like it's just creating pairs of complementary floats, so it's not unreasonable to cast it back to a 32-bit type having the same VECTLENSP as float32 operations -- then it works out to be the same bit width for DP and SP cases. It just needs a temporary 64-bit splat while the pair of floats are unrolled.

Working that change through piece by piece it looks like all (or just most?) of the compatibility workarounds between SP and DP disappear. So I guess I'll do that.

@sh1boot
Copy link
Contributor Author

sh1boot commented Feb 10, 2024

Looks like the first step in rebasing automatically closed this PR? Anyway, I've rebased it, now, and fixed or understood everything that I was worried about.

@sh1boot sh1boot reopened this Feb 10, 2024
@sh1boot sh1boot changed the title Enable libsleefdft on riscv64 Enable libsleefdft and libsleefquad on riscv64 Feb 11, 2024
@sh1boot
Copy link
Contributor Author

sh1boot commented Feb 16, 2024

I've carved out some of this patch into #520 and #521. I'm not sure how github works, here, but I assume that this patch now depends on those PRs going in first, and when they do this diff will shrink?

#522 to follow.

Diff from #521: rivosinc/sleef@dev/shosie/fix-rvv-warnings...dev/shosie/fix-rvv-dft

@blapie
Copy link
Collaborator

blapie commented Mar 1, 2024

Ok, all 4 PRs make sense for the most part. Really appreciate you linking to the diffs, although my eyes started to burn a bit after 2 PRs...
Almost ready to merge #520, #521, #503 and #522 (in that order).
Just requesting more comments here #521 (comment)
If you feel like the code could use a bit more commenting please go ahead if that helps future maintenance.

@blapie
Copy link
Collaborator

blapie commented Mar 7, 2024

@sh1boot This one need a rebase now.

Co-Authored-By:    Simon Hosie <shosie@rivosinc.com>
@luhenry
Copy link
Contributor

luhenry commented Mar 7, 2024

@blapie I've rebased it and updated this PR. The diff between before and after the rebase is the same: https://github.com/shibatch/sleef/compare/82411c1f1883a528471db11c7439b56857038de8..b1c49df4ebe7163f6ab6c553586d0840a46f627e.

@blapie blapie merged commit e0fbd34 into shibatch:master Mar 7, 2024
31 checks passed
@luhenry luhenry deleted the dev/shosie/fix-rvv-dft branch March 25, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants