Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Cost of vsetvl Instructions #642

Closed
daweili1226 opened this issue Feb 18, 2021 · 6 comments
Closed

Cost of vsetvl Instructions #642

daweili1226 opened this issue Feb 18, 2021 · 6 comments

Comments

@daweili1226
Copy link

For the V extension, I see many vsetvl instructions in code examples, and these vsetvl(s) are executed every iteration that possibly update global settings, which might create bubbles waiting for its commit for Out of Order CPU models (correct me if wrong), but the V design seems assuming that vsetvl instructions are cheap in a CPU pipeline, and I guess there should be doc or previous discussions on the cost of vsetvl instructions. Can somebody explain the cost of vsetvl instructions?

@kasanovic
Copy link
Collaborator

Implementations should assume vsetvl instructions will be frequent, and design microarchs accordingly. In particular, OoO implementations should not wait till commit and flush pipeline on vl and vtype CSR updates. The values can be renamed/bypassed before commit.

@daweili1226
Copy link
Author

Thanks Krste, so those vsetvl instructions updating CSRs would be costly, thus software should be aware that frequent updates should be avoided in iterations.

@kasanovic
Copy link
Collaborator

I'd state it as software will want to update these CSRs a lot, so hardware should make updates fast.

@jnk0le
Copy link

jnk0le commented Feb 19, 2021

This is a reason why vtype and vl can only be updated by vsetvl and fault-first loads. It's an unavoidable thing in VLA architecture.

Average OoO cores can execute vsetvl long before following vector ops will be issued on pipelines. On "cold start" there will be just a few cycles of lag.
Simpler cores can decouple vector exec units. As a quick example Andes' NX27V has vector unit skewed to after the scalar commit point, so there is virtually no delay from vsetvl to execution.

@brucehoult
Copy link
Contributor

@daweili1226 note that a future 64 bit encoding of vector instructions is likely to include the vtype in every instruction.

You should maybe look at the current vsetvli instruction and associated CSR as effectively presenting a few more opcode bits than fit in the current 32 bit encoding, and the current vtype should be carried along with each instruction in the pipeline.

Assuming you want maximum performance, anyway. If your machine's vector instructions are executed over a few beats then the overhead of a bubble after each vsetvli might not be too awful.

@kasanovic
Copy link
Collaborator

I don't see anything actionable here, so closing the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants