Cost of vsetvl Instructions #642

daweili1226 · 2021-02-18T02:20:27Z

For the V extension, I see many vsetvl instructions in code examples, and these vsetvl(s) are executed every iteration that possibly update global settings, which might create bubbles waiting for its commit for Out of Order CPU models (correct me if wrong), but the V design seems assuming that vsetvl instructions are cheap in a CPU pipeline, and I guess there should be doc or previous discussions on the cost of vsetvl instructions. Can somebody explain the cost of vsetvl instructions?

kasanovic · 2021-02-18T07:52:20Z

Implementations should assume vsetvl instructions will be frequent, and design microarchs accordingly. In particular, OoO implementations should not wait till commit and flush pipeline on vl and vtype CSR updates. The values can be renamed/bypassed before commit.

daweili1226 · 2021-02-19T01:10:27Z

Thanks Krste, so those vsetvl instructions updating CSRs would be costly, thus software should be aware that frequent updates should be avoided in iterations.

kasanovic · 2021-02-19T02:12:16Z

I'd state it as software will want to update these CSRs a lot, so hardware should make updates fast.

jnk0le · 2021-02-19T04:13:08Z

This is a reason why vtype and vl can only be updated by vsetvl and fault-first loads. It's an unavoidable thing in VLA architecture.

Average OoO cores can execute vsetvl long before following vector ops will be issued on pipelines. On "cold start" there will be just a few cycles of lag.
Simpler cores can decouple vector exec units. As a quick example Andes' NX27V has vector unit skewed to after the scalar commit point, so there is virtually no delay from vsetvl to execution.

brucehoult · 2021-03-30T22:29:49Z

@daweili1226 note that a future 64 bit encoding of vector instructions is likely to include the vtype in every instruction.

You should maybe look at the current vsetvli instruction and associated CSR as effectively presenting a few more opcode bits than fit in the current 32 bit encoding, and the current vtype should be carried along with each instruction in the pipeline.

Assuming you want maximum performance, anyway. If your machine's vector instructions are executed over a few beats then the overhead of a bubble after each vsetvli might not be too awful.

kasanovic · 2021-06-04T18:46:08Z

I don't see anything actionable here, so closing the issue.

kasanovic closed this as completed Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost of vsetvl Instructions #642

Cost of vsetvl Instructions #642

daweili1226 commented Feb 18, 2021

kasanovic commented Feb 18, 2021

daweili1226 commented Feb 19, 2021

kasanovic commented Feb 19, 2021

jnk0le commented Feb 19, 2021 •

edited

brucehoult commented Mar 30, 2021

kasanovic commented Jun 4, 2021

Cost of vsetvl Instructions #642

Cost of vsetvl Instructions #642

Comments

daweili1226 commented Feb 18, 2021

kasanovic commented Feb 18, 2021

daweili1226 commented Feb 19, 2021

kasanovic commented Feb 19, 2021

jnk0le commented Feb 19, 2021 • edited

brucehoult commented Mar 30, 2021

kasanovic commented Jun 4, 2021

jnk0le commented Feb 19, 2021 •

edited