Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Performance implications of zeroing past vl #157

Closed
solomatnikov opened this issue May 2, 2019 · 22 comments
Closed

Performance implications of zeroing past vl #157

solomatnikov opened this issue May 2, 2019 · 22 comments

Comments

@solomatnikov
Copy link

Zeroing past vl implies that vector instruction takes the same number of cycles as in case of vl==VLMAX if vector microarchitecture is limited by write port BW of vector register file.

This can be especially bad if vector code is written with LMUL==8 but used for relatively short vectors. For instance, saxpy example uses LMUL==8 and for VLEN==512, 4 lanes, 32-bit lanes/elements, every vector instruction would take 4*8 cycles because of write port(s) bottleneck, even when vl==16.

@aswaterman
Copy link
Member

There's a note that addresses this issue:

"For zeroing tail updates, implementations with temporally long vector registers, either with or without register renaming, will be motivated to add microarchitectural state to avoid actually writing zeros to all tail elements, but this is a relatively simple microarchitectural optimization. For example, one bit per element group or a quantized VL can be used to track the extent of zeroing. An element group is the set of elements comprising the smallest atomic unit of execution in the microarchitecture (often equivalent to the width of the physical datapath in the machine). The microarchitectural state for an element group indicates that zero should be returned for the element group on a read, and that zero should be substituted in for any masked-off elements in the group on the first write to that element group (after which the element group zero bit can be cleared)."

It's rather annoying, but it's not expensive.

@ccelio
Copy link

ccelio commented May 2, 2019

I would expect nearly all microarchitectures to use some sort of is_zero� register file that is not limited to the same bandwidth requirements as the vector register file.

@solomatnikov
Copy link
Author

Why not to leave the elements past vl in undefined state, just like the whole vector register is after reset? If SW erroneously uses such elements, it would get wrong result anyway.

This should work for implementations with and without register renaming.

@aswaterman
Copy link
Member

I think that’s a defensible position. Presumably the reason the TG came to this design was to avoid the additional implementation-defined behavior that would be subtly exposed by buggy software.

@solomatnikov
Copy link
Author

I think the spec should be changed to allow undefined state for elements past vl. Of course, zero or previous state would be allowed too.

Of course, one bit per element group to track the extent of zeroing can be implemented but I think the overhead could be non-trivial because of fan-out of such flops.

For example, in a simplistic implementation of 512-bit wide datapath with 128 entry vector register file, there will be 128 to 1 mux with fanout of 512 across the whole datapath. The same logic has to be replicated for each read port of the vector register file (4 ports min for a reasonable design). This can be challenging for physical design.

Of course, one can do a different implementation with flops replicated for each lanes, e.g. per 64-bit lane. So, the number of flops would be 8*128 with fanout of 64 and there is still quite a bit of wiring across each lane.

I don't think it's worth doing without clear compelling reason.

@jnk0le
Copy link

jnk0le commented May 2, 2019

Implementation-defined behaviour usually means security holes.

@solomatnikov
Copy link
Author

Abstract generalizations like this do not make an argument. Modern processors have many parts with implementation-defined behavior and a lot more state, e.g. data caches and branch predictors, that can be security holes. Yet no sane processor designer would get rid of data caches and branch predictors because the resulting design would not be competitive.

In this case prevention of data leaks/security holes is simple - on a context switch SW has to clear all registers anyway to prevent information leaks, zeroing past vl does not help or eliminate this. How would not zeroing past vl be a security hole?

@aswaterman
Copy link
Member

I agree with @solomatnikov this can be defined in a way that does not open a security hole: e.g., the unpredictable state must be a deterministic function of the architectural state that's visible to the executing privilege mode. This would permit both preserve-past-VL and zero-past-VL without permitting architecturally visible leakage from a different security domain.

I talked to @kasanovic about this today, and he said the TG's principal concern was about software inadvertently relying on the implementation's behavior. In particular: zero-past-VL and preserve-past-VL are both useful behaviors in some situations, and it's easy to imagine a software developer accidentally relying on whichever one the development machine provides. So we could end up in the situation where software runs only under one discipline or the other, risking the possibility of the adoption of a de-facto standard.

@kasanovic
Copy link
Collaborator

The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data.

@kasanovic
Copy link
Collaborator

We have discussed previously, but not specified in this version of spec, a way to "disable" the vector unit when not in use to save context-switch overhead (or even to enable power gating). This would reuse zeroing logic to clear state.

@jnk0le
Copy link

jnk0le commented May 3, 2019

In this case prevention of data leaks/security holes is simple - on a context switch SW has to clear all registers anyway to prevent information leaks, zeroing past vl does not help or eliminate this. How would not zeroing past vl be a security hole?

There is possibility of leaks within the thread context (some kind of use-after-free) that can be elevated by software written and "debugged" on renamed architectures, as pointed by Andrew. If that's not enough, we can ultimately exploit a vector capable javascript JIT compilers.
I think that this approach is valid, but we need to be carefull before such software is written/compiled.

@vbx-glemieux
Copy link

vbx-glemieux commented May 3, 2019 via email

@solomatnikov
Copy link
Author

The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data.

This is not true in general, i.e. if the vector register file is generated by a compiler.

Also, even for flop-based register file it is better to hold previous value on the output of the read port when the read port is disabled to minimize switching activity and power. I think common case is when read port is used ~50% of cycles. Forcing output to zero would double switching activity.

@kasanovic
Copy link
Collaborator

kasanovic commented May 3, 2019 via email

@solomatnikov
Copy link
Author

On May 3, 2019, at 10:05 AM, Alex Solomatnikov @.***> wrote: The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data. This is not true in general, i.e. if the vector register file is generated by a compiler.
In this case there is much less fanout to worry about, especially in terms of area.
Also, even for flop-based register file it is better to hold previous value on the output of the read port when the read port is disabled to minimize switching activity and power. I think common case is when read port is used ~50% of cycles. Forcing output to zero would double switching activity.
This is very pessimistic assumption in terms of switching activity. Most bits are zeros. Krste

Is it true for floating point values?


— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#157 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGAASOL6PXBNSYNWWUKQ73PTRWEFANCNFSM4HJ2SZAQ.

@billhuffman
Copy link

From my point of view, we need a rule for tail elements or we'll have software incompatibilities. "Leave as it was" is very bad for renamed designs, which leads to "zero." Then the question is whether "zero" is bad for any hardware designs.

As an alternative to Krste's OR trees for read ports, I would suggest that reading the "zero file" a cycle earlier than the vector would remove the fanout difficulties that might otherwise exist. That would allow also for easily zeroing in the last mux stage of the read port, which would avoid the switching activity Alex has mentioned.

Seems to me "zero" is the better answer and I think there are reasonable hardware structures to accomplish it.

 Bill

@solomatnikov
Copy link
Author

From my point of view, we need a rule for tail elements or we'll have software incompatibilities. "Leave as it was" is very bad for renamed designs, which leads to "zero." Then the question is whether "zero" is bad for any hardware designs.

Yes, zeroing past vl adds a lot of complexity to simple implementations, which will be the majority, at least initially.

Tracking dependencies for RAW, WAW and chaining becomes significantly more complicated because single beat can write variable number of elements. And these are required for good/competitive performance.

For example, a typical vector implementation can have separate memory and arithmetic pipelines with 4 lanes and VLMAX==16. Arithmetic pipeline executes FMA with vl==16 and memory pipeline executes vector load with vl==7, writing the same vector register. Last beat of vector load writes 12 elements, so WAW check/stall logic becomes more complicated.

Segment vector loads and stores make it even more complicated because segment mem ops write or read multiple vector registers (up to 8). And segment vector loads and stores are necessary to achieve good performance for many kernels/applications.

Lots of extra complexity without clear benefit.

As an alternative to Krste's OR trees for read ports,

What @kasanovic suggested does not help with timing or fanout, it actually makes it worse.

I would suggest that reading the "zero file" a cycle earlier than the vector would remove the fanout difficulties that might otherwise exist. That would allow also for easily zeroing in the last mux stage of the read port, which would avoid the switching activity Alex has mentioned.

This would complicate dependency and chaining logic even more because reading "zero file" a cycle earlier requires a lot of special cases in the logic. Is "zero file" also written a cycle earlier? Or extra stall cycle must be added? Or special bypass?

Seems to me "zero" is the better answer and I think there are reasonable hardware structures to accomplish it.

 Bill

@billhuffman
Copy link

billhuffman commented May 30, 2019 via email

@solomatnikov
Copy link
Author

Ping @kasanovic

@HanKuanChen
Copy link
Contributor

I think "leave tail as it was" is more convenient and make sense for software developers.

Take dot_prod as an example,

float32_t dot_prod(const float32_t *src1, const float32_t *src2, uint32_t len);

If the rule is "leave tail as it was", then use vfmacc.vv in loop and do vfredosum.vs in the end is intuitive.

	vsetvli x0, x0, e32, m8
	vmv.v.i v16, 0
	vmv.s.x v24, 0
loop:
	beqz new_vl, end
	vsetvli new_vl, len, e32, m8
	vlw.v v0, (src1)
	vlw.v v8, (src2)
	vfmacc.vv v16, v0, v8
	sub len, len, new_vl
	slli mem_offset, new_vl, 2
	add src1, src1, mem_offset
	add src2, src2, mem_offset
	j loop
end:
	vsetvli x0, x0, e32, m8
	vfredosum.vs v24, v16, v24
	# get result in v24[0]

However, if the rult is "leave tail as 0", then use vfredosum.vs in loop reduce performance.

	vsetvli x0, x0, e32, m8
	vmv.v.i v16, 0
	vmv.s.x v24, 0
loop:
	beqz new_vl, end
	vsetvli new_vl, len, e32, m8
	vlw.v v0, (src1)
	vlw.v v8, (src2)
	vfmul.vv v16, v0, v8
	vfredosum.vs v24, v16, v24
	sub len, len, new_vl
	slli mem_offset, new_vl, 2
	add src1, src1, mem_offset
	add src2, src2, mem_offset
	j loop
end:
	# get result in v24[0]

@kasanovic
Copy link
Collaborator

kasanovic commented Sep 2, 2019 via email

@kasanovic
Copy link
Collaborator

Decided to go with tail elements undisturbed in 0.8

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants