Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dotproduct assembly documentation and godbolt links #270

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

miguelraz
Copy link
Contributor

@miguelraz miguelraz commented Apr 1, 2022

Not yet finished but I wanted to save my work for a bit.

Adding a bunch of text to README.md, with some (may I say) nicely curated Rust godbolt links and displays.

stdsimd docs don't yet have a "voice/tone", let me know if it needs a course correction.

@miguelraz miguelraz changed the title Add dotproduct assembly Add dotproduct assembly documentation and godbolt links Apr 1, 2022

This example code takes the dot product of two vectors. You are supposed to mulitply each pair of elements and add them all together.

The easiest way to inspect the assembly of the `scalar` code versions (the non-SIMD versions) is to [click this link](https://rust.godbolt.org/z/xM9Mxb14n) for a *mise en place* of what is going on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to avoid non-english phrases since not everyone knows French (i guess? idk what that phrase means).


```

1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have any SIMD vector registers that can hold 512-bits at a time on your CPU.

```

1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
2. You can switch between different instruction sets by both changing the `#![target-feature(...)]` macro above the function and declaring it unsafe.

declaring it unsafe by itself doesn't change the target features.


1. SIMD comes in many flavors (instructions sets). These (like `sse`, `sse4.1`, `avx2`) describe the hardware capabilities of your current CPU. That is, if you don't have `avx512`, you physically do not have a SIMD vector that can hold 512 bytes at a time at most on your CPU.
2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
3. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest phrasing in terms of "what the instruction does" rather than "what it says".

2. You can switch between different instruction sets by changing the `#![target-feature(...)]` macro above the function, as well as declaring it unsafe.
3. Inside Godbolt, you can hover over an instruction to display a tooltip of what it says. Try hovering your mouse over `mulps` and reading what it says.

We need to find a way to reduce the amount of *data movement*. We're not doing enough work for all the moving floats into and out of the `xmm` registers. This isn't surprising if we stop and try to look at the code for a bit: `dot_prod_simd_0` is loading 4 floats into `xmm` `a`, then the corresponding 4 floats from `b`, multiplying them (the efficient part), and then doing a `reduce_sum`. In general, SIMD reductions inside a tight loop are a perf anti-pattern, and you should try and figure out a way to make those reductions `element-wise` and not `vector-wise`. This is what we see in the following snippet:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element-wise vs. vector-wise reductions -- not clear, should be rephrased, maybe by describing what they do rather than naming them.


-----

Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"can cut swaths in the data movement overheads xmm registers can carry" -- unclear, should be rephrased.

Copy link

@Urhengulas Urhengulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spotted two potential typos while reading through your PR 😄

}
```

In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In `dot_prod_simd_1`, we tried out the `fold` patter from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This
In `dot_prod_simd_1`, we tried out the `fold` pattern from our previous `scalar` code snippet examples. This pattern, when implemented via SIMD instructions naively, means that for every `f32x4` `element`-wise multiplication, we accumulate into a (initially `0` valued `f32x4` SIMD vector) and then finally do a `reduce_sum` at the end to get the final result. This


-----

Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be "can come from", not "can come form".

Suggested change
Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come form knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:
Now we will exploit the `mul_add` instruction. Open [this link to view the snippets side by side once again](https://rust.godbolt.org/z/vPTqG13vK). We've started off with a simple computation: adding and multiplying. Even though the arithmetic operations are not complicated, the performance payoff can come from knowing specific hardware capabilities like `mul_add`: in a single instruction, it can multiply 2 SIMD vectors and add them into a 3rd, which can cut swaths in the data movement overheads `xmm` registers can carry. Other instructions like inverse square roots are available (which are very popular for physics calculations), and it can get oodles more complex depending on the problem - there's published algorithms with `shuffles`, `swizzles` and `casts` for [decoding UTF8](https://arxiv.org/pdf/2010.03090.pdf), all in SIMD registers and with fancy table lookups. We won't talk about those here, but we just want to point out that firstly, reading the books can pay off drastically, and second, we're starting small to show the concepts, like using `mul_add` in the next snippet:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants