Bug in glm_mat4_inv_sse2 #289

deadwanderer · 2023-04-07T00:45:14Z

PREFACE: I completely lack both the linear algebra and SIMD skills to fix this; I merely note the issue's existence!

I was calculating the inverse transpose of the model matrix to properly transform normals for my lighting shader, and noted that the inverse transpose was wrong when I calculated on the CPU with cglm. I compared the results with calculating it in the shader (OpenGL) and in GLM, and found that the issue occurred in glm_mat4_inv, when using SSE (thus following the glm_mat4_inv_sse2 path.

When I manually commented out the SSE path and forced the non-SIMD path in glm_mat4_inv, the calculations came out correct, matching OpenGL shader and GLM results.

Here is the code I used to calculate a sample model matrix, and its inverse transpose, along with the code to print out resulting matrices at each step (NOTE: I've flipped/corrected? the mat4 printing (#288), so the matrices print out column-major, to match glm's output which I was comparing. I also increased precision to 6 decimal places, again to match glm's output):

mat4 model;
mat4 inverse;
mat4 transpose;

// Create an identity matrix for the model
glm_mat4_identity(model);
fprintf(stderr, "Identity matrix:\n");
glm_mat4_print(model, stderr);

// Translate the model matrix
glm_translate(model, (vec3){2.0f, 5.0f, -15.0f});
fprintf(stderr, "Translated matrix:\n");
glm_mat4_print(model, stderr);

// Apply a rotation to the model matrix in the x and y dimensions.
glm_rotate(model, glm_rad(45.0f), (vec3){1.0f, 0.5f, 0.0f});
fprintf(stderr, "Model matrix:\n");
glm_mat4_print(model, stderr);

// Invert the matrix, and store in inverse
glm_mat4_inv(model, inverse);
fprintf(stderr, "Inverse matrix:\n");
glm_mat4_print(inverse, stderr);

// Transpose the inverse matrix, and store in transpose
glm_mat4_transpose_to(inverse, transpose);
fprintf(stderr, "Transpose matrix:\n");
glm_mat4_print(transpose, stderr);

And here are the resulting values.

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.967995  0.120464  0.325154  -0.000000  |
  | -0.120464 -0.787298  0.650308  -0.000000  |
  | -0.325154  0.650308  0.727066  -0.000000  |
  |  7.415617 -5.577195 -13.507222  -1.028227  |
Transpose matrix:
  |  0.967995 -0.120464  -0.325154  7.415617  |
  |  0.120464 -0.787298   0.650308 -5.577195  |
  |  0.325154  0.650308   0.727066 -13.507222  |
  | -0.000000 -0.000000  -0.000000 -1.028227  |

When I forced the non-SIMD path, I got the following results:

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.941421  0.117157  0.316228  -0.000000  |
  |  0.117157  0.765685 -0.632456   0.000000  |
  | -0.316228  0.632456  0.707107  -0.000000  |
  | -7.212046  5.424091  13.136425   1.000000  |
Transpose matrix:
  |  0.941421  0.117157  -0.316228 -7.212046  |
  |  0.117157  0.765685   0.632456  5.424091  |
  |  0.316228 -0.632456   0.707107  13.136425  |
  | -0.000000  0.000000  -0.000000  1.000000  |

For comparison, here's the glm code I used:

// Create an identity matrix for the model.
glm::mat4 model = glm::mat4(1.0f);
fprintf(stderr, "Identity matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Translate the matrix
model = glm::translate(model, glm::vec3(2.0f, 5.0f, -15.0f));
fprintf(stderr, "Translated matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Rotate the matrix along the x and y axes
model = glm::rotate(model, glm::radians(45.0f), glm::vec3(1.0f, 0.5f, 0.0f));
fprintf(stderr, "Model matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Invert the model matrix
glm::mat4 inverse = glm::inverse(model);
fprintf(stderr, "Inverse matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(inverse).c_str());

// Transpose the inverted matrix.
glm::mat4 transpose = glm::transpose(inverse);
fprintf(stderr, "Transpose matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(transpose).c_str());

And here are the glm results (which match the non-SIMD path of glm_mat4_inv, slight formatting differences notwithstanding):

Identity matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  0.000000, 0.000000, 0.000000, 1.000000 |
Translated matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Model matrix:
  |  0.941421, 0.117157, -0.316228, 0.000000 |
  |  0.117157, 0.765685, 0.632456, 0.000000 |
  |  0.316228, -0.632456, 0.707107, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Inverse matrix:
  |  0.941421, 0.117157, 0.316228, -0.000000 |
  |  0.117157, 0.765685, -0.632456, 0.000000 |
  | -0.316228, 0.632456, 0.707107, -0.000000 |
  | -7.212046, 5.424091, 13.136425, 1.000000 |
Transpose matrix:
  |  0.941421, 0.117157, -0.316228, -7.212046 |
  |  0.117157, 0.765685, 0.632456, 5.424091 |
  |  0.316228, -0.632456, 0.707107, 13.136425 |
  | -0.000000, 0.000000, -0.000000, 1.000000 |

So pretty clearly, something goes wonky and wrong in the SSE path for glm_mat4_inv.

The text was updated successfully, but these errors were encountered:

recp · 2023-04-07T08:04:28Z

Hi @deadwanderer

Thanks for the feedbacks,

I'll take a look at asap, many thanks, if anyone could find the issue before me a PR would also be helpful

recp · 2023-04-07T21:36:08Z

This is ARM64 output (macOS M1 Max and Virtual Windows11_ARM MSVC are similar):

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

This is x86 output :

(Virtual Windows11_ARM MSVC)

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

I got similar results. I'll try to test on my Intel Mac later. What's your environment may I ask? Also do you have latest cglm ?

recp · 2023-04-07T22:01:23Z

glm_arch_print_name(stderr || stdout) which is added with glm_arch_print_name could be helpful to print selected simd path in general. glm_arch_print_name can be called at top to see what is the arch...

deadwanderer · 2023-04-08T00:16:02Z

Thanks for the response. Using my example code again, against a fresh pull of the cglm repository, and calling the glm_arch_print_name function at the top, gave the following results:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.96799 -0.12046 -0.32515   7.41562  |
  |  0.12046 -0.78730  0.65031  -5.57719  |
  |  0.32515  0.65031  0.72707 -13.50722  |
  | -0.00000 -0.00000 -0.00000  -1.02823  |

Transpose matrix:
Matrix (float4x4):
  |  0.96799  0.12046   0.32515 -0.00000  |
  | -0.12046 -0.78730   0.65031 -0.00000  |
  | -0.32515  0.65031   0.72707 -0.00000  |
  |  7.41562 -5.57719 -13.50722 -1.02823  |

deadwanderer · 2023-04-08T00:17:41Z

Oh, and I'm on Windows 10 desktop.

gottfriedleibniz · 2023-04-08T13:45:27Z

I suspect the compiler is converting mixed +/-0.0f constants like this, into -0.0f, -0.0f, -0.0f, -0.0f or 0.0f, 0.0f, 0.0f, 0.0f. Are you compiling with /fp:fast by chance?

deadwanderer · 2023-04-08T16:54:50Z

That was it! I removed /fp:fast, and am now getting identical results on SSE and non-SSE paths.

Thank you, @recp and @gottfriedleibniz for all the troubleshooting and help!

gottfriedleibniz · 2023-04-08T17:47:19Z

Ideally, functions which mix +/-0.0f should be robust enough against fp:fast. Especially when they still work with ffast-math on gcc/clang.

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

All uses of -0.0f should probably be looked at, I guess. Maybe.

recp · 2023-04-08T21:46:54Z

@deadwanderer thanks, @gottfriedleibniz good catch, many thanks!

I used /fp:fast but got similar results (SSE2 path on virtual Windows11 arm insider, VS Community 2022 (ARM 64-bit) - 17.6.0 Preview 2.0), not sure how to reproduce the issue 🤷‍♂️

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

+1 for this. It is better to make it work on compilers including msvc :)

Otherwise, MSVC pragmas are an option.

Which pragmas may I ask? How to make it better with msvc pragmas on msvc? A PR to make improvements as you mentioned would also be awesome :)

gottfriedleibniz · 2023-04-08T23:16:34Z

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

Resulting in with fp:fast:

movaps  xmm15, XMMWORD PTR __xmm@80000000000000008000000000000000
// ...
xorps   xmm4, xmm15
// ...
xorps   xmm13, xmm15
// ...
xorps   xmm14, xmm15
// ...
xorps   xmm12, xmm15

Without fp:fast:

shufps  xmm0, xmm10, 153                    ; 00000099H
movaps  xmm11, xmm4
movaps  XMMWORD PTR x9$1$[rsp], xmm0
// ...
xorps   xmm4, xmm10
// ...
xorps   xmm14, xmm10
// ...
xorps   xmm15, XMMWORD PTR x9$1$[rsp]
// ...
xorps   xmm13, XMMWORD PTR x9$1$[rsp]

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this. Although, there may be some superior alternative that I am unaware of (and other cases, esp. after inlining).

Which pragmas may I ask?

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

recp · 2023-04-09T08:36:31Z

@gottfriedleibniz many thanks

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this.

any chance to get a PR? or I'll do it asap

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

thanks

recp · 2023-04-09T08:54:55Z

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

also to avoid similar scenarios, any idea why fp:fast option doing this since both x8 and x9 are used in different places?

gottfriedleibniz · 2023-04-17T23:08:55Z

any chance to get a PR? or I'll do it asap

You can.

any idea why fp:fast option doing this since both x8 and x9 are used in different places?

I suppose some byproduct of subexpression elimination with fp:fast not having to honour signed zeros. I have not looked into this any further.

recp · 2023-04-20T20:52:51Z

@gottfriedleibniz sure, I'll asap, thanks for the infos

recp · 2023-04-21T17:25:45Z

@gottfriedleibniz I created a PR for this, it would be nice to get a review: #291

@deadwanderer is it possible to test that PR with /fp:fast to validate that is working as expected (previous tests)?

thanks

deadwanderer · 2023-04-21T22:40:05Z

Here are my results using that PR's changes:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

Looks good to me! Thanks so much, @recp and @gottfriedleibniz!

gottfriedleibniz · 2023-04-21T23:42:19Z

Seems good. Passes my tests.

recp · 2023-04-22T07:12:58Z

@deadwanderer @gottfriedleibniz many thanks, that PR is merged.

recp added bug simd sse labels Apr 7, 2023

deadwanderer closed this as completed Apr 8, 2023

recp reopened this Apr 8, 2023

recp mentioned this issue Apr 21, 2023

simd, sse: use 0x80000000 insteaf of -0.f to fix fastmath on msvc #291

Merged

recp closed this as completed in #291 Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in glm_mat4_inv_sse2 #289

Bug in glm_mat4_inv_sse2 #289

deadwanderer commented Apr 7, 2023

recp commented Apr 7, 2023

recp commented Apr 7, 2023 •

edited

recp commented Apr 7, 2023

deadwanderer commented Apr 8, 2023 •

edited

deadwanderer commented Apr 8, 2023

gottfriedleibniz commented Apr 8, 2023

deadwanderer commented Apr 8, 2023

gottfriedleibniz commented Apr 8, 2023 •

edited

recp commented Apr 8, 2023 •

edited

gottfriedleibniz commented Apr 8, 2023

recp commented Apr 9, 2023

recp commented Apr 9, 2023

gottfriedleibniz commented Apr 17, 2023

recp commented Apr 20, 2023

recp commented Apr 21, 2023 •

edited

deadwanderer commented Apr 21, 2023 •

edited

gottfriedleibniz commented Apr 21, 2023

recp commented Apr 22, 2023

Bug in glm_mat4_inv_sse2 #289

Bug in glm_mat4_inv_sse2 #289

Comments

deadwanderer commented Apr 7, 2023

recp commented Apr 7, 2023

recp commented Apr 7, 2023 • edited

recp commented Apr 7, 2023

deadwanderer commented Apr 8, 2023 • edited

deadwanderer commented Apr 8, 2023

gottfriedleibniz commented Apr 8, 2023

deadwanderer commented Apr 8, 2023

gottfriedleibniz commented Apr 8, 2023 • edited

recp commented Apr 8, 2023 • edited

gottfriedleibniz commented Apr 8, 2023

recp commented Apr 9, 2023

recp commented Apr 9, 2023

gottfriedleibniz commented Apr 17, 2023

recp commented Apr 20, 2023

recp commented Apr 21, 2023 • edited

deadwanderer commented Apr 21, 2023 • edited

gottfriedleibniz commented Apr 21, 2023

recp commented Apr 22, 2023

recp commented Apr 7, 2023 •

edited

deadwanderer commented Apr 8, 2023 •

edited

gottfriedleibniz commented Apr 8, 2023 •

edited

recp commented Apr 8, 2023 •

edited

recp commented Apr 21, 2023 •

edited

deadwanderer commented Apr 21, 2023 •

edited