Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in glm_mat4_inv_sse2 #289

Closed
deadwanderer opened this issue Apr 7, 2023 · 18 comments · Fixed by #291
Closed

Bug in glm_mat4_inv_sse2 #289

deadwanderer opened this issue Apr 7, 2023 · 18 comments · Fixed by #291

Comments

@deadwanderer
Copy link

PREFACE: I completely lack both the linear algebra and SIMD skills to fix this; I merely note the issue's existence!

I was calculating the inverse transpose of the model matrix to properly transform normals for my lighting shader, and noted that the inverse transpose was wrong when I calculated on the CPU with cglm. I compared the results with calculating it in the shader (OpenGL) and in GLM, and found that the issue occurred in glm_mat4_inv, when using SSE (thus following the glm_mat4_inv_sse2 path.

When I manually commented out the SSE path and forced the non-SIMD path in glm_mat4_inv, the calculations came out correct, matching OpenGL shader and GLM results.

Here is the code I used to calculate a sample model matrix, and its inverse transpose, along with the code to print out resulting matrices at each step (NOTE: I've flipped/corrected? the mat4 printing (#288), so the matrices print out column-major, to match glm's output which I was comparing. I also increased precision to 6 decimal places, again to match glm's output):

mat4 model;
mat4 inverse;
mat4 transpose;

// Create an identity matrix for the model
glm_mat4_identity(model);
fprintf(stderr, "Identity matrix:\n");
glm_mat4_print(model, stderr);

// Translate the model matrix
glm_translate(model, (vec3){2.0f, 5.0f, -15.0f});
fprintf(stderr, "Translated matrix:\n");
glm_mat4_print(model, stderr);

// Apply a rotation to the model matrix in the x and y dimensions.
glm_rotate(model, glm_rad(45.0f), (vec3){1.0f, 0.5f, 0.0f});
fprintf(stderr, "Model matrix:\n");
glm_mat4_print(model, stderr);

// Invert the matrix, and store in inverse
glm_mat4_inv(model, inverse);
fprintf(stderr, "Inverse matrix:\n");
glm_mat4_print(inverse, stderr);

// Transpose the inverse matrix, and store in transpose
glm_mat4_transpose_to(inverse, transpose);
fprintf(stderr, "Transpose matrix:\n");
glm_mat4_print(transpose, stderr);

And here are the resulting values.

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.967995  0.120464  0.325154  -0.000000  |
  | -0.120464 -0.787298  0.650308  -0.000000  |
  | -0.325154  0.650308  0.727066  -0.000000  |
  |  7.415617 -5.577195 -13.507222  -1.028227  |
Transpose matrix:
  |  0.967995 -0.120464  -0.325154  7.415617  |
  |  0.120464 -0.787298   0.650308 -5.577195  |
  |  0.325154  0.650308   0.727066 -13.507222  |
  | -0.000000 -0.000000  -0.000000 -1.028227  |

When I forced the non-SIMD path, I got the following results:

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.941421  0.117157  0.316228  -0.000000  |
  |  0.117157  0.765685 -0.632456   0.000000  |
  | -0.316228  0.632456  0.707107  -0.000000  |
  | -7.212046  5.424091  13.136425   1.000000  |
Transpose matrix:
  |  0.941421  0.117157  -0.316228 -7.212046  |
  |  0.117157  0.765685   0.632456  5.424091  |
  |  0.316228 -0.632456   0.707107  13.136425  |
  | -0.000000  0.000000  -0.000000  1.000000  |

For comparison, here's the glm code I used:

// Create an identity matrix for the model.
glm::mat4 model = glm::mat4(1.0f);
fprintf(stderr, "Identity matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Translate the matrix
model = glm::translate(model, glm::vec3(2.0f, 5.0f, -15.0f));
fprintf(stderr, "Translated matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Rotate the matrix along the x and y axes
model = glm::rotate(model, glm::radians(45.0f), glm::vec3(1.0f, 0.5f, 0.0f));
fprintf(stderr, "Model matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Invert the model matrix
glm::mat4 inverse = glm::inverse(model);
fprintf(stderr, "Inverse matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(inverse).c_str());

// Transpose the inverted matrix.
glm::mat4 transpose = glm::transpose(inverse);
fprintf(stderr, "Transpose matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(transpose).c_str());

And here are the glm results (which match the non-SIMD path of glm_mat4_inv, slight formatting differences notwithstanding):

Identity matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  0.000000, 0.000000, 0.000000, 1.000000 |
Translated matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Model matrix:
  |  0.941421, 0.117157, -0.316228, 0.000000 |
  |  0.117157, 0.765685, 0.632456, 0.000000 |
  |  0.316228, -0.632456, 0.707107, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Inverse matrix:
  |  0.941421, 0.117157, 0.316228, -0.000000 |
  |  0.117157, 0.765685, -0.632456, 0.000000 |
  | -0.316228, 0.632456, 0.707107, -0.000000 |
  | -7.212046, 5.424091, 13.136425, 1.000000 |
Transpose matrix:
  |  0.941421, 0.117157, -0.316228, -7.212046 |
  |  0.117157, 0.765685, 0.632456, 5.424091 |
  |  0.316228, -0.632456, 0.707107, 13.136425 |
  | -0.000000, 0.000000, -0.000000, 1.000000 |

So pretty clearly, something goes wonky and wrong in the SSE path for glm_mat4_inv.

@recp
Copy link
Owner

recp commented Apr 7, 2023

Hi @deadwanderer

Thanks for the feedbacks,

I'll take a look at asap, many thanks, if anyone could find the issue before me a PR would also be helpful

@recp
Copy link
Owner

recp commented Apr 7, 2023

This is ARM64 output (macOS M1 Max and Virtual Windows11_ARM MSVC are similar):

Screenshot 2023-04-08 at 12 32 00 AM

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

This is x86 output :

Screenshot 2023-04-08 at 12 31 51 AM

(Virtual Windows11_ARM MSVC)
Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

I got similar results. I'll try to test on my Intel Mac later. What's your environment may I ask? Also do you have latest cglm ?

@recp
Copy link
Owner

recp commented Apr 7, 2023

glm_arch_print_name(stderr || stdout) which is added with glm_arch_print_name could be helpful to print selected simd path in general. glm_arch_print_name can be called at top to see what is the arch...

@deadwanderer
Copy link
Author

deadwanderer commented Apr 8, 2023

Thanks for the response. Using my example code again, against a fresh pull of the cglm repository, and calling the glm_arch_print_name function at the top, gave the following results:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.96799 -0.12046 -0.32515   7.41562  |
  |  0.12046 -0.78730  0.65031  -5.57719  |
  |  0.32515  0.65031  0.72707 -13.50722  |
  | -0.00000 -0.00000 -0.00000  -1.02823  |

Transpose matrix:
Matrix (float4x4):
  |  0.96799  0.12046   0.32515 -0.00000  |
  | -0.12046 -0.78730   0.65031 -0.00000  |
  | -0.32515  0.65031   0.72707 -0.00000  |
  |  7.41562 -5.57719 -13.50722 -1.02823  |

@deadwanderer
Copy link
Author

Oh, and I'm on Windows 10 desktop.

@gottfriedleibniz
Copy link

I suspect the compiler is converting mixed +/-0.0f constants like this, into -0.0f, -0.0f, -0.0f, -0.0f or 0.0f, 0.0f, 0.0f, 0.0f. Are you compiling with /fp:fast by chance?

@deadwanderer
Copy link
Author

That was it! I removed /fp:fast, and am now getting identical results on SSE and non-SSE paths.

Thank you, @recp and @gottfriedleibniz for all the troubleshooting and help!

@gottfriedleibniz
Copy link

gottfriedleibniz commented Apr 8, 2023

Ideally, functions which mix +/-0.0f should be robust enough against fp:fast. Especially when they still work with ffast-math on gcc/clang.

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

All uses of -0.0f should probably be looked at, I guess. Maybe.

@recp
Copy link
Owner

recp commented Apr 8, 2023

@deadwanderer thanks, @gottfriedleibniz good catch, many thanks!

I used /fp:fast but got similar results (SSE2 path on virtual Windows11 arm insider, VS Community 2022 (ARM 64-bit) - 17.6.0 Preview 2.0), not sure how to reproduce the issue 🤷‍♂️

Screenshot 2023-04-09 at 12 43 22 AM

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

+1 for this. It is better to make it work on compilers including msvc :)

Otherwise, MSVC pragmas are an option.

Which pragmas may I ask? How to make it better with msvc pragmas on msvc? A PR to make improvements as you mentioned would also be awesome :)

@recp recp reopened this Apr 8, 2023
@gottfriedleibniz
Copy link

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

Resulting in with fp:fast:

movaps  xmm15, XMMWORD PTR __xmm@80000000000000008000000000000000
// ...
xorps   xmm4, xmm15
// ...
xorps   xmm13, xmm15
// ...
xorps   xmm14, xmm15
// ...
xorps   xmm12, xmm15

Without fp:fast:

shufps  xmm0, xmm10, 153                    ; 00000099H
movaps  xmm11, xmm4
movaps  XMMWORD PTR x9$1$[rsp], xmm0
// ...
xorps   xmm4, xmm10
// ...
xorps   xmm14, xmm10
// ...
xorps   xmm15, XMMWORD PTR x9$1$[rsp]
// ...
xorps   xmm13, XMMWORD PTR x9$1$[rsp]

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this. Although, there may be some superior alternative that I am unaware of (and other cases, esp. after inlining).

Which pragmas may I ask?

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

@recp
Copy link
Owner

recp commented Apr 9, 2023

@gottfriedleibniz many thanks

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this.

any chance to get a PR? or I'll do it asap

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

thanks

@recp
Copy link
Owner

recp commented Apr 9, 2023

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

also to avoid similar scenarios, any idea why fp:fast option doing this since both x8 and x9 are used in different places?

@gottfriedleibniz
Copy link

any chance to get a PR? or I'll do it asap

You can.

any idea why fp:fast option doing this since both x8 and x9 are used in different places?

I suppose some byproduct of subexpression elimination with fp:fast not having to honour signed zeros. I have not looked into this any further.

@recp
Copy link
Owner

recp commented Apr 20, 2023

@gottfriedleibniz sure, I'll asap, thanks for the infos

@recp
Copy link
Owner

recp commented Apr 21, 2023

@gottfriedleibniz I created a PR for this, it would be nice to get a review: #291

@deadwanderer is it possible to test that PR with /fp:fast to validate that is working as expected (previous tests)?

thanks

@deadwanderer
Copy link
Author

deadwanderer commented Apr 21, 2023

Here are my results using that PR's changes:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

Looks good to me! Thanks so much, @recp and @gottfriedleibniz!

@gottfriedleibniz
Copy link

Seems good. Passes my tests.

@recp recp closed this as completed in #291 Apr 22, 2023
@recp
Copy link
Owner

recp commented Apr 22, 2023

@deadwanderer @gottfriedleibniz many thanks, that PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants