Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webgl: Fix NaN issue #6828

Merged
merged 12 commits into from Sep 29, 2022
Merged

webgl: Fix NaN issue #6828

merged 12 commits into from Sep 29, 2022

Conversation

qjia7
Copy link
Collaborator

@qjia7 qjia7 commented Sep 9, 2022

Fix #6822

Problem
1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN.
2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well.

Solution:
Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero.
Meanwhile, add an env flag WEBGL_ISNAN_CUSTOM to allow user to specify which isnan to use.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@qjia7 qjia7 changed the title [Test] Try to find the bug of WEBGL_USE_SHAPES_UNIFORMS webgl: Use bvec4 isNaN checking instead of vec4 isNaN checking Sep 15, 2022
@qjia7
Copy link
Collaborator Author

qjia7 commented Sep 15, 2022

@vladmandic @shurshilov
Please have the last test with this PR. I also make it ready for review.
The verified solution #6822 (comment):

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

The solution in this PR:

  bvec4 isNaNA = isnan(a);
  bvec4 isNaNB = isnan(b);
  bvec4 isNaN = bvec4(isNaNA.x || isNaNB.x, isNaNA.y || isNaNB.y, isNaNA.z || isNaNB.z, isNaNA.w || isNaNB.w);
  result.r = isNaN.r ? NAN : result.r;
  result.g = isNaN.g ? NAN : result.g;
  result.b = isNaN.b ? NAN : result.b;
  result.a = isNaN.a ? NAN : result.a;

  return result;

I think the second one should also be workable. But if not, I need to revert this change Use NAN instead of nanValue.xxx

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qjia7 thank you for fixing the NaN check, seems there are still some discussion on the validity of the fix? can you add some comment on the snippets to explain why the check needs to be breaking up. thanks!

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 7 of 7 files at r4, 6 of 6 files at r5, 9 of 9 files at r6.
Reviewable status: 0 of 1 approvals obtained (waiting on @lina128)

@qjia7 qjia7 changed the title webgl: Use bvec4 isNaN checking instead of vec4 isNaN checking webgl: Fix NaN issue Sep 20, 2022
Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyu10055 Breaking up the checks won't resolve the problem. And our previous verified solution is a misleading, which used the same test url (which totally removing NAN checking) by accident. So it's reasonable that whether to break up the checks, the result is similar.
After further debugging, it shows that isnan_custom doesn't work well on the problem gpu. But if we switch back to the shader builtin isnan, everything works well. So the current fixing is to allow the user to specify which isnan to use. Please take another look. Thanks.

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128)

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, please hold on the review. I find another issue. Will ping you when it's ready.

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128)

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ready for review. It turns out the previous issue is my test case problem.

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128)

@qjia7 qjia7 requested a review from pyu10055 September 21, 2022 00:17
@qjia7
Copy link
Collaborator Author

qjia7 commented Sep 23, 2022

Kindly ping you guys. Also add @Linchenn in case you are in vacation.

@qjia7 qjia7 requested a review from Linchenn September 23, 2022 00:51
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Reviewable status: 0 of 1 approvals obtained (waiting on @lina128, @Linchenn, and @qjia7)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

 * doesn't have the builtin isnan.
 */
ENV.registerFlag('WEBGL_ISNAN_CUSTOM', () => false);

should this default to true to match the previous behavior?


tfjs-backend-webgl/src/glsl_version.ts line 78 at r8 (raw file):

      #define isnan(value) isnan_custom(value)
    ` :
                                                             '';

The formatting seems to be weird, is it auto-formatted?

@vladmandic
Copy link
Contributor

should this default to true to match the previous behavior?

if previous behavior is proven to be broken on some gpus, why?

afaik, only reason for isnan_custom is because webgl1 does not have native isnan, so why not enable it for webgl1 and disable otherwise. we should avoid creating new flags that nobody will remember few months from now.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jiajia!

Reviewable status: 0 of 1 approvals obtained (waiting on @Linchenn, @pyu10055, and @qjia7)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

should this default to true to match the previous behavior?

If it's only useful for webgl2, should this flag be set to true if webgl1 is true?

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @Linchenn, @pyu10055, and @vladmandic)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

Previously, lina128 (Na Li) wrote…

If it's only useful for webgl2, should this flag be set to true if webgl1 is true?

webgl1 is not controlled by this flag. Since webgl1 doesn't have a builtin isnan, we always use a custom isnan. For webgl2, previously, we use a customized isnan, which is proved to be problem on some GPUs. So I add this flag to use builtin isnan for webgl2 by default. Meanwhile, I keep the original customized isnan for testing if necessary. I agree with @vladmandic's opinion. Maybe I can rename it to WEBGL2_ISNAN_CUSTOM to more reflect the meaning.


tfjs-backend-webgl/src/glsl_version.ts line 78 at r8 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

The formatting seems to be weird, is it auto-formatted?

Yes, it's auto-formatted.

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @Linchenn, @pyu10055, and @vladmandic)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

webgl1 is not controlled by this flag. Since webgl1 doesn't have a builtin isnan, we always use a custom isnan. For webgl2, previously, we use a customized isnan, which is proved to be problem on some GPUs. So I add this flag to use builtin isnan for webgl2 by default. Meanwhile, I keep the original customized isnan for testing if necessary. I agree with @vladmandic's opinion. Maybe I can rename it to WEBGL2_ISNAN_CUSTOM to more reflect the meaning.

Or I can totally remove the customized isnan for webgl2 since we have the builtin isnan ?

@qjia7 qjia7 requested a review from pyu10055 September 27, 2022 01:11
@qjia7 qjia7 requested a review from lina128 September 27, 2022 01:11
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @Linchenn, @pyu10055, and @vladmandic)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

Or I can totally remove the customized isnan for webgl2 since we have the builtin isnan ?

got it, if we believe using build-in isnan will fix the error and not creating any new issues, we can remove this flag.

Copy link
Collaborator Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @Linchenn, @pyu10055, and @vladmandic)


tfjs-backend-webgl/src/flags_webgl.ts line 276 at r8 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

got it, if we believe using build-in isnan will fix the error and not creating any new issues, we can remove this flag.

Thanks Ping. Let's keep it for a while to see if any bug is reported by the builtin isnan. If everything works well, I will submit another PR to totally remove the customized one.

@hujiajie
Copy link
Contributor

@qjia7 It sounds the correctness of the builtin/polyfilled isnan can be programmatically probed with a small snippet and then the right one could be picked up automatically without "creating new flags that nobody will remember few months from now", or no?

@qjia7
Copy link
Collaborator Author

qjia7 commented Sep 28, 2022

@qjia7 It sounds the correctness of the builtin/polyfilled isnan can be programmatically probed with a small snippet and then the right one could be picked up automatically without "creating new flags that nobody will remember few months from now", or no?

In fact, It's hard to construct a small case to reflect the isnan issue. The current solution is to always use the builtin isnan. I know you may have concern about the comment here Use custom isnan definition to work across differences between implementations on various platforms.. But I think it's compare to the old custom isnan [return (val > 0. || val < 1. || val == 0.) ? false : true; ]. So I prefer we directly use the builtin isnan for simplification.

@qjia7 qjia7 merged commit 39dd70a into tensorflow:master Sep 29, 2022
@qjia7 qjia7 deleted the fix_shape_uniforms branch September 29, 2022 00:55
koyykdy added a commit to CodeSmithDSMLProjects/tfjs that referenced this pull request Sep 29, 2022
* Customize setTimeout (tensorflow#6694)

If the setTimeout nesting level is greater than 5 and timeout is less
than 4ms, timeout will be clamped to 4ms, which hurts the perf. A custom
setTimeout is provided to mitigate the perf impact.

BUG: tensorflow#6687
Co-authored-by: Na Li <linazhao@google.com>

* Upgrade windows BrowserStack chrome to 104 (tensorflow#6866)

* webgpu: Disable importExternalTexture (tensorflow#6868)

WebGPU Working Group recently found some problem with
importExtenalTexture in spec, so we have to disable it temporarily.

* Refactored Resizing Layer Unit Tests (#38)

* Rescaling Preprocessing Layer
Co-authored-by:
David Kim (@koyykdy) <dok098@ucsd.edu>
Brian Zheng (@Brianzheng123) <brianzheng345@gmail.com>

* PR issues resolved

* linting and PR issues resolved
Co-authored-by: Adam Lang (@AdamLang96) <adamglang96@gmail.com>
Co-authored-by: (@Brianzheng123) <brianzheng345@gmail.com>

* initial implementation of image preprocessing: resizing layer, and associated unit tests. Comments and refactoring for image scaling layer

* refactoring in computeOutputShape for image resizing layer

* Unit tests for image resizing preprocessing layer expanded and refactored

* refactored unit tests for resizing layer

* Preprocessing-Resizing layer unit test expansion and refactoring.
Co-authored-by: Adam Lang <@AdamLang96> (adamglang96@gmail.com)

* cleaning up commit diffs

* cleaning up commit diffs

* PR commit suggestions accepted - code refactored to reflect changes

* resizing layer unit test refactoring

Co-authored-by: AdamLang96 <45542095+AdamLang96@users.noreply.github.com>

* Linting issue resolved: unused import statement culled (#39)

* Rescaling Preprocessing Layer
Co-authored-by:
David Kim (@koyykdy) <dok098@ucsd.edu>
Brian Zheng (@Brianzheng123) <brianzheng345@gmail.com>

* PR issues resolved

* linting and PR issues resolved
Co-authored-by: Adam Lang (@AdamLang96) <adamglang96@gmail.com>
Co-authored-by: (@Brianzheng123) <brianzheng345@gmail.com>

* initial implementation of image preprocessing: resizing layer, and associated unit tests. Comments and refactoring for image scaling layer

* refactoring in computeOutputShape for image resizing layer

* Unit tests for image resizing preprocessing layer expanded and refactored

* refactored unit tests for resizing layer

* Preprocessing-Resizing layer unit test expansion and refactoring.
Co-authored-by: Adam Lang <@AdamLang96> (adamglang96@gmail.com)

* cleaning up commit diffs

* cleaning up commit diffs

* PR commit suggestions accepted - code refactored to reflect changes

* resizing layer unit test refactoring

* linting issues resolved: unusued import statement culled

Co-authored-by: AdamLang96 <45542095+AdamLang96@users.noreply.github.com>

* Update jasmine_util.ts (tensorflow#6872)

FIX

* webgl: Fix NaN issue (tensorflow#6828)

Fix tensorflow#6822

Problem
1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN.
2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well.

Solution:
Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero.
Meanwhile, add an env flag WEBGL2_ISNAN_CUSTOM to allow user to specify which isnan to use.

* Upgrade nodejs to 18.7.0 (tensorflow#6863)

* Upgrade nodejs to 18.7.0

* Fix hash table test string not passed as base64

* fixed prelu fusing code that pre-maturely neg the const on multiply (tensorflow#6876)

Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>

* Update tfjs-layers/src/layers/preprocessing/image_resizing.ts

Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

Co-authored-by: Yang Gu <yang.gu@intel.com>
Co-authored-by: Na Li <linazhao@google.com>
Co-authored-by: Matthew Soulanille <msoulanille@google.com>
Co-authored-by: AdamLang96 <45542095+AdamLang96@users.noreply.github.com>
Co-authored-by: Linchenn <40653845+Linchenn@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajia.qin@intel.com>
Co-authored-by: Ping Yu <4018+pyu10055@users.noreply.github.com>
Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

Co-authored-by: Yang Gu <yang.gu@intel.com>
Co-authored-by: Na Li <linazhao@google.com>
Co-authored-by: Matthew Soulanille <msoulanille@google.com>
Co-authored-by: AdamLang96 <45542095+AdamLang96@users.noreply.github.com>
Co-authored-by: Linchenn <40653845+Linchenn@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajia.qin@intel.com>
Co-authored-by: Ping Yu <4018+pyu10055@users.noreply.github.com>
Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>
Co-authored-by: Matthew Soulanille <matthew@soulanille.net>
koyykdy added a commit to CodeSmithDSMLProjects/tfjs that referenced this pull request Sep 29, 2022
* Update jasmine_util.ts (tensorflow#6872)

FIX

* webgl: Fix NaN issue (tensorflow#6828)

Fix tensorflow#6822

Problem
1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN.
2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well.

Solution:
Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero.
Meanwhile, add an env flag WEBGL2_ISNAN_CUSTOM to allow user to specify which isnan to use.

* Upgrade nodejs to 18.7.0 (tensorflow#6863)

* Upgrade nodejs to 18.7.0

* Fix hash table test string not passed as base64

* fixed prelu fusing code that pre-maturely neg the const on multiply (tensorflow#6876)

Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>

Co-authored-by: Linchenn <40653845+Linchenn@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajia.qin@intel.com>
Co-authored-by: Matthew Soulanille <msoulanille@google.com>
Co-authored-by: Ping Yu <4018+pyu10055@users.noreply.github.com>
Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>
koyykdy added a commit to CodeSmithDSMLProjects/tfjs that referenced this pull request Sep 29, 2022
* Update jasmine_util.ts (tensorflow#6872)

FIX

* webgl: Fix NaN issue (tensorflow#6828)

Fix tensorflow#6822

Problem
1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN.
2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well.

Solution:
Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero.
Meanwhile, add an env flag WEBGL2_ISNAN_CUSTOM to allow user to specify which isnan to use.

* Upgrade nodejs to 18.7.0 (tensorflow#6863)

* Upgrade nodejs to 18.7.0

* Fix hash table test string not passed as base64

* fixed prelu fusing code that pre-maturely neg the const on multiply (tensorflow#6876)

Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>

Co-authored-by: Linchenn <40653845+Linchenn@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajia.qin@intel.com>
Co-authored-by: Matthew Soulanille <msoulanille@google.com>
Co-authored-by: Ping Yu <4018+pyu10055@users.noreply.github.com>
Co-authored-by: RajeshT <43972606+rthadur@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[webgl] enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values
5 participants