-
Notifications
You must be signed in to change notification settings - Fork 23.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a global configuration to throw error on NaN #1274
Comments
This might be helpful, following the example from their docs: >>> np.int16(32000) * np.int16(3)
30464
>>> old_settings = np.seterr(all='warn', over='raise')
>>> np.int16(32000) * np.int16(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FloatingPointError: overflow encountered in short_scalars Such error raising might be helpful for pytorch when a user is playing with quantization / deep compression. This is one thing I can think of. |
please please please |
FWIW I've implemented a wrapped version of torch that does its best to emulate this sort of behavior: https://github.com/samuela/kindling/blob/master/kindling/nan_police.py.
I've added it to my mini-toolkit of pytorch utilities, kindling. |
please add this feature |
please, this would be great! |
Great feature for debugging. any updates? |
anomaly detection provides this feature: https://pytorch.org/docs/stable/autograd.html#torch.autograd.detect_anomaly However, it's not cheap. @samuela the simulated version you have is not cheap either right? Shall we close this feature request with a pointer to |
@soumith No, I don't imagine that it's particularly fast, but I also haven't benchmarked it. I use it when debugging these sorts of errors and then resume using usual torch after figuring things out. |
in that case, I'm closing the issue. The global configurable variable to detect anomalies including
It is going to be really slow, but there isn't really something else we are aiming to do that'll be any faster. This can also be used as a context manager for a limited set of statements as: with autograd.detect_anomaly():
inp = torch.rand(10, 10, requires_grad=True)
out = run_fn(inp)
out.backward() Full documentation at: https://pytorch.org/docs/stable/autograd.html#torch.autograd.detect_anomaly |
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (pytorch#1274) * FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> * fix and generate docs for FusedRMSNorm (pytorch#1285) * [FusedRMSNorm doc] document where epsilon is added (pytorch#1295) * [FusedRMSNorm doc] add epsilon to formula * correct * better wording * Fix some bugs * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs * Fix NaN issues in FusedRMSNorm * Update test_fused_layer_norm.py * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize Co-authored-by: eqy <eddiey@nvidia.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
It doesn't seem to raise error on nan in forward pass, or does it?
|
NumPy has a global setting where instead of returning NaN, one can throw a RuntimeError.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.seterr.html
We should have something like this in pytorch (maybe).
This task is to scope out that work, or reject the feature request.
The text was updated successfully, but these errors were encountered: