New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally ignore utf-8 decoding error when converting std::string to python str. #97282
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97282
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 5 PendingAs of commit 237daaa: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D43970697 |
… python str. (#97282) Summary: X-link: pytorch/pytorch#97282 Pull Request resolved: pytorch#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 37da8270cfd4ae11a43aeb7ab7093edd7d800cee
This pull request was exported from Phabricator. Differential Revision: D43970697 |
dd7a4c0
to
f73b6b2
Compare
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 3202461664252f309ff9a63b35faaf642e92a81a
This pull request was exported from Phabricator. Differential Revision: D43970697 |
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 88954cce2894909bab6a9f7f26d8a41d9652d9fc
f73b6b2
to
1a8a1c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
- a few nits (inline)
- can you make sure the lint is passing? one of the links shows it's failing but I can't quite find it. (in the pytorch repo, lint failures will get reverted)
torch/script.h
Outdated
@@ -5,6 +5,7 @@ | |||
#include <torch/csrc/autograd/custom_function.h> | |||
#include <torch/csrc/autograd/generated/variable_factories.h> | |||
#include <torch/csrc/autograd/grad_mode.h> | |||
#include <torch/csrc/jit/python/utf8_decoding_ignore.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to for registering bindings in the follow-up change in torchtext/csrc/register_torchbindings.cpp, since it only includes torch/script.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm can we just include torch/csrc/jit/python/utf8_decoding_ignore.h
in register_torchbindings.cpp?
not a big deal but since this is sort of a private api, we probably don't need to include it in torch/script.h (as you can see the current script.h includes a pretty minimal list of .h files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wierd. It fails with
fbcode/pytorch/text/torchtext/csrc/register_torchbindings.cpp:215:20: error: no member named 'setUTF8DecodingIgnore' in namespace 'torch::jit'
&torch::jit::setUTF8DecodingIgnore);
when utf8_decoding_ignore.h
is moved from torch/script.h to torchtext/csrc/register_torchbindings.cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm and this is after adding the .h include to both torchtext/csrc/register_pybindings.cpp and torchtext/csrc/register_torchingbindings.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked. I accidentally added it to files under /xplat 😅
|
… python str. (#97282) Summary: X-link: pytorch/pytorch#97282 Pull Request resolved: pytorch#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 4988147e6905d1fb8096bf6d172ab8d8952b49b0
This pull request was exported from Phabricator. Differential Revision: D43970697 |
1a8a1c4
to
96ce7aa
Compare
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 262b3e9165e50d893a72f162705956102f1143bc
… python str. (#97282) Summary: X-link: pytorch/pytorch#97282 Pull Request resolved: pytorch#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 89cee96315440bb54f5d5e70665ac2fc4ee75e1b
This pull request was exported from Phabricator. Differential Revision: D43970697 |
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: a871d2537a6c3aa26f1858be2484320f92184e37
96ce7aa
to
adfa1ef
Compare
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: 872c1ea41870d885ad52c39a93735afa43020525
adfa1ef
to
59cf210
Compare
This pull request was exported from Phabricator. Differential Revision: D43970697 |
… python str. (pytorch#97282) Summary: Pull Request resolved: pytorch#97282 X-link: pytorch/text#2126 When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705 Reviewed By: Nayef211 Differential Revision: D43970697 fbshipit-source-id: d54a7c527d702bec37544442f9ce6e2d441ddc47
This pull request was exported from Phabricator. Differential Revision: D43970697 |
59cf210
to
237daaa
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
… python str. (#97282) Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 Pull Request resolved: pytorch/pytorch#97282 Approved by: https://github.com/davidberard98
… python str. (#97282) Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 Pull Request resolved: pytorch/pytorch#97282 Approved by: https://github.com/davidberard98
Summary: Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282). Reviewed By: Nayef211 Differential Revision: D44315169 fbshipit-source-id: fa0f2fb02fb65c389743e32455dd5875d4e0324a
…ytorch#2128) Summary: Pull Request resolved: pytorch#2128 Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282). Reviewed By: Nayef211 Differential Revision: D44315169 fbshipit-source-id: a4a60549c8945c8e8594f4f2cc9e7c99d105128d
…2128) Summary: Pull Request resolved: #2128 Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282). Reviewed By: Nayef211 Differential Revision: D44315169 fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3
…ytorch#2128) Summary: Pull Request resolved: pytorch#2128 Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282). Reviewed By: Nayef211 Differential Revision: D44315169 fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3
#2134) * Optionally ignore utf-8 decoding error for scripted C++ tokenizers. (#2128) Summary: Pull Request resolved: #2128 Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282). Reviewed By: Nayef211 Differential Revision: D44315169 fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3 * Linter fixes --------- Co-authored-by: Shuming Hu <smhu@meta.com>
Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.
Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070
Reviewed By: Nayef211
Differential Revision: D43970697