Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally ignore utf-8 decoding error when converting std::string to python str. #97282

Closed
wants to merge 1 commit into from

Conversation

shuminghu
Copy link
Contributor

Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 21, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97282

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 5 Pending

As of commit 237daaa:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 21, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: shuminghu / name: Shuming Hu (dd7a4c0)

@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Mar 21, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/text that referenced this pull request Mar 21, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 37da8270cfd4ae11a43aeb7ab7093edd7d800cee
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 21, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 3202461664252f309ff9a63b35faaf642e92a81a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 21, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 88954cce2894909bab6a9f7f26d8a41d9652d9fc
@shuminghu shuminghu self-assigned this Mar 21, 2023
Copy link
Contributor

@davidberard98 davidberard98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

  • a few nits (inline)
  • can you make sure the lint is passing? one of the links shows it's failing but I can't quite find it. (in the pytorch repo, lint failures will get reverted)

torch/script.h Outdated
@@ -5,6 +5,7 @@
#include <torch/csrc/autograd/custom_function.h>
#include <torch/csrc/autograd/generated/variable_factories.h>
#include <torch/csrc/autograd/grad_mode.h>
#include <torch/csrc/jit/python/utf8_decoding_ignore.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to for registering bindings in the follow-up change in torchtext/csrc/register_torchbindings.cpp, since it only includes torch/script.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can we just include torch/csrc/jit/python/utf8_decoding_ignore.h in register_torchbindings.cpp?

not a big deal but since this is sort of a private api, we probably don't need to include it in torch/script.h (as you can see the current script.h includes a pretty minimal list of .h files)

Copy link
Contributor Author

@shuminghu shuminghu Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wierd. It fails with

fbcode/pytorch/text/torchtext/csrc/register_torchbindings.cpp:215:20: error: no member named 'setUTF8DecodingIgnore' in namespace 'torch::jit'
      &torch::jit::setUTF8DecodingIgnore);

when utf8_decoding_ignore.h is moved from torch/script.h to torchtext/csrc/register_torchbindings.cpp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm and this is after adding the .h include to both torchtext/csrc/register_pybindings.cpp and torchtext/csrc/register_torchingbindings.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked. I accidentally added it to files under /xplat 😅

@shuminghu
Copy link
Contributor Author

The lint fix has already been applied. It's from the formatting of adjacent but un-modified code.

shuminghu added a commit to shuminghu/text that referenced this pull request Mar 22, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 4988147e6905d1fb8096bf6d172ab8d8952b49b0
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 262b3e9165e50d893a72f162705956102f1143bc
shuminghu added a commit to shuminghu/text that referenced this pull request Mar 22, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 89cee96315440bb54f5d5e70665ac2fc4ee75e1b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: a871d2537a6c3aa26f1858be2484320f92184e37
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 872c1ea41870d885ad52c39a93735afa43020525
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: d54a7c527d702bec37544442f9ce6e2d441ddc47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

@davidberard98 davidberard98 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 22, 2023
@shuminghu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 23, 2023
… python str. (#97282)

Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

Pull Request resolved: pytorch/pytorch#97282
Approved by: https://github.com/davidberard98
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 27, 2023
… python str. (#97282)

Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

Pull Request resolved: pytorch/pytorch#97282
Approved by: https://github.com/davidberard98
shuminghu added a commit to shuminghu/text that referenced this pull request Mar 27, 2023
Summary: Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282).

Reviewed By: Nayef211

Differential Revision: D44315169

fbshipit-source-id: fa0f2fb02fb65c389743e32455dd5875d4e0324a
shuminghu added a commit to shuminghu/text that referenced this pull request Mar 27, 2023
…ytorch#2128)

Summary:
Pull Request resolved: pytorch#2128

Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282).

Reviewed By: Nayef211

Differential Revision: D44315169

fbshipit-source-id: a4a60549c8945c8e8594f4f2cc9e7c99d105128d
facebook-github-bot pushed a commit to pytorch/text that referenced this pull request Mar 28, 2023
…2128)

Summary:
Pull Request resolved: #2128

Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282).

Reviewed By: Nayef211

Differential Revision: D44315169

fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3
Nayef211 pushed a commit to Nayef211/text that referenced this pull request Mar 29, 2023
…ytorch#2128)

Summary:
Pull Request resolved: pytorch#2128

Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282).

Reviewed By: Nayef211

Differential Revision: D44315169

fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3
Nayef211 added a commit to pytorch/text that referenced this pull request Mar 29, 2023
#2134)

* Optionally ignore utf-8 decoding error for scripted C++ tokenizers. (#2128)

Summary:
Pull Request resolved: #2128

Binding and test to make sure we can use 'ignore' option for utf-8 decoding added to pytorch in D43970697( pytorch/pytorch#97282).

Reviewed By: Nayef211

Differential Revision: D44315169

fbshipit-source-id: d42fcacafd429cf586c631faf826abc172b173d3

* Linter fixes

---------

Co-authored-by: Shuming Hu <smhu@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: jit release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants