Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally ignore utf-8 decoding error when converting std::string to python str. #2126

Closed

Conversation

shuminghu
Copy link
Contributor

Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/text that referenced this pull request Mar 21, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 37da8270cfd4ae11a43aeb7ab7093edd7d800cee
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 21, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 3202461664252f309ff9a63b35faaf642e92a81a
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 21, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 88954cce2894909bab6a9f7f26d8a41d9652d9fc
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/text that referenced this pull request Mar 22, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 4988147e6905d1fb8096bf6d172ab8d8952b49b0
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 262b3e9165e50d893a72f162705956102f1143bc
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

shuminghu added a commit to shuminghu/text that referenced this pull request Mar 22, 2023
… python str. (#97282)

Summary:
X-link: pytorch/pytorch#97282

Pull Request resolved: pytorch#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 89cee96315440bb54f5d5e70665ac2fc4ee75e1b
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: a871d2537a6c3aa26f1858be2484320f92184e37
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: 872c1ea41870d885ad52c39a93735afa43020525
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43970697

@shuminghu shuminghu closed this Mar 22, 2023
shuminghu added a commit to shuminghu/pytorch that referenced this pull request Mar 22, 2023
… python str. (pytorch#97282)

Summary:
Pull Request resolved: pytorch#97282

X-link: pytorch/text#2126

When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4503599786612705

Reviewed By: Nayef211

Differential Revision: D43970697

fbshipit-source-id: d54a7c527d702bec37544442f9ce6e2d441ddc47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants