Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libtorchtext cpp example #1817

Merged
merged 6 commits into from Jul 8, 2022
Merged

Conversation

Nayef211
Copy link
Contributor

@Nayef211 Nayef211 commented Jul 7, 2022

Reference Issue #1644

Description

  • Adding an example that demonstrates how to use libtorchtext and libtorch libraries in a C++ application
  • The example demonstrates the use of a scripted GPT2BPETokenizer and shows that the tokenization results are consistent across Python and C++

The example is inspired by the libtorchaudio examples from the torchaudio repo and the tokenizer example from @mreso

@Nayef211
Copy link
Contributor Author

Nayef211 commented Jul 7, 2022

@parmeet one thing I wanted to double check with you is whether to reuse the gpt2_bpe_encoder.json and gpt2_bpe_vocab.bpe in the test/asset folder instead of checking it in the example folder. I thought it would be cleaner if all artifacts for the example were self-contained in one folder, but I could go either way here.

@Nayef211 Nayef211 marked this pull request as ready for review July 7, 2022 14:45
Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt2_bpe_vocab.bpe and gpt2_bpe_encoder.json seem to be the second set of the same assets added in https://github.com/pytorch/text/tree/main/test/asset

This kind of trend is one of the reasons why I was suggesting not to check-in such a huge asset. #1462 (comment)

@parmeet
Copy link
Contributor

parmeet commented Jul 7, 2022

I think instead we could just ask users to download the artifacts in readme? I agree with @mthrok to avoid check-in artifacts.

wget https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe

wget https://download.pytorch.org/models/text/gpt2_bpe_encoder.json

@Nayef211
Copy link
Contributor Author

Nayef211 commented Jul 7, 2022

I think instead we could just ask users to download the artifacts in readme? I agree with @mthrok to avoid check-in artifacts.

wget https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe

wget https://download.pytorch.org/models/text/gpt2_bpe_encoder.json

Thanks for the feedback @mthrok and @parmeet. Just removed the assets and added instructions on how to download it.

@Nayef211 Nayef211 requested a review from mthrok July 7, 2022 17:02
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nayef211 for adding this example application, LGTM!!

examples/libtorchtext/tokenizer/main.cpp Outdated Show resolved Hide resolved
@Nayef211 Nayef211 merged commit cf94d30 into pytorch:main Jul 8, 2022
@Nayef211 Nayef211 deleted the example/libtorchtext branch July 8, 2022 03:33
@JiedaokouWangguan
Copy link

Hi @Nayef211 , I was wondering if we need a CMakeLists.txt in http://text/examples/libtorchtext/tokenizer/? I was not able to run the script
I'm sorry if the question is too dumb, I'm very new to cmake

@Nayef211
Copy link
Contributor Author

Hi @Nayef211 , I was wondering if we need a CMakeLists.txt in http://text/examples/libtorchtext/tokenizer/? I was not able to run the script I'm sorry if the question is too dumb, I'm very new to cmake

@JiedaokouWangguan thanks for calling out this issue. I think it should be resolved by #1908. Would you be able to check out the PR locally and see if you are able to get the example working for you?

@JiedaokouWangguan
Copy link

Thanks for the quick fix! Let me have a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants