Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libtorchtext cpp example #1817

Merged
merged 6 commits into from Jul 8, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions examples/libtorchtext/.gitignore
@@ -0,0 +1,2 @@
build
**/*.pt
11 changes: 11 additions & 0 deletions examples/libtorchtext/CMakeLists.txt
@@ -0,0 +1,11 @@
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(libtorchtext_cpp_example)

SET(BUILD_TORCHTEXT_PYTHON_EXTENSION OFF CACHE BOOL "Build Python binding")

find_package(Torch REQUIRED)
message("libtorchtext CMakeLists: ${TORCH_CXX_FLAGS}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_subdirectory(../.. libtorchtext)
add_subdirectory(tokenizer)
22 changes: 22 additions & 0 deletions examples/libtorchtext/README.md
@@ -0,0 +1,22 @@
# Libtorchtext Examples

- [Tokenizer](./tokenizer)

## Build

The example applications in this directory depend on `libtorch` and `libtorchtext`. If you have a working `PyTorch`, you
already have `libtorch`. Please refer to
[this tutorial](https://pytorch.org/tutorials/advanced/torch_script_custom_classes.html) for the use of `libtorch` and
TorchScript.

`libtorchtext` is the library of torchtext's C++ components without Python components. It is currently not distributed,
and it will be built alongside with the applications.

To build `libtorchtext` and the example applications you can run the following command.

```bash
chmod +x build.sh # give script execute permission
./build.sh
```

For the usages of each application, refer to the corresponding application directory.
18 changes: 18 additions & 0 deletions examples/libtorchtext/build.sh
@@ -0,0 +1,18 @@
#!/usr/bin/env bash

set -eux

this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
build_dir="${this_dir}/build"

mkdir -p "${build_dir}"
cd "${build_dir}"

git submodule update
cmake \
-DCMAKE_PREFIX_PATH="$(python -c 'import torch;print(torch.utils.cmake_prefix_path)')" \
-DRE2_BUILD_TESTING:BOOL=OFF \
-DBUILD_TESTING:BOOL=OFF \
-DSPM_ENABLE_SHARED=OFF \
..
cmake --build .
42 changes: 42 additions & 0 deletions examples/libtorchtext/tokenizer/README.md
@@ -0,0 +1,42 @@
# Tokenizer

This example demonstrates how you can use torchtext's `GPT2BPETokenizer` in a C++ environment.

## Steps

### 1. Download necessary artifacts

First we download `gpt2_bpe_vocab.bpe` and `gpt2_bpe_encoder.json` artifacts, both of which are needed to construct the
`GPT2BPETokenizer` object.

```bash
curl -O https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe
curl -O https://download.pytorch.org/models/text/gpt2_bpe_encoder.json
```

### 2. Create tokenizer TorchScript file

Next we create our tokenizer object, and save it as a TorchScript object. We also print out the output of the tokenizer
on a sample sentence and verify that the output is the same before and after saving and re-loading the tokenizer. In the
next steps we will load and execute the tokenizer in our C++ application. The C++ code is found in
[`main.cpp`](./main.cpp).

```bash
tokenizer_file="tokenizer.pt"
python create_tokenizer.py --tokenizer-file "${tokenizer_file}"
```

### 3. Build the application

Please refer to [the top level README.md](../README.md)

### 4. Run the application

Now we run the C++ application `tokenizer`, with the TorchScript object we created in Step 2. The tokenizer is run with
the following sentence as input and we verify that the output is the same as that of Step 2.

In [the top level directory](../)

```bash
./build/tokenizer/tokenize "tokenizer/${tokenizer_file}"
```
29 changes: 29 additions & 0 deletions examples/libtorchtext/tokenizer/create_tokenizer.py
@@ -0,0 +1,29 @@
from argparse import ArgumentParser

import torch
from torchtext import transforms


def main(args):
tokenizer_file = args.tokenizer_file
sentence = "The green grasshopper jumped over the fence"

# create tokenizer object
encoder_json = "gpt2_bpe_encoder.json"
bpe_vocab = "gpt2_bpe_vocab.bpe"
tokenizer = transforms.GPT2BPETokenizer(encoder_json_path=encoder_json, vocab_bpe_path=bpe_vocab)

# script and save tokenizer
tokenizer = torch.jit.script(tokenizer)
print(tokenizer(sentence))
torch.jit.save(tokenizer, tokenizer_file)

# load saved tokenizer and verify outputs match
t = torch.jit.load(tokenizer_file)
print(t(sentence))


if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--tokenizer-file", default="tokenizer.pt", type=str)
main(parser.parse_args())
25 changes: 25 additions & 0 deletions examples/libtorchtext/tokenizer/main.cpp
@@ -0,0 +1,25 @@
#include <torch/script.h>
#include <torch/nn/functional/activation.h>

#include <iostream>
#include <vector>
#include <string>

using namespace std;
Nayef211 marked this conversation as resolved.
Show resolved Hide resolved

int main(int argc, const char* argv[]) {
std::cout << "Loading model...\n";

torch::jit::script::Module module;
try {
module = torch::jit::load(argv[1]);
} catch (const c10::Error& e) {
return -1;
}

torch::NoGradGuard no_grad; // ensures that autograd is off
torch::jit::IValue tokens_ivalue = module.forward(std::vector<c10::IValue>(1,"The green grasshopper jumped over the fence"));
std::cout << "Result: " << tokens_ivalue << std::endl;

return 0;
}