Error loading saved tokenizer #1255

imagineer258 · 2021-03-15T23:27:46Z

❓ Questions and Help

Description

When I try to load a saved tokenizer from torchtext I get the following error :

Loading model...
terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown type name '__torch__.torch.classes.torchtext.RegexTokenizer':
Serialized   File "code/__torch__/torchtext/experimental/transforms.py", line 6
  training : bool
  _is_full_backward_hook : Optional[bool]
  regex_tokenizer : __torch__.torch.classes.torchtext.RegexTokenizer
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchtext.experimental.transforms.RegexTokenizer,
    line: str) -> List[str]:

Aborted (core dumped)

I saved the tokenizer with the following code :

from torchtext.experimental.transforms import regex_tokenizer
tokenizer = regex_tokenizer([])
tokenizer_scripted = torch.jit.script(tokenizer)
tokenizer_scripted.save("tokenizer.pt")

and trying to load it back with

#include <torch/script.h>
#include <torch/nn/functional/activation.h>

#include <iostream>
#include <vector>
#include <string>

using namespace std;

int main(int argc, const char* argv[]) {

    std::cout << "Loading model...\n";

    torch::jit::script::Module module;
    try {
        module = torch::jit::load(argv[1]);
    } catch (const c10::Error& e) {
        return -1;
    }

    torch::NoGradGuard no_grad; // ensures that autograd is off

    namespace F = torch::nn::functional;
    
    torch::jit::IValue tokens_ivalue = module.forward("test@gmail.com 00000001");
    std::cout << "result " << tokens_ivalue << '\n';

    return 0;
}

I'm assuming I have to link the torchtext c++ code correctly somewhere in my CMakeLists.txt but I'm not sure how to do that. I tried adding the following but it didn't help :

set_target_properties(TorchText PROPERTIES IMPORTED_LOCATION <path_to_torchtext.so>)
target_link_libraries(project TorchText)

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2021-03-16T01:11:49Z

We don't have the support with torch.jit.load in C++. cc @mthrok

mthrok · 2021-06-23T12:48:01Z

Hi @imagineer258

Sorry for not getting back to you sooner. The notification was buried in my inbox.
Currently, torchtext does not support C++.
The C++ code has PyBind11 (and Python) dependency, so you would link against libpython, and it's not trivial to do so.

There was a plan to clean up the code to extract the Python dependency and change the build step to CMake based one, but that has not occurred yet. Before doing so we need to add test for macOS and I was working on this in #1235 and #1300, but got busy.

@parmeet If you would like to support this, we can discuss the plan.

parmeet · 2021-06-23T13:16:53Z

Hi @imagineer258

Sorry for not getting back to you sooner. The notification was buried in my inbox.
Currently, torchtext does not support C++.
The C++ code has PyBind11 (and Python) dependency, so you would link against libpython, and it's not trivial to do so.

There was a plan to clean up the code to extract the Python dependency and change the build step to CMake based one, but that has not occurred yet. Before doing so we need to add test for macOS and I was working on this in #1235 and #1300, but got busy.

@parmeet If you would like to support this, we can discuss the plan.

@mthrok sounds good! let's take a look at this.

parmeet · 2021-06-24T17:28:47Z

Hi @imagineer258 till we do the clean up and find a more convenient way to inject torchtext dependency into C++, you may try to manually link against torchtext shared lib (you can find _torchtext.so in the installed directory or you can also build it from source) and python lib

Below is the sample CMake file that should do the job:

cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
project(torchtext_kernels)
find_package(Torch REQUIRED)
find_package(PythonLibs REQUIRED)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_library(_torchtext SHARED IMPORTED)
set_property(TARGET _torchtext PROPERTY IMPORTED_LOCATION PATH/TO/_torchtext.so)

add_executable(torchtext_kernels torchtext_kernels.cpp)

target_link_libraries(torchtext_kernels _torchtext ${PYTHON_LIBRARIES} "${TORCH_LIBRARIES}")

cc: @mthrok

parmeet · 2022-06-23T21:35:39Z

I believe, this can be resolved now by linking the C++ binary with torchtext library using the CMAKE built cc: @Nayef211

Nayef211 · 2022-07-11T16:30:36Z

That's right @parmeet. Here's the issue that describes the new build system #1644. And the example added in #1817 showcases how to load a saved GPT2BPE tokenizer in C++.

parmeet mentioned this issue Jul 29, 2021

How to use TorchText with Java #1369

Open

mreso added a commit to mreso/test_scriptable_tokenizer that referenced this issue Jan 19, 2022

Adopted workaround of pytorch/text#1255

562726c

Nayef211 mentioned this issue Mar 10, 2022

Enable CMake Based Build System for Torchtext #1644

Closed

5 tasks

Nayef211 closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error loading saved tokenizer #1255

Error loading saved tokenizer #1255

imagineer258 commented Mar 15, 2021 •

edited

zhangguanheng66 commented Mar 16, 2021

mthrok commented Jun 23, 2021

parmeet commented Jun 23, 2021

parmeet commented Jun 24, 2021

parmeet commented Jun 23, 2022

Nayef211 commented Jul 11, 2022

Error loading saved tokenizer #1255

Error loading saved tokenizer #1255

Comments

imagineer258 commented Mar 15, 2021 • edited

❓ Questions and Help

zhangguanheng66 commented Mar 16, 2021

mthrok commented Jun 23, 2021

parmeet commented Jun 23, 2021

parmeet commented Jun 24, 2021

parmeet commented Jun 23, 2022

Nayef211 commented Jul 11, 2022

imagineer258 commented Mar 15, 2021 •

edited