Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading saved tokenizer #1255

Closed
imagineer258 opened this issue Mar 15, 2021 · 6 comments
Closed

Error loading saved tokenizer #1255

imagineer258 opened this issue Mar 15, 2021 · 6 comments

Comments

@imagineer258
Copy link

imagineer258 commented Mar 15, 2021

❓ Questions and Help

Description

When I try to load a saved tokenizer from torchtext I get the following error :

Loading model...
terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown type name '__torch__.torch.classes.torchtext.RegexTokenizer':
Serialized   File "code/__torch__/torchtext/experimental/transforms.py", line 6
  training : bool
  _is_full_backward_hook : Optional[bool]
  regex_tokenizer : __torch__.torch.classes.torchtext.RegexTokenizer
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchtext.experimental.transforms.RegexTokenizer,
    line: str) -> List[str]:

Aborted (core dumped)

I saved the tokenizer with the following code :

from torchtext.experimental.transforms import regex_tokenizer
tokenizer = regex_tokenizer([])
tokenizer_scripted = torch.jit.script(tokenizer)
tokenizer_scripted.save("tokenizer.pt")

and trying to load it back with

#include <torch/script.h>
#include <torch/nn/functional/activation.h>

#include <iostream>
#include <vector>
#include <string>

using namespace std;

int main(int argc, const char* argv[]) {

    std::cout << "Loading model...\n";

    torch::jit::script::Module module;
    try {
        module = torch::jit::load(argv[1]);
    } catch (const c10::Error& e) {
        return -1;
    }

    torch::NoGradGuard no_grad; // ensures that autograd is off

    namespace F = torch::nn::functional;
    
    torch::jit::IValue tokens_ivalue = module.forward("test@gmail.com 00000001");
    std::cout << "result " << tokens_ivalue << '\n';

    return 0;
}

I'm assuming I have to link the torchtext c++ code correctly somewhere in my CMakeLists.txt but I'm not sure how to do that. I tried adding the following but it didn't help :

set_target_properties(TorchText PROPERTIES IMPORTED_LOCATION <path_to_torchtext.so>)
target_link_libraries(project TorchText)

@zhangguanheng66
Copy link
Contributor

We don't have the support with torch.jit.load in C++. cc @mthrok

@mthrok
Copy link
Contributor

mthrok commented Jun 23, 2021

Hi @imagineer258

Sorry for not getting back to you sooner. The notification was buried in my inbox.
Currently, torchtext does not support C++.
The C++ code has PyBind11 (and Python) dependency, so you would link against libpython, and it's not trivial to do so.

There was a plan to clean up the code to extract the Python dependency and change the build step to CMake based one, but that has not occurred yet. Before doing so we need to add test for macOS and I was working on this in #1235 and #1300, but got busy.

@parmeet If you would like to support this, we can discuss the plan.

@parmeet
Copy link
Contributor

parmeet commented Jun 23, 2021

Hi @imagineer258

Sorry for not getting back to you sooner. The notification was buried in my inbox.
Currently, torchtext does not support C++.
The C++ code has PyBind11 (and Python) dependency, so you would link against libpython, and it's not trivial to do so.

There was a plan to clean up the code to extract the Python dependency and change the build step to CMake based one, but that has not occurred yet. Before doing so we need to add test for macOS and I was working on this in #1235 and #1300, but got busy.

@parmeet If you would like to support this, we can discuss the plan.

@mthrok sounds good! let's take a look at this.

@parmeet
Copy link
Contributor

parmeet commented Jun 24, 2021

Hi @imagineer258 till we do the clean up and find a more convenient way to inject torchtext dependency into C++, you may try to manually link against torchtext shared lib (you can find _torchtext.so in the installed directory or you can also build it from source) and python lib

Below is the sample CMake file that should do the job:

cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
project(torchtext_kernels)
find_package(Torch REQUIRED)
find_package(PythonLibs REQUIRED)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_library(_torchtext SHARED IMPORTED)
set_property(TARGET _torchtext PROPERTY IMPORTED_LOCATION PATH/TO/_torchtext.so)

add_executable(torchtext_kernels torchtext_kernels.cpp)

target_link_libraries(torchtext_kernels _torchtext ${PYTHON_LIBRARIES} "${TORCH_LIBRARIES}")

cc: @mthrok

mreso added a commit to mreso/test_scriptable_tokenizer that referenced this issue Jan 19, 2022
@parmeet
Copy link
Contributor

parmeet commented Jun 23, 2022

I believe, this can be resolved now by linking the C++ binary with torchtext library using the CMAKE built cc: @Nayef211

@Nayef211
Copy link
Contributor

That's right @parmeet. Here's the issue that describes the new build system #1644. And the example added in #1817 showcases how to load a saved GPT2BPE tokenizer in C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants