-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Labels
bugSomething isn't workingSomething isn't working
Description
In my environment, using the LLaMA fast tokenizer raises an error about protobuf:
File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
from .utils import sentencepiece_model_pb2 as model_pb2
File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
_descriptor.EnumValueDescriptor(
File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.
- Initialization with fast tokenizer & protobuf==3.20.3
real 4m18.476s
user 3m52.706s
sys 0m27.644s
- Initialization with slow tokenizer
real 0m27.620s
user 0m8.011s
sys 0m19.237s
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working