-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The output tensor's data type is not torch.long when the input text is empty. #36277
Comments
Hi @wangzhen0518, does this happen with all tokenizer classes, or just a specific one you tested? |
I have only tested it on the tokenizer of the QWen series models. Here is the complete code. from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct')
t = tokenizer('', return_tensors='pt')
print(t['input_ids'].dtype) #torch.float32 |
I've investigated further and I believe this is caused by the behaviour of
The |
Yes, you are right. I also noticed Actually, I’m currently working with an LLM integrated into an environment. When concatenating the environment’s response (which is sometimes empty) to the LLM’s output tensor of type torch.long, the concatenation unexpectedly changes the resulting tensor’s dtype to torch.float32 when the response is empty. This dtype mismatch subsequently causes an error when feeding the concatenated tensor back into the LLM. I know this can be easily resolved by adding type conversion when concatenating tensors, but should the tokenizer's behavior remain consistent, i.e., whether the default data type of return values should stay unchanged as long as no errors occur during tokenization? |
Hmm, I think it might make sense, but I'm unsure if the expected dtypes are stored anywhere, so this could be a tricky PR. I suspect the only way we could do this is to have a dict of common tokenizer output names like cc @ArthurZucker for tokenizers - would you support a PR to add that, or does it add complexity for not enough gain? |
Thanks! Can we just modify function transformers/src/transformers/tokenization_utils_base.py Lines 719 to 777 in c0f8d05
For example, just explicitly specify the dtype when creating tensors in the elif tensor_type == TensorType.PYTORCH:
if not is_torch_available():
raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
import torch
is_tensor = torch.is_tensor
def as_tensor(value, dtype=None):
if isinstance(value, list) and isinstance(value[0], np.ndarray):
return torch.from_numpy(np.array(value)).to(torch.long)
return torch.tensor(value, dtype=torch.long) |
Hi @wangzhen0518, the problem is that some tokenizers might return |
System Info
transformers
version: 4.48.1Who can help?
@ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The output tensor's data type is not torch.long when the input text is empty.
Expected behavior
The text was updated successfully, but these errors were encountered: