Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Token None for key pad_token should be a str or an AddedToken instance #5

Closed
FlatMapIO opened this issue Dec 2, 2023 · 6 comments

Comments

@FlatMapIO
Copy link

Failed to load deepseek-ai/deepseek-llm-7b-base (which is a model of the llama 2 architecture), is the following code necessary? hf tokenizer should automatically handle this according to tokenizer_config.json?

tokenizer.add_special_tokens({"pad_token" : tokenizer.unk_token});
tokenizer.pad_token = tokenizer.unk_token
config = model.config.update({"pad_token_id" : tokenizer.unk_token_id});

File [/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:599](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:599), in FastLlamaModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map)
    [586](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:586) model = AutoModelForCausalLM.from_pretrained(
    [587](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:587)     model_name,
    [588](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:588)     device_map = device_map,
   (...)
    [591](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:591)     token = token,
    [592](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:592) )
    [593](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:593) tokenizer = AutoTokenizer.from_pretrained(
    [594](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:594)     model_name,
    [595](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:595)     model_max_length = max_seq_length,
    [596](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:596)     padding_side = "right",
    [597](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/unsloth/models/llama.py:597)     token = token,
...
    [962](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:962)     if isinstance(value, (str)):
    [963](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:963)         # for legacy purpose we default to stripping. `False` depends on this
    [964](/workspaces/unsloth-train-playground/.venv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:964)         value = AddedToken(value, rstrip=False, lstrip=False, normalized=False, special=True)

ValueError: Token None for key pad_token should be a str or an AddedToken instance
@danielhanchen
Copy link
Contributor

I shall fix this ASAP! Thanks for the report!

@danielhanchen
Copy link
Contributor

@FlatMapIO Possibly fixed this! It would be awesome if you could check if the preliminary version out! Just reinstall / update :))

@sieu-n
Copy link

sieu-n commented Dec 6, 2023

Is there a reason why tokenizer is returned at all from from_pretrained? Just curious.

@Redix8
Copy link

Redix8 commented Apr 16, 2024

@danielhanchen

is it really fixed?
i think it's happening because tokenizer has null with unk_token and trying to add pad_token with null

if hasattr(tokenizer, "unk_token"):

so it's like
unk_token = null
tokenizer.add_special_tokens({"pad_token" : unk_token})

@deniz-birlikci
Copy link

Working on a quick fix now, just like Redix8 said, it is failing with Deepseek math when the unk_token is null. Therefore, adding another conditional check in line 151 and making a pull request.

@deniz-birlikci
Copy link

@Redix8 in the meantime, you can use the following approach to get around this issue until it is fixed:

  1. Load the tokenizer directly into colab
  2. Use the transformers.AutoTokenizer.from_pretrained function's pad_token argument to specify a new pad token
    new_tokenizer = transformers.AutoTokenizer.from_pretrained( "deepseek-ai/deepseek-math-7b-rl", pad_token="<|PAD|>", padding_side="right", use_fast=True, trust_remote_code=True )
  3. Safe the new tokenizer to a local directory directory:
    new_tokenizer.save_pretrained("dsave_directory", push_to_hub=False)
  4. Load the transformer model and save it to the same directory:
    model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-math-7b-rl") tokenizer_math.save_pretrained("save_directory, push_to_hub=False)
  5. Pass this new local directory to the unsloth from_presaved function:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants