Description
System Info
transformers
version: 4.46.3- Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.17
- Python version: 3.8.20
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA RTX A6000
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am currently rewriting the generate_progressively function for my custom model class. My goal is to enable the model to generate results progressively by concatenating the initial input_ids with each element of the compress_outputs sequence in turn. Specifically:
- In the first iteration, the model generates results by concatenating input_ids with the first element of compress_outputs.
- In the second iteration, it concatenates input_ids with the first and second elements of compress_outputs (the first two elements) to generate results.
- This process continues until the last element of the compress_outputs sequence is included.
To improve efficiency, I want to leverage caching, as the majority of the concatenated input in each iteration has already been used to compute past_key_values. Below is the code snippet for the function I implemented. In this context, self.model refers to mistral-7b-chat-v0.2.
@torch.no_grad()
def generate_progressively(
self,
input_ids,
attention_mask,
compress_outputs,
**kwargs,
):
results = []
compress_output_count = compress_outputs.size(1)
batch_size = input_ids.size(0)
inputs_embs = self.base.model.embed_tokens(input_ids)
prompt_cache = DynamicCache()
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
use_cache=True,
past_key_values=prompt_cache,
)
prompt_cache = outputs.past_key_values
for compress_ind in range(compress_output_count):
current_compress_outputs = compress_outputs[:, compress_ind: compress_ind+1, :].type_as(input_ids)
outputs = self.model(
input_ids=None,
inputs_embeds=current_compress_outputs,
use_cache=True,
past_key_values=prompt_cache,
)
prompt_cache = outputs.past_key_values
inputs_embs = torch.cat([inputs_embs, current_compress_outputs], dim=1)
attention_mask = torch.cat([attention_mask, torch.ones(batch_size, 1, device=input_ids.device)], dim=1)
generated_outputs = self.base.generate(
inputs_embeds=inputs_embs,
attention_mask=attention_mask,
use_cache=True,
past_key_values=prompt_cache,
return_dict_in_generate=True,
**kwargs,
)
results.append(generated_outputs.sequences)
return results
When I execute this code, the program throws an error during execution. The error occurs at line 393 in transformers/generation/utils.py, specifically in the prepare_inputs_for_generation function.
The problematic line of code is:
if inputs_embeds is not None and cache_position[0] == 0:
The error message is: IndexError: index 0 is out of bounds for dimension 0 with size 0.
I track the excution of the code and here’s a detailed breakdown of the issue:
The error occurs in transformers/generation/utils.py. Initially, the program enters the self._sample function and then proceeds to the self._get_initial_cache_position function.
Within this function, the following line:
if not is_torchdynamo_compiling():
cache_position = cache_position[past_length:]
causes the correct cache_position slice to become empty, resulting in an IndexError in subsequent steps.
Even if I manage to fix the issue with cache_position, another problem arises later in the self.prepare_inputs_for_generation function.
The relevant code is as follows:
if not self.config.is_encoder_decoder:
if inputs_embeds is not None and cache_position[0] == 0:
model_inputs[input_ids_key] = None
model_inputs["inputs_embeds"] = inputs_embeds
else:
model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
model_inputs["inputs_embeds"] = None
In my case, I provide only inputs_embeds and past_key_values, and since cache_position[0] is not 0, the code attempts to set model_inputs[input_ids_key] using input_ids. However, since input_ids is None, this results in further issues.
Under the current implementation of the generate function in transformers, is it possible to use only inputs_embeds and past_key_values for generation? How can I modify my implementation to achieve progressive generation with caching as intended? Are there specific guidelines for correctly managing cache_position and ensuring compatibility with inputs_embeds?
Expected behavior
My primary objective is to progressively generate outputs by leveraging caching (past_key_values) to improve efficiency.