Issue with Progressive Generation Using inputs_embeds and past_key_values

### System Info

- `transformers` version: 4.46.3
- Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.17
- Python version: 3.8.20
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA RTX A6000

### Who can help?

@gante 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

I am currently rewriting the generate_progressively function for my custom model class. My goal is to enable the model to generate results progressively by concatenating the initial input_ids with each element of the compress_outputs sequence in turn. Specifically:
1. In the first iteration, the model generates results by concatenating input_ids with the first element of compress_outputs.
2. In the second iteration, it concatenates input_ids with the first and second elements of compress_outputs (the first two elements) to generate results.
3. This process continues until the last element of the compress_outputs sequence is included.

To improve efficiency, I want to leverage caching, as the majority of the concatenated input in each iteration has already been used to compute past_key_values. Below is the code snippet for the function I implemented. In this context, self.model refers to mistral-7b-chat-v0.2.
```
@torch.no_grad()
    def generate_progressively(
            self,
            input_ids,
            attention_mask,
            compress_outputs,
            **kwargs,
    ):
        results = []
        compress_output_count = compress_outputs.size(1)
        batch_size = input_ids.size(0)

        inputs_embs = self.base.model.embed_tokens(input_ids)
        prompt_cache = DynamicCache()
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            use_cache=True,
            past_key_values=prompt_cache,
        )
        prompt_cache = outputs.past_key_values

        for compress_ind in range(compress_output_count):
            current_compress_outputs = compress_outputs[:, compress_ind: compress_ind+1, :].type_as(input_ids)
            outputs = self.model(
                input_ids=None,
                inputs_embeds=current_compress_outputs,
                use_cache=True,
                past_key_values=prompt_cache,
            )
            prompt_cache = outputs.past_key_values

            inputs_embs = torch.cat([inputs_embs, current_compress_outputs], dim=1)
            attention_mask = torch.cat([attention_mask, torch.ones(batch_size, 1, device=input_ids.device)], dim=1)

            generated_outputs = self.base.generate(
                inputs_embeds=inputs_embs,
                attention_mask=attention_mask,
                use_cache=True,
                past_key_values=prompt_cache,
                return_dict_in_generate=True,
                **kwargs,
            )
            results.append(generated_outputs.sequences)
        return results
```

When I execute this code, the program throws an error during execution. The error occurs at line 393 in transformers/generation/utils.py, specifically in the prepare_inputs_for_generation function.
The problematic line of code is:
```
if inputs_embeds is not None and cache_position[0] == 0:
```
The error message is: IndexError: index 0 is out of bounds for dimension 0 with size 0.

I track the excution of the code and here’s a detailed breakdown of the issue:
The error occurs in transformers/generation/utils.py. Initially, the program enters the self._sample function and then proceeds to the self._get_initial_cache_position function.
Within this function, the following line:
```
if not is_torchdynamo_compiling():
    cache_position = cache_position[past_length:]
```
causes the correct cache_position slice to become empty, resulting in an IndexError in subsequent steps.

Even if I manage to fix the issue with cache_position, another problem arises later in the self.prepare_inputs_for_generation function.
The relevant code is as follows:
```
if not self.config.is_encoder_decoder:
    if inputs_embeds is not None and cache_position[0] == 0:
        model_inputs[input_ids_key] = None
        model_inputs["inputs_embeds"] = inputs_embeds
    else:
        model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
        model_inputs["inputs_embeds"] = None
```
In my case, I provide only inputs_embeds and past_key_values, and since cache_position[0] is not 0, the code attempts to set model_inputs[input_ids_key] using input_ids. However, since input_ids is None, this results in further issues.

Under the current implementation of the generate function in transformers, is it possible to use only inputs_embeds and past_key_values for generation? How can I modify my implementation to achieve progressive generation with caching as intended? Are there specific guidelines for correctly managing cache_position and ensuring compatibility with inputs_embeds?

### Expected behavior

My primary objective is to progressively generate outputs by leveraging caching (past_key_values) to improve efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with Progressive Generation Using inputs_embeds and past_key_values #35707

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with Progressive Generation Using inputs_embeds and past_key_values #35707

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions