Skip to content

Issue with Progressive Generation Using inputs_embeds and past_key_values #35707

Closed
@Superbooming

Description

@Superbooming

System Info

  • transformers version: 4.46.3
  • Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.17
  • Python version: 3.8.20
  • Huggingface_hub version: 0.26.1
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA RTX A6000

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am currently rewriting the generate_progressively function for my custom model class. My goal is to enable the model to generate results progressively by concatenating the initial input_ids with each element of the compress_outputs sequence in turn. Specifically:

  1. In the first iteration, the model generates results by concatenating input_ids with the first element of compress_outputs.
  2. In the second iteration, it concatenates input_ids with the first and second elements of compress_outputs (the first two elements) to generate results.
  3. This process continues until the last element of the compress_outputs sequence is included.

To improve efficiency, I want to leverage caching, as the majority of the concatenated input in each iteration has already been used to compute past_key_values. Below is the code snippet for the function I implemented. In this context, self.model refers to mistral-7b-chat-v0.2.

@torch.no_grad()
    def generate_progressively(
            self,
            input_ids,
            attention_mask,
            compress_outputs,
            **kwargs,
    ):
        results = []
        compress_output_count = compress_outputs.size(1)
        batch_size = input_ids.size(0)

        inputs_embs = self.base.model.embed_tokens(input_ids)
        prompt_cache = DynamicCache()
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            use_cache=True,
            past_key_values=prompt_cache,
        )
        prompt_cache = outputs.past_key_values

        for compress_ind in range(compress_output_count):
            current_compress_outputs = compress_outputs[:, compress_ind: compress_ind+1, :].type_as(input_ids)
            outputs = self.model(
                input_ids=None,
                inputs_embeds=current_compress_outputs,
                use_cache=True,
                past_key_values=prompt_cache,
            )
            prompt_cache = outputs.past_key_values

            inputs_embs = torch.cat([inputs_embs, current_compress_outputs], dim=1)
            attention_mask = torch.cat([attention_mask, torch.ones(batch_size, 1, device=input_ids.device)], dim=1)

            generated_outputs = self.base.generate(
                inputs_embeds=inputs_embs,
                attention_mask=attention_mask,
                use_cache=True,
                past_key_values=prompt_cache,
                return_dict_in_generate=True,
                **kwargs,
            )
            results.append(generated_outputs.sequences)
        return results

When I execute this code, the program throws an error during execution. The error occurs at line 393 in transformers/generation/utils.py, specifically in the prepare_inputs_for_generation function.
The problematic line of code is:

if inputs_embeds is not None and cache_position[0] == 0:

The error message is: IndexError: index 0 is out of bounds for dimension 0 with size 0.

I track the excution of the code and here’s a detailed breakdown of the issue:
The error occurs in transformers/generation/utils.py. Initially, the program enters the self._sample function and then proceeds to the self._get_initial_cache_position function.
Within this function, the following line:

if not is_torchdynamo_compiling():
    cache_position = cache_position[past_length:]

causes the correct cache_position slice to become empty, resulting in an IndexError in subsequent steps.

Even if I manage to fix the issue with cache_position, another problem arises later in the self.prepare_inputs_for_generation function.
The relevant code is as follows:

if not self.config.is_encoder_decoder:
    if inputs_embeds is not None and cache_position[0] == 0:
        model_inputs[input_ids_key] = None
        model_inputs["inputs_embeds"] = inputs_embeds
    else:
        model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
        model_inputs["inputs_embeds"] = None

In my case, I provide only inputs_embeds and past_key_values, and since cache_position[0] is not 0, the code attempts to set model_inputs[input_ids_key] using input_ids. However, since input_ids is None, this results in further issues.

Under the current implementation of the generate function in transformers, is it possible to use only inputs_embeds and past_key_values for generation? How can I modify my implementation to achieve progressive generation with caching as intended? Are there specific guidelines for correctly managing cache_position and ensuring compatibility with inputs_embeds?

Expected behavior

My primary objective is to progressively generate outputs by leveraging caching (past_key_values) to improve efficiency.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions