Inference #619

kartikayk · 2024-03-29T20:57:51Z

Context

Our current story for inference is sub-optimal. This PR makes an attempt at fixing this. The code for inference is heavily inspired by gpt-fast though it's missing some of the functionality around compile. I'll add this as a follow up PR. This is basically setting up the inference to actually work.

Changelog

Update the way we setup KV Cache, including removing curr_pos which was really confusing and never made any sense to me. I replace this with input_pos (naming consistent with gpt_fast) which does exactly what you expect it to i.e. its a tensor with the current position. When we first start inference, this includes positions for all tokens in the prompt since the K and V tensors for these need to be correctly computed.
Update all components and tests
Remove the current GenerationUtils class and logit_transforms file. Replace these with standa-alone utilities for generation.

Test plan

All unit tests and recipe tests pass
Output generation is sensible for a 13B model

Comparison with gpt-fast: Without compile, our generation speed is on par with gpt-fast. Adding compile support is beyond the scope of this PR

gpt-fast:

torchtune:

pytorch-bot · 2024-03-29T20:57:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/619

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 24e9ff0 with merge base 08f8235 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-03-29T21:00:47Z

torchtune/modules/attention.py

        # input has shape [b, s, d]
        bsz, seq_len, _ = x.shape

+        # self.wqkv.weight.data = torch.cat([self.q_proj.weight, self.k_proj.weight, self.v_proj.weight ])


joecummings · 2024-03-29T21:01:13Z

torchtune/modules/attention.py

@@ -166,6 +170,8 @@ def forward(
        k = self.k_proj(x)
        v = self.v_proj(x)

+        # pdb.set_trace()


joecummings · 2024-03-29T21:01:36Z

torchtune/modules/kv_cache.py

-            k_val (Tensor): New k value.
-            v_val (Tensor): New v value.
+    def update(self, input_pos, k_val, v_val) -> Tuple[Tensor, Tensor]:
+        # input_pos: [S], k_val: [B, H, S, D]


Love this comment

joecummings · 2024-03-29T21:02:49Z

recipes/alpaca_generate.py

WOOOOOOOOOOO

recipes/configs/generate.yaml

joecummings · 2024-03-29T21:06:12Z

recipes/generate.py

+            model.setup_caches(max_batch_size=1, dtype=self._dtype)
+        return model
+
+    def _multinomial_sample_one_no_sync(self, probs_sort):


Can you leave a comment here explaining this?

joecummings · 2024-03-29T21:06:46Z

recipes/generate.py

+                    model, cur_token, input_pos, temperature, top_k
+                )
+                input_pos += 1
+                new_tokens.append(next_token.clone())


Do we care about performance here? How expensive is the double clone operation?

The memory impact of this is quite minimal. These are like 300 or so ints

joecummings · 2024-03-29T21:09:07Z

Should we also add a test ensuring that this does in fact speed up inference? Something like running the same inference twice, once with cache, and once without and checking the time difference?

joecummings · 2024-03-29T21:10:38Z

recipes/generate.py

+        q = torch.empty_like(probs_sort).exponential_(1)
+        return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
+
+    def _logits_to_probs(self, logits: torch.Tensor, temperature: float, top_k: int):


We're pretty inconsistent with where our sampling code lives. I'd be in favor of putting it all in one file.

I don't want to generalize this code prematurely. I think for now I expect this to be limited to this recipe, but we can generalize as a follow up if that makes sense

A couple questions here:

Does this obviate the need for logits_transforms.py? Seems we are now handling all of that in the recipe. (As a follow-up, if we are gonna keep it, can we at least drop the LogitsTransform ABC? It's literally just a Callable[FloatTensor, FloatTensor])

I notice we are missing top p sampling, which we otherwise have support for. Any particular reason for omitting it?

I removed the logits_transform.py file completely. For the support for top_p, I didn't really find a good reference. Maybe we can add back if its needed?

joecummings · 2024-03-29T21:11:34Z

recipes/configs/generate.yaml

+dtype: bf16
+seed: 1234
+
+temperature: 0.8


Can you leave a comment on this PR explaining how you chose these hyperparams?

joecummings · 2024-03-29T21:12:38Z

recipes/configs/generate.yaml

+  path: /tmp/llama2/tokenizer.model
+
+# Generation arguments
+prompt: "Hello, my name is"


Maybe it would make sense to have this be a common default prompt. (Coming back to update this comment with examples)

rohan-varma · 2024-03-30T18:42:43Z

recipes/configs/generate.yaml

+
+# Model arguments
+model:
+  _component_: torchtune.models.llama2.llama2_13b


Should we call it generate_13b?

Not really - the model itself doesn't change during inference. The inference script is responsible for setting up the caches and the calling eval() on the model to disable all of the stochastic operations.

rohan-varma · 2024-03-30T18:43:08Z

recipes/configs/generate.yaml

+    pytorch_model-00002-of-00003.bin,
+    pytorch_model-00003-of-00003.bin
+  ]
+  recipe_checkpoint: null


Is this pretty much always going to be null? We don't really need to pass in recipe state into generate, IIUC?

Oh yeh, good point. I should remove this

rohan-varma · 2024-03-30T18:44:47Z

torchtune/modules/attention.py


        # [b, n_h, s, h_d]
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

+        # Update key-value cache
+        if self.kv_cache is not None:


Just for my understanding, any benefit in updating kv-cache post transpose? Does it enable better access patterns or the like?

It's a good question - let me think a bit about it

ebsmothers · 2024-03-30T20:30:13Z

torchtune/utils/generation.py

            if incremental_decode:
-                outputs = self.decoder_lm(input_ids, curr_pos=prev_pos)
+                outputs = self.decoder_lm(input_ids, input_pos=input_pos)
            else:
                outputs = self.decoder_lm(input_ids)
            if logits_accessor:


I know it's not part of these changes, but I find the name "accessor" a bit confusing tbh. Is the idea here to do some sort of slicing along the vocab dim or something? What is the difference between this and logits_transforms? Is it just that one is applied before softmax and the other is applied after?

OK coming back to this now I'm a bit confused (prob should have read the full recipe code first). Why isn't this file deleted entirely? Seems you've covered all the generation logic in the recipe already. Or am I missing something?

No particular reason, was leaving this as a follow up. Let me think about this - maybe I should combine the change here

ebsmothers · 2024-03-30T20:42:09Z

recipes/generate.py

+        temperature: float,
+        top_k: int,
+    ) -> torch.Tensor:
+        # input_pos: [B, S]


One thing that's confusing to me.. do we support batch generation? The way we pass the prompt seems to indicate that we don't, but then comments like this would indicate that we do.

There's nothing fundamentally stopping multi-sample generation. The inference recipe doesn;t currently handle this though

ebsmothers · 2024-03-30T21:10:29Z

torchtune/modules/transformer.py

-    def forward(
-        self, tokens: Tensor, mask: Optional[Tensor] = None, curr_pos: int = 0
-    ) -> Tensor:
+    def setup_caches(self, max_batch_size: int, dtype: torch.dtype) -> None:


I'm curious about the choice to define this as a method on the TransformerDecoder class (especially with the addition of mandatory max_seq_len, num_heads, head_dim which are only needed for inference). Why not just define a utility method, e.g.

def setup_caches( model: TransformerDecoder, max_batch_size: int, max_seq_len: int, num_heads: int, head_dim: int, ): # same body as setup_caches but with self -> model

Then optionally call this on the builder gated behind a flag. The slight drawback is that we do add one extra param to builders so it's not a 1:1 swap on the config side as it is now (maybe there's a way to do that though?). But honestly I think that's OK? Like if we are changing the model arch a bit for inference (which we are) there's no harm in being explicit about that.

I think the benefit is that anyone using TransformerDecoder directly (which should be a decent percentage of the people adding new models) does not have to worry about these extra params that are really only relevant for inference.

This is a really interesting point.

I think you touch upon this a bit. The reason I think having a method in the class makes sense is because the behavior of forward changes depending on whether you're calling it during training or during inference and I'd rather the model be able to handle this based on its internal state (eg: causal_mask). For example, if this is a utility, then you also need to take care of passing around the causal_mask explicitly in every call when this can be easily handled by the class itself? I also think, that having the forward of the decoder just take the input and the corresponding positions is nicer than explicitly passing the mask (though that might be a bit of a personal preference).

I'm actually not too concerned about num_heads, head_dim and max_seq_len because the model already has this information as part of all of its components - it's related to the model isn't it? In fact I think this is strictly better than passing something like max_batch_size which was only used during inference :) Let me know if that makes sense.

ebsmothers · 2024-03-30T21:18:15Z

recipes/generate.py

+        return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)
+
+    def _logits_to_probs(self, logits: torch.Tensor, temperature: float, top_k: int):
+        logits = logits / max(temperature, 1e-5)


nit: maybe just add a value check on temperature somewhere?

ebsmothers · 2024-03-30T21:42:43Z

torchtune/models/llama2/_component_builders.py

@@ -103,7 +94,7 @@ def llama2(
        v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
        output_proj=nn.Linear(embed_dim, embed_dim, bias=False),
        pos_embeddings=rope,
-        kv_cache=kv_cache,
+        kv_cache=None,


Is this a useless arg now? Seems if we are building KV caches at the decoder level it will always be none?

Hmm, yeh its default to None so I don't think it needs to exist in this call.

ebsmothers · 2024-03-30T21:46:28Z

torchtune/modules/kv_cache.py

+        k_out[:, :, input_pos] = k_val
+        v_out[:, :, input_pos] = v_val


If I understand correctly, is it now the case that max_batch_size really just means batch size? If so, maybe we should update its name

I don't know why it was called max_batch_size ever, but this seems to be the convention. Let me do some more research on this

ebsmothers · 2024-03-30T21:47:47Z

torchtune/modules/kv_cache.py

+        num_heads (int): number of heads. We take num_heads instead of num_kv_heads because
+            the cache is created after we've expanded the key and value tensors to have the
+            same shape as the query tensor. See attention.py for more details


So this makes sense to me, but given that you didn't really change what we were doing on the attention side of things, how was this working properly before?

I think this is the case where we haven't really run inference for the case where num_kv_heads != num_heads. If we did, this would break

ebsmothers · 2024-03-30T21:54:48Z

recipes/generate.py

+        t0 = time.perf_counter()
+        generated_tokens, _ = self._decode_n_tokens(
+            self._model,
+            next_token.view(1, -1),


What is the view actually doing here? Isn't this just a single token?

Follow-up.. just a general comment on a bunch of these methods: it'd be nice to add docstrings with examples of each one (e.g. given a prompt of length n, prefill will fill the KV cache's first n elements and return token n+1, something similar for _decode_one_token and/or _decode_n_tokens)

The view converts this into the form of [BS, input] which is what the transformer expects

ebsmothers · 2024-03-30T21:55:48Z

recipes/generate.py

+            with torch.backends.cuda.sdp_kernel(
+                enable_flash=False, enable_mem_efficient=False, enable_math=True
+            ):


I assume this is for performance reasons? Might be good to add a bit of detail on why we're doing this though (is it because the math implementation is faster when seq len and batch size are small?)

ebsmothers · 2024-03-30T21:57:20Z

recipes/generate.py

+        top_k: int,
+    ):
+        new_tokens, new_probs = [], []
+        for i in range(num_new_tokens):


Do we always generate a fixed # of tokens? What if e.g. we get EOS before then?

Currently we don't respect the EOS token. Let me add that functionality

ebsmothers · 2024-03-30T22:01:19Z

recipes/generate.py

+logger = utils.get_logger("DEBUG")
+
+
+class InferenceRecipe:


Can probably add documentation here around what's supported and what's not (e.g. explicitly call out we don't have speculative decoding, we support temperature and top-k sampling, etc.)

joecummings · 2024-03-31T20:57:19Z

torchtune/utils/__init__.py

@@ -22,6 +22,7 @@
    validate_no_params_on_meta_device,
    wrap_fsdp,
 )
+from ._generation import generate  # noqa


nit: why do you need the #noqa?

Pls someone add some analogous version of this to our .flake8, then we can delete all these godforsaken noqas in our init files

I know you've been asking for this @ebsmothers! I'm going to be a pain and punt this to a follow up!

joecummings · 2024-03-31T21:15:34Z

recipes/generate.py

+        logger.info(
+            f"Time for inference: {t:.02f} sec total, {tokens_sec:.02f} tokens/sec"
+        )
+        logger.info(f"Memory used: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB")


For my own understanding, why are you including this information?

This seems to be a standard for every generation tool. I added it for my own debugging, but then realized it might be useful for folks running this recipe as well. Anything look out of place?

joecummings

kewl!

kartikayk added 2 commits March 29, 2024 13:47

fix generation

0e379ff

Update inference code

ff82fc7

kartikayk requested a review from joecummings March 29, 2024 20:57

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2024

kartikayk requested a review from ebsmothers March 29, 2024 20:58