Modularize Llama2 #137

joecummings · 2023-12-28T19:26:49Z

Summary

Based on RFC principles outlined in #102, llama2 (and derived models) can be built using principles of composability and shared modules.

Changelog

Separated out llama2/*.py into modules/
- Added these modules to __init__.py so as to be importable from modules/ dir
- Updated README to correct import
Created llama2 components based on these modules in a single file
Updated all tests to test these components through the llama2 interface
Added modules and models to docs
Converted weights from old format to new format and uploaded to s3 @ pytorch-multimodal/llama2-7b-01052024

Testing

pytest tests/
pytest recipes/

Notes

This is a major re-design of the llama2 model. Please take note that the llama2 builder function constructs a TransformerDecoder model from all classes, rather than a bunch of builder functions. This is the hybrid approach mentioned by @ebsmothers and @kartikayk and requires more feedback if this is the direction we want to go in.
It is not possible to pass in a single layer w/ number of layers to TransformerDecoder b/c it initializes to the same data_ptr. Therefore, a big change is that the for loop takes place in the llama2 builder function. Not sure if this is ideal. Curious to hear thoughts.

Docs

netlify · 2023-12-28T19:26:54Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`5d37cd9`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/659f2b6488189f000872ed7f
😎 Deploy Preview	https://deploy-preview-137--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rohan-varma

Not a complete review, just one comment for thought.

rohan-varma · 2024-01-02T08:28:22Z

torchtune/modules/attention.py

+        output_proj (nn.Module): projection layer for output.
+        pos_embeddings (Optional[nn.Module]): positional embeddings layer, e.g. RotaryPositionalEmbeddings.
+            If not specified, then no positional embeddings are used.
+        kv_cache (Optional[KVCache]): KVCache object used to cache key and value.


curious about the tradeoffs here. Before, we'd detect when running for inference and enable kv cache optimization out of the box. Now we're relying on users to pass this in. This could cause some friction when trying to switch btwn training and inference.

cc @ebsmothers on this as well

I may be being dense here, but before didn't we just use a different flag (max_batch_size) to control whether KV caching was enabled?

Before, we needed them to include a max_batch_size for them to include this. I think if we pick-n-choose which modules we include by default and which ones we allow to be passed in, it signals inconsistent design principles.

I agree we should strive for a consistent design here. Ideally we have some flag that's relatively clear and indicates whether a component (be it self-attention or full transformer decoder) is in KV caching mode. Either way idk we have to tackle it in this PR

ebsmothers · 2024-01-02T17:35:59Z

scripts/llama2_inference/inference.py

@@ -13,8 +13,8 @@

 import torch

-from torchtune.models.llama2.tokenizer import Tokenizer
-from torchtune.models.llama2.transformer import TransformerDecoder
+from torchtune.modules.tokenizer import Tokenizer


I know it's not the main point of this PR, but wonder if tokenizers could/should go in something like torchtune/transforms instead

I'm not opposed - good for a follow-up.

ebsmothers · 2024-01-02T17:47:03Z

torchtune/models/__init__.py

@@ -9,7 +9,7 @@
 import torch
 from torch.nn import Module

-from torchtune.models.llama2.models import llama2_7b, llama2_tokenizer
+from .llama2 import llama2_7b, llama2_tokenizer, small_test_ckpt  # noqa


Why relative import? And why do we need to import small_test_ckpt here?

This is pretty standard for at least init.py files. See https://github.com/pytorch/pytorch/blob/main/torch/optim/__init__.py for a single example, but there are more.

small_test_ckpt was used in tests. I can move out of here to private if desired.

Let's stick w/absolute imports IMO. They're much cleaner and should probably be a design principle / coding best practice if it's not already.

For everything except __init__.py files, I absolutely agree. I think the assumption is that the __init__.py for a specific submodule will ALWAYS be in the same place. That way, any changes to higher-level folder structure won't mess up these simple imports.

Yeah I am OK with the relative imports; I think they are often used to define the importable modules from a given package. I see you defined __all__ in the other __init__.py file, so this makes sense to me. Btw to scrap the noqa tags you can try this

ebsmothers · 2024-01-02T17:47:37Z

torchtune/models/llama2.py

+    )
+
+
+def small_test_ckpt(vocab_size: int) -> TransformerDecoder:


vocab_size not used

This has been moved to unit tests in main

torchtune/modules/__init__.py

ebsmothers · 2024-01-02T18:02:27Z

torchtune/models/llama2.py

+    return tokenizer
+
+
+class Llama2FeedForward(nn.Module):


This is one case where I do not like the usage of an extra class wrapping FeedForward. This is basically just a passthrough for everything but activation, I don't think we should be nesting modules here

nit on name, technically all layer types are feedforward if they're not recurrent. Isn't this either a Linear layer or an MLP?

torchtune/modules/transformer.py

ebsmothers · 2024-01-02T18:11:57Z

torchtune/models/llama2.py

+                (1, 1, seq_len, seq_len), float("-inf"), device=tokens.device
+            )
+            mask = torch.triu(mask, diagonal=curr_pos + 1)
+        return self.model(tokens, mask, curr_pos)


So once we get all the way up to the top level model, if I wanna get an individual MLP block I will need to do self.model.layers[0].layer.mlp.ff, right? Imo this is unintuitive

This is a consequence of utilizing classes as opposed to builder functions. Not sure if this discussion was officially closed. I see the confusion here in accessing the class through this method.

ebsmothers · 2024-01-02T18:14:21Z

torchtune/models/llama2.py

+        if seq_len > 1 and self.max_batch_size is not None:
+            mask = torch.full(
+                (1, 1, seq_len, seq_len), float("-inf"), device=tokens.device
+            )
+            mask = torch.triu(mask, diagonal=curr_pos + 1)


Can we offload mask construction to a utility or something?

ebsmothers · 2024-01-02T18:18:00Z

torchtune/modules/attention.py

+        output_proj (nn.Module): projection layer for output.
+        pos_embeddings (Optional[nn.Module]): positional embeddings layer, e.g. RotaryPositionalEmbeddings.
+            If not specified, then no positional embeddings are used.
+        kv_cache (Optional[KVCache]): KVCache object used to cache key and value.


I may be being dense here, but before didn't we just use a different flag (max_batch_size) to control whether KV caching was enabled?

ebsmothers · 2024-01-02T18:18:56Z

torchtune/modules/attention.py

@@ -198,17 +180,19 @@ def forward(
            k = k.expand(bsz, seq_len, self.num_kv_heads, q_per_kv, self.head_dim)
            v = v.expand(bsz, seq_len, self.num_kv_heads, q_per_kv, self.head_dim)

+        # Apply RoPE embeddings
+        # if self.pos_embeddings is not None:


Why is this commented out? I think in init you have pos_embeddings as optional

Will be fixed.

pbontrager · 2024-01-03T16:43:16Z

scripts/llama2_inference/inference.py

@@ -13,8 +13,8 @@



Is it worth making this a recipe? It could be tested then and take advantage of the cli

I'm still not 100% clear on the purpose of this file tbh. As is it reads like a parity check across inference with no KV caching, inference with KV caching, and the HF version of the model. If that's the case we should not make it a recipe. But if we drop the transformers dep and make the KV caching configurable I agree this would make a nice recipe

This file should be deleted eventually. Before MVP

pbontrager · 2024-01-03T16:46:23Z

torchtune/models/llama2.py

+        super().__init__()
+        self.max_batch_size = max_batch_size
+        token_embeddings = nn.Embedding(vocab_size, embed_dim)
+        layer = Llama2DecoderLayer(


If this layer type was passed in, it would make this code more modular. A user could write a custom Lllama deocder layer and pass that in.

ebsmothers · 2024-01-03T22:38:56Z

scripts/llama2_inference/inference.py

@@ -13,8 +13,8 @@



I'm still not 100% clear on the purpose of this file tbh. As is it reads like a parity check across inference with no KV caching, inference with KV caching, and the HF version of the model. If that's the case we should not make it a recipe. But if we drop the transformers dep and make the KV caching configurable I agree this would make a nice recipe

ebsmothers · 2024-01-03T22:43:21Z

torchtune/modules/feed_forward.py

+        self.w1 = linear_class(dim, hidden_dim, bias=False)
+        self.w2 = linear_class(hidden_dim, dim, bias=False)
+        self.w3 = linear_class(dim, hidden_dim, bias=False)


Need to change these (I think this is what you were mentioning earlier?)

Bumping this comment

Does this not work? Would we want to do more than just allow a different linear class?? What interface does LoRALinear support?

I would pass three different nn.Modules instead of dim and hidden_dim. Also the way you're passing linear_class rn it is not an nn.Module, it is just a type (since you have not actually initialized it outside of FeedForward).

ebsmothers · 2024-01-03T22:51:50Z

torchtune/modules/attention.py

@@ -91,37 +99,19 @@ def __init__(
        if attn_dropout < 0 or attn_dropout > 1:
            raise ValueError(f"attn_dropout ({embed_dim}) must be between 0.0 and 1.0")


It might be nice to offload all these checks to a self._validate_parameters method or something to keep the init clean

ebsmothers · 2024-01-03T22:56:05Z

torchtune/modules/transformer.py

+        layer: TransformerDecoderLayer,
+        num_layers: int,
+        norm: nn.Module,
+        output: nn.Linear,


Could also just be generic nn.Module (e.g. if someone wants to use an MLP instead of a single linear). Not a huge deal either way though

ebsmothers · 2024-01-03T23:31:29Z

torchtune/modules/transformer.py

+        # shape: [b, s, d]
+        h = self.tok_embeddings(tokens)
+
+        if seq_len > 1 and self.layers[0].attn.kv_cache is not None:


In general I would try to avoid accessing lower-level components' attributes as much as possible. I know you've typed things all the way down so this isn't gonna directly break anything, but still..

This sucks and should be fixed, will in a follow-up PR.

Can you add a todo?

ebsmothers · 2024-01-03T23:36:59Z

torchtune/models/llama2.py

+        num_kv_heads=32,
+        embed_dim=4096,
+        max_seq_len=2048,
+        max_batch_size=32, # Need to figure out the actual default used by Llama2


Should depend on the user's setup, right? (I.e. how much memory they have) Also does this mean that the default here is for inference mode? (Not that there's anything wrong with that, but we may want to distinguish this somehow)

NicolasHug · 2024-01-04T14:48:13Z

torchtune/modules/__init__.py

+from .position_embeddings import RotaryPositionalEmbeddings  # noqa
+from .rms_norm import RMSNorm  # noqa
+from .tokenizer import Tokenizer  # noqa
+from .transformer import TransformerDecoder, TransformerDecoderLayer  # noqa


[no need to address in this PR!]

As an illustration of what I'm advocating for in #25 (comment), these imports in this __init__.py file would look like:

from ._tokenizer import Tokenizer # noqa from ._transformer import TransformerDecoder, TransformerDecoderLayer # noqa

And everything else stays unchanged.

This allows us to write whatever we want within _tokenizer.py or _transformer.py without worrying about whether what we're writing should be public or private.

scripts/llama2_checkpoint/convert_llama2_to_native.py

ebsmothers · 2024-01-05T15:40:21Z

torchtune/models/llama2.py

+    hidden_dim = 4 * int(2 * embed_dim / 3)
+    # Round hidden dimension to nearest multiple of `multiple_of`
+    multiple_of = 256
+    hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)


This is one case where I think a feedforward-specific builder will be useful. Otherwise we are repeating this logic a lot (or using magic #s as in the transformer decoder test). Why not let FeedForward take three nn.Modules + activation as args, then provide a single feedforward builder that takes hidden_dim and embed_dim and does all the math?

examples/plot_the_best_example_in_the_world.py

daniellepintz · 2024-01-08T20:49:19Z

scripts/llama2_checkpoint/README.md

-TorchTune seeks to leverage the power of native PyTorch including PyTorch-native distributed APIs such as FSDP and Tensor Parallelism. To train Llama-2 models using TorchTune, checkpoints must be
-converted as a result.
-
+TorchTune seeks to leverage the power of native PyTorch including PyTorch-native distributed APIs such as FSDP and Tensor Parallelism. To train Llama-2 models using TorchTune, checkpoints must be converted as a result.


checkpoints must be converted as a result.

converted from what to what?

This will be remedied in a follow-up PR.

ebsmothers

A few more comments but overall looks good. There are still a bunch of open comments so please make sure to address those (I explicitly bumped the couple most important ones imo). Thanks for your diligence in working through all these changes and addressing everything! Accepting now so you're not blocked on me

ebsmothers · 2024-01-09T23:36:35Z

torchtune/models/llama2.py

+    num_kv_heads = num_kv_heads if num_kv_heads else num_heads
+    qkv_dim = (num_heads + 2 * num_kv_heads) * head_dim
+    layers = nn.ModuleList()
+    for _ in range(num_layers):


(Not blocking for this PR) There are gonna be a lot of similar for loops if we aren't using deepcopy, which I don't love. Short-term: maybe some simple utility to slightly reduce boilerplate? Longer-term I would really like to figure out a better solution though.

Fixed back to _get_clones.

torchtune/modules/position_embeddings.py

torchtune/modules/transformer.py

ebsmothers · 2024-01-09T23:42:47Z

torchtune/modules/feed_forward.py

+        self.w1 = linear_class(dim, hidden_dim, bias=False)
+        self.w2 = linear_class(hidden_dim, dim, bias=False)
+        self.w3 = linear_class(dim, hidden_dim, bias=False)


Bumping this comment

ebsmothers · 2024-01-09T23:43:02Z

torchtune/modules/attention.py

@@ -198,17 +180,19 @@ def forward(
            k = k.expand(bsz, seq_len, self.num_kv_heads, q_per_kv, self.head_dim)
            v = v.expand(bsz, seq_len, self.num_kv_heads, q_per_kv, self.head_dim)

+        # Apply RoPE embeddings
+        # if self.pos_embeddings is not None:


ebsmothers · 2024-01-09T23:54:47Z

tests/torchtune/models/llama2/scripts/compare_attention.py

@@ -202,11 +201,21 @@ def compare_attention(
        attn_out_ref = attn_ref(input_t, freq_cis, mask)

    # current implementation; initialize with constant to compare outputs
-    attn = LlamaSelfAttention(
+    head_dim = embed_dim // num_heads


Not to be too much of a broken record, but this block (plus L122-152 of compare_decoder_layer.py, plus several blocks of test_attention.py, plus L69-94 of test_transformer_decoder.py) is why I would like to see builder functions for intermediate components as well.

Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>

Co-authored-by: Evan Smothers <ebs@fb.com> Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>

daniellepintz · 2024-01-16T17:15:34Z

torchtune/models/llama2.py

+    attn_dropout: float = 0.0,
+    max_batch_size: Optional[int] = None,
+    norm_eps: float = 1e-6,
+):


Missing docstring

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 28, 2023

joecummings changed the title ~~modular llama2~~ Modularize Llama2 Dec 28, 2023

joecummings marked this pull request as ready for review December 28, 2023 21:38

joecummings requested review from rohan-varma, pbontrager, daniellepintz, gokulavasan and ebsmothers December 28, 2023 21:39

rohan-varma reviewed Jan 2, 2024

View reviewed changes

rohan-varma self-requested a review January 2, 2024 08:28

ebsmothers reviewed Jan 2, 2024

View reviewed changes

pbontrager reviewed Jan 3, 2024

View reviewed changes

ebsmothers reviewed Jan 3, 2024

View reviewed changes

joecummings force-pushed the modular-llama2 branch from ba9f932 to f623f8d Compare January 4, 2024 01:17

NicolasHug reviewed Jan 4, 2024

View reviewed changes

ebsmothers reviewed Jan 5, 2024

View reviewed changes

joecummings force-pushed the modular-llama2 branch 3 times, most recently from 3910845 to 4ab0275 Compare January 8, 2024 19:36

joecummings mentioned this pull request Jan 8, 2024

Remove computed buffers from state_dict #158

Merged

daniellepintz reviewed Jan 8, 2024

View reviewed changes

examples/plot_the_best_example_in_the_world.py Outdated Show resolved Hide resolved

daniellepintz reviewed Jan 8, 2024

View reviewed changes

ebsmothers approved these changes Jan 9, 2024

View reviewed changes

joecummings force-pushed the modular-llama2 branch 3 times, most recently from 7799d50 to cd06fd8 Compare January 10, 2024 23:02

ebsmothers and others added 2 commits January 10, 2024 15:41

[RFC] Proposed design for LoRA integration (currently very hacky)

d0441c8

Move llama modules to generic modules dir and create builder

28995dd

joecummings and others added 23 commits January 10, 2024 15:41

Consistent naming

cb37815

Stub for testing checkpoint conversion

094a074

Take in norm eps

e59107c

Update compare scripts to match new implementation

2fd2a16

Update README to import directly from modules instead of file

ed84a3e

Fix RMS norm test

f5c3af6

Set max_batch_size=None b/c it takes up too much memory

c8ecbec

Doc linting fixes

497a73d

Add missing Optional type in llama2 declaration

abe8444

Test reference in docs

9983c05

Update small ckpt test

dbd0fe3

Update import

0dd07d8

Update generation

b7fb98a

Fix typo

b964eb1

Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>

Revert changes + helper funcs

1990350

Remove reference to llama2 paper in Sphinx

ddf6609

Utilize helper func

0daf408

Add scaling helper function

d490746

Update docstring and lint

0d1de59

Change max_seq_len back

34b7218

Update import path

2afd4ef

asdf

55de86a

Update inference script

5d37cd9

joecummings force-pushed the modular-llama2 branch from ef7e33a to 5d37cd9 Compare January 10, 2024 23:42

joecummings merged commit 86c2318 into main Jan 10, 2024
15 checks passed

joecummings added a commit that referenced this pull request Jan 11, 2024

Modularize Llama2 (#137)

232be30

Co-authored-by: Evan Smothers <ebs@fb.com> Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com>

daniellepintz mentioned this pull request Jan 16, 2024

Fix generate.py after model refactor #208

Merged

daniellepintz reviewed Jan 16, 2024

View reviewed changes

ebsmothers mentioned this pull request Jan 19, 2024

[RFC] Proposed design for LoRA integration (very hacky, please read description first) #102

Closed

joecummings deleted the modular-llama2 branch March 18, 2024 20:50

		)


		def small_test_ckpt(vocab_size: int) -> TransformerDecoder:

		@@ -91,37 +99,19 @@ def __init__(
		if attn_dropout < 0 or attn_dropout > 1:
		raise ValueError(f"attn_dropout ({embed_dim}) must be between 0.0 and 1.0")

Modularize Llama2 #137

Modularize Llama2 #137

Conversation

joecummings commented Dec 28, 2023 • edited

Summary

Changelog

Testing

Notes

Docs

netlify bot commented Dec 28, 2023 • edited

✅ Deploy Preview for torchtune-preview ready!

rohan-varma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers Jan 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Jan 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings commented Dec 28, 2023 •

edited

netlify bot commented Dec 28, 2023 •

edited

ebsmothers Jan 3, 2024 •

edited

NicolasHug Jan 4, 2024 •

edited