# Model architecture

PhyAGI introduces a versatile transformer-based language model known as MixFormer, which is highly adaptable for various tasks. This model adheres to the Hugging Face API standards, making it compatible with different training frameworks such as Hugging Face, PyTorch Lightning, and DeepSpeed.

In this guide, we will explore the customization options for MixFormer within the PhyAGI framework and learn how to develop new components to innovate upon the existing architecture.

Let's start by creating a basic instance of MixFormer for causal language modeling:

In [1]:
from phyagi.models.mixformer_sequential import MixFormerSequentialConfig, MixFormerSequentialForCausalLM

config = MixFormerSequentialConfig(n_layer=2)
model = MixFormerSequentialForCausalLM(config)

print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): ParallelBlock(
      (ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (mixer): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): FusedMLP(
        (mlp): FusedMLP(
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (resid_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): ParallelBlock(
      (ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (mixer): MHA(
        (rotary_emb): RotaryEmbedding()
    

## Choosing the architecture

MixFormer is composed of the following components:

* `embedding`: Embeds input tokens.

* `block`: Processes the embedded tokens (typically referred to as a decoder layer).

* `mixer`: Mixes the embedded tokens (e.g., attention mechanisms).

* `mlp`: Processes the mixer's output.

* `norm`: Normalizes the tokens (e.g., `LayerNorm`).

* `head`: Processes the block's output and may include components like a linear layer and cross-entropy loss.

You can select each component from a variety of available implementations:

In [2]:
from phyagi.models.mixformer_sequential.blocks import BLOCKS
from phyagi.models.mixformer_sequential.blocks.embeddings import EMBEDDINGS
from phyagi.models.mixformer_sequential.blocks.mixers import MIXERS
from phyagi.models.mixformer_sequential.blocks.mlps import MLPS
from phyagi.models.mixformer_sequential.blocks.norms import NORMS
from phyagi.models.mixformer_sequential.blocks.heads import HEADS, LOSSES

print(f"Embeddings: {list(EMBEDDINGS.keys())}")
print(f"Blocks: {list(BLOCKS.keys())}")
print(f"Mixers: {list(MIXERS.keys())}")
print(f"MLPs: {list(MLPS.keys())}")
print(f"Norms: {list(NORMS.keys())}")
print(f"Heads: {list(HEADS.keys())}")
print(f"Losses: {list(LOSSES.keys())}")

Embeddings: ['default', 'positional']
Blocks: ['parallel', 'sequential', 'xyz']
Mixers: ['mha', 'conv1d']
MLPs: ['glu', 'fused_mlp', 'mlp', 'deep_mlp']
Norms: ['torch', 'low_precision', 'rms', 'flash_rms']
Heads: ['causal_lm', 'seq_cls']
Losses: ['causal_lm', 'seq_cls']


To build a custom architecture, you need to define the `block_cls` (string), `mixer` (dict), `mlp` (dict), and `norm` (dict) arguments in the `architecture` dictionary when configuring `MixFormerSequentialConfig`:

In [3]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "mha",
        },
        "mlp": {
            "mlp_cls": "mlp",
        },
        "norm": {
            "norm_cls": "torch",
        },
    },
)

model = MixFormerSequentialForCausalLM(config)
print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): SequentialBlock(
      (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (attn): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): MLP(
        (act): NewGELUActivation()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): SequentialBlock(
    

It is important to note that embeddings and heads are specified differently. Embeddings are set using `embd_layer` due to legacy reasons, and the head is created directly in the task-specific class (e.g., `MixFormerSequentialForCausalLM`).

The guide continues with examples of how to configure MixFormer to replicate various well-known architectures such as CodeGen, LLaMA, and GPT. Each example demonstrates how to adjust the configuration to match the respective architecture's unique properties.

### CodeGen

In [4]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "parallel",
        "mixer": {
            "mixer_cls": "mha",
            # Additional keyword arguments can be used, please check `phyagi-sdk/models/blocks/mixers/mha.py`
        },
        "mlp": {
            "mlp_cls": "mlp",
        },
        "norm": {
            "norm_cls": "torch",
        },
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): ParallelBlock(
      (ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (mixer): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): MLP(
        (act): NewGELUActivation()
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
      )
      (resid_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): ParallelBlock(
      (ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (mixer): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv

### LLaMA

LLaMa architecture uses a trick to initialize the head component as it needs to disable the `bias`. However, there is a fail-safe mechanism where `head_cls` needs to match the class, e.g., if `head_cls` is `causal_lm`, it will only work when initializing with `MixFormerSequentialForCausalLM`. 

In [5]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "mha",
        },
        "mlp": {
            "mlp_cls": "glu",
            "act_fn": "silu",
            "n_inner": 5456, # int(2/3 * 4 * config.n_embd)
        },
        "norm": {
            "norm_cls": "rms",
        },
        "head": {
            "head_cls": "causal_lm",
            "use_bias": False,
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): SequentialBlock(
      (ln_1): RMSLayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln_2): RMSLayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (attn): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): GLU(
        (act): SiLU()
        (fc1): Linear(in_features=1024, out_features=10912, bias=False)
        (fc2): Linear(in_features=5456, out_features=1024, bias=False)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): SequentialBlock(
      (l

### GPT

In [6]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "mha",
            "window_size": (128, 128)
        },
        "mlp": {
            "mlp_cls": "fused_mlp",
            "n_inner": 4096
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): SequentialBlock(
      (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (attn): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): FusedMLP(
        (mlp): FusedMLP(
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): SequentialBl

## Heterogeneous architectures

MixFormer also supports heterogeneous architectures that employ different types of blocks at different layers. The following configuration illustrates this capability:

In [7]:
config = MixFormerSequentialConfig(
    n_layer=1,
    architecture=[
        {
            "block_cls": "sequential",
            "mixer": {
                "mixer_cls": "mha",
                "window_size": 128,
            },
            "mlp": {
                "mlp_cls": "fused_mlp"
            }
        },
        {
            "block_cls": "parallel",
            "mixer": {
                "mixer_cls": "mha",
            },
            "mlp": {
                "mlp_cls": "glu"
            }
        }
    ]
)

model = MixFormerSequentialForCausalLM(config)
print(model)

MixFormerSequentialForCausalLM(
  (layers): Sequential(
    (0): Embedding(
      (wte): Embedding(50304, 1024)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (1): SequentialBlock(
      (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (attn): MHA(
        (rotary_emb): RotaryEmbedding()
        (Wqkv): FusedDense(in_features=1024, out_features=3072, bias=True)
        (out_proj): FusedDense(in_features=1024, out_features=1024, bias=True)
        (inner_attn): FlashCrossAttention(
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (mlp): FusedMLP(
        (mlp): FusedMLP(
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
    )
    (2): ParallelBloc

The model will automatically adjust the `config.n_layer` to accommodate the actual number of layers specified in the configuration.

# New architectures

MixFormer is designed with extensibility in mind, allowing for the straightforward addition of new blocks, mixers, MLPs, and other components.

### Mixer

To create a new mixer, adhere to the following guidelines:  
  
- The class `__init__` method must accept a `config` argument. 
 
- The `forward` method must return a tensor with the same dimensions as the input tensor.  
  
Here's an example of implementing a `Conv1dMixer`:

In [8]:
import torch
from transformers import PretrainedConfig


class Conv1dMixer(torch.nn.Module):
    def __init__(self, config: PretrainedConfig, kernel_size: int = 3, layer_idx: int = None) -> None:
        super().__init__()
        
        self.conv = torch.nn.Conv1d(
            in_channels=config.n_embd, 
            out_channels=config.n_embd, 
            kernel_size=kernel_size, 
            padding=kernel_size-1,
            groups=config.n_embd
        )
    
    def forward(self, x: torch.FloatTensor, **kwargs) -> torch.FloatTensor:
        out = self.conv(x.transpose(1, 2)).transpose(1, 2)
        return out[:, :x.shape[1]], None

To make the new mixer accessible within `phyagi`, you can:

1. Submit a pull request to include the mixer in the `phyagi-sdk/models/blocks/mixers/__init__.py` file, making it available to all users.

2. Register it dynamically by adding a new key to the `MIXERS` dictionary:

In [9]:
from phyagi.models.mixformer_sequential.blocks.mixers import MIXERS

MIXERS["conv1d"] = Conv1dMixer

After registration, you can confirm its availability using `layers` attribute and instantiate it via the `mixer_cls` argument in the `architecture` dictionary:

In [10]:
from phyagi.models.mixformer_sequential import MixFormerSequentialConfig, MixFormerSequentialForCausalLM
from phyagi.models.mixformer_sequential.blocks.mixers import MIXERS

print(list(MIXERS.keys()))

config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "conv1d",
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model(torch.zeros((1, 1024), dtype=torch.long)).logits)

['mha', 'conv1d']
tensor([[[-0.3493, -0.5180,  0.3647,  ...,  0.1448,  0.1304, -0.9054],
         [-0.5930, -0.2169,  0.6949,  ...,  0.1118,  0.0732, -0.5457],
         [-0.6067,  0.0864,  0.8189,  ...,  0.4726, -0.0754, -0.3408],
         ...,
         [-0.4264,  0.4139,  0.7655,  ...,  0.2481, -0.4173, -0.3329],
         [-0.4264,  0.4139,  0.7655,  ...,  0.2481, -0.4173, -0.3329],
         [-0.4264,  0.4139,  0.7655,  ...,  0.2481, -0.4173, -0.3329]]],
       grad_fn=<ViewBackward0>)


Extra keys in the `mixer` dictionary serve as keyword arguments for the mixer class, allowing for customization, such as changing the `kernel_size`:

In [11]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "conv1d",
            "kernel_size": 11
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
assert model.layers[1].attn.conv.kernel_size == (11,)

## MLP

The same principles apply when creating a new MLP:

* The class `__init__` method must accept a `config` argument.

* The `forward` method must return a tensor with the same dimensions as the input tensor.

Here's an example of a `DeepMLP`:

In [12]:
from transformers.activations import ACT2FN


class DeepMLP(torch.nn.Module):
    def __init__(self, config: PretrainedConfig, n_layer: int = 3, n_inner: int = None) -> None:
        super().__init__()

        self.n_inner = n_inner or config.n_inner or 4*config.n_embd
        self.act = ACT2FN[config.activation_function]

        layers = [torch.nn.Linear(config.n_embd, self.n_inner)]
        layers += [torch.nn.Linear(self.n_inner, self.n_inner) for _ in range(n_layer - 2)]
        layers += [torch.nn.Linear(self.n_inner, config.n_embd)]
        self.layers = torch.nn.ModuleList(layers)
    
    def forward(self, x: torch.FloatTensor, **kwargs) -> torch.FloatTensor:
        for layer in self.layers[:-1]:
            x = self.act(layer(x))
        
        return self.layers[-1](x)


In [13]:
from phyagi.models.mixformer_sequential.blocks.mlps import MLPS

MLPS["deep_mlp"] = DeepMLP
print(list(MLPS.keys()))

['glu', 'fused_mlp', 'mlp', 'deep_mlp']


In [14]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "sequential",
        "mixer": {
            "mixer_cls": "conv1d"
        },
        "mlp": {
            "mlp_cls": "deep_mlp",
            "n_layer": 4,
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model(torch.zeros((1, 1024), dtype=torch.long)).logits)

tensor([[[-0.1301,  0.2461,  0.1236,  ..., -0.1361, -0.0135, -0.0081],
         [-0.2107, -0.3234,  0.3190,  ...,  0.2561, -0.5194,  0.2247],
         [ 0.3618, -0.3681,  0.2278,  ...,  0.1237, -0.3493, -0.2407],
         ...,
         [ 0.1465, -0.7555,  0.0601,  ..., -0.0617, -0.2871, -0.1053],
         [ 0.1465, -0.7555,  0.0601,  ..., -0.0617, -0.2871, -0.1053],
         [ 0.1465, -0.7555,  0.0601,  ..., -0.0617, -0.2871, -0.1053]]],
       grad_fn=<ViewBackward0>)


## Block

To create a new block component:

* The class `__init__` method must include a `config` argument.

* Optionally, it may include a `block_idx` argument.

* Preferably, it should accept `mlp` and `mixer` arguments.

* The `forward` method must output a tensor maintaining the input's dimensions.

Example of a custom block implementation:

In [15]:
from typing import Any, Dict, Tuple, Optional, Union
from phyagi.models.mixformer_sequential.blocks.mixers import get_mixer
from phyagi.models.mixformer_sequential.blocks.mlps import get_mlp


class BlockXyz(torch.nn.Module):
    def __init__(self, config: PretrainedConfig, mixer: Dict[str, Any], mlp: Dict[str, Any], block_idx: int = None, **kwargs) -> None:
        super().__init__()

        self.mixer = get_mixer(config, mixer_config=mixer)
        self.mlp = get_mlp(config, mlp_config=mlp)
        self.block_idx = block_idx
    
    def forward(self, x: Union[torch.FloatTensor, Tuple[torch.FloatTensor]], **kwargs) -> Tuple[
        torch.FloatTensor,
        Optional[torch.FloatTensor],
        Optional[torch.FloatTensor],
        Optional[Tuple[torch.FloatTensor, torch.FloatTensor]],
    ]:
        x = self.mixer(x)[0]
        x = self.mlp(x)
        
        return (x, None, None)


In [16]:
from phyagi.models.mixformer_sequential.blocks import BLOCKS

BLOCKS["xyz"] = BlockXyz
print(list(BLOCKS.keys()))

['parallel', 'sequential', 'xyz']


In [17]:
config = MixFormerSequentialConfig(
    n_layer=2,
    architecture={
        "block_cls": "xyz",
        "mixer": {
            "mixer_cls": "conv1d",
            "kernel_size": 11
        },
        "mlp": {
            "mlp_cls": "deep_mlp",
            "n_layer": 4,
        }
    }
)

model = MixFormerSequentialForCausalLM(config)
print(model(torch.zeros((1, 1024), dtype=torch.long)).logits)

tensor([[[-0.3602, -0.9025,  0.0893,  ..., -0.3397, -0.6635,  0.4196],
         [-0.3642, -0.9088,  0.0790,  ..., -0.3380, -0.6663,  0.4473],
         [-0.3450, -0.9090,  0.1206,  ..., -0.3576, -0.6528,  0.4615],
         ...,
         [-0.3618, -0.9667,  0.1016,  ..., -0.2696, -0.7573,  0.4785],
         [-0.3618, -0.9667,  0.1016,  ..., -0.2696, -0.7573,  0.4785],
         [-0.3618, -0.9667,  0.1016,  ..., -0.2696, -0.7573,  0.4785]]],
       grad_fn=<ViewBackward0>)


The same customization approach can be applied to other components of MixFormer, such as embeddings, heads, and losses.