# Constructing a NeMo model

NeMo "Models" are comprised of a few key components . We'll go in the order (NN arch, Dataset + Data Loaders, Preprocessing + Postprocessing, Optimizer + scheduler , other infrastructure: tokenizers, LM configs, data augmentation, etc)

To make this slightly challenging, let's port a model from the NLP domain. Transformers are all the rage, with BERT and his friends from Sesame street forming the core infrastructure for many NLP tasks.

An excellent implementation of one such model - GPT - can be found in the `minGPT` repo https://github.com/karpathy/minGPT. We will attempt to port minGPT to NeMo 

## Construction the Neural Network Architecture

First, on the list - the neural network that forms the backbone of the NeMo model.


In [2]:
import torch
import nemo
from nemo.core import NeuralModule
from nemo.core import typecheck

[NeMo W 2023-03-22 10:48:37 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.


`NeuralModule` is a subclass of `torch.nn.Module` and it brings with it a few additional functionalities 
It has the following capabilities:
1. `Typing` it add supports for `Neural Type Checking` to the model. `Typing` is optional but quite useful
2. `Serialization` the `OmegaConf` config dict and YAML config files . all `NeuralModules` inherhently supports serialization/deserialization from such config dict
3. `FileIO` optional file serialization system

In [3]:
class MyEmptyModule(NeuralModule):

    def forward(self):
        print("Neural Module - hello world")
        

In [4]:
x = MyEmptyModule()
x()

Neural Module - hello world


## Neural types

Almost all NeMo components inherit the classs `Typing`. `Typing` is a simple class that adds 2 properties to the class that inherits it - `input_type` and `output_types`. A NeuralType is simply a semantic tensor. It contains info regarding the semantic shape and the tensor should hold, as well as the semantic info of what the tensor represents. 

So what semantic info does such a typed tensor contain? 

Across the DL domain, we often encounter cases where tensor shapes may match, but the semantics don't match at all. For example let's take a look at the following rank 3 tensors

In [5]:
# Case 1
embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)
x = torch.randint(high=10, size=(1,5))
print("x: ", x)
print("embedding(x) :", embedding(x).shape)

x:  tensor([[0, 8, 9, 0, 3]])
embedding(x) : torch.Size([1, 5, 30])


In [6]:
# Case 2
lstm = torch.nn.LSTM(1, 30, batch_first=True)
x = torch.randn(1, 5, 1)
print("x: ", x)
print("lstm(x) :", lstm(x)[0].shape)

x:  tensor([[[ 0.8871],
         [-0.1232],
         [-0.7368],
         [-2.8013],
         [ 0.5705]]])
lstm(x) : torch.Size([1, 5, 30])


The ability to recognize that the 2 tensors do not represent the same semantic info is precisely why we utilize Neural Types. It contains the info of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the 2 outputs of those tensors, it would raise an error saying semantically they were different things (they are `INCOMPATIBLE` with each other)

## Neural Types - Usage

Neural Types are one of the core foundations of NeMo. While they are entirely *optional* and not instrusive, NeMo takes great care to support it so that there is no semantic incompatibility between components being used by users

In [7]:
from nemo.core.neural_types import NeuralType
from nemo.core.neural_types import *

In [8]:
class EmbeddingModule(NeuralModule):
    def __init__(self) -> None:
        super().__init__()
        self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)
    
    @typecheck()
    def forward(self, x):
        return self.embedding(x)
    
    @property
    def input_types(self):
        return {
            'x': NeuralType(axes=('B','T'), elements_type=Index())
        }

    @property
    def output_types(self):
        return {
            'y': NeuralType(axes=('B','T','C'), elements_type=EmbeddedTextType())
        }


Let's discuss how we added  type checking support to the above class
1. `forward` has a decorator `@typecheck()` on it
2. `input_types` and `output_types` properties are defined

Let's expand on each of the above steps
- `@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the 2 default properties of `input_types` and `output_types`, which by default return None

The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can 'opt-in' to type checking by overriding the 2 properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.

So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically,, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator. 

As we see above, `@typecheck()` enforces the types. How then, do we provide this type of info to NeMo ?

By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`. In the above case, we define a `NeuralType` as 2 components.
- `axes`: this is the semantic info of the carried by the axes themselves. The most common axes info is from single character notation.\
  `B` = batch\
  `C` / `D` - Channel/Dimension (treated the same)\
  `T` - Time\
  `H` / `W` : Height/Width
- `element_types`: This is the semantic info of "what the tensor represents". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us tobuild a hierarchy of custom semantic types that can be used by NeMo

Here we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding) 

In [9]:
embedding_module = EmbeddingModule()

In [10]:
from typing import *

class LSTMModule(NeuralModule):
    def __init__(self) -> None:
        super().__init__()
        self.lstm = torch.nn.LSTM(1,30,batch_first=True)

    @typecheck()
    def forward(self, x):
        return self.lstm(x)

    @property
    def input_types(self) -> Optional[Dict[str, NeuralType]]:
        return {
            'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())
        }
    
    @property
    def output_types(self) -> Optional[Dict[str, NeuralType]]:
        return {
            'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation())
        }

We changed the input to be a rank 3 tensor, now representing a 'SpectrogramType', we intentionally keep it generic - it can be a `MelSpectrogramType` or a `MFCCSpectrogramType` as its input !

The output of a LSTM is now an `EncodedRepresentation`. Pratically, this can be the output of a CNN layer, a Transformer block, or in this case, a LSTM layer.

In [11]:
lstm_module = LSTMModule()

---

Now for the test

In [12]:
# Case 1 [ERROR Cell]
x1 = torch.randint(high=10, size=(1,5))
print("x:",x1)
print("embedding(x):",embedding_module(x1).shape)

x: tensor([[3, 9, 3, 2, 9]])


TypeError: All arguments must be passed by kwargs only for typed methods

You might be wondering why we get a `TypeError` right off the bat. This `TypeError` is raised by design.

Positional arguments can cause significant issues during model development, mostly when the model/module is not finalized. To reduce the potential for mistakes cause by wrong
positional arguments and enforce the name of arguments provided to the function, `Typing` requires you to **call of your type-checked functions by kwargs only**

In [13]:
# Case 1
print("x:", x1)
print("embedding(x): ",embedding_module(x=x1).shape)

x: tensor([[3, 9, 3, 2, 9]])
embedding(x):  torch.Size([1, 5, 30])


In [15]:
# Case 2: ERROR Cell
x2 = torch.randn(1, 5, 1)
print("x:", x2)
print("lstm(x): ",lstm_module(x=x2).shape)

x: tensor([[[-0.2252],
         [ 0.7333],
         [-1.1834],
         [ 1.6252],
         [ 2.0465]]])


TypeError: Number of output arguments provided (2) is not as expected. It should be larger or equal than 1 and less or equal than 1.
This can be either because insufficient/extra number of output NeuralTypes were provided,or the provided NeuralTypes {'y': NeuralType(axis=(batch, time, dimension), element_type=EncodedRepresentation)} should enable container support (add '[]' to the NeuralType definition)

Inside our `LSTMModule` class, we declare the output types to be a single NeuralType- an `EncodedRepresentation` of shape [B,T,C]

But the output of a LSTM layer is a tuple of 1) the encoded representation of shape [B,T,C] 2) another tuple containing 2 state values - the hidden state `h` and the
cell state `c`, each of shape [num_layers*num_directions, B, C]!

So the neural type system raises an error saying that the number of output arguments does not match what is expected

Let's fix the above

In [16]:
class CorrectLSTMModule(LSTMModule):
    @property
    def output_types(self) -> Optional[Dict[str, NeuralType]]:
        return {
            'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),
            'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())]
        }

You should note that for the `h_c` neural type, we wrap it in a list `[]`. NeMo, by default, assumes that each `NeuralType` corresponds to a single returned vvalue.
However, in the case of LSTMs, they produce a tuple of 2-state tensors.

So we inform NeMo that this particular `NeuralType` is a single dimensional list of items- and that each element of this list shares the same `NeuralType` and has the
same shape.

NeMo then ensures that the `h_c` is always a list of tensors. It will not check how many items are in the list, but will ensure that the returned value msut be a list containing zero or more items - and that each of these items share the same `NeuralType`

In [17]:
lstm_module = CorrectLSTMModule()

In [18]:
# Case 2
x2 = torch.randn(1,5,1)
y2, (h, c) = lstm_module(x=x2)
print("x: ",x2)
print("lstm(x): ", y2.shape) # The output of the LSTM RNN
print("hidden_state (h):", h.shape) #The 1st hidden state of the LSTM RNN
print("hidden_state (c):", c.shape) #The 2nd hidden state of the LSTM RNN


x:  tensor([[[-1.8756],
         [-0.7585],
         [ 1.0798],
         [ 0.8792],
         [-0.3083]]])
lstm(x):  torch.Size([1, 5, 30])
hidden_state (h): torch.Size([1, 1, 30])
hidden_state (c): torch.Size([1, 1, 30])


When the `output_types` is overridden, and valid torch tensors are returned as a result, these tensors are attached with the attribute `neural_type`

In [19]:
emb_out = embedding_module(x=x1)
lstm_out = lstm_module(x=x2)[0]

assert hasattr(emb_out, 'neural_type')
assert hasattr(lstm_out, 'neural_type')

In [20]:
print("Embedding tensor : ", emb_out.neural_type)
print("LSTM tensor : ", lstm_out.neural_type)

Embedding tensor :  axes: (batch, time, dimension); elements_type: EmbeddedTextType
LSTM tensor :  axes: (batch, time, dimension); elements_type: EncodedRepresentation


In [21]:
emb_out.neural_type.compare(lstm_out.neural_type)

<NeuralTypeComparisonResult.INCOMPATIBLE: 6>

In [22]:
emb_out.neural_type == lstm_out.neural_type

<NeuralTypeComparisonResult.INCOMPATIBLE: 6>

## Neural Types - Limitations



In [23]:
embedding_module = EmbeddingModule()
x1 = torch.randint(high=10, size=(1,5))

# Attach correct neural type
x1.neural_type =  NeuralType(('B','T'), Index())

print("embedding(x): ", embedding_module(x=x1).shape)

embedding(x):  torch.Size([1, 5, 30])


In [24]:
# Attach wrong neural type [ERROR CELL]
x1.neural_type = NeuralType(('B','T'), LabelsType())
print("embedding(x): ", embedding_module(x=x1).shape)


TypeError: NeuralTypeComparisonResult.INCOMPATIBLE :
Input type expected : axes: (batch, time); elements_type: Index
Input type found : axes: (batch, time); elements_type: LabelsType
Argument: x

## Let's create the minGPT components

Now that we have somewhat firm grasp of neural type checking, let's begin porting the minGPT example code. 

In [4]:
import math
from typing import List, Set, Dict, Tuple, Optional

import torch
import torch.nn as nn
from torch.nn import functional as F

## Creating Element types

Till now , we have used the Neural Types provided by the NeMo core. But we need not restricted to the pre-defined element types !
Users have total flexibility in defining hierarchy of element types as they please

In [26]:
class AttentionType(EncodedRepresentation):
    """Basic Attention Element Type"""

class SelfAttentionType(AttentionType):
    """Self attention element type"""

class CausalSelfAttentionType(SelfAttentionType):
    """Causal Self Attention Element Type"""

## Creating the modules

Neural Module are generally top-level modules but can be used at any level of the module hierarchy

For demonstration, we will treat an encoder comprising a block of Causal Self Attention modules as a typed Neural Module. Of course, we can also treat each causal self attention layer itself as a neural module if we require it, but top-level modules are generally prefered

In [5]:
class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use a torch.nn.MultiHeadAttention here but I am including an 
    explicit implementation here to show that there is nothing to scary here
    """

    def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop) -> None:
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        # key, query, value projections for all heads
        self.key = nn.Linear(n_embd, n_embd)
        self.value = nn.Linear(n_embd, n_embd)
        self.query = nn.Linear(n_embd, n_embd)
        # Regularization
        self.attn_drop = nn.Dropout(attn_pdrop)
        self.resid_drop = nn.Dropout(resid_pdrop)
        # output projection
        self.proj = nn.Linear(n_embd, n_embd)
        # Causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size))
                                        .view(1, 1, block_size, block_size))
        
    def forward(self, x, layer_mask=None):
        B, T, C = x.size()

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1,2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1,2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1,2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T]) == 0, float('-inf')
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y
    
class Block(nn.Module):
    '''An unassuming Transformer block'''

    def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop) -> None:
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)
        self.mlp = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU,
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(resid_pdrop)
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

## Building the NeMo Model

Let's start by inheriting `ModelPT` - the core class of a PyTorch NeMo model, which inherits the PyTorch Lightning Module.

- The NeMo equivalent of `torch.nn.Module` is the NeuralModule
- The NeMo equivalent of the `LightningModule` is `ModelPT`

In [2]:
import pytorch_lightning as ptl
from nemo.core import ModelPT
from omegaconf import OmegaConf

[NeMo W 2023-03-23 10:57:28 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.


Let's construct the bare minimum implementation of the NeMo model - just the constructor, the initializer of weights, and the forward method.

In [6]:
class PTLGPT(ptl.LightningModule):
    def __init__(self,
                #  Model definition args
                vocab_size: int, # size of the vocabulary (number of possible toker)
                block_size: int, # length of the model's context window in time
                n_layer: int, # depth of the model; number of transformer blocks in sequence
                n_embd: int, # the "width" of the model, number of each channels in each transformer
                n_head: int, # number of heads in each multi-head attention inside 
                # model optimization args
                learning_rate: float = 3e-4, # the base learning rate of the model
                weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops
                betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer
                embd_prop: float = 0.1, # \in [0,1]: amount of dropout on input embedding
                resid_pdrop: float = 0.1, # \in [0,1]:amount of dropout in each residual connection
                attn_pdrop: float = 0.1, # \in [0,1]:amount of dropout on the attention matrix
                ):
        super.__init__()

        # save these for optimizer init later
        self.learning_rate = learning_rate
        self.weight_decay = weight_decay
        self.betas = betas

        # input embedding stem: drop(content+position)
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
        self.drop = nn.Dropout(embd_prop)
        # deep transformer: just a sequence of trandformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])
        # decoder: at the end one more layernorm and decode the answers
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f

        self.block_size = block_size
        self.apply(self._init_weights)

        print("number of parameters: %e" % sum(p.numel() for p in self.parameters()))

    def forward(self, idx):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted"

        # forward the GPT model
        token_embedding = self.tok_emb(idx) # each index maps to a (learnable) vector
        position_embedding = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector
        x = self.drop(token_embedding + position_embedding)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        return logits
    
    def get_block_size(self):
        return self.block_size
    
    def _init_weights(self, module):
        """
        Vanilla model initialization :
        - all matmul weights \in N(0, 0.02) and biases to zero 
        - all LayerNorm post-normalization scaling set to identit, so weight=1 bias=0
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)


In [7]:
m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)

TypeError: descriptor '__init__' of 'super' object needs an argument