In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

```markdown
# Large Language Models and Generative AI Systems

## Introduction

Communication is an essential aspect of human life, and the ability to express thoughts and ideas through language is a distinctive trait of our species. Writing, a significant breakthrough in human history, allowed us to preserve and transmit information over long distances and time.

Since the 1950s, artificial intelligence (AI) has been fascinated with the idea of building communicating bots. The famous Turing Test, proposed by Alan Turing, states that a machine can be considered intelligent if it can converse with a human without being recognized as a machine. The first chatbot, ELIZA, was introduced in the 1960s and could match patterns and mimic conversations, but it lacked true intelligence.

It took over 60 years to reach a point where we can debate intelligence, originality, and novelty in AI systems. This progress is largely due to generative modeling, which has enabled machines to understand and generate human-like language.

Natural language processing (NLP) is a field that focuses on building machines that can manipulate human language, specifically text. NLP tasks include text classification, text correction, machine translation, semantic analysis, text generation, text summarization, named entity recognition, information retrieval, question answering, and chatbots.

The main challenge in NLP is representing text in a way that is useful for downstream tasks. Classical NLP methods focused on syntactic structures, but semantics is crucial for proper communication. The Word2Vec method, introduced by [2], revolutionized NLP by allowing neural networks to learn word embeddings from raw text, capturing semantic meaning.

## Large Language Models (LLMs)

### What are Large Language Models?

Large Language Models (LLMs) are language models parameterized by neural networks with millions or billions of weights. They have been developed to address the challenges of representing text and understanding its semantic meaning.

LLMs have had a significant impact on NLP, enabling machines to learn and generate human-like language. They have been applied to various tasks, including text classification, machine translation, and text generation.

### Natural Language Processing and Deep Learning

The combination of deep learning and NLP has given rise to LLMs. Deep learning has enabled the development of powerful neural network architectures that can learn complex representations of text. These models can capture semantic meaning and generate coherent and contextually relevant language.

### Multimodality and Generative AI Systems (GenAISys)

While LLMs focus on processing text, the field of Generative AI Systems (GenAISys) aims to develop multimodal AI systems that can understand and generate content across various modalities, including text, images, audio, and more.

GenAISys builds upon the advancements in LLMs and extends them to other modalities, enabling machines to understand and generate content in a more holistic and human-like manner.

## Conclusion

Large Language Models have revolutionized natural language processing and opened up new possibilities for AI systems. By combining deep learning and NLP, LLMs have enabled machines to learn and generate human-like language, leading to significant advancements in various NLP tasks.

The field of Generative AI Systems takes this a step further by aiming to develop multimodal AI systems that can understand and generate content across multiple modalities. This approach brings us closer to creating AI systems that can communicate and interact with humans in a more natural and intuitive way.



# Large Language Models and Generative AI Systems

## Introduction

Communication is an essential aspect of human life, and the ability to express thoughts and ideas through language is a distinctive trait of our species. Writing, a significant breakthrough in human history, allowed us to preserve and transmit information over long distances and time.

Since the 1950s, artificial intelligence (AI) has been fascinated with the idea of building communicating bots. The famous Turing Test, proposed by Alan Turing, states that a machine can be considered intelligent if it can converse with a human without being recognized as a machine. The first chatbot, ELIZA, was introduced in the 1960s and could match patterns and mimic conversations, but it lacked true intelligence.

It took over 60 years to reach a point where we can debate intelligence, originality, and novelty in AI systems. This progress is largely due to generative modeling, which has enabled machines to understand and generate human-like language.

Natural language processing (NLP) is a field that focuses on building machines that can manipulate human language, specifically text. NLP tasks include text classification, text correction, machine translation, semantic analysis, text generation, text summarization, named entity recognition, information retrieval, question answering, and chatbots.

The main challenge in NLP is representing text in a way that is useful for downstream tasks. Classical NLP methods focused on syntactic structures, but semantics is crucial for proper communication. The Word2Vec method revolutionized NLP by allowing neural networks to learn word embeddings from raw text, capturing semantic meaning.

## Large Language Models (LLMs)

### What are Large Language Models?

Large Language Models (LLMs) are language models parameterized by neural networks with millions or billions of weights. They have been developed to address the challenges of representing text and understanding its semantic meaning.

LLMs have had a significant impact on NLP, enabling machines to learn and generate human-like language. They have been applied to various tasks, including text classification, machine translation, and text generation.

### Natural Language Processing and Deep Learning

The combination of deep learning and NLP has given rise to LLMs. Deep learning has enabled the development of powerful neural network architectures that can learn complex representations of text. These models can capture semantic meaning and generate coherent and contextually relevant language.

### Multimodality and Generative AI Systems (GenAISys)

While LLMs focus on processing text, the field of Generative AI Systems (GenAISys) aims to develop multimodal AI systems that can understand and generate content across various modalities, including text, images, audio, and more.

GenAISys builds upon the advancements in LLMs and extends them to other modalities, enabling machines to understand and generate content in a more holistic and human-like manner.

## Conclusion

Large Language Models have revolutionized natural language processing and opened up new possibilities for AI systems. By combining deep learning and NLP, LLMs have enabled machines to learn and generate human-like language, leading to significant advancements in various NLP tasks.

The field of Generative AI Systems takes this a step further by aiming to develop multimodal AI systems that can understand and generate content across multiple modalities. This approach brings us closer to creating AI systems that can communicate and interact with humans in a more natural and intuitive way.
```

# Large Language Models (LLMs)

## General Architectures of LLMs

LLMs have three main types of architectures:

1. **Encoders**: These LLMs take a piece of text (string) and return an encoding, which is a numerical representation of the input. Encoders have access to the entire input at any point during processing and do not require specific constraints. They provide outputs in a single forward run during both training and inference.

2. **Decoders**: This class of LLMs is used for generating new text. They are autoregressive models, meaning the neural networks used to parameterize them must be causal. The sampling procedure for decoders is an iterative process, making it typically slower than encoders.

3. **Encoder-Decoders and Encoder-Encoders**: LLMs can be conditional, meaning they require a combination of an encoder to process conditioning and another encoder or decoder to provide an encoding of the input text or generate new text, respectively. Figure 11.2a shows a schematic representation of an encoder or decoder, while Figure 11.2b illustrates an encoder-encoder or encoder-decoder.

## Parameterizations

The parameterization of LLMs is a crucial aspect, and it involves using hierarchical, deep neural networks. The choice of neural network architecture depends on the specific requirements of the LLM.

### Recurrent Neural Networks (RNNs)

RNNs were among the first neural networks used for language modeling due to their intrinsic sequential structure. They were used to formulate RNN-based decoders [4] and encoder-decoders [5]. However, RNNs suffer from forgetting issues.

### Convolutional Neural Networks (CNNs)

CNNs were successfully used in the first encoders [7] and later in decoders [8] and encoder-decoders [9]. The combination of CNNs and RNNs showed improvements in language modeling [6]. However, CNN-based language models also faced scaling issues.

### Transformers

The introduction of transformers [10] brought a significant breakthrough in LLMs. Transformers utilize (multihead) attention layers and implement important techniques like multihead attention, layer normalization, and positional embeddings. They excel at scaling up models, resulting in models with billions of weights. However, transformers have quadratic time and memory complexity in the number of tokens.

### Lean Large Language Models (L3M)

There is a growing focus on making LLMs leaner, known as L3M. The main aspects of obtaining L3M include:

1. **Quantization**: Reducing the number of bits per weight to lower physical memory requirements on disk.
2. **Faster Training and Inference**: Various methods have been proposed to speed up training and inference, such as FlashAttention [11] and the linear transformer [13].

### Selective State Space Models (S3Ms)

S3Ms are built on state space models and operate like RNNs at inference time, offering fully parallelizable training. They provide an alternative approach to transformer-based language models.

## Conclusion

The parameterization and architecture of LLMs are critical aspects that determine their performance and efficiency. The choice of neural network architecture depends on the specific requirements and constraints of the LLM. Transformers have revolutionized LLMs, but there is ongoing research to make them more efficient and scalable.


# Learning Large Language Models (LLMs)

## Training Procedure of FMs and LLMs

The training procedure of LLMs typically involves two stages:

1. **Pre-training**: This initial stage aims to prepare an LLM for further tasks. The model is trained using either the masked loss or the negative log-likelihood. The masked loss is used for encoders, while the negative log-likelihood is typically utilized by decoders. The pre-training stage requires a large amount of data to train general patterns in the data, such as grammar, word co-occurrences, or specific programming languages.

2. **Fine-tuning**: A pre-trained model is further specialized on another dataset for a downstream task. This stage involves fine-tuning the LLM on specific data, such as legal data for generating legal documents or a new programming language. LLMs can also be fine-tuned for various tasks like text summarization, Q&A, text classification, sentiment analysis, and more.

## Losses and Objectives

During the training process, different losses and objectives are used depending on the task:

- **Masked Loss**: Used for encoders, it aims to reconstruct masked tokens from the input.
- **Negative Log-Likelihood**: Typically utilized by decoders, it minimizes the negative log-likelihood of the generated output.
- **Additional Neural Network**: For certain tasks, an additional neural network is optimized using a different objective, denoted as $\mathcal{L}_{\text{pred}}(\theta, \phi)$.

## Combining Losses

It is possible to combine different losses, such as $\mathcal{L}_{\alpha}(\theta) = \mathcal{L}(\theta) + \alpha \mathcal{L}_{\text{masked}}(\theta)$, which can be seen as a penalized negative log-likelihood objective. This idea has been utilized in pre-training LLMs for various problems simultaneously [25] or for pre-training LLMs for molecules [26].

## Fine-tuning Challenges and Solutions

Fine-tuning LLMs can be challenging due to their large size, making it computationally expensive and time-consuming. To address this issue, techniques like Low-Rank Adaptation (LoRA) [27] have been proposed. LoRA introduces a new set of learnable matrices, $A$ and $B$, during fine-tuning, while keeping the original weight matrix $W$ fixed. The forward pass is then calculated as:

$$h_l = W h_{l-1} + B A h_{l-1}$$

LoRA updates only a small fraction of the original number of weights, typically less than 1% of the LLM's weights. This approach allows for efficient fine-tuning while achieving similar results as full fine-tuning.

## Conclusion

The training procedure of LLMs involves pre-training and fine-tuning stages, with different losses and objectives depending on the task. Fine-tuning LLMs can be computationally intensive, but techniques like LoRA provide an efficient solution by updating only a small subset of the model's weights.

# teenyGPT: A Tiny Implementation of a Decoder LLM

## Loss Function

The loss function used in teenyGPT is the negative log-likelihood, which is implemented in the `LossFun` class. This class takes the model's output (`y_model`) and the true labels (`y_true`) as input and calculates the loss. The loss is computed as the negative log-likelihood of the predicted probabilities.

```python
class LossFun(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss = nn.NLLLoss(reduction='none')

    def forward(self, y_model, y_true, reduction='sum'):
        # y_model: B(atch) x T(okens) x V(alues)
        # y_true: B x T
        B, T, V = y_model.size()

        y_model = y_model.view(B * T, V)
        y_true = y_true.view(B * T)

        loss_matrix = self.loss(y_model, y_true)  # B*T

        if reduction == 'sum':
            return torch.sum(loss_matrix)
        elif reduction == 'mean':
            loss_matrix = loss_matrix.view(B, T)
            return torch.mean(torch.sum(loss_matrix, 1))
        else:
            raise ValueError("Reduction could be either 'sum' or 'mean'.")


![image.png](attachment:image.png)

Fig.3 Examples of results: (a) The negative log-likelihood calculated on the validation set. (b) The top one reconstruction accuracy calculated on the validation set.



```markdown
# teenyGPT: A Tiny Implementation of a Decoder LLM

## Loss Function

The loss function used in teenyGPT is the negative log-likelihood, which is implemented in the `LossFun` class. This class takes the model's output (`y_model`) and the true labels (`y_true`) as input and calculates the loss. The loss is computed as the negative log-likelihood of the predicted probabilities.

```python
class LossFun(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss = nn.NLLLoss(reduction='none')

    def forward(self, y_model, y_true, reduction='sum'):
        # y_model: B(atch) x T(okens) x V(alues)
        # y_true: B x T
        B, T, V = y_model.size()

        y_model = y_model.view(B * T, V)
        y_true = y_true.view(B * T)

        loss_matrix = self.loss(y_model, y_true)  # B*T

        if reduction == 'sum':
            return torch.sum(loss_matrix)
        elif reduction == 'mean':
            loss_matrix = loss_matrix.view(B, T)
            return torch.mean(torch.sum(loss_matrix, 1))
        else:
            raise ValueError("Reduction could be either 'sum' or 'mean'.")
```

## Transformer Block

The transformer block is a crucial component of teenyGPT. It is implemented using the PyTorch implementation of multi-head attention layers. The `TransformerBlock` class takes the number of embeddings, number of neurons, and number of heads as input.

```python
class TransformerBlock(nn.Module):
    def __init__(self, num_emb, num_neurons, num_heads=4):
        super().__init__()

        # hyperparameters
        self.D = num_emb
        self.H = num_heads
        self.neurons = num_neurons

        # components
        self.msha = nn.MultiheadAttention(embed_dim=self.D, num_heads=self.H, batch_first=True)
        self.layer_norm1 = nn.LayerNorm(self.D)
        self.layer_norm2 = nn.LayerNorm(self.D)

        self.mlp = nn.Sequential(
            nn.Linear(self.D, self.neurons * self.D),
            nn.GELU(),
            nn.Linear(self.neurons * self.D, self.D)
        )

    def forward(self, x, causal=True):
        # Multi-Head Self-Attention
        x_attn, _ = self.msha(x, x, x, is_causal=causal, attn_mask=torch.empty(1, 1), need_weights=False)

        # LayerNorm
        x = self.layer_norm1(x_attn + x)

        # MLP
        x_mlp = self.mlp(x)

        # LayerNorm
        x = self.layer_norm2(x_mlp + x)

        return x
```

## teenyGPT Class

The `teenyGPT` class defines the forward pass for the transformer and the sampling procedure. It also includes an auxiliary metric, top-one reconstruction accuracy, which checks if the most probable token matches the input token.

```python
class teenyGPT(nn.Module):
    def __init__(self, num_tokens, num_token_vals, num_emb, num_neurons, num_heads=2, dropout_prob=0.1, num_blocks=10, device='cpu'):
        super().__init__()

        # hyperparameters
        self.device = device
        self.num_tokens = num_tokens
        self.num_token_vals = num_token_vals
        self.num_emb = num_emb
        self.num_blocks = num_blocks

        # embedding layer
        self.embedding = torch.nn.Embedding(num_token_vals, num_emb)

        # positional embedding
        self.positional_embedding = nn.Embedding(num_tokens, num_emb)

        # transformer blocks
        self.transformer_blocks = nn.ModuleList()
        for _ in range(num_blocks):
            self.transformer_blocks.append(
                TransformerBlock(num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads)
            )

        # output layer (logits + softmax)
        self.logits = nn.Sequential(nn.Linear(num_emb, num_token_vals))

        # dropout layer
        self.dropout = nn.Dropout(dropout_prob)

        # loss function
        self.loss_fun = LossFun()

    def transformer_forward(self, x, causal=True, temperature=1.0):
        # x: B(atch) x T(okens)

        # embedding of tokens
        x = self.embedding(x)  # B x T x D

        # embedding of positions
        pos = torch.arange(0, x.shape[1], dtype=torch.long).unsqueeze(0).to(self.device)
        pos_emb = self.positional_embedding(pos)

        # dropout of embedding of inputs
        x = self.dropout(x + pos_emb)

        # transformer blocks
        for i in range(self.num_blocks):
            x = self.transformer_blocks[i](x)

        # output logits
        out = self.logits(x)

        return F.log_softmax(out / temperature, 2)

    @torch.no_grad()
    def sample(self, batch_size=4, temperature=1.0):
        x_seq = np.asarray([[self.num_token_vals - 1] for i in range(batch_size)])

        # sample next tokens
        for i in range(self.num_tokens - 1):
            xx = torch.tensor(x_seq, dtype=torch.long, device=self.device)
            x_log_probs = self.transformer_forward(xx, temperature=temperature)
            x_i_sample = torch.multinomial(torch.exp(x_log_probs[:, i]), 1).to(self.device)
            x_seq = np.concatenate((x_seq, x_i_sample.to('cpu').detach().numpy()), 1)

        return x_seq

    @torch.no_grad()
    def top1_rec(self, x, causal=True):
        x_prob = torch.exp(self.transformer_forward(x, causal=True))[:, :-1, :].contiguous()
        _, x_rec_max = torch.max(x_prob, dim=2)
        return torch.sum(torch.mean((x_rec_max.float() == x[:, 1:].float().to(self.device)).float(), 1).float())

    def forward(self, x, causal=True, temperature=1.0, reduction='mean'):
        log_prob = self.transformer_forward(x, causal=causal, temperature=temperature)
        return self.loss_fun(log_prob[:, :-1].contiguous(), x[:, 1:].contiguous(), reduction=reduction)
```

## Results

With 128 neurons in MLPs, 8 attention heads, 4 transformer blocks, and an embedding size of 32, teenyGPT has approximately 1 million weights. The model's performance is illustrated in Figure 11.3 and Tables 11.3 and 11.4. As you can see, even with a small amount of data and a limited number of weights, teenyGPT trains successfully and achieves reasonable performance.

![image.png](attachment:image.png)
Fig.3: Examples of results: (a) Negative log-likelihood calculated on the validation set. (b) Top-one reconstruction accuracy calculated on the validation set.

## Conclusion

The teenyGPT implementation demonstrates the feasibility of training a small decoder LLM with a limited amount of data and a relatively small number of weights. Despite its simplicity, teenyGPT achieves reasonable performance, showcasing the potential of LLMs in natural language processing tasks.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class LossFun(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss = nn.NLLLoss(reduction='none')

    def forward(self, y_model, y_true, reduction='sum'):
        B, T, V = y_model.size()
        y_model = y_model.view(B * T, V)
        y_true = y_true.view(B * T)
        loss_matrix = self.loss(y_model, y_true)
        if reduction == 'sum':
            return torch.sum(loss_matrix)
        elif reduction == 'mean':
            loss_matrix = loss_matrix.view(B, T)
            return torch.mean(torch.sum(loss_matrix, 1))
        else:
            raise ValueError("Reduction could be either 'sum' or 'mean'.")

import torch.nn.Transformer as nn_Transformer

class TransformerBlock(nn.Module):
    def __init__(self, num_emb, num_neurons, num_heads=4):
        super().__init__()
        self.D = num_emb
        self.H = num_heads
        self.neurons = num_neurons
        self.msha = nn_Transformer.TransformerEncoderLayer(d_model=self.D, nhead=self.H)
        self.layer_norm1 = nn.LayerNorm(self.D)
        self.layer_norm2 = nn.LayerNorm(self.D)
        self.mlp = nn.Sequential(
            nn.Linear(self.D, self.neurons * self.D),
            nn.GELU(),
            nn.Linear(self.neurons * self.D, self.D)
        )

    def forward(self, x):
        x_attn = self.msha(x)
        x = self.layer_norm1(x_attn + x)
        x_mlp = self.mlp(x)
        x = self.layer_norm2(x_mlp + x)
        return x



class teenyGPT(nn.Module):
    def __init__(self, num_tokens, num_token_vals, num_emb, num_neurons, num_heads=2, dropout_prob=0.1, num_blocks=10, device='cpu'):
        super().__init__()
        self.device = device
        self.num_tokens = num_tokens
        self.num_token_vals = num_token_vals
        self.num_emb = num_emb
        self.num_blocks = num_blocks
        self.embedding = torch.nn.Embedding(num_token_vals, num_emb)
        self.positional_embedding = nn.Embedding(num_tokens, num_emb)
        self.transformer_blocks = nn.ModuleList()
        for _ in range(num_blocks):
            self.transformer_blocks.append(
                TransformerBlock(num_emb=num_emb, num_neurons=num_neurons, num_heads=num_heads)
            )
        self.logits = nn.Sequential(nn.Linear(num_emb, num_token_vals))
        self.dropout = nn.Dropout(dropout_prob)
        self.loss_fun = LossFun()

    def transformer_forward(self, x, causal=True, temperature=1.0):
        x = self.embedding(x)
        pos = torch.arange(0, x.shape[1], dtype=torch.long).unsqueeze(0).to(self.device)
        pos_emb = self.positional_embedding(pos)
        x = self.dropout(x + pos_emb)
        for i in range(self.num_blocks):
            x = self.transformer_blocks[i](x)
        out = self.logits(x)
        return F.log_softmax(out / temperature, 2)

    @torch.no_grad()
    def sample(self, batch_size=4, temperature=1.0):
        x_seq = np.asarray([[self.num_token_vals - 1] for i in range(batch_size)])
        for i in range(self.num_tokens - 1):
            xx = torch.tensor(x_seq, dtype=torch.long, device=self.device)
            x_log_probs = self.transformer_forward(xx, temperature=temperature)
            x_i_sample = torch.multinomial(torch.exp(x_log_probs[:, i]), 1).to(self.device)
            x_seq = np.concatenate((x_seq, x_i_sample.to('cpu').detach().numpy()), 1)
        return x_seq

    @torch.no_grad()
    def top1_rec(self, x, causal=True):
        x_prob = torch.exp(self.transformer_forward(x, causal=True))[:, :-1, :].contiguous()
        _, x_rec_max = torch.max(x_prob, dim=2)
        return torch.sum(torch.mean((x_rec_max.float() == x[:, 1:].float().to(self.device)).float(), 1).float())

    def forward(self, x, causal=True, temperature=1.0, reduction='mean'):
        log_prob = self.transformer_forward(x, causal=causal, temperature=temperature)
        return self.loss_fun(log_prob[:, :-1].contiguous(), x[:, 1:].contiguous(), reduction=reduction)

# Example usage
num_tokens = 100
num_token_vals = 20
num_emb = 32
num_neurons = 128
num_heads = 8
dropout_prob = 0.1
num_blocks = 4
device = 'cpu'

model = teenyGPT(num_tokens, num_token_vals, num_emb, num_neurons, num_heads, dropout_prob, num_blocks, device)

# Example input data
x = torch.randint(0, num_token_vals, (4, num_tokens))

# Forward pass
loss = model(x)
print(f"Loss: {loss.item()}")

# Sampling
samples = model.sample(batch_size=4)
print(f"Sampled sequences: {samples}")

# Top-one reconstruction accuracy
acc = model.top1_rec(x)
print(f"Top-one reconstruction accuracy: {acc.item()}")


In [None]:
class LossFun:
    def __init__(self):
        pass

    def forward(self, y_model, y_true, reduction='sum'):
        total_loss = 0
        for i in range(len(y_model)):
            if reduction == 'sum':
                total_loss += self.loss(y_model[i], y_true[i])
            elif reduction == 'mean':
                total_loss += self.loss(y_model[i], y_true[i]) / len(y_model)
        return total_loss

    def loss(self, y_model, y_true):
        return -sum(y_true[i] * y_model[i] for i in range(len(y_true)))

class TransformerBlock:
    def __init__(self, num_emb, num_neurons, num_heads=4):
        self.D = num_emb
        self.H = num_heads
        self.neurons = num_neurons

    def forward(self, x):
        x_attn = self.msha(x)
        x = self.layer_norm1(x_attn + x)
        x_mlp = self.mlp(x)
        x = self.layer_norm2(x_mlp + x)
        return x

    def msha(self, x):
        return [sum(x[i] * x[j] for j in range(len(x))) for i in range(len(x))]

    def layer_norm1(self, x):
        return [x[i] / sum(x) for i in range(len(x))]

    def mlp(self, x):
        return [sum(x[i] * x[j] for j in range(len(x))) for i in range(len(x))]

    def layer_norm2(self, x):
        return [x[i] / sum(x) for i in range(len(x))]

class teenyGPT:
    def __init__(self, num_tokens, num_token_vals, num_emb, num_neurons, num_heads=2, dropout_prob=0.1, num_blocks=10, device='cpu'):
        self.device = device
        self.num_tokens = num_tokens
        self.num_token_vals = num_token_vals
        self.num_emb = num_emb
        self.num_blocks = num_blocks
        self.embedding = [i for i in range(num_token_vals)]
        self.positional_embedding = [i for i in range(num_tokens)]
        self.transformer_blocks = [TransformerBlock(num_emb, num_neurons, num_heads) for _ in range(num_blocks)]
        self.logits = [i for i in range(num_token_vals)]
        self.dropout = dropout_prob
        self.loss_fun = LossFun()

    def __call__(self, x, causal=True, temperature=1.0, reduction='mean'):
        return self.forward(x, causal, temperature, reduction)

    def forward(self, x, causal=True, temperature=1.0, reduction='mean'):
        log_prob = self.transformer_forward(x, causal, temperature)
        return self.loss_fun.forward(log_prob, x, reduction)

    def transformer_forward(self, x, causal=True, temperature=1.0):
        x = [self.embedding[i] for i in x]
        pos = [i for i in range(len(x))]
        pos_emb = [self.positional_embedding[i] for i in pos]
        x = [x[i] + pos_emb[i] for i in range(len(x))]
        for i in range(self.num_blocks):
            x = self.transformer_blocks[i].forward(x)
        out = [self.logits[i] for i in x]
        return out

    def sample(self, batch_size=4, temperature=1.0):
        x_seq = [[self.num_token_vals - 1] for _ in range(batch_size)]
        for _ in range(self.num_tokens - 1):
            xx = [x_seq[i][-1] for i in range(len(x_seq))]
            x_log_probs = self.transformer_forward(xx, temperature=temperature)
            x_i_sample = [x_log_probs[i].index(max(x_log_probs[i])) for i in range(len(x_log_probs))]
            x_seq = [x_seq[i] + [x_i_sample[i]] for i in range(len(x_seq))]
        return x_seq

    def top1_rec(self, x, causal=True):
        x_prob = self.transformer_forward(x, causal=True)
        x_rec_max = [x_prob[i].index(max(x_prob[i])) for i in range(len(x_prob))]
        return sum(1 for i in range(len(x)) if x_rec_max[i] == x[i][1]) / len(x)

# Example usage
num_tokens = 100
num_token_vals = 20
num_emb = 32
num_neurons = 128
num_heads = 8
dropout_prob = 0.1
num_blocks = 4
device = 'cpu'

model = teenyGPT(num_tokens, num_token_vals, num_emb, num_neurons, num_heads, dropout_prob, num_blocks, device)

# Example input data
x = [[i % num_token_vals] for i in range(4 * num_tokens)]

# Forward pass
loss = model(x)
print(f"Loss: {loss}")

# Sampling
samples = model.sample(batch_size=4)
print(f"Sampled sequences: {samples}")

# Top-one reconstruction accuracy
acc = model.top1_rec(x)
print(f"Top-one reconstruction accuracy: {acc}")
