### WORK IN PROGRESS


### Introduction

Welcome to the Transformer implementation tutorial using PyTorch! In this tutorial, you'll embark on an exciting journey to understand and build one of the most powerful architectures in the field of deep learning.

Transformers have revolutionized various natural language processing tasks, such as machine translation, text generation, and sentiment analysis, by capturing long-range dependencies and contextual information effectively. Understanding how to implement a Transformer from scratch will not only deepen your knowledge of deep learning but also equip you with the skills to tackle a wide range of sequence-based tasks.

Throughout this tutorial, we'll break down the core components of the Transformer architecture, including self-attention mechanisms, positional encodings, and feed-forward networks. You'll learn how to build each component step-by-step using PyTorch.




<p align="center">
<img src="transformerEncoder.png" alt="Drawing" style="width: 400px;"/>
</p>

Each layer of transformers is composed of two parts: the multi-head attention and the multi-layer perceptron.

<p align="center">
<img src="transformerLayer.png" alt="Drawing" style="width: 400px;"/>
</p>

## Table of Contents

1. [Multilayer Perceptron (MLP)](#Multilayer-Perceptron-(MLP))
2. [Multi-Head Attention](#Multi-Head-Attention)
3. [Transformer Blocks](#Transformer-Blocks)
4. [Positional Embedding](#Positional-Embedding)

In [None]:
import torch
import torch.nn as nn

### 1. Multilayer Perceptron (MLP)

In this section, we'll delve into the role of the Multilayer Perceptron (MLP) within the Transformer architecture. The MLP serves as the core component for processing each position in the sequence independently, enabling the model to capture complex patterns and relationships within the data.

In [None]:
class Mlp(nn.Module):

    def __init__(self, dim=768, ratio=4.0, dropout=0.0, activation=nn.GELU):
        super().__init__()

        self.dim = dim
        self.ratio = ratio
        self.dropout = dropout
        self.activation = activation()

        self.fc1 = nn.Linear(dim, int(dim*ratio))
        self.fc2 = nn.Linear(int(dim*ratio), dim)

        self.drop = nn.Dropout(p=dropout)

    def forward(self, x):

        x = self.fc1(x)
        x = self.activation(x)
        x = self.drop(x)
        x = self.fc2(x)

        return x
