# BERT: Pre-Training of Bidrectional Tranformers for Language Understanding
**see the full paper [here](https://arxiv.org/pdf/1810.04805.pdf)**

## Architecture
For the architecture, the BERT paper referenced to the original implementation of the multi-layer bidirectional
Transformer encoder described in [Vaswani et al. (2017)](https://arxiv.org/pdf/1706.03762.pdf). The Bert model has only one encoders stack.
So for this part, I am using the architecture described by the paper above.

![architecture](../img/model-architecture.png)

### Requirements

In [1]:
import torch
import copy
from torch import nn

### Bert Encoder Stacks
* Bert takes as input a sequence of plain text tokens
* the output is a representation vector of the size of the hidden layers
* Bert is a stack of multi-layer bidirectional Transformer encoder

In [2]:
class Bert(nn.Module):
    def __init__(self, encoder, stack_size=6):
        super(Bert, self).__init__()
        self.encoderLayer = nn.ModuleList()
        for i in range(stack_size):
            self.encoderLayer.append(copy.deepcopy(encoder))
    def forward(self, tokens):
        representation = self.encoderLayer[0](tokens)
        for encoder in self.encoderLayer :
            representation = encoder(representation)
        return representation

### Encoder

* The encoder is composed of two modules. The first is the attention module and the second is the feed-forward network
module.

* this model is execute sequentially but the computation of each token is independent and could be compute concurrently

In [3]:
class Encoder(nn.Module):
    def __init__(self, hidden_size, output_size, attention):
        super(Encoder, self).__init__()
        self.attention = copy.deepcopy(attention)
        self.ffnn = nn.Linear(hidden_size, output_size)
    def forward(self, tokens):
        z_representations = []
        for t in tokens:
            z = self.attention(t, tokens)
            z_representations.append(self.ffnn(z))
        return torch.stack(z_representations)

### Multi-Head Attention
![attention](../img/multihead-attention.png)

In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.emb = nn.Embedding
    def forward(self, query, key, value):
        return 0

### Feed Forward Network

### Attention

## Pre-Training & Fine-Tuning



## Experimentation