<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/ml/bert_encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Understanding BERT Encoding

**BERT (Bidirectional Encoder Representations from Transformers)** is a transformer-based model designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers. Here’s a breakdown of how BERT encoding works:

### How BERT Reads Text

1. **Bidirectional Context**: Unlike traditional models like LSTMs that read text sequentially either left-to-right or right-to-left, BERT reads text in both directions. It captures the context from both directions simultaneously using transformers.

2. **Transformers**: BERT is based on the Transformer architecture, which uses self-attention mechanisms to weigh the importance of different words in a sentence regardless of their position. This allows it to build a rich contextual representation of words.

### Encoding Process

1. **Tokenization**:
   - The text is split into tokens. BERT uses WordPiece tokenization, where words are split into subwords or characters when necessary.
   - Special tokens are added: `[CLS]` at the beginning of a sequence and `[SEP]` at the end of a sequence or to separate different sequences.

   Example:
   ```
   Input: "I love programming."
   Tokenized: ['[CLS]', 'i', 'love', 'programming', '.', '[SEP]']
   ```

2. **Embedding**:
   - **Token Embeddings**: Each token is converted into a dense vector.
   - **Segment Embeddings**: For tasks involving pairs of sentences, segment embeddings differentiate the two sentences.
   - **Positional Embeddings**: Since transformers don't have inherent order information, positional embeddings are added to each token to encode its position in the sequence.

   Example:
   ```
   Token Embedding: [v_i, v_love, v_programming, v_.]
   Segment Embedding: [v_segment_A, v_segment_A, v_segment_A, v_segment_A]
   Positional Embedding: [v_pos_0, v_pos_1, v_pos_2, v_pos_3]
   Combined Embedding: v_combined_0, v_combined_1, v_combined_2, v_combined_3
   ```

3. **Self-Attention Mechanism**:
   - Each token attends to every other token in the sequence to build a contextual understanding.
   - This is done using multiple attention heads, allowing the model to focus on different parts of the sentence simultaneously.

4. **Transformer Layers**:
   - The combined embeddings are passed through multiple transformer layers. Each layer consists of a multi-head self-attention mechanism followed by position-wise feed-forward neural networks.
   - The output of each layer is passed to the next, allowing the model to build increasingly complex representations.

### Connecting Sentences

- BERT can handle pairs of sentences by including a `[SEP]` token to differentiate between them. During training for tasks like Question Answering or Next Sentence Prediction, it learns to understand the relationship between these sentences.

   Example:
   ```
   Sentence A: "How are you?"
   Sentence B: "I am fine."
   Tokenized: ['[CLS]', 'how', 'are', 'you', '?', '[SEP]', 'i', 'am', 'fine', '.', '[SEP]']
   ```

### Integration into Neural Networks

**Using BERT in Downstream Tasks**:

1. **Feature Extraction**:
   - BERT can be used as a fixed feature extractor where the encoded representations of the input text are taken from a specific layer and used in downstream tasks.
   - The `[CLS]` token's representation is often used as a summary of the entire sequence for classification tasks.

   Example:
   ```
   Encoded Representation: [v_CLS, v_how, v_are, v_you, v_?, v_SEP, v_i, v_am, v_fine, v_.]
   ```

2. **Fine-Tuning**:
   - BERT can be fine-tuned for specific tasks by adding task-specific layers on top of the pre-trained model and training the entire architecture.
   - For classification, a dense layer followed by a softmax layer can be added on top of the `[CLS]` token's output.

   Example:
   ```
   Model Architecture:
   Input: [CLS] How are you? [SEP] I am fine. [SEP]
   BERT Encoder Layers
   Output (of [CLS] token): v_CLS
   Dense Layer: Dense(v_CLS)
   Softmax Layer: Softmax(Dense(v_CLS))
   ```

### Examples

1. **Single Sentence Encoding**:
   - Input: "The quick brown fox."
   - Tokenized: ['[CLS]', 'the', 'quick', 'brown', 'fox', '.', '[SEP]']
   - Embedding + Self-Attention -> Encoded Representation

2. **Sentence Pair Encoding**:
   - Input: "What is your name?" "My name is BERT."
   - Tokenized: ['[CLS]', 'what', 'is', 'your', 'name', '?', '[SEP]', 'my', 'name', 'is', 'bert', '.', '[SEP]']
   - Embedding + Self-Attention -> Encoded Representation

### Conclusion

BERT's ability to read text bidirectionally and its use of transformers for self-attention allows it to create rich, contextual representations of text. These encoded representations can be used as features in downstream tasks or fine-tuned with additional layers to improve performance on specific tasks.