# Week 2 study notes

## Encoder Architecture

Take input (of variable length) and encode it into a fixed length vector representation.

In transformers, the encoder architecture is designed around self-attention and feedforward (output of one layer is the input for the next) neural networks, without relying on any recurrent mechanisms, such as loops or feedback connections.

- **Multi-headed self attention**: attention scores that define how much each token relates to every other token in the sequence. Different relationships are considered so that the output includes understanding of global dependencies between tokens regardless of the distance between them. An example might be the connection between two tokens because they are both nouns.
- **Position-wise feedforward network**: applies non-linear transformations to each position separately and identically (so every token is treated the same) this helps extract meaning for that specific word and is done in parallel. This enriches the local context in parallel to the global attention.

Combining the information from the two representations (global and local) is done by adding them together and normalizing the result. We can then use the content-rich encoding to fine tune our models for downstream tasks.


### ELMo (Embeddings from Language Models)
ELMo, on the other hand, uses a bidirectional LSTM (Long Short-Term Memory) architecture. LSTM is a type of recurrent neural network (RNN) that processes text sequentially, taking into account the order of words in a sentence. It has limited awareness of the entire sentence's context, especially when it comes to words far away from the current word in the sequence.

### BERT (Bidirectional Encoder Representations from Transformers)
BERT is a transformer-based model that came after ELMo. It's designed to understand the context of a word by looking at both the words that come before and after it in a sentence. This bidirectional understanding of words allows BERT to capture richer and more nuanced context because it can consider the entire sentence when processing each word.

#### Pre-training

BERT is pre-trained on unlabeled text to learn generally useful patterns. This creates reusable encoder layers that provide significant performance gains when fine-tuned to downstream tasks, making BERT a versatile tool for various tasks.

-**masked language modeling**: During pre-training, BERT randomly masks 15% of input tokens and challenges the model to predict these masked words using the full context from both directions. This allows the model to implicitly learn complex relationships between all words across long distances.
-**next sentence prediction**: This task involves feeding sentence pairs as input, and BERT must decide whether the second sentence is the subsequent sentence or just a random one. This objective teaches discourse-level relationships between sentences.

### Bert Based LLMs

These cater to different use cases and are trained on different datasets. They are all based on the same architecture, but the pre-training data and fine-tuning tasks are different.

- **[DistilBERT](https://huggingface.co/distilbert-base-uncased)**: 40% less parameters achieved through knowledge distillation. This makes it faster and cheaper to train, while still retaining 97% of BERT's performance.
- **[ALBERT](https://huggingface.co/albert-base-v2)**: focuses on factorized embedding and cross-layer sharing to achieve parameter reduction.
- **[RoBERTa](https://huggingface.co/roberta-base)**: removes next sentence prediction and trains on longer sequences. It's adept at handling case-sensitive distinctions, and therefore suitable for tasks requiring nuanced language understanding and sentiment analysis.
- **[BioBERT](https://huggingface.co/alvaroalon2/biobert_diseases_ner)**: trained on biomedical text, it's useful for biomedical text mining tasks such as biomedical named entity recognition (NER).
- **[SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased)**: trained on scientific text, it's useful for scientific text mining tasks such as scientific named entity recognition (NER).
- **[BERTweet](https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis)**: trained on Twitter data, it's useful for sentiment analysis and emotion detection on tweets.
- **[MobileBERT]()**: designed for mobile devices,it's a compressed version of BERT achieved through knowledge distillation and network architecture modifications like depth-wise convolutions. This design enables MobileBERT to run inference operations with remarkable efficiency, making it ideal for resource-limited devices.

### BERT fine-tuning

BERT leverages its pre-trained knowledge to learn labeling rules using minimal task-specific data and training. This is done through a process called fine-tuning, which involves training additional layers while keeping the pre-trained layers fixed. The additional layers are randomly initialized and learned through backpropagation. During this process, the pre-existing weights of the BERT model are adjusted to maximize performance on the specific task at hand.

The loss function computes the cross-entropy between predicted and true labels, essentially guiding the optimization process to refine the model's ability to discern different sentiment categories. For evaluation token-level metrics such as classification accuracy and the F1 score are used.

## Fine-tuning BERT for Natural Language Inference with Hugging Face


Steps taken:

- The Tokenizer converts raw text into numeric token ids, truncating longer sequences to a max length.
- we then format the data into torch tensors with labels, input ids, token types, and attention masks removing the original text columns, keeping only the tokenized ids.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate
import numpy as np

# load dataset and initialize tokenizer
raw_datasets = load_dataset("glue", "mnli")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# tokenize and encode the dataset into a format suitable for training a NN model
tokenized_datasets = raw_datasets.map(lambda x: tokenizer(x['premise'], x['hypothesis'], truncation=True), batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["premise", "hypothesis"])
tokenized_datasets.set_format(type="torch", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

# pad batches of tokenized examples to the maximum sequence length, so they can be efficiently batched together.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# initialize a pre-trained BERT base model and add a classification head
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 762kB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 757kB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 1.04MB/s]
Downloading data: 100%|██████████| 313M/313M [00:11<00:00, 26.1MB/s] 
Generating train split: 100%|██████████| 392702/392702 [00:09<00:00, 40914.05 examples/s]
Generating validation_matched split: 100%|██████████| 9815/9815 [00:00<00:00, 39000.54 examples/s]
Generating validation_mismatched split: 100%|██████████| 9832/9832 [00:00<00:00, 37029.54 examples/s]
Generating test_matched split: 100%|██████████| 9796/9796 [00:00<00:00, 43795.50 examples/s]
Generating test_mismatched split: 100%|██████████| 9847/9847 [00:00<00:00, 43770.91 examples/s]
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 93.3kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 1.73MB/s]
Downloading (…)solve/main/vocab.txt: 100%|███████

In [3]:
def compute_metrics(eval_preds):
    # Load evaluation metrics
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    # Extract logits and labels from eval_preds
    logits, labels = eval_preds

    # Convert logits to class labels
    predictions = np.argmax(logits, axis=-1)

    # Compute F1 score and extract the scalar value
    f1_result = f1_metric.compute(predictions=predictions, references=labels, average="macro")
    f1_score = f1_result['f1'] if isinstance(f1_result, dict) else f1_result

    # Compute accuracy and extract the scalar value
    accuracy_result = accuracy_metric.compute(predictions=predictions, references=labels)
    accuracy_score = accuracy_result['accuracy'] if isinstance(accuracy_result, dict) else accuracy_result


    return {"F1": f1_score, "Accuracy": accuracy_score}

In [4]:
# initialize trainer with defined arguments and data collector
training_args = TrainingArguments("test-nli")
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_datasets["train"].select(range(1000)),
  eval_dataset=tokenized_datasets["validation_matched"].select(range(1000)),
  tokenizer=tokenizer,
  data_collator=data_collator,
  compute_metrics=compute_metrics)

# train the model
trainer.train()

# evaluate the model on test set
eval_results = trainer.evaluate()
print(eval_results)

  0%|          | 0/375 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 375/375 [03:15<00:00,  1.92it/s]


{'train_runtime': 195.7826, 'train_samples_per_second': 15.323, 'train_steps_per_second': 1.915, 'train_loss': 0.6678671061197917, 'epoch': 3.0}


Downloading builder script: 100%|██████████| 6.77k/6.77k [00:00<00:00, 7.88MB/s]
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 4.72MB/s]
100%|██████████| 125/125 [00:31<00:00,  4.01it/s]

{'eval_loss': 1.0557687282562256, 'eval_F1': 0.6361470867450364, 'eval_Accuracy': 0.637, 'eval_runtime': 31.3464, 'eval_samples_per_second': 31.902, 'eval_steps_per_second': 3.988, 'epoch': 3.0}





## Hugging Face Hub for MLOps

Centralised platform for storing, versioning and streamlining the management of transformer models. It also integrates with logging utilities like TensorBoard.

The Hub allows you t store models alongside their training code, creating a comprehensive record of each model version's development journey. These records include the exact code used to produce each model version, enhancing model reproducibility, and allowing teams to restore and compare previous versions. It also enables the direct loading of hosted models into applications through well-structured pipelines for inference.

- **model cards**: provide a summary of the model's intended use case, training data, and performance metrics. This helps users understand the model's capabilities and limitations, and how it should be used.
- **model checkpoints**: allow you to save and load model weights, parameters, and other configuration details. This makes it easy to share models with others, and to restore previous versions of a model.
- **training scripts**: allow you to train and fine-tune models on your own data. This is useful for adapting pre-trained models to your specific use case, or for training new models from scratch.



## Lexicon

**Cross-Entropy**: Cross-entropy is a way to measure how wrong or right our predictions are compared to the actual correct answers in classification tasks.
*Resource:* [Cross-Entropy Explained](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)

**Backpropagation**: Backpropagation is a technique that helps a neural network learn from its mistakes by adjusting its internal parameters.
*Resource:* [Backpropagation Explained](https://www.geeksforgeeks.org/backpropagation-neural-networks/)

**Loss Function**: A loss function quantifies how well or poorly a model is performing on a task. It's used to guide the training process.
*Resource:* [Understanding Loss Functions](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html)

**Unsupervised Learning**: Unsupervised learning is a type of machine learning where the model finds patterns in data without explicit labels or supervision.
*Resource:* [Unsupervised Learning abd Data Clustering](https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a)

**LSTM (Long Short-Term Memory) Architecture**: LSTM is a type of neural network architecture that's good at handling sequences of data, like text or time series, by remembering important information for longer periods.
*Resource:* [Understanding LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

**Bidirectional Encoder**: A bidirectional encoder is a part of models like BERT that understand language by considering both the words that come before and after a given word.
*Resource:* [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

**Feedforward Neural Network**: A feedforward neural network is a type of neural network where data moves in one direction, from input to output, without loops. It's used for various machine learning tasks.
*Resource:* [Feedforward Neural Networks](https://en.wikipedia.org/wiki/Feedforward_neural_network)