# Week 1 study notes

## Transformers

Encoder-Decoder architecture.
**Ecoder** reads in a sequence of tokens and generates representations for each. It applies multi-headed self-attention to model relationships between all tokens simultaneously. The **Decoder** consumes these to generate output text. By calculating attention weights between tokens, transformers are able to focus on the most relevant parts of the text for an individual token.

Paper on structured Attention Networks: https://arxiv.org/pdf/1702.00887.pdf

### Encoder only Architecture

- removing the decoder optimizes for natural language understanding tasks, where generation is not required.
- excel at tasks like classification, NER, question answering and semantic search.
- more compact but powerful
- example: BERT

### Decoder only Architecture

- removing the encoder optimizes for natural language generation tasks, where the model is required to produce text.
- excel at tasks like text completion, dialogue response generation, summarization, and translation.
- struggle with tasks requiring deep language understanding.
- example: GPT-3

### Encoder-Decoder Architecture

- can tackle a wide range of tasks, including those that require both natural language understanding and generation.
- example: T5


Further reading: https://sebastianraschka.com/blog/2023/llm-reading-list.html


## Fine Tuning LLMs

- LLMs are pre-trained on large datasets and then fine-tuned on a specific task.
- pre-trained models are initially trained on a general domain dataset using self-supervised learning.
- this provides a deep understanding of language structure and meaning.
- however, if used as is, the model will not perform well on specific tasks, producing generic outputs, lacking coherence, or even being inappropriate.

** Benefits**
- cost effective (smaller model can be deployed, don't have to train from scratch)
- privacy (don't have to share data with third parties or rely on black box solutions)

### Framework for LLM fine-tuning

#### Overview

1. Research
    - understand the problem space and associated risks
    - translate into Specific NLP tasks
    - research existing models and architectures
    - establish strong baselines (measure critical aspects like accuracy, fairness, latency)
    - set concrete goals that align with business objectives
2. Development
    - prepare, analyze, and document datasets (look for potential bias, document filtering steps, human review of labels)
    - fine-tune base models (track experiments)
    - evaluate fine-tuned models on unseen datasets on fairness and latency
3. Production
    - MLOps
        - model optimization
        - model versioning
        - CI/CD
        - model monitoring
        - retraining
    - pre-deployment testing
        - check bias and fairness
        - edge cases
        - security (adversarial attacks)
        - user feedback and A/B testing
    - monitor post-deployment
        - continuous monitoring
        - user feedback & usage metrics
        - retraining

