## Architectures:

### 1. Transformer-Based Architectures


These are the backbone of most modern GenAI systems.

🔹 **Encoder-Only (e.g., BERT, RoBERTa)**

- Purpose: Understanding tasks (e.g., classification, NER)
- Architecture: Only the encoder stack of the Transformer
- Not generative: Cannot generate text directly
- Use cases: Embedding generation, sentence classification, question answering (extractive)

🔹 **Decoder-Only (e.g., GPT, LLaMA, Mistral)**

- Purpose: Text generation
- Architecture: Only the decoder stack with masked self-attention
- Autoregressive: Predicts the next token given previous ones
- Use cases: Chatbots, story generation, code completion
    
🔹 **Encoder-Decoder (Seq2Seq) (e.g., T5, BART, FLAN-T5)**

- Purpose: Text-to-text tasks (translation, summarization, Q&A)
- Architecture: Full Transformer with both encoder and decoder
- Flexible: Input is encoded, and decoder generates output conditioned on it
- Use cases: Translation, summarization, style transfer


![Architectures](Components_of_Transformer_Architecture-3546166406.png)

Ref: https://datasciencedojo.com/blog/transformer-models-types-their-uses/

### 🧠 2. Diffusion Models (for Images, Audio, Video)

These models generate data by reversing a noise process.
    
🔹 Examples: DALL·E 2, Stable Diffusion, Imagen

- Process: Start with noise → denoise step-by-step to generate image
- Text-to-image: Often conditioned on text embeddings (from CLIP or T5)
- Use cases: Image generation, inpainting, super-resolution


![Diffusion Architecture](Stable_Diffusion_architecture-3150774982.png)

Ref: https://jalammar.github.io/illustrated-stable-diffusion/

### 3. Autoencoders & Variants

🔹 **Variational Autoencoders (VAEs)**
- Purpose: Learn latent representations for generation
- Probabilistic: Encodes input into a distribution
- Use cases: Image generation, anomaly detection
  
🔹 **Denoising Autoencoders**
- Trained to reconstruct corrupted input
- Used in: Pretraining (e.g., BART corrupts text and learns to reconstruct)


Ref:https://pyimagesearch.com/2023/10/02/a-deep-dive-into-variational-autoencoders-with-pytorch/

![VAE](vae-diagram-1-2048x1126-1442333120.jpg)

### 4. Retrieval-Augmented Generation (RAG)

Combines retrieval with generation to improve factual accuracy.

- Architecture:
    - Encoder retrieves relevant documents (e.g., using vector search)
    - Decoder generates output conditioned on retrieved context
- Examples: RAG (Facebook), RETRO (DeepMind), Atlas
- Use cases: Open-domain QA, chatbots with knowledge grounding


![RAG](EnterpriseRAG-2925482421.png)

Ref: https://medium.com/enterprise-rag/an-introduction-to-rag-and-simple-complex-rag-9c3aa9bd017b

### 🧮 5. Mixture of Experts (MoE)

- Architecture: Multiple expert subnetworks; only a few are activated per input
- Scalable: Enables training very large models efficiently
- Examples: GLaM (Google), Switch Transformer
- Use cases: Efficient large-scale generation


![MOE](moe_block-1959552305.png)

Ref: https://huggingface.co/blog/moe

### 🧠 6. Multimodal Architectures

Designed to handle multiple input types (text, image, audio).

🔹 Examples:
- CLIP: Connects images and text in a shared embedding space
- Flamingo, GPT-4V: Vision-language models
- Gemini, Kosmos: Unified models for text, image, and more


![Multi-modal](multimodal-arch.jpg)

https://slds-lmu.github.io/seminar_multimodal_dl/c02-00-multimodal.html