Skip to content

lm-scratch-pytorch - The code is designed to be beginner-friendly, with a focus on understanding the fundamentals of PyTorch and implementing LLMs from scratch,step by step.

License

Notifications You must be signed in to change notification settings

skyloevil/llm-scratch-pytorch

Repository files navigation

简体中文

llm-scratch-pytorch

llm-scratch-pytorch - The code is designed to be beginner-friendly, with a focus on understanding the fundamentals of PyTorch and implementing LLMs from scratch,step by step.

Table of Contents

Installation

To install the required dependencies, run:

pip install -r requirements.txt

If you have a CUDA-capable GPU and want to use CUDA for acceleration, make sure you have the appropriate CUDA toolkit and drivers installed. PyTorch will automatically use CUDA if available. You can verify CUDA is available in Python with:

import torch
print(torch.cuda.is_available())

If this prints True, your environment is ready for GPU acceleration.

If you don't have a local GPU environment, we recommend using Runpod cloud environment for installation: Runpod referral link: https://runpod.io?ref=4dzcggxy By using this link, you'll get:

  • A one-time random credit bonus from $5-$500 when you sign up using the referral link and spend $10
  • Instant access to GPU resources to get started right away

Process

  • [✅] grad basis
  • [✅] partial derivaties
  • [✅] compute graph
  • [✅] forward & backward
  • [✅] torch_variables_grad_inplace_operation
  • [✅] retain_graph
  • [✅] exploring the GPT-2 (124M) OpenAI checkpoint
  • [✅] SECTION 1: implementing the GPT-2 nn.Module
  • [✅] loading the huggingface/GPT-2 parameters
  • [✅] implementing the forward pass to get logits
  • [✅] sampling init, prefix tokens, tokenization
  • [✅] sampling loop
  • [✅] sample, auto-detect the device
  • [✅] let’s train: data batches (B,T) → logits (B,T,C)
  • [✅] cross entropy loss
  • [✅] optimization loop: overfit a single batch
  • [✅] data loader lite
  • [✅] parameter sharing wte and lm_head
  • [✅] model initialization: std 0.02, residual init
  • [✅] SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
  • [✅] Tensor Cores, timing the code, TF32 precision, 333ms
  • [✅] float16, gradient scalers, bfloat16, 300ms
  • [✅] torch.compile, Python overhead, kernel fusion, 130ms
  • [✅] flash attention, 96ms
  • [✅] nice/ugly numbers. vocab size 50257 → 50304, 93ms
  • [✅] SECTION 3: hyperpamaters, AdamW, gradient clipping
  • [✅] learning rate scheduler: warmup + cosine decay
  • [✅] batch size schedule, weight decay, FusedAdamW, 90ms
  • [✅] gradient accumulation
  • [❌] distributed data parallel (DDP)
  • [❌] datasets used in GPT-2, GPT-3, FineWeb (EDU)
  • [❌] validation data split, validation loss, sampling revive
  • [❌] evaluation: HellaSwag, starting the run
  • [❌] SECTION 4: results in the morning! GPT-2, GPT-3 repro
  • [❌] shoutout to llm.c, equivalent but faster code in raw C/CUDA
  • [❌] summary, phew, build-nanogpt github repo
  • [❌] Introduction
  • [❌] LLaMA Architecture
  • [❌] Embeddings
  • [❌] Coding the Transformer
  • [❌] Rotary Positional Embedding
  • [❌] RMS Normalization
  • [❌] Encoder Layer
  • [❌] Self Attention with KV Cache
  • [❌] Grouped Query Attention
  • [❌] Coding the Self Attention
  • [❌] Feed Forward Layer with SwiGLU
  • [❌] Model weights loading
  • [❌] Inference strategies
  • [❌] Greedy Strategy
  • [❌] Beam Search
  • [❌] Temperature
  • [❌] Random Sampling
  • [❌] Top K
  • [❌] Top P
  • [❌] Coding the Inference
  • [❌] Multi-Head Attention
  • [❌] Why Flash Attention
  • [❌] Safe Softmax
  • [❌] Online Softmax
  • [❌] Online Softmax (Proof)
  • [❌] Block Matrix Multiplication
  • [❌] Flash Attention forward (by hand)
  • [❌] Flash Attention forward (paper)
  • [❌] Intro to CUDA with examples
  • [❌] Tensor Layouts
  • [❌] Intro to Triton with examples
  • [❌] Flash Attention forward (coding)
  • [❌] LogSumExp trick in Flash Attention 2
  • [❌] Derivatives, gradients, Jacobians
  • [❌] Autograd
  • [❌] Jacobian of the MatMul operation
  • [❌] Jacobian through the Softmax
  • [❌] Flash Attention backwards (paper)
  • [❌] Flash Attention backwards (coding)
  • [❌] Triton Autotuning
  • [❌] Triton tricks: software pipelining
  • [❌] Running the code

Reference

Key educational resources and implementations that inspired this work:

  • PyTorch Grad Tutorials: Practical examples demonstrating PyTorch's automatic differentiation system and gradient computation.
  • Computation Graph Visualization: Interactive notebook explaining how PyTorch constructs and traverses computation graphs during backpropagation.
  • Forward & Backward Pass: Step-by-step walkthrough of neural network forward/backward operations with PyTorch internals.
  • NanoGPT Implementation: Andrej Karpathy's minimal GPT implementation that clearly demonstrates transformer architecture essentials.
  • PyTorch LLaMA: Clean, hackable implementation of the LLaMA architecture in pure PyTorch for educational purposes.
  • Triton & Cuda Flash Attention: Reference implementation of Flash Attention using Triton and CUDA, providing efficient attention mechanisms for large language models.

Tools

We recommend the following tools to help with model development and optimization:

  • Tokenizer: An interactive tokenizer playground that helps visualize and understand how text gets tokenized, useful for prompt engineering and debugging.
  • KV Cache Size Calculator: A handy calculator for estimating GPU memory requirements of key-value caches in transformer models, crucial for optimizing inference performance.

Acknowledgments

We sincerely appreciate the following individuals and organizations for their contributions and inspiration:

  • PyTorch: For building the foundational deep learning framework that powers this project. The PyTorch team’s dedication to open-source innovation has been invaluable.
  • chunhuizhang: Thank you for your technical insights and collaborative efforts, which helped improve critical components of this work.
  • Andrej Karpathy: Your pioneering research and educational contributions in AI have deeply influenced our approach. We’re grateful for your leadership.
  • yihong0618: Your creative implementations and open-source spirit have motivated us to push boundaries. Thank you for sharing your work with the world.
  • OpenAI: For advancing the field with groundbreaking research and tools that empower projects like ours. Your commitment to ethical AI development sets a vital example.
  • Hugging Face: For democratizing NLP with open models and libraries that accelerated our development. Your inclusive ecosystem sets a benchmark for the field.
  • Umar Jamil: Your thoughtful feedback and technical expertise have strengthened this project. Thank you for your dedication and collaborative spirit.

This work thrives on the collective effort of open-source contributors; we aim to honor their spirit of collaboration.

About

lm-scratch-pytorch - The code is designed to be beginner-friendly, with a focus on understanding the fundamentals of PyTorch and implementing LLMs from scratch,step by step.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •