Created by Roberto H Luna - Open source education for transformer technology
This repository contains complete from-scratch implementations of both Translation Transformers and Large Language Models (LLMs), designed specifically for educational purposes. Learn how modern AI works by building it yourself!
🔥 What's Inside:
- Translation Model: Encoder-decoder transformer (English ↔ Italian)
- Language Model: GPT-style decoder-only transformer for text generation
- Attention Visualization: See what the models actually learn
- Educational Examples: Step-by-step concept explanations
This implementation faithfully follows the architecture from "Attention is All You Need" by Vaswani et al., with educational extensions for modern LLM understanding.
This implementation helps you understand:
- Multi-head attention mechanism and why it's revolutionary
- Positional encoding and how transformers handle sequence order
- Encoder-decoder architecture for sequence-to-sequence tasks (translation)
- Decoder-only architecture for autoregressive generation (LLMs)
- Training dynamics of both translation and language models
- Attention visualization to see what the models learn
- Text generation and sampling strategies
Input Embeddings + Positional Encoding
↓
Encoder Stack (6 layers)
┌─────────────────────┐
│ Multi-Head Attention│
│ + Residual & Norm │
├─────────────────────┤
│ Feed Forward │
│ + Residual & Norm │
└─────────────────────┘
↓
Decoder Stack (6 layers)
┌─────────────────────┐
│ Masked Self-Attn │
│ + Residual & Norm │
├─────────────────────┤
│ Cross Attention │
│ + Residual & Norm │
├─────────────────────┤
│ Feed Forward │
│ + Residual & Norm │
└─────────────────────┘
↓
Linear + Softmax
Input Embeddings + Positional Encoding
↓
Decoder Stack (12+ layers)
┌─────────────────────┐
│ Causal Self-Attn │
│ + Residual & Norm │
├─────────────────────┤
│ Feed Forward │
│ + Residual & Norm │
└─────────────────────┘
↓
Language Head → Next Token
-
model.py- Complete transformer architecture implementationLayerNormalization- Layer normalization for stable trainingMultiHeadAttentionBlock- The heart of the transformerFeedForwardBlock- Position-wise feed-forward networksPositionalEncoding- Sinusoidal positional embeddingsEncoderBlock&DecoderBlock- Complete encoder/decoder layersTransformer- Full translation model combining all components
-
llm_model.py- GPT-style language model implementationLLMDecoderBlock- Simplified decoder for language modelingLanguageModel- Complete LLM with text generationcausal_mask- Autoregressive attention maskingbuild_language_model- LLM factory function
train.py- Translation model training script with validationtrain_llm.py- Language model training script with text generationdataset.py- Bilingual dataset processing and tokenizationconfig.py- Model and training configurations
translate.py- Interactive translation interfaceinference.py- Model evaluation and testingattention.py- 🎨 Attention visualization (must-see!)beam_search.py- Advanced decoding strategies
educational_examples.py- 📚 Step-by-step concept explanationsllm_playground.py- 🎮 Interactive LLM experimentationllm_attention.py- 🔍 LLM-specific attention visualizationmodel_comparison.py- ⚖️ Architecture comparison tool
dataset_preparation.py- 🛠️ Complete dataset preprocessing pipelinedata_examples.py- 📝 Dataset structure examples for different domains
train_wb.py- Training with Weights & Biases integrationlocal_train.py- Local training setup
pip install -r requirements.txt# Translation model training
python train.py
# Language model training
python train_llm.py
# With W&B logging
python train_wb.py# Translate text (EN ↔ IT)
python translate.py
# Generate text with LLM
python -c "from train_llm import generate_text, torch;
# Load your trained LLM and generate text"# See dataset structure examples
python data_examples.py
# Prepare custom dataset
python dataset_preparation.py# Step-by-step concept explanations
python educational_examples.py
# Interactive LLM playground
python llm_playground.py
# LLM attention patterns
python llm_attention.py
# Compare architectures
python model_comparison.pypython attention.pydataset_preparation.py- Full preprocessing pipeline with quality filteringdata_examples.py- Real dataset structure examples for 5+ domains- Text cleaning, tokenization, and quality analysis tools
educational_examples.py- Step-by-step concept walkthroughsllm_playground.py- Real-time text generation experimentationmodel_comparison.py- Side-by-side architecture analysis
attention.py- Translation model attention patternsllm_attention.py- LLM causal attention analysis- Interactive exploration of different layers and heads
- Prompt vs generation attention comparison
Every operation includes:
- Tensor shape comments (e.g.,
# (batch, seq_len, d_model)) - Mathematical formulas from the paper
- Step-by-step transformations
# Apply the attention formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V
attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)- Standard training with TensorBoard
- Weights & Biases integration for experiment tracking
- Local training for development
Visual side-by-side comparison of:
- Translation models vs LLMs
- Parameter count analysis across model sizes
- Task suitability comparisons
- Computational complexity analysis
Dataset quality is THE most important factor for LLM performance. This repository includes comprehensive tools for preparing high-quality training data:
python data_examples.pyShows 5 different dataset formats:
- General Text - Books, articles, web content
- Code Generation - Programming instruction-response pairs
- Conversational - Multi-turn chat data
- Instruction Following - Task-oriented examples
- Domain-Specific - Medical, legal, technical content
python dataset_preparation.pyFull pipeline including:
- Text Analysis - Character/word distributions, quality metrics
- Cleaning - Remove URLs, normalize whitespace, filter low-quality
- Quality Filtering - Repetition detection, language identification
- Tokenization - BPE or WordLevel with custom vocabularies
- Dataset Creation - Proper train/val/test splits with HuggingFace format
- Quality > Quantity - 10K high-quality examples beat 100K poor ones
- Diversity Matters - Mix sources, topics, and writing styles
- Domain-Specific - Adapt preprocessing for your target use case
- Tokenization Strategy - Choose BPE for efficiency, WordLevel for interpretability
- Proper Splits - Prevent overfitting with clean validation sets
from dataset_preparation import DatasetPreparator
# Initialize preparator
prep = DatasetPreparator("my_dataset")
# Analyze raw data
analysis = prep.analyze_raw_text(my_texts)
# Clean and filter
cleaned = prep.quality_filter(my_texts)
# Create tokenizer and dataset
tokenizer = prep.create_tokenizer(cleaned, vocab_size=10000)
datasets = prep.create_training_dataset(cleaned, tokenizer)Edit config.py to experiment with different settings:
{
"d_model": 512, # Model dimension
"num_heads": 8, # Number of attention heads
"num_layers": 6, # Number of encoder/decoder layers
"seq_len": 350, # Maximum sequence length
"dropout": 0.1, # Dropout rate
"lang_src": "en", # Source language
"lang_tgt": "it", # Target language
}- Loss curves - Monitor convergence
- BLEU scores - Translation quality
- Character/Word Error Rates - Accuracy metrics
Look for these interesting patterns in attention visualizations:
- Diagonal patterns in self-attention (attending to nearby words)
- Alignment patterns in cross-attention (source-target word relationships)
- Specialized heads that focus on different linguistic phenomena
- Read the transformer paper
- Understand the
MultiHeadAttentionBlockclass - Run
attention.pyto visualize attention - Train on a small dataset
- Experiment with different model sizes
- Implement custom decoding strategies
- Add new positional encoding schemes
- Try different datasets/language pairs
- Implement optimization techniques (gradient clipping, warm-up)
- Add regularization methods
- Experiment with model architectures
- Scale to larger datasets
- Default Model Size: 512 dimensions, 8 heads, 6 layers
- Vocabulary: Built using HuggingFace tokenizers
- Dataset: OPUS Books (English ↔ Italian)
- Training: Adam optimizer with label smoothing
- Xavier initialization for stable training
- Residual connections prevent vanishing gradients
- Layer normalization for training stability
- Causal masking in decoder for autoregressive generation
- OOM Errors: Reduce batch size or sequence length
- Slow Convergence: Check learning rate and warm-up schedule
- Poor Translation: Ensure sufficient training data and epochs
- Blank Plots: Check model weights are loaded correctly
- Unclear Patterns: Try different layers/heads or longer training
- Attention is All You Need - Original paper
- The Illustrated Transformer - Excellent visual guide
- Annotated Transformer - Harvard's implementation guide
This is an educational project by Roberto H Luna, created to help more people understand transformer technology through open source learning! Contributions that improve education are welcome:
- Better documentation and comments
- Additional visualization tools
- More educational examples
- Bug fixes and improvements
- New model architectures for learning
MIT License - Feel free to use for educational purposes!
Created by Roberto H Luna with the mission of making transformer technology accessible to everyone through open source education. Special thanks to the authors of "Attention is All You Need" and the broader AI research community.
Happy Learning! 🎉
Remember: The goal isn't just to run the code, but to understand how transformers and LLMs work under the hood.
Learn → Build → Teach → Repeat 🚀