Modular Transformer Inference Engine built from scratch using PyTorch.
MiniLLM is a modular transformer inference engine focused on understanding and engineering the internals of modern Large Language Model systems.
The project explores:
- Transformer architecture
- Self-attention mechanisms
- Autoregressive generation
- KV cache optimization
- Quantized inference
- Runtime benchmarking
- Local GPU inference
- FastAPI-based serving
Instead of relying entirely on external APIs, MiniLLM focuses on building and understanding the inference pipeline itself.
┌────────────────────┐
│ User Prompt │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Tokenizer │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Embedding Layer │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Transformer Blocks │
│ Self Attention │
│ Feed Forward Net │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Inference Runtime │
│ KV Cache Engine │
│ Sampling Pipeline │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Generated Tokens │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ FastAPI Serving │
└────────────────────┘
- Token embeddings
- Positional embeddings
- Multi-head self-attention
- Feedforward layers
- Residual connections
- Layer normalization
- Autoregressive generation
- Token-by-token generation
- Greedy decoding
- Top-K sampling
- Top-P sampling
- Temperature scaling
- CUDA acceleration
- CPU/GPU execution support
- KV cache support
- Runtime benchmarking
- Throughput analysis
- Latency profiling
- VRAM monitoring
- Quantization experiments
- FastAPI inference server
- Async request handling
- Local deployment pipeline
- Streaming generation support
| Layer | Technology |
|---|---|
| Programming Language | Python |
| Deep Learning Framework | PyTorch |
| GPU Acceleration | CUDA |
| API Framework | FastAPI |
| Numerical Computing | NumPy |
| Deployment | Docker |
| Vector Search | FAISS (planned) |
miniLLM/
│
├── README.md
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── src/
│ ├── model/
│ ├── tokenizer/
│ ├── inference/
│ ├── training/
│ ├── config/
│ └── utils/
│
├── api/
│ ├── routes/
│ ├── schemas/
│ └── main.py
│
├── benchmarks/
├── tests/
├── docs/
├── scripts/
└── assets/
- Build transformer blocks from scratch
- Implement custom tokenizer
- Add KV cache optimization
- Build optimized inference runtime
- Benchmark latency and throughput
- Add quantized inference support
- Implement FastAPI serving layer
- Add streaming token generation
- Retrieval-Augmented Generation (RAG)
- FAISS vector database integration
- Concurrent inference batching
- Distributed inference experiments
- Dockerized deployment
- Local web-based inference UI
- NVIDIA RTX 4060 Laptop GPU
- CUDA-enabled PyTorch
- Python 3.x
Most projects stop at consuming LLM APIs.
MiniLLM focuses on understanding and engineering the systems behind transformer inference, runtime optimization, scalable serving, and local AI infrastructure.