MiniLLM

Modular Transformer Inference Engine built from scratch using PyTorch.

Overview

MiniLLM is a modular transformer inference engine focused on understanding and engineering the internals of modern Large Language Model systems.

The project explores:

Transformer architecture
Self-attention mechanisms
Autoregressive generation
KV cache optimization
Quantized inference
Runtime benchmarking
Local GPU inference
FastAPI-based serving

Instead of relying entirely on external APIs, MiniLLM focuses on building and understanding the inference pipeline itself.

System Architecture

                 ┌────────────────────┐
                 │    User Prompt     │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │     Tokenizer      │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │   Embedding Layer  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Transformer Blocks │
                 │  Self Attention    │
                 │ Feed Forward Net   │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Inference Runtime  │
                 │  KV Cache Engine   │
                 │ Sampling Pipeline  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Generated Tokens   │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ FastAPI Serving    │
                 └────────────────────┘

Features

Transformer Architecture

Token embeddings
Positional embeddings
Multi-head self-attention
Feedforward layers
Residual connections
Layer normalization
Autoregressive generation

Inference Runtime

Token-by-token generation
Greedy decoding
Top-K sampling
Top-P sampling
Temperature scaling
CUDA acceleration
CPU/GPU execution support

Optimization & Systems Engineering

KV cache support
Runtime benchmarking
Throughput analysis
Latency profiling
VRAM monitoring
Quantization experiments

API & Serving

FastAPI inference server
Async request handling
Local deployment pipeline
Streaming generation support

Tech Stack

Layer	Technology
Programming Language	Python
Deep Learning Framework	PyTorch
GPU Acceleration	CUDA
API Framework	FastAPI
Numerical Computing	NumPy
Deployment	Docker
Vector Search	FAISS (planned)

Project Structure

miniLLM/
│
├── README.md
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── src/
│   ├── model/
│   ├── tokenizer/
│   ├── inference/
│   ├── training/
│   ├── config/
│   └── utils/
│
├── api/
│   ├── routes/
│   ├── schemas/
│   └── main.py
│
├── benchmarks/
├── tests/
├── docs/
├── scripts/
└── assets/

Current Goals

Build transformer blocks from scratch
Implement custom tokenizer
Add KV cache optimization
Build optimized inference runtime
Benchmark latency and throughput
Add quantized inference support
Implement FastAPI serving layer
Add streaming token generation

Planned Extensions

Retrieval-Augmented Generation (RAG)
FAISS vector database integration
Concurrent inference batching
Distributed inference experiments
Dockerized deployment
Local web-based inference UI

Development Environment

NVIDIA RTX 4060 Laptop GPU
CUDA-enabled PyTorch
Python 3.x

Motivation

Most projects stop at consuming LLM APIs.

MiniLLM focuses on understanding and engineering the systems behind transformer inference, runtime optimization, scalable serving, and local AI infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniLLM

Overview

System Architecture

Features

Transformer Architecture

Inference Runtime

Optimization & Systems Engineering

API & Serving

Tech Stack

Project Structure

Current Goals

Planned Extensions

Development Environment

Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MiniLLM

Overview

System Architecture

Features

Transformer Architecture

Inference Runtime

Optimization & Systems Engineering

API & Serving

Tech Stack

Project Structure

Current Goals

Planned Extensions

Development Environment

Motivation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages