⚒️ AutoLLM Forge

Forge Your Perfect Model - A beautiful, production-ready full-stack platform for automated LLM fine-tuning with QLoRA, AI-powered hyperparameter optimization, and real-time training monitoring.

Documentation | 🐛 Report Bug | ✨ Request Feature

🎬 Welcome Screen

Beautiful welcome screen explaining the 5-step fine-tuning process

✨ Features

🎯 Core Capabilities

⚡ QLoRA Fine-Tuning: 4-bit quantization with NormalFloat (NF4) for 75% memory reduction
🤖 Model Analysis: Deep inspection of 1000+ Hugging Face models with VRAM estimation
🧠 AI Hyperparameter Tuning: 8-tier intelligent recommendations based on model size and hardware
📊 Real-Time Progress: Live training metrics with detailed progress messages every 2 seconds
💾 Memory Optimized: Train 7B models on 16GB VRAM, 13B on 24GB, 70B on 80GB
💻 Code Generation: Production-ready inference scripts, Gradio apps, FastAPI servers, and documentation
📦 Complete Export: One-click download of fine-tuned models with all deployment files
📊 Automated Evaluation: Built-in perplexity testing and automatic Model Card generation
🧪 Experiment Tracking: Track metrics, loss curves, and artifacts for every training run
🎨 Modern UI: Beautiful dark theme with glassmorphism effects and smooth animations
🔐 Secure: Hugging Face token authentication for gated models (Llama, Gemma, etc.)

🏗️ Technical Stack

Backend (Python)

FastAPI 0.109 - High-performance async API framework
PyTorch 2.1+ - Deep learning framework with CUDA support
Transformers 4.37+ - Hugging Face model loading and training
PEFT 0.8+ - Parameter-Efficient Fine-Tuning (LoRA)
bitsandbytes 0.42+ - 4-bit/8-bit quantization
Accelerate 0.26+ - Distributed training utilities
Pydantic 2.5 - Data validation and settings

Frontend (TypeScript)

Next.js 14.2 - React framework with App Router
TypeScript 5.3 - Type-safe JavaScript
Tailwind CSS 3.4 - Utility-first styling
Radix UI - Accessible component primitives
Zustand 4.5 - State management
Lucide React - Beautiful icons
Framer Motion 11 - Smooth animations

Fine-Tuning Pipeline

QLoRA: 4-bit NF4 quantization + double quantization
Paged AdamW 8-bit: Memory-efficient optimizer
Gradient Checkpointing: Reduce memory footprint
Mixed Precision: FP16/BF16 training

Use Cases

User	Benefit
ML Engineers	Fine-tune 7B-70B models on consumer GPUs (RTX 3090/4090)
Researchers	Rapid experimentation with 8-tier hyperparameter optimization
Startups	Deploy custom models without expensive cloud infrastructure
Educators	Interactive teaching tool for LLM fine-tuning concepts
Enterprises	Standardized, reproducible fine-tuning pipelines

💡 Why QLoRA?

QLoRA (Quantized Low-Rank Adaptation) makes large model fine-tuning accessible:

Feature	Benefit	Example
75% Memory Reduction	4-bit vs FP16	7B model: 28GB → 7GB VRAM
NF4 Quantization	Minimal accuracy loss	Maintains model quality
Paged Optimizers	No OOM errors	Stable training on consumer GPUs
Double Quantization	Extra memory savings	Nested quantization of quantization constants
Consumer Hardware	No cloud costs	RTX 3090/4090 sufficient for 13B models

Supported Models

✅ Llama 2/3 (7B, 13B, 70B)
✅ Mistral (7B, Mixtral 8x7B)
✅ Gemma (2B, 7B)
✅ Falcon (7B, 40B)
✅ GPT-2/Neo/J
✅ Bloom (1.7B, 3B, 7B)
✅ Any Hugging Face causal LM model

📋 Requirements

Minimum System Requirements

Component	Requirement
OS	Windows 10/11, Linux (Ubuntu 20.04+), macOS
RAM	16GB (32GB recommended)
Storage	20GB free space
GPU	CUDA-capable (optional but recommended)
Python	3.10+
Node.js	18+

GPU Recommendations

GPU	VRAM	Max Model Size	Training Speed
RTX 3060	12GB	7B	~2h/epoch
RTX 3090	24GB	13B	~1h/epoch
RTX 4090	24GB	13B	~45min/epoch
A100 (40GB)	40GB	33B	~30min/epoch
A100 (80GB)	80GB	70B	~2h/epoch
CPU Only	N/A	7B	~12h/epoch ⚠️

Software Dependencies

Backend:

Python 3.10+
CUDA 11.8+ (for GPU)
PyTorch 2.1+ with CUDA
bitsandbytes 0.42+ (4-bit quantization)

Frontend:

Node.js 18+
npm or yarn or pnpm

Quick Start

1️⃣ Clone Repository

git clone https://github.com/royxlead/autollmforge-python.git
cd autollmforge-python

2️⃣ Backend Setup

cd backend

# Create virtual environment
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Activate (Linux/Mac)
source venv/bin/activate

# Install PyTorch with CUDA (Windows/Linux)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
pip install -r requirements.txt

# Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env
# Add other settings as needed

# Start backend
python main.py

✅ Backend running at: http://localhost:8000
📚 API docs at: http://localhost:8000/docs

3️⃣ Frontend Setup

Open a new terminal (keep backend running):

cd frontend

# Install dependencies
npm install
# or: yarn install
# or: pnpm install

# Start frontend
npm run dev

✅ Frontend running at: http://localhost:3000

4️⃣ First Training Job

Open http://localhost:3000
Click "Start Your Journey"
Search for a model (e.g., "gpt2")
Upload a dataset (JSON format with "text" field)
Get AI recommendations
Start training!
Watch real-time progress
Download your fine-tuned model

🔑 Hugging Face Token Setup

For gated models (Llama, Gemma, etc.):

Go to https://huggingface.co/settings/tokens
Create a new token with read access
Add to backend/.env:
```
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
```
Restart backend

🐳 Docker (Alternative)

# Coming soon
docker-compose up

📖 User Guide

🔍 Step 1: Model Analysis

Search for any Hugging Face model:
- "gpt2" - Small model for testing
- "google/gemma-2b" - Efficient 2B model
- "meta-llama/Llama-2-7b-hf" - Popular 7B model
Click "Analyze Model" to fetch:
- Architecture details
- Parameter count
- VRAM requirements (inference & training)
- Supported tasks
Review metrics and click "Select Model"

📊 Step 2: Dataset Upload

Dataset Format Required:

[
  {"text": "Your first training example..."},
  {"text": "Your second training example..."},
  {"text": "Your third training example..."}
]

Drag & drop your JSON file (or browse)
System validates:
- ✅ JSON format
- ✅ "text" field exists
- ✅ Sample count
- ✅ Token statistics
View dataset preview and stats
Click "Continue"

⚙️ Step 3: Hyperparameter Tuning

Click "Get AI Recommendations"
System analyzes:
- Model size
- Dataset size
- Available VRAM
- Compute tier
Review 8-tier recommendations:
- Learning rate
- Batch size
- LoRA rank (r)
- Epochs
- Gradient accumulation
Adjust manually (optional)
Click "Start Training"

🔥 Step 4: Training Monitor

Real-time updates every 2 seconds:

📥 Model Download: "Downloading model with 4-bit quantization..."
🔧 Quantization: "Preparing model for QLoRA training..."
📊 Dataset: "Tokenizing training dataset..."
🚀 Training: "Step 45/156 (29%) | Loss: 0.3456"
✅ Complete: "Training completed successfully!"

Live Metrics:

Current step / Total steps
Loss value (updates every 10 steps)
Learning rate
Samples per second
GPU memory usage
ETA remaining

📦 Step 5: Code Export

4 types of production code generated:

Inference Script (inference.py)
- Load fine-tuned model with 4-bit quantization
- Generate text with customizable parameters
- Error handling and device management
Gradio App (gradio_app.py)
- Interactive web UI
- 4 parameter controls (temperature, top_p, max_length, repetition_penalty)
- Example prompts included
FastAPI Server (api_server.py)
- REST API with CORS
- Pydantic validation
- Health checks and OpenAPI docs
- Production-ready deployment
README (README.md)
- Installation instructions
- Quick start guide
- API documentation
- Troubleshooting tips

Actions:

📋 Copy code to clipboard
💾 Download individual files
📦 Export All - ZIP with model + all code

🧪 CLI Experiment Runner

For automated experiments without the UI, use the CLI runner:

# 1. Create experiment config
echo '{
  "job_name": "llama2-experiment-1",
  "training_config": {
    "model_id": "meta-llama/Llama-2-7b-hf",
    "dataset_id": "data/train.json",
    "num_epochs": 3,
    "learning_rate": 2e-4,
    "use_lora": true
  }
}' > experiment_config.json

# 2. Run experiment
python backend/scripts/run_experiment.py experiment_config.json

This will:

Analyze the model
Validate the dataset
Run training with progress logging
Automatically evaluate the model (perplexity, loss)
Generate a Model Card (model_card.json)

📁 Project Structure

autollmforge-python/
├── backend/                          # FastAPI Backend (Python)
│   ├── main.py                       # API server & routes
│   ├── config.py                     # Settings & environment
│   ├── requirements.txt              # Python dependencies
│   ├── .env                          # Environment variables (create this)
│   │
│   ├── models/
│   │   └── schemas.py                # Pydantic data models
│   │
│   ├── services/                     # Business logic
│   │   ├── model_analyzer.py         # HF model analysis
│   │   ├── hyperparameter_optimizer.py  # 8-tier AI recommendations
│   │   ├── dataset_processor.py      # Dataset validation
│   │   ├── training_service.py       # QLoRA training pipeline
│   │   ├── quantization_service.py   # Post-training quantization
│   │   ├── eval_service.py           # Model evaluation & cards
│   │   └── code_generator.py         # Production code templates
│   │
│   ├── scripts/
│   │   └── run_experiment.py         # CLI experiment runner
│   │
│   ├── utils/
│   │   ├── hf_utils.py               # Hugging Face helpers
│   │   ├── compute_estimator.py      # VRAM/time estimation
│   │   └── logger.py                 # Logging configuration
│   │
│   └── storage/                      # Auto-created directories
│       ├── datasets/                 # Uploaded datasets
│       ├── outputs/                  # Fine-tuned models
│       └── cache/                    # Model cache
│
├── frontend/                         # Next.js Frontend (TypeScript)
│   ├── app/
│   │   ├── layout.tsx                # Root layout
│   │   ├── page.tsx                  # Welcome screen
│   │   └── globals.css               # Global styles
│   │
│   ├── components/                   # React components
│   │   ├── ModelAnalysis.tsx         # Step 1: Model selection
│   │   ├── DatasetUpload.tsx         # Step 2: Dataset upload
│   │   ├── HyperparameterTuning.tsx  # Step 3: Hyperparameters
│   │   ├── Training.tsx              # Step 4: Training monitor
│   │   └── CodeGeneration.tsx        # Step 5: Code export
│   │
│   ├── store/
│   │   └── pipelineStore.ts          # Zustand state management
│   │
│   ├── types/
│   │   └── index.ts                  # TypeScript definitions
│   │
│   ├── package.json                  # Node dependencies
│   └── .env.local                    # Frontend config (create this)
│
├── .gitignore                        # Git ignore rules
└── README.md                         # This file

🔌 API Reference

Core Endpoints

Method	Endpoint	Description
`POST`	`/api/analyze-model`	Analyze Hugging Face model
`GET`	`/api/models/popular`	Get popular model list
`POST`	`/api/upload-dataset`	Upload & validate dataset
`POST`	`/api/recommend-hyperparameters`	Get AI recommendations
`POST`	`/api/start-training`	Start QLoRA training job
`GET`	`/api/training-progress/{job_id}`	Get training progress
`POST`	`/api/cancel-training/{job_id}`	Cancel training job
`GET`	`/api/training-jobs`	List all training jobs
`WS`	`/ws/training/{job_id}`	Real-time training updates
`POST`	`/api/generate-code`	Generate deployment code
`GET`	`/api/download-model/{job_id}`	Download fine-tuned model
`GET`	`/api/download-package/{job_id}`	Download complete ZIP
`POST`	`/api/quantize`	Quantize existing model
`GET`	`/api/experiment/{job_id}/eval`	Get evaluation metrics
`POST`	`/api/experiment/{job_id}/evaluate`	Run evaluation & generate card

Example: Start Training

curl -X POST http://localhost:8000/api/start-training \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "model_id": "gpt2",
      "dataset_id": "your-dataset.json",
      "num_epochs": 3,
      "learning_rate": 0.0002,
      "batch_size": 4,
      "use_lora": true,
      "lora_config": {
        "r": 16,
        "lora_alpha": 32
      }
    },
    "job_name": "my-first-training"
  }'

Example: Get Progress

curl http://localhost:8000/api/training-progress/{job_id}

Response:

{
  "job_id": "089b4602-1275-4a...",
  "status": "running",
  "current_step": 45,
  "total_steps": 156,
  "current_epoch": 1,
  "train_loss": 0.3456,
  "learning_rate": 0.0002,
  "samples_per_second": 15.34,
  "progress_message": "🔥 Step 45/156 (29%) | Loss: 0.3456 | LR: 2.00e-04"
}

📚 Full API Documentation: http://localhost:8000/docs (when backend is running)

⚙️ Configuration

Backend Environment (`backend/.env`)

# Required: Hugging Face Token for gated models
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx

# API Settings
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true
ENVIRONMENT=development

# CORS (for frontend)
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:3001

# Storage Paths
HF_CACHE_DIR=./cache/huggingface
MODELS_DIR=./storage/models
DATASETS_DIR=./storage/datasets
OUTPUTS_DIR=./storage/outputs
TEMP_DIR=./storage/temp

# Training Configuration
MAX_CONCURRENT_TRAININGS=2
DEFAULT_DEVICE=cuda
MIXED_PRECISION=fp16
USE_QLORA_BY_DEFAULT=true
DEFAULT_QUANTIZATION=4bit

# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/app.log

Frontend Environment (`frontend/.env.local`)

NEXT_PUBLIC_API_URL=http://localhost:8000

Hyperparameter Optimization Tiers

The system uses 8 intelligence tiers for recommendations:

Tier	Model Size	Batch Size	LoRA Rank	Learning Rate	Use Case
1	<1B	16	8	3e-4	Testing/Experimentation
2	1-3B	8	16	2e-4	Small models
3	3-7B	4	16	2e-4	Standard fine-tuning
4	7-13B	2	32	1.5e-4	Larger models
5	13-30B	1	64	1e-4	Very large models
6	30-65B	1	64	8e-5	Huge models
7	65-100B	1	128	5e-5	Massive models
8	>100B	1	256	3e-5	Extreme scale

Automatically adjusted based on available VRAM and dataset size

🧪 Advanced Features

8-Tier Hyperparameter Intelligence

Automatically optimizes based on:

Model architecture and size
Dataset characteristics
Available compute resources
Task complexity

Real-Time Progress Tracking

Backend sends updates every training step:

📥 Model download progress
🔧 Quantization status
📊 Dataset processing
🔥 Training metrics (loss, LR, speed)
💾 Model saving

Frontend polls every 2 seconds for smooth UI updates.

Production-Ready Code Generation

Inference Script Features:

4-bit quantization loading
Device auto-detection (CUDA/CPU)
Configurable generation parameters
Error handling
Memory optimization

Gradio App Features:

Interactive web interface
Real-time text generation
Parameter sliders
Example prompts
One-command launch

FastAPI Server Features:

RESTful API
CORS configuration
Pydantic validation
Async request handling
OpenAPI documentation
Health check endpoint

Memory Optimization Techniques

4-bit Quantization: NF4 (NormalFloat 4-bit)
Double Quantization: Quantize quantization constants
Paged Optimizers: Prevent OOM with automatic offloading
Gradient Checkpointing: Trade compute for memory
Mixed Precision: FP16/BF16 training
LoRA: Train only 0.1% of parameters

🧪 Testing & Validation

Quick Test with GPT-2

Perfect for testing the complete pipeline:

# 1. Start both servers (backend + frontend)

# 2. Go to http://localhost:3000

# 3. Model Analysis
Search: "gpt2"
Click: "Analyze Model"

# 4. Dataset Upload
# Create test dataset:
echo '[{"text":"Hello world"},{"text":"Test example"}]' > test.json
# Upload test.json

# 5. Get Recommendations
Click: "Get AI Recommendations"
# Should show: Tier 1, batch_size=16, r=8

# 6. Start Training
Click: "Start Training"
# Watch real-time progress updates

# 7. Export Code
Click through: Inference → Gradio → API → README
Download all

Expected Duration: ~2-3 minutes on GPU, ~10-15 minutes on CPU

Verify Installation

# Check backend
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Check PyTorch GPU
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

# Check bitsandbytes
python -c "import bitsandbytes; print('✅ bitsandbytes OK')"

🐛 Troubleshooting

❌ Backend Won't Start

Problem: ModuleNotFoundError or import errors

# Solution 1: Reinstall dependencies
pip install --upgrade pip
pip install -r requirements.txt

# Solution 2: Install PyTorch separately
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Solution 3: Check Python version
python --version  # Should be 3.10+

Problem: Port 8000 already in use

# Windows
netstat -ano | findstr :8000
taskkill /PID <pid> /F

# Linux/Mac
lsof -ti:8000 | xargs kill -9

❌ Training Fails Immediately

Problem: CUDA out of memory

Solutions:
1. Reduce batch_size (try 1 or 2)
2. Enable gradient_checkpointing
3. Increase gradient_accumulation_steps
4. Use smaller model (e.g., 2B instead of 7B)

Problem: "Cannot access gated repo"

# Add HF token to backend/.env
HF_TOKEN=hf_xxxxxxxxxxxxx

# Restart backend

Problem: bitsandbytes not working (Windows)

# Use pre-built wheels
pip uninstall bitsandbytes
pip install bitsandbytes --prefer-binary

# Or use Windows-specific build
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

❌ Frontend Issues

Problem: "Failed to fetch" errors

Solutions:
1. Check backend is running (http://localhost:8000/health)
2. Verify CORS settings in backend/.env
3. Check browser console for errors
4. Disable browser extensions (ad blockers)

Problem: Progress not updating

Solutions:
1. Check browser console for WebSocket errors
2. Verify job_id is valid
3. Check backend logs for training progress
4. Refresh page and restart training

❌ No GPU Detected

Problem: torch.cuda.is_available() returns False

# Check NVIDIA driver
nvidia-smi

# Check PyTorch CUDA version
python -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch with correct CUDA
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

💬 Still Having Issues?

Check GitHub Issues
Review logs: backend/logs/app.log
Check browser console (F12)
Open a new issue with:
- Error message
- System specs (OS, GPU, RAM)
- Python/Node versions
- Steps to reproduce

🚀 Deployment

Production Checklist

Deployment Options

Option 1: Cloud VM (Recommended for GPU)

AWS EC2 / GCP Compute Engine / Azure VM

# Instance requirements:
- GPU instance (g4dn.xlarge, n1-standard-4-k80, etc.)
- 50GB+ storage
- Ubuntu 20.04+

# Setup:
1. SSH into instance
2. Install NVIDIA drivers + CUDA
3. Clone repo and setup (same as Quick Start)
4. Use systemd for auto-restart
5. Configure nginx as reverse proxy
6. Setup SSL with Let's Encrypt

Option 2: Railway / Render (CPU-only)

# For CPU-only training (slower but cheaper)
1. Connect GitHub repo
2. Set environment variables
3. Deploy backend and frontend separately
4. Use persistent volumes for storage

Option 3: Docker (Coming Soon)

# docker-compose.yml
services:
  backend:
    build: ./backend
    ports: ["8000:8000"]
    volumes: ["./storage:/app/storage"]
    
  frontend:
    build: ./frontend
    ports: ["3000:3000"]

Scaling Considerations

Users	Setup	Hardware
1-10	Single server	1x GPU instance
10-100	Load balancer + 2 servers	2x GPU instances
100-1000	Kubernetes cluster	GPU node pool
1000+	Multi-region + queue system	Dedicated infrastructure

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Hugging Face for Transformers and PEFT
FastAPI and Next.js teams
Open source community

📞 Contact & Support

GitHub: @royxlead
Email: royxlead@proton.me
Issues: Report a bug
Repository: https://github.com/royxlead/autollmforge-python

📊 Project Stats

🗺️ Roadmap

⚒️ Built with passion for the AI community

⭐ Star this repo if you find it useful! ⭐

🐛 Report Bug • ✨ Request Feature • 📖 Documentation

Made with ❤️ using FastAPI, Next.js, and QLoRA

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

⚒️ AutoLLM Forge

🎬 Welcome Screen

✨ Features

🎯 Core Capabilities

🏗️ Technical Stack

Use Cases

💡 Why QLoRA?

Supported Models

📋 Requirements

Minimum System Requirements

GPU Recommendations

Software Dependencies

Quick Start

1️⃣ Clone Repository

2️⃣ Backend Setup

3️⃣ Frontend Setup

4️⃣ First Training Job

🔑 Hugging Face Token Setup

🐳 Docker (Alternative)

📖 User Guide

🔍 Step 1: Model Analysis

📊 Step 2: Dataset Upload

⚙️ Step 3: Hyperparameter Tuning

🔥 Step 4: Training Monitor

📦 Step 5: Code Export

🧪 CLI Experiment Runner

📁 Project Structure

🔌 API Reference

Core Endpoints

Example: Start Training

Example: Get Progress

⚙️ Configuration

Backend Environment (backend/.env)

Frontend Environment (frontend/.env.local)

Hyperparameter Optimization Tiers

🧪 Advanced Features

8-Tier Hyperparameter Intelligence

Real-Time Progress Tracking

Production-Ready Code Generation

Memory Optimization Techniques

🧪 Testing & Validation

Quick Test with GPT-2

Verify Installation

🐛 Troubleshooting

❌ Backend Won't Start

❌ Training Fails Immediately

❌ Frontend Issues

❌ No GPU Detected

💬 Still Having Issues?

🚀 Deployment

Production Checklist

Deployment Options

Option 1: Cloud VM (Recommended for GPU)

Option 2: Railway / Render (CPU-only)

Option 3: Docker (Coming Soon)

Scaling Considerations

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact & Support

📊 Project Stats

🗺️ Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Backend Environment (`backend/.env`)

Frontend Environment (`frontend/.env.local`)

Packages