যেখানে কোড ও কথা বলে | A lightweight local AI framework for running language models locally
- Local AI Models - Run GGUF format models (TinyLlama, Phi-2, etc.)
- REST API - FastAPI-based server with comprehensive endpoints
- Session Management - Track conversation context across requests
- Model Lifecycle - Lazy loading, auto-unload, idle detection
- Download Manager - Direct HuggingFace Hub integration
- Runtime Orchestrator - llama.cpp backend with process management
- ✅ 100% Local - No cloud dependencies, complete privacy
- ✅ Lightweight - ~15MB framework, models load on-demand
- ✅ Fast - 14+ tokens/sec on CPU, faster with GPU
- ✅ Easy Setup - Python + llama.cpp, minimal dependencies
- ✅ Production Ready - Full test suite (100% passing), error handling, logging
- OS: Windows 10/11, Linux, macOS
- RAM: 8 GB (16 GB recommended)
- Storage: 10 GB free space
- Python: 3.8 or higher
- RAM: 16 GB+
- GPU: CUDA-compatible (optional, for 10x+ speed)
- Storage: 20 GB+ (for multiple models)
git clone https://github.com/zombiecoder1/zombie-coder-local-ai-ollama.git
cd zombie-coder-local-ai-ollamapip install -r requirements.txtDownload llama.cpp from official releases
Place binary in:
C:\model\config\llama.cpp\server.exe (Windows)
~/model/config/llama.cpp/server (Linux/Mac)
python model_server.pyServer starts at: http://localhost:8155
curl -X POST http://localhost:8155/download/start \
-H "Content-Type: application/json" \
-d '{
"model_name": "tinyllama-gguf",
"repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
}'
# Check progress
curl http://localhost:8155/download/status/tinyllama-gguf# Load model
curl -X POST http://localhost:8155/runtime/load/tinyllama-gguf?threads=4
# Generate text
curl -X POST http://localhost:8155/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama-gguf",
"prompt": "Write a Python hello world program"
}'
# Unload model
curl -X POST http://localhost:8155/runtime/unload/tinyllama-ggufDetailed documentation available in doc/ folder:
- API Documentation - Complete API reference
- System Integration - Integration guide
- Current Capabilities - Features & limitations
- Verification Report - Test results
cd test
python run_core_tests.py✅ 5/5 Core Tests Passing (100%)
01_preflight_check.py- System health check02_model_lifecycle.py- Model operations03_api_standard_check.py- API compatibility04_ui_data_integrity.py- Data validation05_integrated_session_test.py- Session management
Test Coverage: See test/README.md
GET /health # Server health
GET /system/info # Hardware detection
GET /models/installed # List local models
GET /models/available # Recommended modelsGET /runtime/status # Current state
POST /runtime/load/{model} # Load model
POST /runtime/unload/{model} # Unload model
GET /runtime/config # ConfigurationPOST /api/generate # Generate text
GET /api/tags # List modelsPOST /api/session/start # Create session
GET /api/session/status/{id} # Check session
POST /api/session/end/{id} # End sessionPOST /download/start # Start download
GET /download/status/{model} # Check progress
POST /download/cancel/{model} # Cancel downloadFull API docs: doc/README.md
| Metric | CPU (4 cores) | GPU (CUDA) |
|---|---|---|
| Load Time | 3-4 seconds | 2-3 seconds |
| Tokens/sec | 14 tokens/s | 150+ tokens/s |
| Memory Usage | 2-3 GB | 1-2 GB |
| First Token | ~60ms | ~20ms |
Tested on: Intel i5-4590, 16GB RAM, Windows 10
┌─────────────────────────────────────────┐
│ FastAPI Server (Port 8155) │
├─────────────────────────────────────────┤
│ ┌────────┐ ┌────────┐ ┌──────────┐ │
│ │ Models │ │Download│ │ Session │ │
│ │ API │ │Manager │ │ Manager │ │
│ └────────┘ └────────┘ └──────────┘ │
├─────────────────────────────────────────┤
│ Runtime Orchestrator │
│ (Model Loading, Process Management) │
├─────────────────────────────────────────┤
│ llama.cpp Server (Dynamic) │
│ (Port 8080+) │
└─────────────────────────────────────────┘
import requests
# Load model
requests.post(
"http://localhost:8155/runtime/load/tinyllama-gguf",
params={"threads": 4}
)
# Generate text
response = requests.post(
"http://localhost:8155/api/generate",
json={
"model": "tinyllama-gguf",
"prompt": "Explain quantum computing"
}
)
print(response.json()['runtime_response']['content'])// Generate text
const response = await fetch('http://localhost:8155/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'tinyllama-gguf',
prompt: 'What is AI?'
})
});
const data = await response.json();
console.log(data.runtime_response.content);More examples: examples/
MODEL_SERVER_PORT=8155 # API server port
MODELS_DIR=/path/to/models # Models directory
HUGGINGFACE_HUB_TOKEN=token # HF token (optional)| System RAM | Model | Quantization | Speed |
|---|---|---|---|
| 8 GB | TinyLlama 1.1B | Q2_K | 10-15 t/s |
| 16 GB | Phi-2 2.7B | Q4_K_M | 8-12 t/s |
| 32 GB | Llama-2 7B | Q5_K_M | 5-10 t/s |
# Check runtime config
curl http://localhost:8155/runtime/config
# Verify model exists
curl http://localhost:8155/models/installed- Use quantized models (Q2_K, Q4_K)
- Increase thread count:
?threads=8 - Enable GPU acceleration (if available)
- Use smaller quantization (Q2_K instead of Q8_0)
- Use smaller model (TinyLlama vs Phi-2)
- Close other applications
C:\model/
├── doc/ # 📚 Documentation
│ ├── README.md # API reference
│ ├── SYSTEMS_INTEGRATION.md
│ ├── CURRENT_CAPABILITIES.md
│ └── VERIFICATION_COMPLETE.md
├── test/ # 🧪 Test suite
│ ├── 01_preflight_check.py
│ ├── 02_model_lifecycle.py
│ ├── 03_api_standard_check.py
│ ├── 04_ui_data_integrity.py
│ ├── 05_integrated_session_test.py
│ ├── run_core_tests.py
│ └── README.md
├── static/ # 🌐 Web UI
├── models/ # 🤖 Downloaded models
├── logs/ # 📝 Server logs
├── data/ # 💾 Database
├── model_server.py # 🚀 Main server
├── router.py # 🔄 Runtime orchestrator
├── downloader.py # ⬇️ Download manager
├── db_manager.py # 💾 Model registry
├── system_detector.py # 🖥️ Hardware detection
└── requirements.txt # 📦 Dependencies
Contributions welcome! See CONTRIBUTING.md
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
cd test && python run_all_tests.pyMIT License - see LICENSE
Created by: Sahon Srabon
Organization: Developer Zone
Website: https://zombiecoder.my.id/
Contact: infi@zombiecoder.my.id
- llama.cpp - Inference engine
- FastAPI - Web framework
- HuggingFace Hub - Model repository
- GitHub Issues: Report issues
- Email: infi@zombiecoder.my.id
- Phone: +880 1323-626282
Made with ❤️ by ZombieCoder
Building AI tools that respect privacy and run anywhere