This project provides a robust REST API built with FastAPI and Docker to manage and interact with llama.cpp
-based BitNet model instances. It allows developers and researchers to programmatically control llama-cli
processes for automated testing, benchmarking, and interactive chat sessions.
It serves as a backend replacement for the Electron-BitNet project, offering enhanced performance, scalability, and persistent chat sessions.
- Session Management: Start, stop, and check the status of multiple persistent
llama-cli
andllama-server
session based chats. - Batch Operations: Initialize, shut down, and chat with multiple instances in a single API call.
- Interactive Chat: Send prompts to running bitnet sessions and receive cleaned model responses.
- Model Benchmarking: Programmatically run benchmarks and calculate perplexity on GGUF models.
- Resource Estimation: Estimate maximum server capacity based on available system RAM and CPU threads.
- VS Code Integration: Connects directly to GitHub Copilot Chat as a tool via the Model Context Protocol.
- Automatic API Docs: Interactive API documentation powered by Swagger UI and ReDoc.
- FastAPI for the core web framework.
- Uvicorn as the ASGI server.
- Docker for containerization and easy deployment.
- Pydantic for data validation and settings management.
- fastapi-mcp for VS Code Copilot tool integration.
- Docker Desktop
- Conda (or another Python environment manager)
- Python 3.10+
Create and activate a Conda environment:
conda create -n bitnet python=3.11
conda activate bitnet
Install the Huggingface-CLI tool to download the models:
pip install -U "huggingface_hub[cli]"
Download Microsoft's official BitNet model:
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir app/models/BitNet-b1.58-2B-4T
This is the easiest and recommended way to run the application.
-
Build the Docker image:
docker build -t fastapi_bitnet .
-
Run the Docker container: This command runs the container in detached mode (
-d
) and maps port 8080 on your host to port 8080 in the container.docker run -d --name ai_container -p 8080:8080 fastapi_bitnet
For development, you can run the application directly with Uvicorn, which enables auto-reloading.
uvicorn app.main:app --host 0.0.0.0 --port 8080 --reload
Once the server is running, you can access the interactive API documentation:
- Swagger UI: http://127.0.0.1:8080/docs
- ReDoc: http://127.0.0.1:8080/redoc
You can connect this API directly to VS Code's Copilot Chat to create and interact with models.
- Run the application using Docker or locally.
- In VS Code, open the Copilot Chat panel.
- Click the wrench icon ("Configure Tools...").
- Scroll to the bottom and select
+ Add MCP Server
, then chooseHTTP
. - Enter the URL:
http://127.0.0.1:8080/mcp
Copilot will now be able to use the API to launch and chat with BitNet instances.
For a more integrated experience, check out the companion VS Code extension:
- GitHub: https://github.com/grctest/BitNet-VSCode-Extension
- Marketplace: https://marketplace.visualstudio.com/items?itemName=nftea-gallery.bitnet-vscode-extension
This project is licensed under the MIT License. See the LICENSE file for details.