A Claude Code-inspired AI coding assistant that runs entirely on your local machine. Powered by llama.cpp and local language models, Local Coder provides an interactive CLI for asking coding questions, getting explanations, and editing code files—all without sending your code to external servers.
- 🤖 Local AI Model: Runs completely offline using llama.cpp with GGUF models
- 💬 Interactive Chat: Multi-turn conversations with streaming markdown-rendered responses
- 🔧 Agentic Tool Use: The model can read files, list directories, and search code autonomously via MCP
- 📝 Code Editing: Request code changes—the model reads target files itself before applying edits
- 📁 File Context: Reference files in your prompts using
@filenamesyntax - 🎨 Beautiful Terminal UI: Syntax-highlighted markdown rendering with code blocks
- ⚡ GPU Acceleration: Leverages your GPU for fast inference
- 💾 Persistent Config: Model settings saved to
~/.local-coder/config.json - 📄 CONTEXT.md Auto-Injection: Generate a project context file the model reads automatically
- 🔌 MCP Server Mode: Expose local-coder as an MCP tool server for Claude Code, Cursor, and other agents
- Python 3.10 or higher
- Node.js (required for MCP filesystem integration)
- [Optional but recommended] CUDA-compatible GPU for faster inference
- ~5GB disk space for the default model
You can run Local Coder either natively or using Docker.
git clone https://github.com/yourusername/local-coder.git
cd local-coderpip install -r requirements.txtNote: If you have a CUDA-compatible GPU, install llama-cpp-python with GPU support:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dirFor Metal (macOS with Apple Silicon):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dirDownload a GGUF model file and put it in the root of the project. The example in this project uses the Qwen2.5-Coder 7B model
# Using huggingface-cli (recommended)
pip install huggingface-hub
huggingface-cli download Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--local-dir . \
--local-dir-use-symlinks FalseOr download manually from Hugging Face.
Alternative Models: You can use any GGUF model. Popular options include:
- Qwen2.5-Coder (recommended for coding)
- DeepSeek-Coder
- CodeLlama
Whichever gguf model you choose to use, make sure to not commit it to the repo. It should be gitignored
python main.py models --set ./your-model-name.ggufThis saves the path to ~/.local-coder/config.json so you don't need to configure it again.
git clone https://github.com/yourusername/local-coder.git
cd local-coderDownload your GGUF model file (see step 3 in Native Installation above).
docker build -t local-coder .Interactive chat:
docker run -it --rm \
-v $(pwd)/your-model.gguf:/app/models/model.gguf:ro \
local-coderAsk a question:
docker run -it --rm \
-v $(pwd)/your-model.gguf:/app/models/model.gguf:ro \
local-coder ask "your question"For GPU support and more Docker options, see DOCKER.md
Start an interactive session for multi-turn conversations:
python main.py chatExample session:
Connecting to MCP filesystem server...
MCP connected (11 tools available)
Starting interactive chat session. Type /exit to quit.
You: How does the agentic loop in @agent.py work?
Assistant: [reads the file via MCP, then explains it]
You: /exit
Goodbye!
Features in chat mode:
- Reference files with
@filenamesyntax (pre-loads file contents) - Responses are rendered as formatted markdown
- The model can also call filesystem tools itself via MCP
- Conversation history is kept for up to 10 turns
- Use
--no-mcpto disable MCP and run without filesystem tools
Slash commands (type these at the You: prompt):
| Command | Description |
|---|---|
/exit |
Quit the chat session |
/model |
Show the current model and interactively switch to another GGUF file |
/md |
Explore the project and generate a CONTEXT.md file the model reads on startup |
You: /model
Current model: qwen2.5-coder-7b-instruct-q4_k_m.gguf
Path: /Users/you/local-coder/qwen2.5-coder-7b-instruct-q4_k_m.gguf
Available GGUF models in current directory:
1. codellama-7b.gguf
Enter a path to a .gguf file to switch models, or press Enter to keep the current model:
> 1
You can enter a number from the list, type a file path directly, or press Enter to keep the current model.
You: /md
Generating CONTEXT.md by exploring the project...
Reading project files...
Generating CONTEXT.md content...
[preview of generated markdown]
Write CONTEXT.md? [y/N]: y
CONTEXT.md has been created. It will be auto-injected into future prompts.
Once CONTEXT.md exists in the project root, its contents are automatically included in the system prompt for every ask, chat, and edit call so the model has persistent project context.
Ask a single question without entering interactive mode:
python main.py ask "What's the difference between a list and a tuple?"With file context:
python main.py ask "Explain what @main.py does"Disable MCP for a quick offline answer:
python main.py ask "What is a decorator?" --no-mcpThe model will call filesystem tools as needed to answer questions about your codebase.
Request changes to a code file. The model reads the file automatically before making changes:
python main.py edit "Add error handling to @helpers.py"Edit command options:
--max-tokens Nor-n N: Set max response length (default: 2048)
Example workflow:
# The model reads helpers.py via MCP, applies changes, then summarizes what it did
python main.py edit "Add docstrings to all functions in @helpers.py"
# Larger changes may need more tokens
python main.py edit "Refactor @agent.py to use async/await throughout" -n 4096View the currently configured model and switch to a different one:
# Show current model configuration
python main.py models
# Set a different model (persisted to ~/.local-coder/config.json)
python main.py models --set ./codellama-7b.Q4_K_M.ggufModel settings are stored at ~/.local-coder/config.json:
{
"model_path": "/path/to/your-model.gguf",
"n_ctx": 8192,
"n_gpu_layers": -1
}Use python main.py models --set <path> to update the model path. Edit the JSON file directly to change n_ctx or n_gpu_layers.
Edit n_ctx in ~/.local-coder/config.json:
{
"n_ctx": 16384
}Larger values allow longer conversations and bigger files, but use more VRAM.
Control how many model layers run on GPU by editing n_gpu_layers:
{ "n_gpu_layers": -1 } // All layers on GPU (recommended)
{ "n_gpu_layers": 0 } // CPU only
{ "n_gpu_layers": 20 } // First 20 layers on GPU, rest on CPUControl response length per command:
python main.py chat --max-tokens 1024 # Longer responses in chat
python main.py ask "question" --max-tokens 256 # Shorter responsesReference files in your prompts using @:
# Single file
python main.py ask "Explain @main.py"
# Multiple files
python main.py ask "How do @main.py and @helpers.py work together?"
# With paths
python main.py ask "Review @src/utils/parser.py for bugs"The tool automatically:
- Reads the file contents
- Injects them into the prompt with proper formatting
- Handles missing files gracefully with warnings
In addition to @filename syntax, the model can also browse your filesystem autonomously using MCP tools—so you can ask "what files are in this project?" without pre-loading anything.
Local Coder connects to @modelcontextprotocol/server-filesystem as a subprocess, giving the model a set of filesystem tools it can call during inference:
| Tool | What the model uses it for |
|---|---|
read_file |
Read any file before answering or editing |
list_directory |
Browse the project structure |
search_files |
Find code patterns across files |
write_file |
Apply edits to files |
create_directory, move_file, etc. |
Other filesystem operations |
The server is started automatically when you run ask, chat, or edit. Pass --no-mcp to skip it:
python main.py chat --no-mcp
python main.py ask "quick math question" --no-mcpNode.js is required for MCP. If Node is not installed, the server will be unavailable and the model will fall back to answering without filesystem access.
Local Coder ships with a skill package (local-coder.skill) that lets you use your local model as a tool from Claude Code, Cursor, or any other MCP-compatible agent.
Once the MCP server is running, it provides five tools to the host agent:
| Tool | Description |
|---|---|
ask |
Single-turn Q&A with optional file context |
chat |
Stateful multi-turn conversation (use session_id to continue) |
edit |
Code editing—model reads files, applies changes, returns a summary |
get_model |
Inspect current model path, context size, GPU layers |
set_model |
Switch to a different GGUF model file |
/path/to/local-coder/llm/bin/python /path/to/local-coder/skills/local-coder/scripts/server.pyAdd to ~/.claude/settings.json (or a project-level .claude/settings.json):
{
"mcpServers": {
"local-coder": {
"command": "/absolute/path/to/local-coder/llm/bin/python",
"args": ["/absolute/path/to/local-coder/skills/local-coder/scripts/server.py"],
"cwd": "/absolute/path/to/local-coder"
}
}
}Restart Claude Code—the ask, chat, edit, get_model, and set_model tools will appear in the available tools list.
Add to .cursor/mcp.json in your workspace (or ~/.cursor/mcp.json globally):
{
"mcpServers": {
"local-coder": {
"command": "/absolute/path/to/local-coder/llm/bin/python",
"args": ["/absolute/path/to/local-coder/skills/local-coder/scripts/server.py"],
"cwd": "/absolute/path/to/local-coder"
}
}
}For a full setup walkthrough see skills/local-coder/references/setup.md.
Problem: FileNotFoundError: Model file not found
Solution: Run python main.py models to check the configured path, then use python main.py models --set ./your-model.gguf to update it.
Problem: MCP unavailable — no tools will be available
Solution: Install Node.js. The MCP filesystem server is an npm package that requires Node to run.
Problem: Inference is slow despite having a GPU
Solution: Reinstall llama-cpp-python with GPU support:
# For NVIDIA/CUDA
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# For Apple Silicon
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dirProblem: CUDA out of memory or system crashes
Solution:
- Reduce
n_ctxin~/.local-coder/config.json - Use a smaller quantized model (e.g., Q4_K_M instead of Q6_K)
- Reduce
n_gpu_layersto offload some layers to CPU
Solution:
- Ensure GPU acceleration is enabled
- Use a smaller model
- Reduce
max_tokensfor shorter responses - Lower quantization (Q4_K_M is faster than Q6_K)
local-coder/
├── main.py # CLI entry point (ask, chat, edit, models commands + slash commands)
├── agent.py # Agentic tool-calling loop (up to 10 iterations)
├── helpers.py # Utility functions (file parsing, @reference handling)
├── prompt_builder.py # LLM prompt construction (injects CONTEXT.md automatically)
├── config.py # Persistent config (~/.local-coder/config.json)
├── mcp_client.py # MCP filesystem client (wraps @modelcontextprotocol/server-filesystem)
├── tools.py # Built-in tool implementations (read_file, list_directory, etc.)
├── skills/
│ └── local-coder/
│ ├── SKILL.md # Skill metadata and usage guide
│ ├── references/
│ │ └── setup.md # Agent setup guide (Claude Code, Cursor, etc.)
│ └── scripts/
│ └── server.py # MCP server entry point for use by other agents
├── local-coder.skill # Packaged skill archive
├── requirements.txt # Python dependencies
└── *.gguf # Your downloaded model file
- User Input: Commands are parsed by Typer in
main.py - File References:
@filenamepatterns are extracted and files are loaded byhelpers.py - Prompt Building: Context, CONTEXT.md, and instructions are formatted by
prompt_builder.py - MCP Connection:
mcp_client.pystarts the filesystem server as a subprocess and discovers tools - Agentic Loop:
agent.pycalls the LLM repeatedly until it produces a final text answer (no more tool calls), up to 10 iterations - Tool Calls: When the LLM calls a filesystem tool,
mcp_client.pyforwards it to the MCP server and returns the result - Output Rendering: Rich library renders the final markdown response
The model operates in a loop rather than a single shot:
User message → LLM → tool call? → execute tool → append result → LLM → tool call? → ... → final text answer
This allows the model to read files, search for patterns, or list directories before answering—no need to pre-load everything with @filename.
Contributions are welcome! Areas for improvement:
- Support for more model formats
- Configuration file for settings
- Conversation history persistence
- Better error messages
- Additional commands (code search, refactoring, etc.)
- Unit tests
As you can tell from the CLAUDE.md file in the gitignore, I used AI to help me make this project. Feel free to use AI tools to help you contribute to the repo. Whatever you use, make sure to test properly on your local before you push changes.
MIT License - feel free to use this project for any purpose.
- Built with llama-cpp-python
- Inspired by Claude Code
- Terminal UI powered by Rich
- CLI framework: Typer
- Filesystem tools via MCP
Your code never leaves your machine. All inference happens locally using your hardware.