PhoneLlama

Turn your Android phone into a GPU-accelerated local inference host.

PhoneLlama is a fork of Google AI Edge Gallery that keeps the full native on-device accelerated inference path, adds better model management, and exposes a local OpenAI-compatible REST API that any standard client (Open WebUI, DeerFlow, LM Studio, Jan, curl) can call over localhost or the local network.

Why PhoneLlama

Google AI Edge Gallery demonstrates excellent on-device inference using the LiteRT runtime with GPU delegation — often faster than CPU-bound llama.cpp builds in Termux. PhoneLlama builds on that foundation and adds:

OpenAI-compatible HTTP API — /v1/chat/completions, /v1/models, streaming SSE
Easier model management — curated catalog of verified working models, one-tap activation
Local network serving — optional LAN exposure with ZeroTier VPN support for remote access
Server status UI — active model, request count, server controls, thermal warnings
Web status page — browse the server at http://PHONE-IP:8888/ from any browser
Stability fixes — smart context trimming prevents the native KV-cache overflow crash present in upstream, foreground service keeps inference alive in the background

Screenshots

Requirements

Android 10+ device with a decent GPU (tested on Pixel Fold)
6 GB RAM minimum; 12 GB recommended for larger models
Android Studio or ./gradlew assembleDebug to build

Build

# Clone
git clone https://github.com/thebitcoinman/phonellama.git
cd phonellama

# Create local.properties pointing at your Android SDK
echo "sdk.dir=/path/to/your/Android/Sdk" > local.properties

# Build debug APK
export JAVA_HOME=/path/to/jdk-21   # JDK 17+ required
./gradlew assembleDebug

# Install
adb install -r app/build/outputs/apk/debug/app-debug.apk

Usage

1. Download a model

Open the Models tab and download one of the listed models. Recommended for most use cases:

Model	Size	Best for
Qwen2.5-1.5B-Instruct	~1 GB	Speed — great for agent tool-calling
Qwen3-0.6B	~585 MB	Smallest, fast with `/no_think`
Phi-4-Mini-Instruct	~3.6 GB	Best reasoning quality (12 GB RAM)

2. Activate a model

Tap a downloaded model → Set Active. The API server immediately routes to it.

3. Start the API server

Open the Server tab → toggle Enable API Server.

Default: localhost:8888 only
Enable LAN mode to expose on your local network
The screen shows the exact URL and sample snippets

4. Make API calls

# List models
curl http://127.0.0.1:8888/v1/models

# Chat completion
curl http://127.0.0.1:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming
curl http://127.0.0.1:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen2.5-1.5B-Instruct", "messages": [{"role":"user","content":"Hello"}], "stream": true}'

5. Web status page

Browse to http://PHONE-IP:8888/ (or http://127.0.0.1:8888/ via ADB forward) for a live status dashboard.

API Reference

All endpoints are OpenAI-compatible.

`GET /v1/models`

Returns the list of installed, working models.

{
  "object": "list",
  "data": [
    {"id": "Qwen2.5-1.5B-Instruct", "object": "model", "owned_by": "edge-host", "active": true}
  ]
}

`POST /v1/chat/completions`

Standard OpenAI chat completions. Supports:

messages array (full conversation history, as DeerFlow / LangChain send)
stream: true for SSE streaming
max_tokens override
tools array for function-calling (returned as structured JSON from the model)

`POST /activate`

Switch the active model without restarting the server.

curl -X POST http://127.0.0.1:8888/activate \
  -H "Content-Type: application/json" \
  -d '{"model": "Phi-4-Mini-Instruct"}'

`GET /health`

Returns {"status": "ok", "model": "..."}.

`GET /` or `GET /ui`

Returns the HTML status dashboard.

Connecting external clients

Open WebUI / LM Studio / Jan

Set the base URL to http://PHONE-IP:8888/v1. Leave API key blank or use phonellama.

DeerFlow (YAML config)

models:
  - name: phone_llama
    display_name: PhoneLlama on-device
    use: langchain_openai:ChatOpenAI
    model: phonellama
    api_key: phonellama
    base_url: http://PHONE-IP:8888/v1

Remote access via ZeroTier

Install ZeroTier on the phone and join your network
In PhoneLlama Server tab, enable LAN mode
Use the ZeroTier IP shown in the app as the base URL

PhoneLlama detects ZeroTier status and shows a red banner + notification if the VPN drops.

Architecture

┌──────────────────────────────────────────────────────┐
│                  PhoneLlama App                      │
│                                                      │
│  ┌──────────────┐   ┌──────────────────────────┐    │
│  │  Model UI    │   │  EdgeServer (NanoHTTPD)  │    │
│  │  (Jetpack    │   │  /v1/chat/completions    │    │
│  │   Compose)   │   │  /v1/models              │    │
│  └──────┬───────┘   └──────────┬───────────────┘    │
│         │                      │                    │
│  ┌──────▼──────────────────────▼───────────────┐    │
│  │          ModelManagerViewModel               │    │
│  │   model registry · download · activation    │    │
│  └──────────────────────┬──────────────────────┘    │
│                          │                          │
│  ┌───────────────────────▼──────────────────────┐   │
│  │    LiteRT / Google AI Edge Runtime (JNI)     │   │
│  │    GPU-delegated inference · KV cache        │   │
│  └──────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────┘

Key files added on top of upstream Edge Gallery:

File	Purpose
`edgeserver/EdgeServer.kt`	NanoHTTPD server, OpenAI routing, smart context trimming
`edgeserver/EdgeServerManager.kt`	Lifecycle, ZeroTier detection, notifications
`edgeserver/EdgeServerScreen.kt`	Server tab UI, ZeroTier banner
`edgeserver/EdgeServerService.kt`	Foreground service for background inference
`data/PhoneLlamaCatalog.kt`	Curated extended model catalog
`ui/common/DeviceStatsBar.kt`	RAM/CPU/thermal overlay
`ui/rammanager/RamManagerSheet.kt`	Memory management controls

Known limitations

One model active at a time — switching models takes 5–15 seconds
No embeddings endpoint — not yet mapped from the LiteRT runtime
No tool execution — the model returns structured tool-call JSON; the client executes tools
Crash in liblitertlm_jni.so — mitigated via smart context trimming (see EdgeServer.kt); if you encounter a crash, reduce max context or report with logcat output

Differences from upstream Edge Gallery

Feature	Edge Gallery (upstream)	PhoneLlama
API server	❌	✅ OpenAI-compatible
Model catalog	Google's curated list	Extended with community LiteRT-LM models
App name / branding	Google AI Edge Gallery	PhoneLlama
Background serving	❌	✅ Foreground service
Network status	❌	✅ ZeroTier detection + alerts
Web status UI	❌	✅ at `http://PHONE-IP:PORT/`
Context overflow fix	❌ (crashes)	✅ Smart message trimming
RAM manager	❌	✅
Package ID	`com.google.ai.edge.gallery`	`com.phonellama.app`

Contributing

PRs welcome. Particularly interested in:

Additional verified LiteRT-LM models for the catalog
Embeddings endpoint once LiteRT exposes the API
iOS port if LiteRT supports it

Please test on-device before submitting model additions.

License

Apache 2.0 — same as Google AI Edge Gallery.

PhoneLlama is an independent fork and is not affiliated with or endorsed by Google.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
gradle		gradle
.gitignore		.gitignore
README.md		README.md
Screenshot_20260514-173712.png		Screenshot_20260514-173712.png
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhoneLlama

Why PhoneLlama

Screenshots

Requirements

Build

Usage

1. Download a model

2. Activate a model

3. Start the API server

4. Make API calls

5. Web status page

API Reference

`GET /v1/models`

`POST /v1/chat/completions`

`POST /activate`

`GET /health`

`GET /` or `GET /ui`

Connecting external clients

Open WebUI / LM Studio / Jan

DeerFlow (YAML config)

Remote access via ZeroTier

Architecture

Known limitations

Differences from upstream Edge Gallery

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhoneLlama

Why PhoneLlama

Screenshots

Requirements

Build

Usage

1. Download a model

2. Activate a model

3. Start the API server

4. Make API calls

5. Web status page

API Reference

GET /v1/models

POST /v1/chat/completions

POST /activate

GET /health

GET / or GET /ui

Connecting external clients

Open WebUI / LM Studio / Jan

DeerFlow (YAML config)

Remote access via ZeroTier

Architecture

Known limitations

Differences from upstream Edge Gallery

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /v1/models`

`POST /v1/chat/completions`

`POST /activate`

`GET /health`

`GET /` or `GET /ui`

Packages