Skip to content

zuza-chat/zuza

Repository files navigation

zuza

License: MIT Android API 28+ llama.cpp Tests Models

a warm little assistant that lives on your phone.

zuza is a private AI chat app for Android. Every conversation runs entirely on your device — no cloud, no account, no telemetry, no data leaving your phone. Pick an open-weight model from a curated catalog, download it over Wi-Fi, and chat. Works on budget phones that most AI apps refuse to touch.

Built with llama.cpp vendored natively for ARM, Jetpack Compose for the UI, and Room for persistence. One APK, seven CPU backend variants, zero external services.


How it works

graph LR
    U[User] -->|types| C[Compose UI]
    C -->|prompt| Z[zuza::Engine<br/>C++ / llama.cpp]
    Z -->|tokens| C
    C -->|persist| R[(Room DB)]
    M[GitHub manifest] -.->|catalog refresh| C
    HF[HuggingFace CDN] -.->|model download| D[DownloadService]
    D -->|.gguf file| Z
Loading
  1. User types a message in the Compose UI
  2. The chat screen builds a prompt using the active model's chat template
  3. The prompt is fed through zuza::Engine (a C++ wrapper around llama.cpp's inference API) via JNI
  4. Tokens stream back one at a time into the UI — Compose recomposes on each token
  5. Conversations persist to a Room database; the model's KV cache is reused across turns
  6. On context overflow (~60% of the window), a background summarization condenses older turns into a prose paragraph so the model never loses context

Features

Inference

  • Fat APK with 7 CPU variants — ships libggml-cpu-android_armv8.0_1.so through libggml-cpu-android_armv9.2_2.so. Runtime dispatch picks the best one for the current SoC: baseline for Cortex-A53, DOTPROD/FP16 for A76-class, i8mm/SVE2 for A715+.
  • Dynamic context windown_ctx scales with device RAM (1024 on 3 GB → 8192 on 12 GB+) so budget phones aren't OOM'd and flagships get the full window.
  • Multi-turn KV cache reuse — turn 2+ appends to the existing cache via continueConversation; only the new user turn is tokenized + decoded.
  • Background summarization — when the cache crosses 60% capacity, older turns are summarized into a prose paragraph during the idle window between turns. The summary replaces those turns in the next prompt rebuild. If the user outruns the background path, an inline fallback fires at 85%.
  • Qwen 3 think-tag parsing<think>...</think> blocks are hidden; the UI shows a pulsing "thinking" indicator until the real answer starts streaming.

Models

  • Remote catalog — the app fetches a JSON manifest from zuza-chat/models on GitHub at startup. Adding a model = editing the JSON; every install picks it up on next refresh. Falls back to a bundled list if unreachable.
  • Resumable downloads — HTTP Range + ETag resume. Partial .part files survive app kills and network drops; re-tap Download to continue.
  • Foreground download service — downloads run in an Android foreground service with an ongoing notification. Lock the screen, minimize the app, use other apps — the download keeps going.
  • GGUF magic validation — every downloaded file is checked for the GGUF magic bytes before it's declared ready. Silent CDN corruption or zero-filled responses are caught, not silently loaded.
  • RAM fit warnings — the picker checks totalMem against each model's runtime footprint and shows tight / won't fit badges with confirmation dialogs.

Personality

  • Three-tier system prompt — tiny models (LFM 2 350M) get a one-sentence prompt; mid-size models (Gemma 3 1B, Llama 3.2 1B) get a standard prompt; large models (Qwen 3 1.7B, Gemma 3 4B, Gemma 4) get a rich personality with backstory, opinions, and anti-corporate-speak guards.
  • Name personalization — first-run onboarding asks "what should I call you?"; the system prompt weaves the name into its instructions. Changeable in Settings.

UI / UX

  • Onboarding — first-launch welcome screen with name input, skip option, and privacy tagline.
  • Dark UI — cerulean accent, Inter variable font, Lucide bot icon.
  • Markdown rendering — assistant bubbles render bold, italic, code, fenced blocks, lists, headers via compose-richtext.
  • Saved conversations — Room-backed persistence. Every chat is reachable from the drawer with title, message count, and timestamp.
  • Settings — thread count, temperature, max tokens, cellular data toggle, your name, catalog source URL, manual refresh.

Model catalog

Model Size Template Personality RAM Speed (budget)
LFM 2 350M 219 MB ChatML tiny 3 GB+ ~8–12 tok/s
Gemma 3 1B 769 MB Gemma 3 standard 4 GB+ ~3–4 tok/s
Llama 3.2 1B 770 MB Llama 3 standard 4 GB+ ~3–4 tok/s
Qwen 3 1.7B 1.19 GB ChatML rich 4 GB+ ~1–1.5 tok/s
Gemma 3 4B 2.32 GB Gemma 3 rich 6 GB+ ~0.5–1 tok/s
Gemma 4 E2B 3.22 GB Gemma 4 rich 6 GB+ ~0.3–0.6 tok/s
Gemma 4 E4B 5.03 GB Gemma 4 rich 8 GB+ flagship only

Templates are implemented in Kotlin (ChatTemplate.kt), not via llama.cpp's built-in Jinja engine. Each variant has full unit tests. Gemma 4 uses different turn markers (<|turn> / <turn|>) from Gemma 3 (<start_of_turn> / <end_of_turn>) — mixing them up causes the model to hallucinate fake dialogues, which is how we discovered the difference.

The remote manifest lives at zuza-chat/models. See docs/MODELS.md for the schema, per-model notes, and how to add entries.


Architecture

chat.zuza/
├── MainActivity.kt                    app entry point + top-level state
├── engine/
│   ├── Zuza.kt                        singleton over libzuza.so (load/generate/stop)
│   ├── ZuzaNative.kt                  JNI external fun declarations
│   ├── ZuzaParams.kt                  ZuzaLoadParams + ZuzaGenParams
│   ├── ContextBudget.kt               dynamic n_ctx from device RAM
│   ├── ContextSummarizer.kt           two-tier memory: summarize older turns
│   ├── BudgetChecker.kt               soft (60%) / hard (85%) threshold logic
│   └── TokenEstimator.kt              rough char-to-token heuristic
├── data/
│   ├── catalog/
│   │   ├── ChatTemplate.kt            LLAMA3 / CHATML / PHI3 / GEMMA3 / GEMMA4
│   │   ├── ModelInfo.kt               data class for a single catalog entry
│   │   ├── ModelCatalog.kt            active catalog (mutableStateOf, replaceable)
│   │   ├── CatalogJson.kt             JSON parser for the remote manifest
│   │   └── RemoteCatalog.kt           fetch + cache + bootstrap logic
│   ├── download/
│   │   ├── DownloadService.kt         foreground service for background downloads
│   │   ├── DownloadStateRegistry.kt   process-wide SnapshotStateMap of progress
│   │   ├── ModelRepository.kt         HTTP Range + ETag resume, .part → .gguf
│   │   ├── DownloadState.kt           sealed interface (NotDownloaded/Downloading/...)
│   │   └── ResumeStore.kt             per-model ETag + totalBytes persistence
│   ├── conversations/
│   │   ├── ConversationStore.kt       save / load / delete + legacy JSON migration
│   │   └── room/                      DAO, entities, migrations, ZuzaDatabase
│   └── preferences/
│       └── SettingsStore.kt           SharedPreferences-backed flows
├── ui/
│   ├── theme/Theme.kt                 Inter + cerulean palette
│   ├── common/                        DottedRule, ZuzaSeal, CircleIconButton
│   ├── onboarding/OnboardingScreen.kt first-run name input
│   ├── chat/                          ChatScreen + 8 focused siblings
│   ├── models/                        ModelsScreen + ModelPill + RamFit
│   ├── drawer/                        ZuzaDrawer + row composables
│   ├── settings/                      SettingsScreen + controls
│   └── about/                         AboutScreen
├── util/DeviceRam.kt                  Context.deviceTotalRamBytes()
└── cpp/
    ├── zuza_engine.h                  public C++ class (zuza::Engine)
    ├── zuza_engine.cpp                implementation (load/generate/poll/stop)
    ├── zuza_jni.cpp                   thin JNI marshalling (~95 lines)
    ├── CMakeLists.txt                 build config + multi-variant toggle
    └── llama/                         vendored llama.cpp (untouched)
graph TD
    subgraph "Kotlin"
        MA[MainActivity] --> CS[ChatScreen]
        MA --> MS[ModelsScreen]
        MA --> SS[SettingsScreen]
        MA --> OS[OnboardingScreen]
        CS --> Z[Zuza singleton]
        CS --> CSm[ContextSummarizer]
        MS --> DS[DownloadService]
        DS --> MR[ModelRepository]
        DS --> DSR[DownloadStateRegistry]
        Z --> CB[ContextBudget]
        Z --> BC[BudgetChecker]
        MA --> RC[RemoteCatalog]
        RC --> MC[ModelCatalog]
        RC --> CJ[CatalogJson]
    end
    subgraph "C++ / NDK"
        Z --> ZN[ZuzaNative]
        ZN --> ZE[zuza::Engine]
        ZE --> LC[llama.cpp]
    end
    subgraph "Storage"
        MR --> FS[(filesDir/models/*.gguf)]
        CS --> Room[(Room: zuza.db)]
        RC --> Cache[(filesDir/catalog.json)]
    end
    subgraph "Network"
        MR --> HF[HuggingFace CDN]
        RC --> GH[GitHub raw]
    end
Loading

No ViewModels, no DI framework. State is hoisted to MainActivity in idiomatic Compose fashion. Unit-testable logic lives in pure Kotlin files with zero Compose or Android dependencies, covered by 149 JVM tests.


Building

Requirements:

  • Android Studio Ladybug or newer (or just the command-line SDK tools)
  • Android NDK r27.1.12297006
  • CMake 3.22+
  • JDK 17
  • ~5 GB disk (llama.cpp compiles seven CPU variant .so files)
# Clone
git clone https://github.com/zuza-chat/zuza.git
cd zuza

# Build
./gradlew :app:assembleDebug

# Install on a connected device
./gradlew :app:installDebug

# Run the tests (149 JVM tests, ~3s)
./gradlew :app:testDebugUnitTest

First build takes 1–2 minutes (llama.cpp compiles once per variant). Incremental builds are seconds.


Tests

149 pure-JVM unit tests, no Robolectric, no emulator, ~3 seconds total.

Area Tests What they cover
ChatTemplate 12 Every template variant × begin/continue/wrap; GEMMA4 marker regression
PromptBuilder 9 Double-user-turn regression, summary-aware rebuild, empty-turns guard
SystemPrompt 12 Anonymous, personalized, rich, tiny; format guards; length caps
ContextBudget 15 RAM breakpoints, boundary conditions, degenerate inputs
TokenEstimator 5 Scaling, overhead, conservative bias
BudgetChecker 11 Hard/soft thresholds, custom ratios, zero-nCtx
ContextSummarizer 8 Prompt builder purity, system prompt constant, gen params
ConversationStore 15 CRUD, ordering, summary round-trip, legacy JSON migration
ModelRepository 9 200/206/200-fallback/404, GGUF magic, resume, delete
AssistantContent 9 Think-tag parsing, nested tags, incomplete streams
RamFit 10 Fine/Tight/WontFit thresholds
CatalogJson 15 Happy path, schema versions, per-entry validation, round-trip
RemoteCatalog 12 Remote success, network error, cache fallback, corrupt cache
# Run a single test class
./gradlew :app:testDebugUnitTest --tests "chat.zuza.engine.ContextBudgetTest"

# Run all tests in a package
./gradlew :app:testDebugUnitTest --tests "chat.zuza.data.catalog.*"

Device support

Tier Example devices Models that work well
Budget (3 GB) Redmi 14, Galaxy A16, HONOR X5 LFM 2 350M, Gemma 3 1B
Mid-range (6–8 GB) Pixel 10a, Galaxy A56, Poco X7, Nothing Phone 2a + Llama 3.2 1B, Qwen 3 1.7B, Gemma 3 4B
Upper mid (8–12 GB) Pixel 10, Galaxy S25, OnePlus 13 + Gemma 4 E2B
Flagship (12–16 GB) Galaxy S26 Ultra, Pixel 10 Pro, OnePlus 14 Full catalog including Gemma 4 E4B

Privacy

zuza has exactly two network calls:

  1. Model downloads — HTTPS GET to HuggingFace CDN URLs listed in the manifest
  2. Catalog refresh — HTTPS GET to raw.githubusercontent.com to fetch the model list JSON

That's it. No analytics, no crash reporter, no telemetry, no feature flags, no account system, no server-side anything. The ACCESS_NETWORK_STATE permission detects metered connections for the cellular data warning. Inspect the source — there's nothing else to find.


Contributing

See docs/CONTRIBUTING.md for how to add a model, implement a new chat template, or extend the UI.


Third-party code

Dependency License What it does
llama.cpp MIT Vendored under cpp/llama/, compiled natively for ARM
Inter OFL 1.1 Variable font for all text
Lucide MIT Bot icon (res/drawable/ic_robot.xml)
compose-richtext Apache 2.0 Markdown rendering in assistant bubbles
OkHttp MockWebServer Apache 2.0 Test-only: HTTP server for ModelRepository + RemoteCatalog tests

License

MIT — do whatever you want with the code, keep the copyright notice intact.

About

A warm little assistant that lives on your phone. Private on-device AI chat for Android.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors