Offline AI Chat for Android โ Run llama.cpp locally in Java with no internet, no cloud, full privacy.
Offline.AI is an Android app that runs llama.cpp models fully on-device, written in Java and integrated through JNI (Java Native Interface).
It provides an offline AI chat experience โ no internet required, no cloud inference, and complete data privacy.
This project showcases how to embed llama.cpp inside an Android app, load GGUF models, and perform text generation locally.
โ ๏ธ Note: This is a proof-of-concept (POC) intended for learning and experimentation, not a production-ready app.
- โ 100% offline LLM inference โ no network calls
- โ Java + JNI bridge to native llama.cpp
- โ Local model loader for GGUF models
- โ Streaming chat interface built with RecyclerView
- โ Works on Android 12+ (API 31+)
- โ Privacy-first design โ your data never leaves your phone
If you donโt want to build, you can simply download the APK from the repositoryโs Releases section, install it on your device, and run it directly.
Settings โ Security โ allow installing apps from unknown sources (if prompted).
- Clone llama.cpp (native engine used by the app):
git clone https://github.com/ggml-org/llama.cpp
- Android Studio (Electric Eel or newer) with Android NDK and CMake components installed.
- Android device running Android 12+ (API 31+) with at least 4โ6 GB RAM recommended.
- A GGUF model (e.g., Llama 3.2 1B).
git clone https://github.com/weaktogeek/llama.cpp-android-java.git
cd llama.cpp-android-javagit checkout main- File โ Openโฆ โ select the project folder.
- Let Gradle sync and the NDK/CMake components finish downloading if prompted.
- Open the
llamamodule โsrc/main/cpp/โCMakeLists.txt. - At line 36, update the path that points to your local
llama.cppbuild directory (the repo you cloned in Prerequisites).- Example (adjust to your machine):
# Example: point this to your local llama.cpp build dir set(LLAMA_BUILD_DIR "/absolute/path/to/llama.cpp/build-llama")
- Example (adjust to your machine):
- If
build-llamadoes not exist yet, create it or adjust the path to the correct native source/build location within your clonedllama.cpprepo.
- Click Sync Project with Gradle Files.
- Select a physical device (recommended) or compatible emulator (x86_64, plenty of RAM).
- Click Run โถ to build and install the app.
When the app launches:
- Grant storage permission (used only to let you pick model files from device storage).
- Prepare a GGUF model (example:
llama-3.2-1b-instruct.Q4_K_M.gguf).- Place it anywhere accessible on your device (e.g.,
Downloads/).
- Place it anywhere accessible on your device (e.g.,
- Open the app and grant storage permission.
- Download a small GGUF model (e.g., Llama 3.2 1B) to your device.
- Tap Load Model and select the downloaded
.gguffile. - Wait for initialization; once the model is loaded, youโre good to go.
- Enter your prompt and chat locally โ all inference stays on-device.
โณ Initial load may take some time depending on device performance and model size.
-
Build fails: NDK/CMake not found
Open Android Studio โ SDK Manager โ SDK Tools โ install NDK, CMake, and LLDB. -
CMakeLists.txtpath error at line 36
Make sureLLAMA_BUILD_DIR(or equivalent variable) points to your actual local path of thellama.cpprepo (e.g.,/Users/you/dev/llama.cpp/build-llama). -
App crashes on model load
Use a smaller model (e.g., 1B or 3B quantized GGUF), close background apps, and ensure 4โ6 GB free RAM. -
Very slow inference
Smaller/quantized models run faster. Multi-threading and acceleration toggles may be limited in this POC.
- Add multi-threaded inference settings
- Add token streaming with partial text updates
- UI: voice input + TTS reply
- Support model quantization selector
- Optional Vulkan acceleration toggle
- No network calls
- No analytics or telemetry
- All prompts and generations stay on-device
- ggml-org/llama.cpp
- Android NDK, CMake, JNI