diff --git a/docs/source/llm/export-llm-optimum.md b/docs/source/llm/export-llm-optimum.md index e2c8ee14743..b7de8d99689 100644 --- a/docs/source/llm/export-llm-optimum.md +++ b/docs/source/llm/export-llm-optimum.md @@ -45,15 +45,11 @@ Optimum ExecuTorch supports a wide range of model architectures including decode For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models). -## Export Methods +## CLI Export -Optimum ExecuTorch offers two ways to export models: +The `optimum-cli` command is the recommended way to export Hugging Face models. It provides a single invocation that downloads the model from the Hub, applies the configured optimizations, and writes the resulting `.pte` file. -### Method 1: CLI Export - -The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format. - -#### Basic Export +### Basic Export ```bash optimum-cli export executorch \ @@ -63,7 +59,7 @@ optimum-cli export executorch \ --output_dir="./smollm2_exported" ``` -#### With Optimizations +### With Optimizations Add custom SDPA, KV cache optimization, and quantization: @@ -79,7 +75,7 @@ optimum-cli export executorch \ --output_dir="./smollm2_exported" ``` -#### Available CLI Arguments +### Available CLI Arguments Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`. @@ -156,8 +152,8 @@ print(generated_text) After verifying your model works correctly, deploy it to device: - [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime -- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices -- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices +- [Running on Android](run-on-android.md) - Java APIs for the `executorch-android` AAR (sample app: [LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)) +- [Running on iOS](run-on-ios.md) - Objective-C and Swift APIs for the `ExecuTorchLLM` framework (sample app: [etLLM](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple)) ## Performance diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md index 95caae6ddd9..1985a610cae 100644 --- a/docs/source/llm/getting-started.md +++ b/docs/source/llm/getting-started.md @@ -25,6 +25,6 @@ Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) expor ### Running - [Running with C++](run-with-c-plus-plus.md) -- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) +- [Running on Android](run-on-android.md) - [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md) -- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) +- [Running on iOS](run-on-ios.md) diff --git a/docs/source/llm/run-on-android.md b/docs/source/llm/run-on-android.md new file mode 100644 index 00000000000..81abd6a79d5 --- /dev/null +++ b/docs/source/llm/run-on-android.md @@ -0,0 +1,202 @@ +# Running LLMs on Android + +ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR. + +## Prerequisites + +Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. + +To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API. + +## Runtime API + +Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package. + +### Importing + +```java +import org.pytorch.executorch.extension.llm.LlmModule; +import org.pytorch.executorch.extension.llm.LlmModuleConfig; +import org.pytorch.executorch.extension.llm.LlmGenerationConfig; +import org.pytorch.executorch.extension.llm.LlmCallback; +``` + +### LlmModule + +The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt. + +This API is experimental and subject to change. + +#### Initialization + +Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough: + +```java +LlmModule module = new LlmModule( + "/data/local/tmp/llama-3.2-instruct.pte", + "/data/local/tmp/tokenizer.model", + 0.8f); +``` + +For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder: + +```java +LlmModuleConfig config = LlmModuleConfig.create() + .modulePath("/data/local/tmp/llama-3.2-instruct.pte") + .tokenizerPath("/data/local/tmp/tokenizer.model") + .temperature(0.8f) + .modelType(LlmModuleConfig.MODEL_TYPE_TEXT) + .loadMode(LlmModuleConfig.LOAD_MODE_MMAP) + .build(); + +LlmModule module = new LlmModule(config); +``` + +Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`. + +Construction itself is lightweight and does not load the program data immediately. + +#### Loading + +Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. + +```java +int status = module.load(); +if (status != 0) { + // Handle load failure (status is an ExecuTorch runtime error code). +} +``` + +If you skip this step, the model is loaded lazily on the first `generate` call. + +#### Generating + +Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes. + +```java +LlmCallback callback = new LlmCallback() { + @Override + public void onResult(String token) { + // Called once per generated token. Append to your UI buffer here. + System.out.print(token); + } + + @Override + public void onStats(String statsJson) { + // Called once when generation finishes. See extension/llm/runner/stats.h + // for the field definitions. + System.out.println("\n" + statsJson); + } + + @Override + public void onError(int errorCode, String message) { + // Called if the runtime reports an error during generation. + } +}; + +module.generate("Once upon a time", callback); +``` + +For full control over generation parameters, use `LlmGenerationConfig`: + +```java +LlmGenerationConfig genConfig = LlmGenerationConfig.create() + .seqLen(2048) + .temperature(0.8f) + .echo(false) + .build(); + +module.generate("Once upon a time", genConfig, callback); +``` + +`LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md). + +#### Stopping Generation + +If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback): + +```java +module.stop(); +``` + +Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`). + +#### Resetting + +To clear the prefilled tokens from the KV cache and reset the start position to 0, call: + +```java +module.resetContext(); +``` + +This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`. + +### Multimodal Inputs + +For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response. + +#### Images + +Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies: + +```java +// As int[] +int[] pixels = ...; // length == channels * height * width +module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3); + +// As direct ByteBuffer (preferred for large images) +ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336); +buffer.put(rawBytes).rewind(); +module.prefillImages(buffer, 336, 336, 3); +``` + +Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order: + +```java +float[] normalized = ...; // length == channels * height * width +module.prefillImages(normalized, 336, 336, 3); + +ByteBuffer floatBuffer = ByteBuffer + .allocateDirect(3 * 336 * 336 * Float.BYTES) + .order(ByteOrder.nativeOrder()); +// fill floatBuffer with normalized values, then: +module.prefillNormalizedImage(floatBuffer, 336, 336, 3); +``` + +#### Audio + +Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`: + +```java +module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000); +``` + +Raw audio samples can be supplied with `prefillRawAudio`: + +```java +module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000); +``` + +#### Generating with Multimodal Prefill + +After prefilling each modality, run `generate()` with the text prompt as usual: + +```java +module.prefillImages(pixels, 336, 336, 3); +module.generate("What's in this image?", callback); +``` + +For text-vision models, a convenience overload accepts the image and prompt together: + +```java +module.generate( + pixels, /*width=*/336, /*height=*/336, /*channels=*/3, + "What's in this image?", + /*seqLen=*/768, + callback, + /*echo=*/false); +``` + +## Demo + +See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI. diff --git a/docs/source/llm/working-with-llms.md b/docs/source/llm/working-with-llms.md index e4088efd12b..ce6daff6ce8 100644 --- a/docs/source/llm/working-with-llms.md +++ b/docs/source/llm/working-with-llms.md @@ -15,5 +15,6 @@ export-llm-optimum export-custom-llm run-with-c-plus-plus build-run-llama3-qualcomm-ai-engine-direct-backend +run-on-android run-on-ios ``` diff --git a/docs/source/using-executorch-export.md b/docs/source/using-executorch-export.md index d37dfae2ef7..30f2a22368e 100644 --- a/docs/source/using-executorch-export.md +++ b/docs/source/using-executorch-export.md @@ -45,6 +45,10 @@ Commonly used hardware backends are listed below. For mobile, consider using XNN The export process takes in a standard PyTorch model, typically a `torch.nn.Module`. This can be an custom model definition, or a model from an existing source, such as TorchVision or HuggingFace. See [Getting Started with ExecuTorch](getting-started.md) for an example of lowering a TorchVision model. +:::{tip} +Exporting a model from the [Hugging Face Hub](https://huggingface.co/models)? Use the [Optimum ExecuTorch](llm/export-llm-optimum.md) integration. It wraps the export and lowering steps below in a single CLI invocation and supports a wide range of decoder, encoder, multimodal, and seq2seq architectures out of the box. +::: + Model export is done from Python. This is commonly done through a Python script or from an interactive Python notebook, such as Jupyter or Colab. The example below shows instantiation and inputs for a simple PyTorch model. The inputs are prepared as a tuple of torch.Tensors, and the model can run with these inputs. ```python