Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 7 additions & 11 deletions docs/source/llm/export-llm-optimum.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,15 +45,11 @@ Optimum ExecuTorch supports a wide range of model architectures including decode

For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).

## Export Methods
## CLI Export

Optimum ExecuTorch offers two ways to export models:
The `optimum-cli` command is the recommended way to export Hugging Face models. It provides a single invocation that downloads the model from the Hub, applies the configured optimizations, and writes the resulting `.pte` file.

### Method 1: CLI Export

The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.

#### Basic Export
### Basic Export

```bash
optimum-cli export executorch \
Expand All @@ -63,7 +59,7 @@ optimum-cli export executorch \
--output_dir="./smollm2_exported"
```

#### With Optimizations
### With Optimizations

Add custom SDPA, KV cache optimization, and quantization:

Expand All @@ -79,7 +75,7 @@ optimum-cli export executorch \
--output_dir="./smollm2_exported"
```

#### Available CLI Arguments
### Available CLI Arguments

Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.

Expand Down Expand Up @@ -156,8 +152,8 @@ print(generated_text)
After verifying your model works correctly, deploy it to device:

- [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices
- [Running on Android](run-on-android.md) - Java APIs for the `executorch-android` AAR (sample app: [LlamaDemo](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android))
- [Running on iOS](run-on-ios.md) - Objective-C and Swift APIs for the `ExecuTorchLLM` framework (sample app: [etLLM](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple))

## Performance

Expand Down
4 changes: 2 additions & 2 deletions docs/source/llm/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) expor

### Running
- [Running with C++](run-with-c-plus-plus.md)
- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
- [Running on Android](run-on-android.md)
- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple)
- [Running on iOS](run-on-ios.md)
202 changes: 202 additions & 0 deletions docs/source/llm/run-on-android.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Running LLMs on Android

ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR.

## Prerequisites

Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide.

To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API.

## Runtime API

Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package.

### Importing

```java
import org.pytorch.executorch.extension.llm.LlmModule;
import org.pytorch.executorch.extension.llm.LlmModuleConfig;
import org.pytorch.executorch.extension.llm.LlmGenerationConfig;
import org.pytorch.executorch.extension.llm.LlmCallback;
```

### LlmModule

The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt.

This API is experimental and subject to change.

#### Initialization

Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough:

```java
LlmModule module = new LlmModule(
"/data/local/tmp/llama-3.2-instruct.pte",
"/data/local/tmp/tokenizer.model",
0.8f);
```

For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder:

```java
LlmModuleConfig config = LlmModuleConfig.create()
.modulePath("/data/local/tmp/llama-3.2-instruct.pte")
.tokenizerPath("/data/local/tmp/tokenizer.model")
.temperature(0.8f)
.modelType(LlmModuleConfig.MODEL_TYPE_TEXT)
.loadMode(LlmModuleConfig.LOAD_MODE_MMAP)
.build();

LlmModule module = new LlmModule(config);
```

Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`.

Construction itself is lightweight and does not load the program data immediately.

#### Loading

Explicitly load the model before generation to avoid paying the load cost during your first `generate` call.

```java
int status = module.load();
if (status != 0) {
// Handle load failure (status is an ExecuTorch runtime error code).
}
```

If you skip this step, the model is loaded lazily on the first `generate` call.

#### Generating

Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes.

```java
LlmCallback callback = new LlmCallback() {
@Override
public void onResult(String token) {
// Called once per generated token. Append to your UI buffer here.
System.out.print(token);
}

@Override
public void onStats(String statsJson) {
// Called once when generation finishes. See extension/llm/runner/stats.h
// for the field definitions.
System.out.println("\n" + statsJson);
}

@Override
public void onError(int errorCode, String message) {
// Called if the runtime reports an error during generation.
}
};

module.generate("Once upon a time", callback);
```

For full control over generation parameters, use `LlmGenerationConfig`:

```java
LlmGenerationConfig genConfig = LlmGenerationConfig.create()
.seqLen(2048)
.temperature(0.8f)
.echo(false)
.build();

module.generate("Once upon a time", genConfig, callback);
```

`LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md).

#### Stopping Generation

If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback):

```java
module.stop();
```

Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`).

#### Resetting

To clear the prefilled tokens from the KV cache and reset the start position to 0, call:

```java
module.resetContext();
```

This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`.

### Multimodal Inputs

For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response.

#### Images

Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies:

```java
// As int[]
int[] pixels = ...; // length == channels * height * width
module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3);

// As direct ByteBuffer (preferred for large images)
ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336);
buffer.put(rawBytes).rewind();
module.prefillImages(buffer, 336, 336, 3);
```

Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order:

```java
float[] normalized = ...; // length == channels * height * width
module.prefillImages(normalized, 336, 336, 3);

ByteBuffer floatBuffer = ByteBuffer
.allocateDirect(3 * 336 * 336 * Float.BYTES)
.order(ByteOrder.nativeOrder());
// fill floatBuffer with normalized values, then:
module.prefillNormalizedImage(floatBuffer, 336, 336, 3);
```

#### Audio

Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`:

```java
module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000);
```

Raw audio samples can be supplied with `prefillRawAudio`:

```java
module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000);
```

#### Generating with Multimodal Prefill

After prefilling each modality, run `generate()` with the text prompt as usual:

```java
module.prefillImages(pixels, 336, 336, 3);
module.generate("What's in this image?", callback);
```

For text-vision models, a convenience overload accepts the image and prompt together:

```java
module.generate(
pixels, /*width=*/336, /*height=*/336, /*channels=*/3,
"What's in this image?",
/*seqLen=*/768,
callback,
/*echo=*/false);
```

## Demo

See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI.
1 change: 1 addition & 0 deletions docs/source/llm/working-with-llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@ export-llm-optimum
export-custom-llm
run-with-c-plus-plus
build-run-llama3-qualcomm-ai-engine-direct-backend
run-on-android
run-on-ios
```
4 changes: 4 additions & 0 deletions docs/source/using-executorch-export.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ Commonly used hardware backends are listed below. For mobile, consider using XNN

The export process takes in a standard PyTorch model, typically a `torch.nn.Module`. This can be an custom model definition, or a model from an existing source, such as TorchVision or HuggingFace. See [Getting Started with ExecuTorch](getting-started.md) for an example of lowering a TorchVision model.

:::{tip}
Exporting a model from the [Hugging Face Hub](https://huggingface.co/models)? Use the [Optimum ExecuTorch](llm/export-llm-optimum.md) integration. It wraps the export and lowering steps below in a single CLI invocation and supports a wide range of decoder, encoder, multimodal, and seq2seq architectures out of the box.
:::

Model export is done from Python. This is commonly done through a Python script or from an interactive Python notebook, such as Jupyter or Colab. The example below shows instantiation and inputs for a simple PyTorch model. The inputs are prepared as a tuple of torch.Tensors, and the model can run with these inputs.

```python
Expand Down
Loading