Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions extension/llm/runner/llm_runner_helper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,8 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,
std::optional<const std::string> data_path,
float temperature,
const std::string& method_name) {
const std::string& method_name,
Module::LoadMode load_mode) {
if (data_path.has_value()) {
std::vector<std::string> data_files;
data_files.push_back(data_path.value());
Expand All @@ -193,15 +194,17 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::move(data_files),
temperature,
nullptr,
method_name);
method_name,
load_mode);
}
return create_text_llm_runner(
model_path,
std::move(tokenizer),
std::vector<std::string>(),
temperature,
nullptr,
method_name);
method_name,
load_mode);
}

std::unique_ptr<TextLLMRunner> create_text_llm_runner(
Expand All @@ -210,7 +213,8 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::vector<std::string> data_files,
float temperature,
std::unique_ptr<::executorch::runtime::EventTracer> event_tracer,
const std::string& method_name) {
const std::string& method_name,
Module::LoadMode load_mode) {
// Sanity check tokenizer
if (!tokenizer || !tokenizer->is_loaded()) {
ET_LOG(Error, "Tokenizer is null or not loaded");
Expand All @@ -221,13 +225,10 @@ std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::unique_ptr<Module> module;
if (data_files.size() > 0) {
module = std::make_unique<Module>(
model_path,
data_files,
Module::LoadMode::File,
std::move(event_tracer));
model_path, data_files, load_mode, std::move(event_tracer));
} else {
module = std::make_unique<Module>(
model_path, Module::LoadMode::File, std::move(event_tracer));
model_path, load_mode, std::move(event_tracer));
}

// Get metadata from Module
Expand Down
14 changes: 12 additions & 2 deletions extension/llm/runner/llm_runner_helper.h
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,10 @@ ET_EXPERIMENTAL std::unordered_set<uint64_t> get_eos_ids(
* @param temperature Optional temperature parameter for controlling randomness
* (deprecated)
* @param method_name Name of the method to execute in the model
* @param load_mode Loading strategy for the model file. Defaults to
* MmapUseMlockIgnoreErrors which uses mmap to avoid loading the entire
* model into RAM and attempts to pin pages with mlock for lower inference
* latency, gracefully falling back to standard mmap if mlock is unavailable.
* @return std::unique_ptr<TextLLMRunner> Initialized TextLLMRunner instance, or
* nullptr on failure
*/
Expand All @@ -104,7 +108,8 @@ ET_EXPERIMENTAL std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::unique_ptr<::tokenizers::Tokenizer> tokenizer,
std::optional<const std::string> data_path,
float temperature = -1.0f,
const std::string& method_name = "forward");
const std::string& method_name = "forward",
Module::LoadMode load_mode = Module::LoadMode::MmapUseMlockIgnoreErrors);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore errors variant over MmapUseMlock?

Copy link
Copy Markdown
Contributor Author

@psiddh psiddh Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase only uses MmapUseMlockIgnoreErrors over MmapUseMlock for LLM runners. I feel that is the right default.

  • MmapUseMlock: mlock failure → logs error, unmaps the pages, returns Error::NotSupported.
    The model fails to load entirely. (Stricter!)
  • MmapUseMlockIgnoreErrors: mlock failure → logs at Debug level and continues. The model
    loads via normal mmap, just without pages pinned in RAM.

For LLM runners, a hard failure is almost never the right behavior. Failing to load the model at all will be
bad UX / functionality experience for the End User.

By using mmap-based loading in our LLM runners, we avoid loading the entire model into RAM
upfront, which reduces peak memory usage and OOM risk. The MmapUseMlockIgnoreErrors variant
additionally attempts to pin pages in memory for better inference latency, but gracefully
falls back to standard mmap if the system can't support it, giving us the best of both
worlds without hard failures.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, I think that we're unlikely to actually be able to lock the entire LLM PTE on most systems, so using the base mmap might be better? The mlock ignore errors variant seems functionally fine, though. It'll just fall through to mmap pretty much 100% of the time.

Comment thread
psiddh marked this conversation as resolved.

/**
* @brief Creates a TextLLMRunner instance with dependency injection
Expand All @@ -120,6 +125,10 @@ ET_EXPERIMENTAL std::unique_ptr<TextLLMRunner> create_text_llm_runner(
* (deprecated)
* @param event_tracer Optional event tracer for profiling
* @param method_name Name of the method to execute in the model
* @param load_mode Loading strategy for the model file. Defaults to
* MmapUseMlockIgnoreErrors which uses mmap to avoid loading the entire
* model into RAM and attempts to pin pages with mlock for lower inference
* latency, gracefully falling back to standard mmap if mlock is unavailable.
* @return std::unique_ptr<TextLLMRunner> Initialized TextLLMRunner instance, or
* nullptr on failure
*/
Expand All @@ -129,7 +138,8 @@ ET_EXPERIMENTAL std::unique_ptr<TextLLMRunner> create_text_llm_runner(
std::vector<std::string> data_files = {},
float temperature = -1.0f,
std::unique_ptr<::executorch::runtime::EventTracer> event_tracer = nullptr,
const std::string& method_name = "forward");
const std::string& method_name = "forward",
Module::LoadMode load_mode = Module::LoadMode::MmapUseMlockIgnoreErrors);

/**
* @brief Creates a MultimodalRunner instance with dependency injection
Expand Down
Loading