Skip to content

mutex issue on Mac only for release 1.21.X only #24579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
giorgosHadji opened this issue Apr 28, 2025 · 6 comments
Open

mutex issue on Mac only for release 1.21.X only #24579

giorgosHadji opened this issue Apr 28, 2025 · 6 comments

Comments

@giorgosHadji
Copy link

Describe the issue

Hello,

I am trying to use the latest onnxruntime release (1.21.X) (tried both 0 and 1), they work fine for unix and windows but for mac I get the following issue at the end of the inference when everything seems to be closing down:

libc++abi: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
No code changes from my side, the issue is only present on the latest onnxruntime release and only for Mac.
On previous releases all are working just fine, the only thing changed is the onnx runtime.

Googling says it has something todo with static mutex and the order in which they get destroyed/accessed - see more here.

This happens both on the cpu case and on the gpu (coreML) case.

I have got in my hands only macOS 12 and macOS13 and both fail. Also both arm64 and x86 fail on all scenarios.

Important to note is that I can compile and link against onnxruntime just fine, its during running that the issue appears. I believe more specifically when the inference has finished and things are getting destructed/closing down

To reproduce

For the current moment can't share a model to repro, and not sure If I will ever get approval for this.
Sharing to see if anyone has seen this or if it brings something in mind.

Urgency

No response

Platform

Mac

OS Version

MacOs 12 and MacOs13

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.21.0 and 1.21.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

@github-actions github-actions bot added the ep:CoreML issues related to CoreML execution provider label Apr 28, 2025
@fs-eire
Copy link
Contributor

fs-eire commented Apr 29, 2025

Since there is no model and no repro steps, it may be difficult to investigate.

Before more info is shared, could you please help to verify whether a simple test model works (eg. use this model: https://github.com/onnx/onnx/blob/v1.16.2/onnx/backend/test/data/node/test_abs/model.onnx).

If the runtime error still occur to this test model, maybe it's OK to share your code and repro steps?

@fs-eire fs-eire removed the ep:CoreML issues related to CoreML execution provider label Apr 30, 2025
@xenova
Copy link
Contributor

xenova commented May 5, 2025

I can confirm that I get this error too with onnxruntime-node dev build: https://www.npmjs.com/package/onnxruntime-node/v/1.22.0-dev.20250418-c19a49615b

libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

It seems to occur at shutdown, after loading any model on WebGPU ep.

@fs-eire
Copy link
Contributor

fs-eire commented May 5, 2025

@xenova does it happen on CPU EP or CoreML EP?

@xenova
Copy link
Contributor

xenova commented May 6, 2025

Doesn't happen on CPU EP, but with CoreML EP I get a different error:

Context leak detected, msgtracer returned -1

Here's a reproduction:

import { AutoTokenizer, MusicgenForConditionalGeneration, RawAudio } from '@huggingface/transformers';

// Load tokenizer and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/musicgen-small');
const model = await MusicgenForConditionalGeneration.from_pretrained('Xenova/musicgen-small', {
  dtype: {
    text_encoder: 'q4',
    decoder_model_merged: 'q4',
    encodec_decode: 'fp32',
  },
  device: "webgpu", // or "coreml"
});

// Prepare text input
const prompt = 'a light and cheerly EDM track, with syncopated drums, aery pads, and strong emotions bpm: 130';
const inputs = tokenizer(prompt);

// Generate audio
const audio_values = await model.generate({
  ...inputs,
  max_new_tokens: 500,
  do_sample: true,
  guidance_scale: 3,
});

// (Optional) Write the output to a WAV file
const audio = new RawAudio(audio_values.data, model.config.audio_encoder.sampling_rate);
audio.save('musicgen.wav');

throw new Error('dummy error'); // If we throw a dummy error, we get the error.

But the libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument issue appears to be printed whenever the process exits with a non-zero status code (?) You should be able to reproduce this with any model, not just musicgen

@clouds56
Copy link

clouds56 commented May 6, 2025

I think I'm on CPU EP, with this code

#include <iostream>
#include <cassert>
#include <filesystem>
#include <onnxruntime_cxx_api.h>
#include <onnxruntime_c_api.h>
using namespace std;

int main() {
  // auto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "test");
  auto api = OrtGetApiBase()->GetApi(ORT_API_VERSION);
  auto env_ptr = (OrtEnv*)nullptr;
  auto result = api->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env_ptr);
  assert(result == nullptr && env_ptr != nullptr);
  // api->ReleaseEnv(env_ptr);
  cout << "Hello, World!" << endl;
  return 0;
}

uncomment api->ReleaseEnv(env_ptr); and it would work.

Update:
No models needed to reproduce the "abort"

  • c++ API auto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "test"); works, since RAII released it
  • c API api->CreateEnv(...) then api->ReleaseEnv(env_ptr); works
  • c API api->CreateEnv(...) but without manually release failed, this could happen for some framework that has static OrtEnv singleton

@fs-eire
Copy link
Contributor

fs-eire commented May 15, 2025

found the root cause:

node::errors::TriggerUncaughtException() will calls into node::Exit(), which eventually calls into libsystem_c.dylib!__cxa_finalize_ranges (destruction of static/global variables in libonnxruntime.dylib) without calling the finalizer set by napi_set_instance_data. A later destruction of std::unique_ptr<Ort::Env> (onnxruntime_binding.node) will refer to invalid static mutex and crashes.

If the program exits normally (not from node::errors::TriggerUncaughtException()), the finalizer get called correctly and the std::unique_ptr<Ort::Env> will be reset to nullptr so the problem will not occur.

Not sure if the behavior is expected (finalizer set by napi_set_instance_data not called).

I am still trying to figure out how to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants