Skip to content

feat(server): refactor concurrency model and enhance task management #3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sachaarbonel
Copy link
Contributor

@sachaarbonel sachaarbonel commented Jun 16, 2025

This pull request makes significant enhancements to the server implementation in whisper.cpp, especially around concurrency and request handling. Here’s a summary of the key changes:

Major Changes

1. Multi-threaded Task Queue for Inference

  • Introduces a new WhisperTaskQueue class, which manages a queue of inference tasks and processes them using multiple worker threads (the number of workers is now configurable via a new --workers server argument).
  • All incoming POST requests for transcription are now handled asynchronously via this queue, allowing multiple requests to be processed in parallel.
  • Each task encapsulates all request information and handles its own completion signaling and abort reasons (e.g., client disconnect, server shutdown).

2. Graceful Shutdown and Robust Handling

  • The server now handles shutdowns more gracefully: ongoing and queued tasks are notified and can be aborted with meaningful error responses (503 for server shutdown, 499 for client disconnect).
  • The task queue is properly destroyed and re-created when a new model is loaded or on shutdown, ensuring clean resource management.

3. Refactoring of Main Server Logic

  • The request handler for audio transcription (POST /inference_path) is refactored to create a fresh copy of the parameter set for each request, improving thread safety.
  • The actual inference is now performed inside the worker threads, so the server remains responsive to new requests.
  • The code for response formatting (plain text, SRT, VTT, verbose JSON) is encapsulated within the task result logic.

4. Command-Line Interface and Usability Improvements

  • Adds a --workers N argument to control the number of concurrent worker threads for handling requests.
  • Usage help text and parameter parsing are updated accordingly.

5. Minor Fixes and Improvements

  • Swaps the use of std::atomic_flag for a more idiomatic std::atomic<bool> for the termination signal.
  • Reduces use of global and shared state, improving modularity and reliability.

Example: New Option

./server --workers 4

This will launch the server with 4 worker threads for handling requests concurrently.

Copy link
Collaborator

@danbev danbev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial pass over this and wanted to leave some early feedback (or feedback in chucks at least so there are not too many in one go) and I'll go through this in more detail.

@@ -41,10 +43,10 @@ const std::string vjson_format = "verbose_json";
const std::string vtt_format = "vtt";

std::function<void(int)> shutdown_handler;
std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that std::atomic_flag might be more appropriate here as it is guaranteed to be lockfree. And it is only used in a signal handler where we don't want any potential blocking. So I'd prefer to keep this as is.

const httplib::Request* request_ptr; // For abort callback
std::atomic<AbortReason> abort_reason{AbortReason::NotAborted};
std::atomic<bool>* stop_flag_ptr{nullptr};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove empty space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

WhisperTask() = default;

// Move constructor
WhisperTask(WhisperTask&& other) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove trailing white space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

std::vector<std::vector<float>> pcmf32s;
whisper_params params;
std::string filename;
const httplib::Request* request_ptr; // For abort callback
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Pointers/references are preferred to be "non-leaning". There are few more of these.


class WhisperTaskQueue {
public:
WhisperTaskQueue(struct whisper_context* ctx, size_t n_workers = 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering about the default of 2 for n_workers as the default value is 1 elsewhere in the code and perhaps this should be also be 1?

wparams.token_timestamps = !task.params.no_timestamps && task.params.response_format == vjson_format;
wparams.no_context = task.params.no_context;
wparams.suppress_nst = task.params.suppress_nst;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also add VAD parameters here similar to what is done in

wparams.vad = params.vad;
wparams.vad_model_path = params.vad_model.c_str();
wparams.vad_params.threshold = params.vad_threshold;
wparams.vad_params.min_speech_duration_ms = params.vad_min_speech_duration_ms;
wparams.vad_params.min_silence_duration_ms = params.vad_min_silence_duration_ms;
wparams.vad_params.max_speech_duration_s = params.vad_max_speech_duration_s;
wparams.vad_params.speech_pad_ms = params.vad_speech_pad_ms;
wparams.vad_params.samples_overlap = params.vad_samples_overlap;

std::mutex queue_mutex_;
std::condition_variable queue_cv_;
std::mutex whisper_mutex_; // Protect whisper context access
std::atomic<bool> stop_flag_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stop_flag_ could just be a bool as the accesses to it are all guarded by the queue_cv_ lock.

@sachaarbonel
Copy link
Contributor Author

Lte's hold on this one. I ran some becnhmarks and tracing with NVIDIA Nsight and CPU is not the bottleneck

@sachaarbonel sachaarbonel marked this pull request as draft June 18, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants