Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

dagleaves · 2025-03-25T18:42:43Z

Search before asking

I have searched the Inference issues and found no similar bug report.

Bug

inference/inference/core/interfaces/stream/inference_pipeline.py

Line 484 in 295ca68

profiling_directory: str = "./inference_profiling",

Because the default profiling directory is found in "./inference_profiling" and workflow profiling is enabled by default, this fails to stop workflow inference as it errors when trying to write to this folder. Disabling workflow profiling via ENABLE_WORKFLOWS_PROFILING=False environment variable is sufficient to resolve the issue.

This is a workaround. I assume a proper fix would be to write to /tmp to support profiling? Not sure.

Environment

Inference: 0.44.0
OS: Ubuntu Server 24.04

Minimal Reproducible Example

Preview any workflow and attempt to stop it.

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

dagleaves · 2025-03-25T20:09:54Z

Looks like stopping inference causes all future API requests to timeout, even with profiling disabled. Returns a 200 OK for the /terminate call and then locks up. No logs or errors coming up though. Might be a different issue. Unclear.

PawelPeczek-Roboflow · 2025-03-26T14:45:49Z

hi there - thanks for reporting. Could you provide more context. Are you running video inference or inference against images? What is the sequence of actions that lead for the error?

dagleaves · 2025-03-26T16:37:56Z

Sure thing.

I set up the template Time in Zone workflow. I am running video inference on an RTSP stream using that workflow. I am able to successfully start inference, and the workflow runs. However, when I try to terminate the pipeline, one of the following cases occur:

Case 1

If ENABLE_WORKFLOWS_PROFILING is set to True (default), the pipeline will fail to terminate.

ERROR:inference:{"event": "Could not handle Command. request_id=5a0effd2dd1c434795a68701251881cf, error=[Errno 30] Read-only file system: '/app/inference_profiling', error_type=internal_error, public_error_message=Unknown internal error. Raise this issue providing as much of a context as possible: https://github.com/roboflow/inference/issues", "timestamp": "2025-03-26 14:56.02", "exception": "Traceback (most recent call last):\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 134, in _handle_command\n    return self._terminate_pipeline(request_id=request_id)\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 407, in _terminate_pipeline\n    self._execute_termination()\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 433, in _execute_termination\n    self._inference_pipeline.join()\n  File \"/app/inference/core/interfaces/stream/inference_pipeline.py\", line 896, in join\n    self._on_pipeline_end()\n  File \"/app/inference/core/interfaces/stream/utils.py\", line 107, in on_pipeline_end\n    save_workflows_profiler_trace(\n  File \"/app/inference/core/interfaces/stream/utils.py\", line 123, in save_workflows_profiler_trace\n    os.makedirs(directory, exist_ok=True)\n  File \"/usr/lib/python3.10/os.py\", line 225, in makedirs\n    mkdir(name, mode)\nOSError: [Errno 30] Read-only file system: '/app/inference_profiling'", "filename": "inference_pipeline_manager.py", "func_name": "_handle_error", "lineno": 565}
Traceback (most recent call last):
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 134, in _handle_command
   return self._terminate_pipeline(request_id=request_id)
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 407, in _terminate_pipeline
   self._execute_termination()
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 433, in _execute_termination
   self._inference_pipeline.join()
 File "/app/inference/core/interfaces/stream/inference_pipeline.py", line 896, in join
   self._on_pipeline_end()
 File "/app/inference/core/interfaces/stream/utils.py", line 107, in on_pipeline_end
   save_workflows_profiler_trace(
 File "/app/inference/core/interfaces/stream/utils.py", line 123, in save_workflows_profiler_trace
   os.makedirs(directory, exist_ok=True)
 File "/usr/lib/python3.10/os.py", line 225, in makedirs
   mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/app/inference_profiling'

ERROR:inference:{"positional_args": [{"status": "failure", "error_type": "internal_error", "error_class": "OSError", "error_message": "[Errno 30] Read-only file system: '/app/inference_profiling'", "public_error_message": "Unknown internal error. Raise this issue providing as much of a context as possible: https://github.com/roboflow/inference/issues"}], "event": "Malformed response returned by termination command, '%s'", "timestamp": "2025-03-26 14:56.02", "filename": "app.py", "func_name": "check_process_health", "lineno": 385}

Case 2

If ENABLE_WORKFLOWS_PROFILING is set to False, the pipeline will successfully terminate the pipeline. After that, I cannot try to start the workflow again. If I try to start the workflow, any inference_pipeline endpoints will then hang. I have confirmed that the pipeline no longer exists (not found in /list). If I do not try to start the workflow, other inference_pipeline endpoints work (only /initialise hangs).

The request never reaches the handler function. I added logging before anything is done with /initialise, and it never hits. Very weird.

Replication

Set up workflow local deployment
Start workflow
Terminate workflow
Try to start workflow again
All inference_pipeline endpoints get gateway timeout

Inference code for reference

from inference_sdk import InferenceHTTPClient
import atexit
import time

client = InferenceHTTPClient(
    api_url="http://192.168.1.197:9001",
    api_key=""
)


max_fps = 1
result = client.start_inference_pipeline_with_workflow(
    video_reference=["rtsp://192.168.0.197:554/cam/realmonitor?channel=1&subtype=1"],
    workspace_name="local",
    workflow_id="time-in-zone",
    max_fps=max_fps,
    results_buffer_size=5,  # results are consumed from in-memory buffer - optionally you can control its size
)
print(result)
pipeline_id = result["context"]["pipeline_id"]

# Terminate the pipeline when the script exits
atexit.register(lambda: client.terminate_inference_pipeline(pipeline_id))

while True:
  result = client.consume_inference_pipeline_result(pipeline_id=pipeline_id)

  if not result["outputs"] or not result["outputs"][0]:
    # still initializing
    time.sleep(1/max_fps)
    continue

  output = result["outputs"][0]
  print(output["time_in_zone"])

  time.sleep(1/max_fps)

dagleaves · 2025-03-26T16:50:28Z

Looks like Case 2 might be hanging on getting a response from the pipeline socket when trying to initialize the pipeline the second time. Able to get logs to appear if I hit a different endpoint (e.g. /workflows/definition/schema).

Successful command

INFO:     192.168.1.197:46840 - "GET /inference_pipelines/8292d80f-1c09-4c0c-a979-ee042579bc2f/consume HTTP/1.1" 200 OK
Connecting to 127.0.0.1 7070
Established connection to 127.0.0.1 7070
Sent message to 127.0.0.1 7070
Received response from 127.0.0.1 to 7070
INFO:     192.168.1.197:46852 - "POST /inference_pipelines/8292d80f-1c09-4c0c-a979-ee042579bc2f/terminate HTTP/1.1" 200 OK

Hanging initialize command

Start of initialize pipeline function
Connecting to 127.0.0.1 7070
Established connection to 127.0.0.1 7070
Sent message to 127.0.0.1 7070
INFO:     192.168.0.13:55790 - "GET /workflows/definition/schema HTTP/1.1" 200 OK

Edit: I have confirmed this is the case. I set STREAM_MANAGER_OPERATIONS_TIMEOUT=5 and the command does timeout.

inference.core.interfaces.stream_manager.api.errors.ConnectivityError: Could not communicate with InferencePipeline Manager

PawelPeczek-Roboflow · 2025-03-26T16:59:34Z

ok, cool - thanks for all of the evidences - will take a look when I have a moment, but cannot promise strict deadline. Is that something you can run for a while with (I mean setting the workaround)?

PawelPeczek-Roboflow · 2025-03-28T15:03:16Z

this PR should bring solution: #1123
if we fail to bump numpy to 2.0 we will provide separate PR with just this patch

PawelPeczek-Roboflow · 2025-03-31T11:21:14Z

hi there - would you be able to verify on your end?

PawelPeczek-Roboflow · 2025-03-31T12:15:38Z

sorry - did not actually wake up. We did not merge the change with fix to main before release, as it was on the branch with other changes I did not finish. Sorry for confusion.

dagleaves · 2025-04-22T03:57:58Z

The fix does seem to resolve the profiling issue. I am still only able to run a single pipeline though before it hangs. It hangs on trying to start again. I can make a separate issue for that, thanks for the profiling fix.

PawelPeczek-Roboflow · 2025-04-24T15:19:04Z

we had a PR that may have addressed this issue this week - please post the issue with more details if problem is still there after next release (planned end of this week)

dagleaves added the bug label Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

dagleaves commented Mar 25, 2025 •

edited

Loading

dagleaves commented Mar 25, 2025 •

edited

Loading

Uh oh!

PawelPeczek-Roboflow commented Mar 26, 2025

Uh oh!

dagleaves commented Mar 26, 2025

Uh oh!

dagleaves commented Mar 26, 2025 •

edited

Loading

Uh oh!

PawelPeczek-Roboflow commented Mar 26, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 28, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 31, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 31, 2025

Uh oh!

dagleaves commented Apr 22, 2025 •

edited

Loading

Uh oh!

PawelPeczek-Roboflow commented Apr 24, 2025

Uh oh!

Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

Comments

dagleaves commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

dagleaves commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawelPeczek-Roboflow commented Mar 26, 2025

Uh oh!

dagleaves commented Mar 26, 2025

Case 1

Case 2

Replication

Inference code for reference

Uh oh!

dagleaves commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawelPeczek-Roboflow commented Mar 26, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 28, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 31, 2025

Uh oh!

PawelPeczek-Roboflow commented Mar 31, 2025

Uh oh!

dagleaves commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawelPeczek-Roboflow commented Apr 24, 2025

Uh oh!

dagleaves commented Mar 25, 2025 •

edited

Loading

dagleaves commented Mar 25, 2025 •

edited

Loading

dagleaves commented Mar 26, 2025 •

edited

Loading

dagleaves commented Apr 22, 2025 •

edited

Loading