Skip to content

Workflows fail to stop while workflow inference profiling is enabled (readonly filesystem) #1110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
dagleaves opened this issue Mar 25, 2025 · 10 comments
Open
1 of 2 tasks
Labels
bug Something isn't working

Comments

@dagleaves
Copy link
Contributor

dagleaves commented Mar 25, 2025

Search before asking

  • I have searched the Inference issues and found no similar bug report.

Bug

profiling_directory: str = "./inference_profiling",

Because the default profiling directory is found in "./inference_profiling" and workflow profiling is enabled by default, this fails to stop workflow inference as it errors when trying to write to this folder. Disabling workflow profiling via ENABLE_WORKFLOWS_PROFILING=False environment variable is sufficient to resolve the issue.

This is a workaround. I assume a proper fix would be to write to /tmp to support profiling? Not sure.

Environment

Inference: 0.44.0
OS: Ubuntu Server 24.04

Minimal Reproducible Example

Preview any workflow and attempt to stop it.

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@dagleaves dagleaves added the bug Something isn't working label Mar 25, 2025
@dagleaves
Copy link
Contributor Author

dagleaves commented Mar 25, 2025

Looks like stopping inference causes all future API requests to timeout, even with profiling disabled. Returns a 200 OK for the /terminate call and then locks up. No logs or errors coming up though. Might be a different issue. Unclear.

@PawelPeczek-Roboflow
Copy link
Collaborator

hi there - thanks for reporting. Could you provide more context. Are you running video inference or inference against images? What is the sequence of actions that lead for the error?

@dagleaves
Copy link
Contributor Author

Sure thing.

I set up the template Time in Zone workflow. I am running video inference on an RTSP stream using that workflow. I am able to successfully start inference, and the workflow runs. However, when I try to terminate the pipeline, one of the following cases occur:

Case 1

If ENABLE_WORKFLOWS_PROFILING is set to True (default), the pipeline will fail to terminate.

ERROR:inference:{"event": "Could not handle Command. request_id=5a0effd2dd1c434795a68701251881cf, error=[Errno 30] Read-only file system: '/app/inference_profiling', error_type=internal_error, public_error_message=Unknown internal error. Raise this issue providing as much of a context as possible: https://github.com/roboflow/inference/issues", "timestamp": "2025-03-26 14:56.02", "exception": "Traceback (most recent call last):\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 134, in _handle_command\n    return self._terminate_pipeline(request_id=request_id)\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 407, in _terminate_pipeline\n    self._execute_termination()\n  File \"/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py\", line 433, in _execute_termination\n    self._inference_pipeline.join()\n  File \"/app/inference/core/interfaces/stream/inference_pipeline.py\", line 896, in join\n    self._on_pipeline_end()\n  File \"/app/inference/core/interfaces/stream/utils.py\", line 107, in on_pipeline_end\n    save_workflows_profiler_trace(\n  File \"/app/inference/core/interfaces/stream/utils.py\", line 123, in save_workflows_profiler_trace\n    os.makedirs(directory, exist_ok=True)\n  File \"/usr/lib/python3.10/os.py\", line 225, in makedirs\n    mkdir(name, mode)\nOSError: [Errno 30] Read-only file system: '/app/inference_profiling'", "filename": "inference_pipeline_manager.py", "func_name": "_handle_error", "lineno": 565}
Traceback (most recent call last):
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 134, in _handle_command
   return self._terminate_pipeline(request_id=request_id)
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 407, in _terminate_pipeline
   self._execute_termination()
 File "/app/inference/core/interfaces/stream_manager/manager_app/inference_pipeline_manager.py", line 433, in _execute_termination
   self._inference_pipeline.join()
 File "/app/inference/core/interfaces/stream/inference_pipeline.py", line 896, in join
   self._on_pipeline_end()
 File "/app/inference/core/interfaces/stream/utils.py", line 107, in on_pipeline_end
   save_workflows_profiler_trace(
 File "/app/inference/core/interfaces/stream/utils.py", line 123, in save_workflows_profiler_trace
   os.makedirs(directory, exist_ok=True)
 File "/usr/lib/python3.10/os.py", line 225, in makedirs
   mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/app/inference_profiling'

ERROR:inference:{"positional_args": [{"status": "failure", "error_type": "internal_error", "error_class": "OSError", "error_message": "[Errno 30] Read-only file system: '/app/inference_profiling'", "public_error_message": "Unknown internal error. Raise this issue providing as much of a context as possible: https://github.com/roboflow/inference/issues"}], "event": "Malformed response returned by termination command, '%s'", "timestamp": "2025-03-26 14:56.02", "filename": "app.py", "func_name": "check_process_health", "lineno": 385}

Case 2

If ENABLE_WORKFLOWS_PROFILING is set to False, the pipeline will successfully terminate the pipeline. After that, I cannot try to start the workflow again. If I try to start the workflow, any inference_pipeline endpoints will then hang. I have confirmed that the pipeline no longer exists (not found in /list). If I do not try to start the workflow, other inference_pipeline endpoints work (only /initialise hangs).

The request never reaches the handler function. I added logging before anything is done with /initialise, and it never hits. Very weird.

Replication

  1. Set up workflow local deployment
  2. Start workflow
  3. Terminate workflow
  4. Try to start workflow again
  5. All inference_pipeline endpoints get gateway timeout

Inference code for reference

from inference_sdk import InferenceHTTPClient
import atexit
import time

client = InferenceHTTPClient(
    api_url="http://192.168.1.197:9001",
    api_key=""
)


max_fps = 1
result = client.start_inference_pipeline_with_workflow(
    video_reference=["rtsp://192.168.0.197:554/cam/realmonitor?channel=1&subtype=1"],
    workspace_name="local",
    workflow_id="time-in-zone",
    max_fps=max_fps,
    results_buffer_size=5,  # results are consumed from in-memory buffer - optionally you can control its size
)
print(result)
pipeline_id = result["context"]["pipeline_id"]

# Terminate the pipeline when the script exits
atexit.register(lambda: client.terminate_inference_pipeline(pipeline_id))

while True:
  result = client.consume_inference_pipeline_result(pipeline_id=pipeline_id)

  if not result["outputs"] or not result["outputs"][0]:
    # still initializing
    time.sleep(1/max_fps)
    continue

  output = result["outputs"][0]
  print(output["time_in_zone"])

  time.sleep(1/max_fps)

@dagleaves
Copy link
Contributor Author

dagleaves commented Mar 26, 2025

Looks like Case 2 might be hanging on getting a response from the pipeline socket when trying to initialize the pipeline the second time. Able to get logs to appear if I hit a different endpoint (e.g. /workflows/definition/schema).

Successful command

INFO:     192.168.1.197:46840 - "GET /inference_pipelines/8292d80f-1c09-4c0c-a979-ee042579bc2f/consume HTTP/1.1" 200 OK
Connecting to 127.0.0.1 7070
Established connection to 127.0.0.1 7070
Sent message to 127.0.0.1 7070
Received response from 127.0.0.1 to 7070
INFO:     192.168.1.197:46852 - "POST /inference_pipelines/8292d80f-1c09-4c0c-a979-ee042579bc2f/terminate HTTP/1.1" 200 OK

Hanging initialize command

Start of initialize pipeline function
Connecting to 127.0.0.1 7070
Established connection to 127.0.0.1 7070
Sent message to 127.0.0.1 7070
INFO:     192.168.0.13:55790 - "GET /workflows/definition/schema HTTP/1.1" 200 OK

Edit: I have confirmed this is the case. I set STREAM_MANAGER_OPERATIONS_TIMEOUT=5 and the command does timeout.

inference.core.interfaces.stream_manager.api.errors.ConnectivityError: Could not communicate with InferencePipeline Manager

@PawelPeczek-Roboflow
Copy link
Collaborator

ok, cool - thanks for all of the evidences - will take a look when I have a moment, but cannot promise strict deadline. Is that something you can run for a while with (I mean setting the workaround)?

@PawelPeczek-Roboflow
Copy link
Collaborator

this PR should bring solution: #1123
if we fail to bump numpy to 2.0 we will provide separate PR with just this patch

@PawelPeczek-Roboflow
Copy link
Collaborator

hi there - would you be able to verify on your end?

@PawelPeczek-Roboflow
Copy link
Collaborator

sorry - did not actually wake up. We did not merge the change with fix to main before release, as it was on the branch with other changes I did not finish. Sorry for confusion.

@dagleaves
Copy link
Contributor Author

dagleaves commented Apr 22, 2025

The fix does seem to resolve the profiling issue. I am still only able to run a single pipeline though before it hangs. It hangs on trying to start again. I can make a separate issue for that, thanks for the profiling fix.

@PawelPeczek-Roboflow
Copy link
Collaborator

we had a PR that may have addressed this issue this week - please post the issue with more details if problem is still there after next release (planned end of this week)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants