Skip to content

Check ICE State before suspending Orchestrator #3614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Jun 6, 2025

Before sending a "out segment timeout" error (and suspending Orchestrator), first check if the ICE Connection is connected.

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Jun 6, 2025
Copy link

codecov bot commented Jun 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.

Project coverage is 30.77426%. Comparing base (e098564) to head (bcc5662).

Files with missing lines Patch % Lines
server/ai_live_video.go 0.00000% 9 Missing ⚠️
media/whip_connection.go 0.00000% 8 Missing ⚠️
media/whip_server.go 0.00000% 2 Missing ⚠️
server/ai_mediaserver.go 0.00000% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3614         +/-   ##
===================================================
- Coverage   30.78230%   30.77426%   -0.00804%     
===================================================
  Files            153         153                 
  Lines          45916       45928         +12     
===================================================
  Hits           14134       14134                 
- Misses         30958       30970         +12     
  Partials         824         824                 
Files with missing lines Coverage Δ
server/ai_process.go 1.67926% <ø> (ø)
media/whip_server.go 0.00000% <0.00000%> (ø)
server/ai_mediaserver.go 4.71597% <0.00000%> (-0.00506%) ⬇️
media/whip_connection.go 76.70455% <0.00000%> (-3.65259%) ⬇️
server/ai_live_video.go 0.00000% <0.00000%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e098564...bcc5662. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_process.go 1.67926% <ø> (ø)
media/whip_server.go 0.00000% <0.00000%> (ø)
server/ai_mediaserver.go 4.71597% <0.00000%> (-0.00506%) ⬇️
media/whip_connection.go 76.70455% <0.00000%> (-3.65259%) ⬇️
server/ai_live_video.go 0.00000% <0.00000%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leszko leszko marked this pull request as ready for review June 6, 2025 08:43
@leszko leszko requested review from j0sh and mjh1 June 6, 2025 08:43
@@ -809,6 +809,7 @@ func (ls *LivepeerServer) CreateWhip(server *media.WHIPServer) http.Handler {
whepURL = "http://localhost:8889/" // default mediamtx output
}
whepURL = whepURL + streamName + "-out/whep"
conn := server.CreateWHIP(ctx, ssr, whepURL, w, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict it looks ok to move this line up/earlier but maybe worth @j0sh confirming

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should preserve the original ordering.

Swapping them would block things like stream auth and orch selection which is undesirable, since CreateWHIP is not guaranteed to complete quickly.

If the ice state is still needed, then it should be wired through the WHIPConnection struct which is already available in this scope via the whipConn var. A bit unwieldy, but that's why we have the distinction between MediaState and WHIPConnection.

Copy link
Collaborator

@j0sh j0sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR - in addition to the comment about reordering the conn, is this PR still needed now that we have #3613 ? Specifically, I am curious about what exactly this PR fixes, eg what are the the scenarios when:

  1. we time out via the 15 second segment timeout
  2. we do not have an ICE timeout
  3. the state is not connected and we haven't cleaned up the stream

BTW this may still produce spurious suspensions because the ICE state doesn't immediately transition to connected - here are the previous states.

But rather than checking for specific ICE states, which I feel are bound to change, we probably should tighten up our overall notion of when a stream is actually connected and when it isn't; I started a bit of that work in #3590 . This might also involve stricter ICE (peerconnection) checks, eg explicitly closing on the disconnected and failed states but we need to do a little more verification before going that far; I'll try to get that in shortly.

@@ -809,6 +809,7 @@ func (ls *LivepeerServer) CreateWhip(server *media.WHIPServer) http.Handler {
whepURL = "http://localhost:8889/" // default mediamtx output
}
whepURL = whepURL + streamName + "-out/whep"
conn := server.CreateWHIP(ctx, ssr, whepURL, w, r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should preserve the original ordering.

Swapping them would block things like stream auth and orch selection which is undesirable, since CreateWHIP is not guaranteed to complete quickly.

If the ice state is still needed, then it should be wired through the WHIPConnection struct which is already available in this scope via the whipConn var. A bit unwieldy, but that's why we have the distinction between MediaState and WHIPConnection.

@leszko
Copy link
Contributor Author

leszko commented Jun 10, 2025

Thanks for the comments Josh!

Thanks for this PR - in addition to the comment about reordering the conn, is this PR still needed now that we have #3613 ? Specifically, I am curious about what exactly this PR fixes, eg what are the the scenarios when:

  1. we time out via the 15 second segment timeout
  2. we do not have an ICE timeout
  3. the state is not connected and we haven't cleaned up the stream

I think that we still do need it. Look at this example:

12:52:24.504 ice connection state changed state=disconnected
12:52:33.889 copy operation timed out: context deadline exceeded 
12:52:33.889 ice connection state changed state=closed

So, we may either further reduce the ICE Timeout or use the ICE state to decide what to do if the out segment does not arrive by the given time. I think that from the user perspective waiting 15s for the out segment is a lot. Wdyt?

But rather than checking for specific ICE states, which I feel are bound to change, we probably should tighten up our overall notion of when a stream is actually connected and when it isn't; I started a bit of that work in #3590 . This might also involve stricter ICE (peerconnection) checks, eg explicitly closing on the disconnected and failed states but we need to do a little more verification before going that far; I'll try to get that in shortly.

Yeah, I think it's a good idea. We need some general conditions and rules of when the stream is "connected". And the same for how long we "tolerate" out segments not coming back to the user before we stop or restart the whole piepeline.

@leszko leszko requested a review from j0sh June 10, 2025 07:32
@j0sh
Copy link
Collaborator

j0sh commented Jun 11, 2025

12:52:24.504 ice connection state changed state=disconnected
12:52:33.889 copy operation timed out: context deadline exceeded 
12:52:33.889 ice connection state changed state=closed

use the ICE state to decide what to do if the out segment does not arrive by the given time. I think that from the user perspective waiting 15s for the out segment is a lot

In this case, input stopped coming in (state=disconnected). So we are not going to get output segments, of course.

It feels odd to attribute stream termination to "no output segments" when the issue is we actually had no input. We really should tighten up our criteria of what's considered connected and tear down the rest of the stream sooner than that.

Consider the case when we do orchestrator swapping (instead of stream termination like we do now). The output segment times out, so we swap orchestrators ... then what? Wait some more until we swap again or have an official disconnect?

And the same for how long we "tolerate" out segments not coming back to the user before we stop or restart the whole piepeline.

Agreed, but we should also have a clear distinction when the issue is actually from the O itself, rather than some other input issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch. go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants