Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Camera stops recording after a few days #209

Closed
robin-thoni opened this issue Mar 30, 2022 · 20 comments
Closed

Camera stops recording after a few days #209

robin-thoni opened this issue Mar 30, 2022 · 20 comments
Labels
bug rust Rust backend work required

Comments

@robin-thoni
Copy link

Describe the bug
One of my camera (the same as in #197) stops recording after a few days with the following logs:

moonfire_1  | I20220328 20:09:03.324 sync-/var/lib/moonfire-nvr/media/main moonfire_db::db] Flush 10362 (why: 120 sec after start of 59 seconds cam01-rt02-main recording 1/30498):
moonfire_1  | /var/lib/moonfire-nvr/media/main: added 5M 859K 280B in 2 recordings (1/30498, 1/30499), deleted 0B in 0 (), GCed 0 recordings ().
moonfire_1  | /var/lib/moonfire-nvr/media/sub: added 2M 812K 120B in 1 recordings (2/30498), deleted 4M 215K 99B in 2 (2/30481, 2/30482), GCed 2 recordings (2/30479, 2/30480).
moonfire_1  | W20220328 20:09:38.034 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: sleeping for PT1S after error: stream ended
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | W20220328 20:09:38.039 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: sleeping for PT1S after error: stream ended
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | I20220328 20:09:39.034 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:39.039 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | W20220328 20:09:39.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: sleeping for PT1S after error: [172.16.12.2:33708(me)->10.15.9.1:554@2022-03-28T20:09:39, 1302@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=0000b8c6] Wrong ssrc; expecting ssrc=Some(76bbcb4e) seq=Some(b8c6)
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | W20220328 20:09:39.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: sleeping for PT1S after error: [172.16.12.2:33706(me)->10.15.9.1:554@2022-03-28T20:09:39, 39474@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=00006747] Out-of-order packet or large loss; expecting ssrc=Some(11339322) seq=Some(b8e4)
moonfire_1  | 
moonfire_1  | (set environment variable RUST_BACKTRACE=1 to see backtraces)
moonfire_1  | I20220328 20:09:40.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:40.338 s-cam01-rt02-main moonfire_nvr::streamer] cam01-rt02-main: waiting up to 63.999408706s for TEARDOWN or expiration of 1 stale sessions
moonfire_1  | I20220328 20:09:40.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: Opening input: rtsp://cam01.lesse.rthoni.com/streaming/channels/1
moonfire_1  | I20220328 20:09:40.346 s-cam01-rt02-sub moonfire_nvr::streamer] cam01-rt02-sub: waiting up to 63.991108989s for TEARDOWN or expiration of 1 stale sessions
moonfire_1  | I20220328 20:10:23.392 sync-/var/lib/moonfire-nvr/media/sub moonfire_db::db] Flush 10363 (why: 120 sec after start of 59 seconds cam01-rt02-sub recording 2/30499):
moonfire_1  | /var/lib/moonfire-nvr/media/main: added 1M 698K 287B in 1 recordings (1/30500), deleted 0B in 0 (), GCed 0 recordings ().
moonfire_1  | /var/lib/moonfire-nvr/media/sub: added 3M 602K 854B in 2 recordings (2/30499, 2/30500), deleted 2M 337K 240B in 1 (2/30483), GCed 2 recordings (2/30481, 2/30482).

Expected behavior
The camera should record indefinitely, or restart if the stream dies for some reason.

Screenshots
Screenshot from 2022-03-30 10-31-38
Screenshot from 2022-03-30 10-31-25
Screenshot from 2022-03-30 10-33-59

Server (please complete the following information):

$ docker ps | grep moonfire
ddfdaeb31c49        scottlamb/moonfire-nvr:v0.7.1                     "/usr/local/bin/moon…"    2 days ago          Up 35 minutes           127.0.0.1:35280->8080/tcp                                                                             docker-moonfire-server_moonfire_1

$ docker images | grep moonfire
scottlamb/moonfire-nvr                          v0.7.1                  d124dc9d87cc        3 months ago        755MB

Camera (please complete the following information):
Model: Trendnet TV-IP311PI
Firmware Version: V5.0.1 140303

@scottlamb
Copy link
Owner

scottlamb commented Mar 30, 2022

Are you using TCP or UDP to talk with the cameras? (The default is TCP, so it's TCP unless you picked UDP within the camera setup dialog.)

sleeping for PT1S after error: stream ended

This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.

sleeping for PT1S after error: [172.16.12.2:33708(me)->10.15.9.1:554@2022-03-28T20:09:39, 1302@2022-03-28T20:09:39 ch=0, stream=0, ssrc=11339322, seq=0000b8c6] Wrong ssrc; expecting ssrc=Some(76bbcb4e) seq=Some(b8c6)

I might add some more debug logging to narrow this down. Not sure how familiar you are with RTSP, but it's layered on top of another protocol, RTP. RTP packets all have a ssrc that more or less identifies the stream. Moonfire was expecting one ssrc (either because it was promised in a RTSP header or because a previous packet used that ssrc; extra debug logging could tell us which) and got a packet for a different one.

When this happens, Moonfire suspects it's due to a bug in old versions of a popular RTSP server implementation called live555. In short, the server doesn't tear down TCP sessions properly, and the camera can end up sending several of them over the same connection, which cascades into problems with overload and then another bug that causes the connection to become corrupt after the TCP window camera's kernel TCP send buffer fills. There's a whole saga about this at scottlamb/retina#17 .

Moonfire tries to avoid this cascade by waiting for the old session to go away before trying again, which is why you see this message:

waiting up to 63.999408706s for TEARDOWN or expiration of 1 stale sessions

What happens after that? There should be another attempt around 20:10:44.

@robin-thoni
Copy link
Author

I'm using TCP explicitly in the config

What happens after that? There should be another attempt around 20:10:44.

Nothing, that's the end of the logs

This says the camera decided to end the stream for whatever reason. I don't think there's anything Moonfire could do differently here.

If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?

@scottlamb
Copy link
Owner

Nothing, that's the end of the logs

😞

It kept running but didn't log any more? Was this the only camera Moonfire was configured to record?

If the camera kills the stream, that's ok for me, as long as Moonfire retries indefinitely to reconnect, which doesn't seem to be the case here. Is that expected?

No, that part is definitely broken then.

This sounds similar to #184, which I'd thought I'd fixed. I'll take another look.

@scottlamb
Copy link
Owner

Reread through the code in question, nothing's jumping out at me yet.

If you don't mind, it might help to enable more logging. Where there are --env lines in your nvr wrapper script, you can add one like --env=MOONFIRE_LOG=info,moonfire_nvr=debug,retina=debug, then do a nvr stop; nvr rm; nvr run to make it take effect. When a connection is dropped, there should then be a sequent of debug log lines from retina::client::teardown that narrow down the problem.

@scottlamb scottlamb added bug rust Rust backend work required labels Mar 31, 2022
@robin-thoni
Copy link
Author

Yeah, it kept running and recording the second camera. Just no recordings/logs for that one anymore.

I just enabled the logs, and I'm now getting logs like this:

moonfire_1  | D20220401 10:23:19.197 tokio-runtime-worker retina::client] SETUP response: Response {
moonfire_1  |     version: V1_0,
moonfire_1  |     status: Ok,
moonfire_1  |     reason_phrase: "OK",
moonfire_1  |     headers: Headers(
moonfire_1  |         {
moonfire_1  |             HeaderName(
moonfire_1  |                 "CSeq",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "3",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Date",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "Fri, Apr 01 2022 10:23:19 GMT",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Session",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "1353878211;timeout=60",
moonfire_1  |             ),
moonfire_1  |             HeaderName(
moonfire_1  |                 "Transport",
moonfire_1  |             ): HeaderValue(
moonfire_1  |                 "RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=32887ea3;mode=\"play\"",
moonfire_1  |             ),
moonfire_1  |         },
moonfire_1  |     ),
moonfire_1  |     body: b"",
moonfire_1  | }

Is that what you're looking for? If so, I'll send you everything once it happens again

@robin-thoni
Copy link
Author

Ah, second problem: the second camera has actually stopped recording since I restarted Moonfire on March 19th:

Screenshot from 2022-04-01 10-28-53

Screenshot from 2022-04-01 10-32-15

As you can see, there's a 0/1 second recording, then nothing. I've restarted multiple time just right now, but the recording won't restart. The camera is working/streaming fine outside of Moonfire.

I attached the logs since I restarted a few minutes ago.

Probably a different bug, though?

@scottlamb
Copy link
Owner

scottlamb commented Apr 1, 2022

Ahh! Got it. Those logs helped, and I believe both cameras' problems are due to the same bug.

From this log line:

W20220401 10:23:19.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: sleeping for PT1S after error: [172.16.12.2:54278(me)->10.15.9.2:554@2022-04-01T10:23:19, 1302@2022-04-01T10:23:19 ch=0, stream=0, ssrc=32887ea3, seq=000090b1] Wrong ssrc; expecting ssrc=Some(4c29bf0c) seq=Some(90b1)

I see that it reached this code here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/rtp.rs#L135

in which it sees a strange ssrc and assumes it's due to a stale live555 tcp bug session, as described here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L70-L74

And we see that Moonfire is waiting for this session to go away:

I20220401 10:23:20.275 s-cam02-rt02-sub moonfire_nvr::streamer] cam02-rt02-sub: waiting up to 63.999485425s for TEARDOWN or expiration of 1 stale sessions

but it never does [edit: removed confused ideas based on looking at the wrong version of the code] because you're running Moonfire v0.7.1, which predates my fix to #184.

Also, your camera isn't advertising itself as using a buggy version of live555 (even though it is using such a version, or some other software with a similar bug), or we'd have seen this line logged:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L1320-L1322

Anyway, I think there are several changes to make to improve this situation:

  • fix Retina to time out these "unknown session id" sessions. This is a straight-up bug on my part. This is already fixed in v0.7.2 and beyond.
  • also tweak Retina's logging to give a warning when it sees the live555 bug on a camera not already known to be problematic
  • maybe plumb through the tool from the SDP into a log message and/or the result of the Test button in the camera UI, so we can see what it says. Your camera is buggy, and it might be good to find out how it advertises itself as we do for other buggy live555 cameras. (You could also grab this yourself via ffmpeg or a packet capture.)
  • start trying to avoid this situation by sending a TEARDOWN even when using TCP. I already do this for the live555 buggy cameras. Maybe we should do that explicitly for however your camera advertises itself, or always make at least one quick try before dropping the TCP session, and/or add a pull-down in the config UI for TeardownPolicy. (right now Moonfire always uses Auto)

In the meantime, a couple ways you can work around this problem:

  • switch from TCP to UDP, and/or
  • just shut down Moonfire, manually wait a couple minutes for the old sessions to go away, and then start it again.

@robin-thoni
Copy link
Author

Ah, glad it helped!

Thanks for taking for the time to chase that bug! Unfortunately, I have little to no knowledge to help you to fix it...

By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.

@jlpoolen
Copy link
Contributor

jlpoolen commented Apr 1, 2022

Hi Robin,
"Poc"?? and could you expound upon your implementation of an alert system? Perhaps through the mailing list for this project?
Thank you,
John

@robin-thoni
Copy link
Author

Hi John,
I meant Proof of Concept. That's a very simple simple gateway that will authenticate to Moonfire and do the following:

  • Discover the cameras, and return them to Zabbix-friendly way.
  • Return interesting data for each camera: startDate, endDate and duration, basically

The second part is a Zabbix template to connect Zabbix to it. That's the end result:

Screenshot from 2022-04-01 18-59-45

Screenshot from 2022-04-01 19-00-59

I plan to release it shortly

@scottlamb
Copy link
Owner

Unfortunately, I have little to no knowledge to help you to fix it...

No worries. If you'd like to learn more (about Rust, RTSP, etc.), let me know. Otherwise, I've got this.

By this time, I'm working on (I already have a working PoC) a small gateway to monitor Moonfire through Zabbix, with automatic discovery, thanks to Moonfire API. That allows me to trigger a problem when a camera has not any recording for 5 minutes.

Neat! I don't know Zabbix, but in general: I'm open to adding support for white-box monitoring to Moonfire itself also. We can discuss more in a separate issue. I've been meaning for a while to export metrics in the Prometheus exposition format. It looks like Zabbix supports this format also.

@scottlamb
Copy link
Owner

Actually, come to think of it, even if we did go on to launch the background task, that only tears down sessions with a known id, not these special live555 unknown session id stale sessions. My fix for #184 wasn't complete. I started sending TEARDOWN requests for TCP sessions when we know the camera is buggy, which I guess has mostly avoided the problem since for @jlpoolen . But once a stale session with unknown id is noted, we never clean it up. Oops.

Huh. On second (third? fourth? I've lost track) look, I have code that's supposed to do this here:

https://github.com/scottlamb/retina/blob/9b4f50231cefe3029df5e544d4e28eee6548dcd5/src/client/mod.rs#L259-L280

and it seems to be working in my best attempt to reproduce locally (an old Reolink camera with this bug, a hardcoded change to TeardownPolicy::Never, quick restart so there's a bad session to clean up). Huh. I may just have to release a version with more logging to figure this out.

@robin-thoni
Copy link
Author

I may just have to release a version with more logging to figure this out.

Simply push a docker image on a different name, I'll run and try it until it fails. As you can see, it can take 1 to 2 weeks.

I'm open to adding support for white-box monitoring to Moonfire itself also

Nice to hear that! I'll have a deeper look then.

@scottlamb
Copy link
Owner

Oh! Moonfire NVR 0.7.1 is older than I realized. It predates my fix for #184. I should have been looking at Retina version v0.3.4 rather than latest when trying to figure this out. That explains why it didn't recover after 65 seconds. If you upgrade to the latest Moonfire NVR (note you'll need to adapt your nvr script and create a /etc/moonfire-nvr.toml file), this behavior will be better.

The other items in my checklist still apply.

@robin-thoni
Copy link
Author

I'll do the upgrade (tomorrow?) and see if it helps.

BTW, would you mind making use of Github releases? That makes it much easier to track new releases.

@scottlamb
Copy link
Owner

I've done most of the coding in Retina on these checklist items (but haven't prepared either a Retina or Moonfire release yet). In the process of testing the changes in Moonfire NVR I noticed a pre-existing panic on shutdown, see #206.

Github releases: sounds reasonable. Filed #210.

@robin-thoni
Copy link
Author

I've updated both of my installations to 0.7.3. I'll update when/if anything bad happen w/ it.

scottlamb added a commit to scottlamb/retina that referenced this issue Apr 12, 2022
@scottlamb
Copy link
Owner

Newest commit addresses each of the checklist item. I plan to make a new release after fixing #206 and possibly #217.

Regarding this item:

start trying to avoid this situation by sending a TEARDOWN even when using TCP. I already do this for the live555 buggy cameras. Maybe we should do that explicitly for however your camera advertises itself, or always make at least one quick try before dropping the TCP session, and/or add a pull-down in the config UI for TeardownPolicy. (right now Moonfire always uses Auto)

I adjusted the Auto policy to make that one quick try; more detail here. If you take a look at your camera's tool (which will be exposed through the nvr config's "Test" button in the next release), I might tweak it to make repeated attempts as it does for known-bad servers, depending on how distinct the attribute value seems. (If it doesn't advertise a tool at all, this won't work out of course.) I haven't added that pull-down to Moonfire but still could.

@scottlamb
Copy link
Owner

v0.7.4 released.

@robin-thoni
Copy link
Author

I just updated to 0.7.4. Before that, 0.7.3 has successfully recorded for 3 weeks in a row without any noticeable issue.

This is what it shows:

  • Trendnet TV-IP311PI:
    Screenshot from 2022-04-23 15-40-47

  • Trendnet TV-IP315PI:
    Screenshot from 2022-04-23 15-44-15

Thanks for your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug rust Rust backend work required
Projects
None yet
Development

No branches or pull requests

3 participants