Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

sebenste · 2021-03-27T04:33:12Z

OS: Centos 7, fully updated

This bug has actually been around for many years, but I was hoping it was vanquished. I guess not...

We were bit by this bug this evening, When executing an "ldmadmin restart", it has an issue whereby upon starting up, the LDM will start normally, except within hours, the queue will become corrupt, and data won't write to physical media. You must then stop the LDM, remake the queue, and start it again. This fixes the issue. Every time you delete the queue after you stop it, upon restart, everything is fine. But, occasionally, if you only do an "ldmadmin restart", the queue becomes corrupt. This is more likely to happen, in my experience, if:

You are running a high-volume, high file-size count feed (think Level2 radar or CONDUIT)
If you do multiple restarts of the LDM, spaced hours or more apart

This does NOT happen, ever, if the queue is deleted and remade before restarting the LDM, even if you do this:

ldmadmin stop
ldmadmin clean
ldmadmin delqueue
ldmadmin restart

Or this:

ldmadmin stop
ldmamdin delqueue
ldmadmin mkqueue
ldmadmin start

It only happens when doing a straight "ldmadmin restart" command, nothing before or after it. Furthermore, it may not happen until hours after a restart.

semmerson · 2021-03-27T17:26:43Z

Are there any relevant messages in the LDM log file?

sebenste · 2021-03-29T17:12:13Z

Unfortunately, no. Gilbert From: Steven Emmerson ***@***.***> Sent: Saturday, March 27, 2021 12:27 PM To: Unidata/LDM ***@***.***> Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89) Are there any relevant messages in the LDM log file? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2MTEJW76FGSUWHNWN3TFYIOBANCNFSM4Z4QBRPA>.

semmerson · 2021-03-29T17:21:36Z

Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?

sebenste · 2021-03-29T17:25:40Z

No. That’s why this has been going on for…well, decades. I can’t reproduce it consistently, and it doesn’t happen often. All I know is, to trip the bug, you have to do an ldmadmin restart. Sometimes it takes one time to do, sometimes a bunch of times, and then sometimes, never. I can’t see any pattern to this. An ldmadmin restart doesn’t indicate anything, but the queue becomes corrupt minutes to hours (or even a day) after it gets restarted. That’s the crazy thing about this… From: Steven Emmerson ***@***.***> Sent: Monday, March 29, 2021 12:22 PM To: Unidata/LDM ***@***.***> Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89) Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2N3EQGCV6TKCG4ORADTGCZLFANCNFSM4Z4QBRPA>.

semmerson · 2021-03-29T18:42:03Z

How have you determined that the queue becomes corrupt?

sebenste · 2021-03-30T14:06:14Z

Products stop writing to the disk. Whenever there is queue corruption, that always happens. And, remaking the queue and restarting the LDM always fixes it, without exception. Gilbert From: Steven Emmerson ***@***.***> Sent: Monday, March 29, 2021 1:42 PM To: Unidata/LDM ***@***.***> Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89) How have you determined that the queue becomes corrupt? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2PN3EBWD6W75FYFOYLTGDCYXANCNFSM4Z4QBRPA>.

semmerson · 2021-03-30T14:39:39Z

Do you have any processes running outside of the LDM's process group (i.e., *not* executed by an EXEC entry in the LDM configuration-file) that insert products into the queue?

sebenste · 2021-03-30T17:01:42Z

It happens on servers that do and do not insert products into the queue. It will even do it on our NOAAport ingester at our dish. Gilbert From: Steven Emmerson ***@***.***> Sent: Tuesday, March 30, 2021 9:40 AM To: Unidata/LDM ***@***.***> Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***> Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89) Do you have any processes running outside of the LDM's process group (i.e., *not* executed by an EXEC entry in the LDM configuration-file) that insert products into the queue? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2JWBVETUGUY2KJ53RTTGHPD5ANCNFSM4Z4QBRPA>.

semmerson · 2021-03-30T17:30:19Z

Products stop writing to the disk. Whenever there is queue corruption, that always happens.

If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again.

sebenste · 2021-03-30T17:34:06Z

Unfortunately, the log got wiped out now so I can’t check. But, I will double check when it happens again. Sorry about that… Gilbert From: Steven Emmerson ***@***.***> Sent: Tuesday, March 30, 2021 12:31 PM To: Unidata/LDM ***@***.***> Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>

Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89) Products stop writing to the disk. Whenever there is queue corruption, that always happens.

If products stop being written to disk, then there should be at least one associated log message (for example, indicating that a pqact(1) process terminated). Would you please check again. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2IBZKKXUFM7UIV7XH3TGIDDZANCNFSM4Z4QBRPA>.

sebenste · 2021-10-12T16:36:41Z

This was fixed in 6.13.14. Closing ticket.

sebenste changed the title ~~"ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss~~ Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss Mar 27, 2021

sebenste closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

sebenste commented Mar 27, 2021

semmerson commented Mar 27, 2021 via email

sebenste commented Mar 29, 2021 via email

semmerson commented Mar 29, 2021

sebenste commented Mar 29, 2021 via email

semmerson commented Mar 29, 2021

sebenste commented Mar 30, 2021 via email

semmerson commented Mar 30, 2021 via email

sebenste commented Mar 30, 2021 via email

semmerson commented Mar 30, 2021 via email

sebenste commented Mar 30, 2021 via email

sebenste commented Oct 12, 2021

Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89

Comments

sebenste commented Mar 27, 2021

semmerson commented Mar 27, 2021 via email

sebenste commented Mar 29, 2021 via email

semmerson commented Mar 29, 2021

sebenste commented Mar 29, 2021 via email

semmerson commented Mar 29, 2021

sebenste commented Mar 30, 2021 via email

semmerson commented Mar 30, 2021 via email

sebenste commented Mar 30, 2021 via email

semmerson commented Mar 30, 2021 via email

sebenste commented Mar 30, 2021 via email

sebenste commented Oct 12, 2021