-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss #89
Comments
sebenste
changed the title
"ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss
Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss
Mar 27, 2021
Are there any relevant messages in the LDM log file?
|
Unfortunately, no.
Gilbert
From: Steven Emmerson ***@***.***>
Sent: Saturday, March 27, 2021 12:27 PM
To: Unidata/LDM ***@***.***>
Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>
Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Are there any relevant messages in the LDM log file?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2MTEJW76FGSUWHNWN3TFYIOBANCNFSM4Z4QBRPA>.
|
Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything? |
No. That’s why this has been going on for…well, decades. I can’t reproduce it consistently, and it doesn’t happen often. All I know is, to trip the bug, you have to do an ldmadmin restart. Sometimes it takes one time to do, sometimes a bunch of times, and then sometimes, never. I can’t see any pattern to this.
An ldmadmin restart doesn’t indicate anything, but the queue becomes corrupt minutes to hours (or even a day) after it gets restarted. That’s the crazy thing about this…
From: Steven Emmerson ***@***.***>
Sent: Monday, March 29, 2021 12:22 PM
To: Unidata/LDM ***@***.***>
Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>
Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Is there any evidence of what the problem might be? Does an "ldmadmin restart" indicate anything?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2N3EQGCV6TKCG4ORADTGCZLFANCNFSM4Z4QBRPA>.
|
How have you determined that the queue becomes corrupt? |
Products stop writing to the disk. Whenever there is queue corruption, that always happens. And, remaking the queue and restarting the LDM always fixes it, without exception.
Gilbert
From: Steven Emmerson ***@***.***>
Sent: Monday, March 29, 2021 1:42 PM
To: Unidata/LDM ***@***.***>
Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>
Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
How have you determined that the queue becomes corrupt?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2PN3EBWD6W75FYFOYLTGDCYXANCNFSM4Z4QBRPA>.
|
Do you have any processes running outside of the LDM's process group (i.e.,
*not* executed by an EXEC entry in the LDM configuration-file) that insert
products into the queue?
|
It happens on servers that do and do not insert products into the queue. It will even do it on our NOAAport ingester at our dish.
Gilbert
From: Steven Emmerson ***@***.***>
Sent: Tuesday, March 30, 2021 9:40 AM
To: Unidata/LDM ***@***.***>
Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>
Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Do you have any processes running outside of the LDM's process group (i.e.,
*not* executed by an EXEC entry in the LDM configuration-file) that insert
products into the queue?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2JWBVETUGUY2KJ53RTTGHPD5ANCNFSM4Z4QBRPA>.
|
Products stop writing to the disk. Whenever there is queue corruption,
that always happens.
If products stop being written to disk, then there should be at least one
associated log message (for example, indicating that a pqact(1) process
terminated). Would you please check again.
|
Unfortunately, the log got wiped out now so I can’t check. But, I will double check when it happens again. Sorry about that…
Gilbert
From: Steven Emmerson ***@***.***>
Sent: Tuesday, March 30, 2021 12:31 PM
To: Unidata/LDM ***@***.***>
Cc: Gilbert Sebenste ***@***.***>; Author ***@***.***>
Subject: Re: [Unidata/LDM] Performing an "ldmadmin restart"" in LDM 6.13.13 and previous can result in queue corruption, data loss (#89)
Products stop writing to the disk. Whenever there is queue corruption,
that always happens.
If products stop being written to disk, then there should be at least one
associated log message (for example, indicating that a pqact(1) process
terminated). Would you please check again.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#89 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFLWO2IBZKKXUFM7UIV7XH3TGIDDZANCNFSM4Z4QBRPA>.
|
This was fixed in 6.13.14. Closing ticket. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
OS: Centos 7, fully updated
This bug has actually been around for many years, but I was hoping it was vanquished. I guess not...
We were bit by this bug this evening, When executing an "ldmadmin restart", it has an issue whereby upon starting up, the LDM will start normally, except within hours, the queue will become corrupt, and data won't write to physical media. You must then stop the LDM, remake the queue, and start it again. This fixes the issue. Every time you delete the queue after you stop it, upon restart, everything is fine. But, occasionally, if you only do an "ldmadmin restart", the queue becomes corrupt. This is more likely to happen, in my experience, if:
You are running a high-volume, high file-size count feed (think Level2 radar or CONDUIT)
If you do multiple restarts of the LDM, spaced hours or more apart
This does NOT happen, ever, if the queue is deleted and remade before restarting the LDM, even if you do this:
ldmadmin stop
ldmadmin clean
ldmadmin delqueue
ldmadmin restart
Or this:
ldmadmin stop
ldmamdin delqueue
ldmadmin mkqueue
ldmadmin start
It only happens when doing a straight "ldmadmin restart" command, nothing before or after it. Furthermore, it may not happen until hours after a restart.
The text was updated successfully, but these errors were encountered: