-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: jobsdb panics during recovery after backup failure(s) #3580
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #3580 +/- ##
==========================================
- Coverage 68.08% 67.96% -0.13%
==========================================
Files 318 318
Lines 50271 50274 +3
==========================================
- Hits 34229 34169 -60
- Misses 13816 13875 +59
- Partials 2226 2230 +4
☔ View full report in Codecov by Sentry. |
if err := jd.WithTx(func(tx *Tx) error { | ||
opID, err := jd.JournalMarkStartInTx(tx, backupDSOperation, opPayload) | ||
if err != nil { | ||
return fmt.Errorf("mark start of backup operation: %w", err) | ||
} | ||
if err := jd.backupDS(ctx, backupDSRange); err != nil { | ||
return fmt.Errorf("backup dataset: %w", err) | ||
} | ||
if err := jd.journalMarkDoneInTx(tx, opID); err != nil { | ||
return fmt.Errorf("mark end of backup operation: %w", err) | ||
} | ||
return nil | ||
}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the usages of the journal is detecting if a job got interrupted, so it could recover a half-completed job later.
With this approach here, since backupDS
is not using the transaction we effectively add a journal entry once the backup job is completed. I am not sure how this will impact our recovery logic, and if it is useful to use the journal pattern like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we are using journal to detect incomplete backups and run the following code:
func (jd *HandleT) removeTableJSONDumps() {
backupPathDirName := "/rudder-s3-dumps/"
tmpDirPath, err := misc.CreateTMPDIR()
jd.assertError(err)
files, err := filepath.Glob(fmt.Sprintf("%v%v_job*", tmpDirPath+backupPathDirName, jd.tablePrefix))
jd.assertError(err)
for _, f := range files {
err = os.Remove(f)
jd.assertError(err)
}
}
The cleanup is trivial and we can probably run it after every start, regardless of a successful backup or not. The os tmp folder can also take care of this cleanup doing a system restart.
I would propose to completely remove journaling for backups, we will get right of a lot of code/complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: We need to think a bit more about backupDropDSOperation
, the next section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently journaling is mostly used for logging, that's the reason we haven't removed it yet. Journal table can provide useful information for debugging jobsdb issues.
Recovery logic, we may remove it altogether once we are confident we no longer need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still wrapping with transactions doesn't help.
- For debugging you will never know if something was started and later interrupted.
- The recovery logic gets disabled with this change, so by keeping the code there we are just adding more confusion into the mix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove recovery logic altogether, we can also do this as part of a another PR.
Journaling we can keep, so as to have a comprehensive list of jobdb operations and when they happened
As for removeTableJSONDumps
, yes we can run it every time jobsdb starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fair.
Does it make sense to use journaling mart start/done inside the same transaction?
Note that during recovery, entries can be deleted:
if undoOp {
sqlStatement = fmt.Sprintf(`DELETE from "%s_journal" WHERE id=$1`, jd.tablePrefix)
Do you have any concerns about using logging instead of a journal? Is it a reliability concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run we might end up dropping journal tables as well.
For the time being and since we already have the journaling code in place and it is working properly, it is more straightforward to look in a single table for finding out what operations happened successfully in jobsdb, when and for how long, compared to searching in a pile of logs.
I would keep journal mark start/done operations as is for now and:
- Remove all journal recovery code from jobsdb in a separate PR.
- Assess in a few months time whether we still prefer having the journal tables or they are not that important and replace them with logs
Description
Journal operation
BACKUP_DS
is currently being marked and completed without a transaction. Due to this, we are making multiple entries of journal start forBACKUP_DS
when we get an error inbackupDs
function. With this change, any error within a transaction should roll back the change.Notion Ticket
https://www.notion.so/rudderstacks/backup-failure-causing-jobsdb-panic-adf15a01e88241699be18ca53d9ab38c
Security