fix: jobsdb panics during recovery after backup failure(s) #3580

BonapartePC · 2023-07-04T11:55:22Z

Description

Journal operationBACKUP_DS is currently being marked and completed without a transaction. Due to this, we are making multiple entries of journal start for BACKUP_DS when we get an error in backupDs function. With this change, any error within a transaction should roll back the change.

Notion Ticket

https://www.notion.so/rudderstacks/backup-failure-causing-jobsdb-panic-adf15a01e88241699be18ca53d9ab38c

Security

The code changed/added as part of this pull request won't create any security issues with how the software is being used.

codecov · 2023-07-04T12:19:46Z

Codecov Report

Patch coverage: 25.92% and project coverage change: -0.13 ⚠️

Comparison is base (9e7f117) 68.08% compared to head (6665bd7) 67.96%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3580      +/-   ##
==========================================
- Coverage   68.08%   67.96%   -0.13%     
==========================================
  Files         318      318              
  Lines       50271    50274       +3     
==========================================
- Hits        34229    34169      -60     
- Misses      13816    13875      +59     
- Partials     2226     2230       +4

Impacted Files	Coverage Δ
jobsdb/backup.go	`70.03% <25.92%> (-0.37%)`	⬇️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jobsdb/backup.go

lvrach · 2023-07-05T11:12:16Z

jobsdb/backup.go

+			if err := jd.WithTx(func(tx *Tx) error {
+				opID, err := jd.JournalMarkStartInTx(tx, backupDSOperation, opPayload)
+				if err != nil {
+					return fmt.Errorf("mark start of backup operation: %w", err)
+				}
+				if err := jd.backupDS(ctx, backupDSRange); err != nil {
+					return fmt.Errorf("backup dataset: %w", err)
+				}
+				if err := jd.journalMarkDoneInTx(tx, opID); err != nil {
+					return fmt.Errorf("mark end of backup operation: %w", err)
+				}
+				return nil
+			}); err != nil {


One of the usages of the journal is detecting if a job got interrupted, so it could recover a half-completed job later.

With this approach here, since backupDS is not using the transaction we effectively add a journal entry once the backup job is completed. I am not sure how this will impact our recovery logic, and if it is useful to use the journal pattern like this.

It looks like we are using journal to detect incomplete backups and run the following code:

func (jd *HandleT) removeTableJSONDumps() { backupPathDirName := "/rudder-s3-dumps/" tmpDirPath, err := misc.CreateTMPDIR() jd.assertError(err) files, err := filepath.Glob(fmt.Sprintf("%v%v_job*", tmpDirPath+backupPathDirName, jd.tablePrefix)) jd.assertError(err) for _, f := range files { err = os.Remove(f) jd.assertError(err) } }

The cleanup is trivial and we can probably run it after every start, regardless of a successful backup or not. The os tmp folder can also take care of this cleanup doing a system restart.

I would propose to completely remove journaling for backups, we will get right of a lot of code/complexity.

NOTE: We need to think a bit more about backupDropDSOperation, the next section.

Currently journaling is mostly used for logging, that's the reason we haven't removed it yet. Journal table can provide useful information for debugging jobsdb issues.

Recovery logic, we may remove it altogether once we are confident we no longer need it

Still wrapping with transactions doesn't help.

For debugging you will never know if something was started and later interrupted.

The recovery logic gets disabled with this change, so by keeping the code there we are just adding more confusion into the mix

We can remove recovery logic altogether, we can also do this as part of a another PR.
Journaling we can keep, so as to have a comprehensive list of jobdb operations and when they happened
As for removeTableJSONDumps, yes we can run it every time jobsdb starts.

Sounds fair.

Does it make sense to use journaling mart start/done inside the same transaction?

Note that during recovery, entries can be deleted:

if undoOp { sqlStatement = fmt.Sprintf(`DELETE from "%s_journal" WHERE id=$1`, jd.tablePrefix)

Do you have any concerns about using logging instead of a journal? Is it a reliability concern?

In the long run we might end up dropping journal tables as well.

For the time being and since we already have the journaling code in place and it is working properly, it is more straightforward to look in a single table for finding out what operations happened successfully in jobsdb, when and for how long, compared to searching in a pile of logs.

I would keep journal mark start/done operations as is for now and:

Remove all journal recovery code from jobsdb in a separate PR.

Assess in a few months time whether we still prefer having the journal tables or they are not that important and replace them with logs

fix: backup panic should mark journal in tx

9e71882

github-actions bot added the server-team label Jul 4, 2023

Merge branch 'master' into fix.backupPanic

f67255a

BonapartePC requested review from lvrach, cisse21 and atzoum July 4, 2023 12:13

atzoum reviewed Jul 5, 2023

View reviewed changes

jobsdb/backup.go Outdated Show resolved Hide resolved

jobsdb/backup.go Outdated Show resolved Hide resolved

jobsdb/backup.go Outdated Show resolved Hide resolved

jobsdb/backup.go Outdated Show resolved Hide resolved

jobsdb/backup.go Outdated Show resolved Hide resolved

BonapartePC and others added 2 commits July 5, 2023 14:26

address comments

a4082bb

Merge branch 'master' into fix.backupPanic

57f9368

BonapartePC requested a review from atzoum July 5, 2023 09:08

atzoum changed the title ~~fix: backup panic should mark journal in tx~~ fix: jobsdb panics during recovery after backup failure(s) Jul 5, 2023

atzoum reviewed Jul 5, 2023

View reviewed changes

jobsdb/backup.go Outdated Show resolved Hide resolved

lvrach reviewed Jul 5, 2023

View reviewed changes

BonapartePC and others added 2 commits July 5, 2023 16:56

fix error

d7fbea4

Merge branch 'master' into fix.backupPanic

6665bd7

atzoum approved these changes Jul 5, 2023

View reviewed changes

BonapartePC requested a review from lvrach July 5, 2023 14:57

cisse21 approved these changes Jul 6, 2023

View reviewed changes

atzoum merged commit abd9c8c into master Jul 6, 2023
37 checks passed

atzoum deleted the fix.backupPanic branch July 6, 2023 12:12

devops-github-rudderstack mentioned this pull request Jul 6, 2023

chore: release 1.11.0 #3593

Merged

This was referenced Jul 6, 2023

chore: prerelease 1.11.0-rc.1 #3594

Merged

chore: prerelease 1.11.0-rc.2 #3624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: jobsdb panics during recovery after backup failure(s) #3580

fix: jobsdb panics during recovery after backup failure(s) #3580

BonapartePC commented Jul 4, 2023 •

edited by atzoum

codecov bot commented Jul 4, 2023 •

edited

lvrach Jul 5, 2023

lvrach Jul 5, 2023

lvrach Jul 5, 2023

atzoum Jul 5, 2023 •

edited

lvrach Jul 5, 2023

atzoum Jul 5, 2023 •

edited

lvrach Jul 5, 2023

atzoum Jul 5, 2023

fix: jobsdb panics during recovery after backup failure(s) #3580

fix: jobsdb panics during recovery after backup failure(s) #3580

Conversation

BonapartePC commented Jul 4, 2023 • edited by atzoum

Description

Notion Ticket

Security

codecov bot commented Jul 4, 2023 • edited

Codecov Report

lvrach Jul 5, 2023

Choose a reason for hiding this comment

lvrach Jul 5, 2023

Choose a reason for hiding this comment

lvrach Jul 5, 2023

Choose a reason for hiding this comment

atzoum Jul 5, 2023 • edited

Choose a reason for hiding this comment

lvrach Jul 5, 2023

Choose a reason for hiding this comment

atzoum Jul 5, 2023 • edited

Choose a reason for hiding this comment

lvrach Jul 5, 2023

Choose a reason for hiding this comment

atzoum Jul 5, 2023

Choose a reason for hiding this comment

BonapartePC commented Jul 4, 2023 •

edited by atzoum

codecov bot commented Jul 4, 2023 •

edited

atzoum Jul 5, 2023 •

edited

atzoum Jul 5, 2023 •

edited