Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: jobsdb panics during recovery after backup failure(s) #3580

Merged
merged 6 commits into from
Jul 6, 2023

Conversation

BonapartePC
Copy link
Contributor

@BonapartePC BonapartePC commented Jul 4, 2023

Description

Journal operationBACKUP_DS is currently being marked and completed without a transaction. Due to this, we are making multiple entries of journal start for BACKUP_DS when we get an error in backupDs function. With this change, any error within a transaction should roll back the change.

Notion Ticket

https://www.notion.so/rudderstacks/backup-failure-causing-jobsdb-panic-adf15a01e88241699be18ca53d9ab38c

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented Jul 4, 2023

Codecov Report

Patch coverage: 25.92% and project coverage change: -0.13 ⚠️

Comparison is base (9e7f117) 68.08% compared to head (6665bd7) 67.96%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3580      +/-   ##
==========================================
- Coverage   68.08%   67.96%   -0.13%     
==========================================
  Files         318      318              
  Lines       50271    50274       +3     
==========================================
- Hits        34229    34169      -60     
- Misses      13816    13875      +59     
- Partials     2226     2230       +4     
Impacted Files Coverage Δ
jobsdb/backup.go 70.03% <25.92%> (-0.37%) ⬇️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jobsdb/backup.go Outdated Show resolved Hide resolved
jobsdb/backup.go Outdated Show resolved Hide resolved
jobsdb/backup.go Outdated Show resolved Hide resolved
jobsdb/backup.go Outdated Show resolved Hide resolved
jobsdb/backup.go Outdated Show resolved Hide resolved
@BonapartePC BonapartePC requested a review from atzoum July 5, 2023 09:08
@atzoum atzoum changed the title fix: backup panic should mark journal in tx fix: jobsdb panics during recovery after backup failure(s) Jul 5, 2023
jobsdb/backup.go Outdated Show resolved Hide resolved
Comment on lines +75 to +87
if err := jd.WithTx(func(tx *Tx) error {
opID, err := jd.JournalMarkStartInTx(tx, backupDSOperation, opPayload)
if err != nil {
return fmt.Errorf("mark start of backup operation: %w", err)
}
if err := jd.backupDS(ctx, backupDSRange); err != nil {
return fmt.Errorf("backup dataset: %w", err)
}
if err := jd.journalMarkDoneInTx(tx, opID); err != nil {
return fmt.Errorf("mark end of backup operation: %w", err)
}
return nil
}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the usages of the journal is detecting if a job got interrupted, so it could recover a half-completed job later.

With this approach here, since backupDS is not using the transaction we effectively add a journal entry once the backup job is completed. I am not sure how this will impact our recovery logic, and if it is useful to use the journal pattern like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are using journal to detect incomplete backups and run the following code:

func (jd *HandleT) removeTableJSONDumps() {
	backupPathDirName := "/rudder-s3-dumps/"
	tmpDirPath, err := misc.CreateTMPDIR()
	jd.assertError(err)
	files, err := filepath.Glob(fmt.Sprintf("%v%v_job*", tmpDirPath+backupPathDirName, jd.tablePrefix))
	jd.assertError(err)
	for _, f := range files {
		err = os.Remove(f)
		jd.assertError(err)
	}
}

The cleanup is trivial and we can probably run it after every start, regardless of a successful backup or not. The os tmp folder can also take care of this cleanup doing a system restart.

I would propose to completely remove journaling for backups, we will get right of a lot of code/complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: We need to think a bit more about backupDropDSOperation, the next section.

Copy link
Contributor

@atzoum atzoum Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently journaling is mostly used for logging, that's the reason we haven't removed it yet. Journal table can provide useful information for debugging jobsdb issues.

Recovery logic, we may remove it altogether once we are confident we no longer need it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wrapping with transactions doesn't help.

  • For debugging you will never know if something was started and later interrupted.
  • The recovery logic gets disabled with this change, so by keeping the code there we are just adding more confusion into the mix

Copy link
Contributor

@atzoum atzoum Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove recovery logic altogether, we can also do this as part of a another PR.
Journaling we can keep, so as to have a comprehensive list of jobdb operations and when they happened
As for removeTableJSONDumps, yes we can run it every time jobsdb starts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fair.

Does it make sense to use journaling mart start/done inside the same transaction?

Note that during recovery, entries can be deleted:

	if undoOp {
		sqlStatement = fmt.Sprintf(`DELETE from "%s_journal" WHERE id=$1`, jd.tablePrefix)

Do you have any concerns about using logging instead of a journal? Is it a reliability concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run we might end up dropping journal tables as well.

For the time being and since we already have the journaling code in place and it is working properly, it is more straightforward to look in a single table for finding out what operations happened successfully in jobsdb, when and for how long, compared to searching in a pile of logs.

I would keep journal mark start/done operations as is for now and:

  • Remove all journal recovery code from jobsdb in a separate PR.
  • Assess in a few months time whether we still prefer having the journal tables or they are not that important and replace them with logs

@BonapartePC BonapartePC requested a review from lvrach July 5, 2023 14:57
@atzoum atzoum merged commit abd9c8c into master Jul 6, 2023
37 checks passed
@atzoum atzoum deleted the fix.backupPanic branch July 6, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants