Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: jobsdb panics during recovery after backup failure(s) #3580

Merged
merged 6 commits into from
Jul 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 32 additions & 29 deletions jobsdb/backup.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,37 +72,40 @@
opPayload, err := json.Marshal(&backupDS)
jd.assertError(err)

opID, err := jd.JournalMarkStart(backupDSOperation, opPayload)
if err != nil {
return fmt.Errorf("mark start of backup operation: %w", err)
}
err = jd.backupDS(ctx, backupDSRange)
if err != nil {
return fmt.Errorf("backup dataset: %w", err)
}
err = jd.JournalMarkDone(opID)
if err != nil {
return fmt.Errorf("mark end of backup operation: %w", err)
if err := jd.WithTx(func(tx *Tx) error {
opID, err := jd.JournalMarkStartInTx(tx, backupDSOperation, opPayload)
if err != nil {
return fmt.Errorf("mark start of backup operation: %w", err)
}

Check warning on line 79 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L78-L79

Added lines #L78 - L79 were not covered by tests
if err := jd.backupDS(ctx, backupDSRange); err != nil {
return fmt.Errorf("backup dataset: %w", err)
}

Check warning on line 82 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L81-L82

Added lines #L81 - L82 were not covered by tests
if err := jd.journalMarkDoneInTx(tx, opID); err != nil {
return fmt.Errorf("mark end of backup operation: %w", err)
}

Check warning on line 85 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L84-L85

Added lines #L84 - L85 were not covered by tests
return nil
}); err != nil {
Comment on lines +75 to +87
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the usages of the journal is detecting if a job got interrupted, so it could recover a half-completed job later.

With this approach here, since backupDS is not using the transaction we effectively add a journal entry once the backup job is completed. I am not sure how this will impact our recovery logic, and if it is useful to use the journal pattern like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are using journal to detect incomplete backups and run the following code:

func (jd *HandleT) removeTableJSONDumps() {
	backupPathDirName := "/rudder-s3-dumps/"
	tmpDirPath, err := misc.CreateTMPDIR()
	jd.assertError(err)
	files, err := filepath.Glob(fmt.Sprintf("%v%v_job*", tmpDirPath+backupPathDirName, jd.tablePrefix))
	jd.assertError(err)
	for _, f := range files {
		err = os.Remove(f)
		jd.assertError(err)
	}
}

The cleanup is trivial and we can probably run it after every start, regardless of a successful backup or not. The os tmp folder can also take care of this cleanup doing a system restart.

I would propose to completely remove journaling for backups, we will get right of a lot of code/complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: We need to think a bit more about backupDropDSOperation, the next section.

Copy link
Contributor

@atzoum atzoum Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently journaling is mostly used for logging, that's the reason we haven't removed it yet. Journal table can provide useful information for debugging jobsdb issues.

Recovery logic, we may remove it altogether once we are confident we no longer need it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wrapping with transactions doesn't help.

  • For debugging you will never know if something was started and later interrupted.
  • The recovery logic gets disabled with this change, so by keeping the code there we are just adding more confusion into the mix

Copy link
Contributor

@atzoum atzoum Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove recovery logic altogether, we can also do this as part of a another PR.
Journaling we can keep, so as to have a comprehensive list of jobdb operations and when they happened
As for removeTableJSONDumps, yes we can run it every time jobsdb starts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fair.

Does it make sense to use journaling mart start/done inside the same transaction?

Note that during recovery, entries can be deleted:

	if undoOp {
		sqlStatement = fmt.Sprintf(`DELETE from "%s_journal" WHERE id=$1`, jd.tablePrefix)

Do you have any concerns about using logging instead of a journal? Is it a reliability concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run we might end up dropping journal tables as well.

For the time being and since we already have the journaling code in place and it is working properly, it is more straightforward to look in a single table for finding out what operations happened successfully in jobsdb, when and for how long, compared to searching in a pile of logs.

I would keep journal mark start/done operations as is for now and:

  • Remove all journal recovery code from jobsdb in a separate PR.
  • Assess in a few months time whether we still prefer having the journal tables or they are not that important and replace them with logs

return err

Check warning on line 88 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L87-L88

Added lines #L87 - L88 were not covered by tests
}

// drop dataset after successfully uploading both jobs and jobs_status to s3
opID, err = jd.JournalMarkStart(backupDropDSOperation, opPayload)
if err != nil {
return fmt.Errorf("mark start of drop backup operation: %w", err)
}
// Currently, we retry uploading a table for some time & if it fails. We only drop that table & not all `pre_drop` tables.
// So, in situation when new table creation rate is more than drop. We will still have pipe up issue.
// An easy way to fix this is, if at any point of time exponential retry fails then instead of just dropping that particular
// table drop all subsequent `pre_drop` table. As, most likely the upload of rest of the table will also fail with the same error.
err = jd.dropDS(backupDS)
if err != nil {
return fmt.Errorf(" drop dataset: %w", err)
}
err = jd.JournalMarkDone(opID)
if err != nil {
return fmt.Errorf("mark end of drop backup operation: %w", err)
}
return nil
return jd.WithTx(func(tx *Tx) error {
// drop dataset after successfully uploading both jobs and jobs_status to s3
opID, err := jd.JournalMarkStartInTx(tx, backupDropDSOperation, opPayload)
if err != nil {
return fmt.Errorf("mark start of drop backup operation: %w", err)
}

Check warning on line 96 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L95-L96

Added lines #L95 - L96 were not covered by tests
// Currently, we retry uploading a table for some time & if it fails. We only drop that table & not all `pre_drop` tables.
// So, in situation when new table creation rate is more than drop. We will still have pipe up issue.
// An easy way to fix this is, if at any point of time exponential retry fails then instead of just dropping that particular
// table drop all subsequent `pre_drop` table. As, most likely the upload of rest of the table will also fail with the same error.
if err := jd.dropDSInTx(tx, backupDS); err != nil {
return fmt.Errorf(" drop dataset: %w", err)
}

Check warning on line 103 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L102-L103

Added lines #L102 - L103 were not covered by tests
if err := jd.journalMarkDoneInTx(tx, opID); err != nil {
return fmt.Errorf("mark end of drop backup operation: %w", err)
}

Check warning on line 106 in jobsdb/backup.go

View check run for this annotation

Codecov / codecov/patch

jobsdb/backup.go#L105-L106

Added lines #L105 - L106 were not covered by tests
return nil
})
}
if err := loop(); err != nil && ctx.Err() == nil {
if !jd.skipMaintenanceError {
Expand Down
Loading