Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cataloger current diff using scanners #790

Merged
merged 40 commits into from
Oct 19, 2020

Conversation

nopcoder
Copy link
Contributor

@nopcoder nopcoder commented Oct 8, 2020

Use a DB entity scanner to go over parent/child branches to perform diff operation.
The scanner interface operations as an iterator that scan a branch or a lineage.
The diff implementation uses the scanner to compare the two branches and produce diff result into a temporary table.
The iteration support starting after 'path' and the diff loop can limit the output to X records to handle cases where we need pagination.

@nopcoder nopcoder self-assigned this Oct 8, 2020
@nopcoder nopcoder added the area/cataloger Improvements or additions to the cataloger label Oct 8, 2020
@nopcoder nopcoder added this to In progress in lakeFS via automation Oct 8, 2020
@nopcoder nopcoder moved this from In progress to Review in progress in lakeFS Oct 8, 2020
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/db_lineage_scanner.go Show resolved Hide resolved
catalog/db_lineage_scanner.go Outdated Show resolved Hide resolved
catalog/db_scanner.go Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
- keep original options and setting defaults if needed
@codecov-io
Copy link

codecov-io commented Oct 11, 2020

Codecov Report

Merging #790 into master will increase coverage by 0.18%.
The diff coverage is 78.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   42.92%   43.11%   +0.18%     
==========================================
  Files         135      136       +1     
  Lines       10571    10648      +77     
==========================================
+ Hits         4538     4591      +53     
- Misses       5443     5460      +17     
- Partials      590      597       +7     
Impacted Files Coverage Δ
catalog/diff.go 18.51% <ø> (ø)
catalog/views.go 97.01% <ø> (-1.65%) ⬇️
catalog/cataloger_merge.go 60.81% <66.66%> (-0.53%) ⬇️
catalog/cataloger_diff.go 66.66% <73.59%> (+9.43%) ⬆️
catalog/db_lineage_scanner.go 77.55% <77.55%> (ø)
catalog/db_branch_scanner.go 90.32% <90.32%> (ø)
catalog/db_scanner.go 100.00% <100.00%> (ø)
catalog/model.go 76.92% <100.00%> (+15.38%) ⬆️
catalog/cataloger_create_entry.go 94.73% <0.00%> (-5.27%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f3cd72...5f1f6e5. Read the comment docs.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just requests to clarify some oddities that make it hard for me to read. Nothing blocking.

catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/db_scanner.go Outdated Show resolved Hide resolved
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

You may want someone more qualified to approve too, I mostly used this review to clear up my ignorance

catalog/db_lineage_scanner_test.go Show resolved Hide resolved
catalog/db_lineage_scanner_test.go Outdated Show resolved Hide resolved
@nopcoder nopcoder changed the title Implement cataloger diff using scanners Implement cataloger current diff using scanners Oct 15, 2020
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

We're storing CTIDs inside tables. This is very scary, because TFM warns against using them for anything long-term. I know that the current implementation only writes them to an effectively-temporary table so it might be OK, but we do need a tonne of documentation around this:

  1. All functions that might write a CTID to a table some time during their execution should carry a warning. Specifically when creating such a dangerous table.
  2. User-level documentation must warn never ever to run VACUUM FULL on the database if there is any chance of a concurrent merge. The fact that we would never do such a thing does not mean everyone follows the same ops guidelines, so we have to tell everyone they have to do it our way. E.g. user X might have their own small private lakeFS instance (we actually encourage this) and naïvely suppose that this is safe because it is fast. We should ensure that they know enough to be safe.
  3. If there is any chance of such tables surviving and being re-used then we must document backup strategy etc.

Very sorry, but I am requesting changes with this as the primary blocker. It cannot be solved by code because it is not about code, it is about dangerous future code or dangerous current ops. I am treating warning all devs calling these functions, and any users running VACUUM FULL, as a threat to user data integrity. So: if they are not such please let me know and I will reconsider.

.all-contributorsrc Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Outdated Show resolved Hide resolved
}

diffRec := &diffResultRecord{
SourceBranch: parentID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: isn't this ID fixed on each record, and equal to a parameter that the caller passed in? (If so, can we just not include it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is a bug - should take the branch ID from the parent entry's branch

}
ins := psql.Insert(tableName).Columns("source_branch", "diff_type", "path", "entry_ctid")
for _, rec := range batch {
ins = ins.Values(rec.SourceBranch, rec.DiffType, rec.Entry.Path, rec.EntryCtid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. You write a CTID of another table into a table? Then everything up to the top-level exported functions has to carry a comment saying it is only useful inside a transaction (with an appropriately high isolation level), otherwise this is unsafe in the presence of concurrent VACUUMing FULL. At the very least, add warning lines to all system documentation never to VACUUM FULL the tables.
Similarly, this imposes immediate restrictions either on any implementation of metadata retention (which deletes entries) or of all code that uses the result -- the CTID might point at nothing.
Also, please document the restrictions on restoring from backups: I assume most backup methods will have to invalidate ctid on restore, so the result of writing these diff entries must be trashed during backup/restore.

The manual has this to say:

ctid

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Our reliance on CTIDs is becoming a danger to data integrity. This technical debt might be acceptable because currently there is no breakage that you or I can see. But at the very least it requires precise exact documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same mechanism as today - the cataloger's Merge calls the diff implementation inside the merge transaction and for entries that needs to update/added as part of the diff, we use the ctid to select the full entry information.

catalog/cataloger_diff.go Outdated Show resolved Hide resolved
catalog/cataloger_diff.go Show resolved Hide resolved
catalog/db_lineage_scanner_test.go Show resolved Hide resolved
@nopcoder
Copy link
Contributor Author

You may want someone more qualified to approve too, I mostly used this review to clear up my ignorance

Thanks for everything - I have also Tzahi going over the implementation to check that it matches the current SQL one

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This sets the R number for CTID-20 exposure to <0.8, meaning we should be able to recover from this plague. :-)

Entry Entry
EntryCtid *string
Entry Entry // Partially filled. Path is always set.
EntryCtid *string // CTID of the modified/added entry. Do not use outside of catalog diff-by-iterators. https://github.com/treeverse/lakeFS/issues/831
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Drop a marker and keep going" :-)

@nopcoder nopcoder merged commit 0f923de into master Oct 19, 2020
lakeFS automation moved this from Review in progress to Done Oct 19, 2020
@nopcoder nopcoder deleted the feature/diff-with-scanner branch February 4, 2021 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cataloger Improvements or additions to the cataloger
Projects
No open projects
lakeFS
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants