Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair with hashes #2925

Merged
merged 47 commits into from
Sep 6, 2019
Merged

Repair with hashes #2925

merged 47 commits into from
Sep 6, 2019

Conversation

mobyvb
Copy link
Member

@mobyvb mobyvb commented Aug 30, 2019

What:
Create a specialized ecclient for the repairer that does additional verification:

  • Gets uplink-signed piece hashes and signed original order limits from storagenodes during download
  • Verifies piece hash and order limit signature
  • Calculates piece hash of downloaded data from storagenode, and verifies that it matches the uplink-signed one
  • (optional?) Verifies that piece public key on original order limit matches the one used for the other pieces
  • Reconstructs segment with downloaded pieces, and repairs to new nodes in the same way we currently do (ecRepairer.Repair does the same thing as ecClient.Repair, which can be removed now)

Why:
Repair is expensive so we want to download from the smallest possible number of storagenodes to reconstruct the segment. However, if we use the minimum RS number of pieces, we have no way of automatically detecting errors in pieces, and the reconstructed segment could be incorrect without our knowledge. By using hashes to check the validity of pieces before we decode the segment, we can be sure of the authenticity of a piece, and use the minimum number of pieces to get the segment.

Unfortunately, this requires us to bring each piece entirely into memory since we need to download the entire piece before we can verify the hash.

https://storjlabs.atlassian.net/browse/V3-2396
https://storjlabs.atlassian.net/browse/V3-2395
https://storjlabs.atlassian.net/browse/V3-2394

Please describe the tests:

  • Test 1:
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

@cla-bot cla-bot bot added the cla-signed label Aug 30, 2019
@mobyvb mobyvb changed the title Repair with hash Repair with hashes Aug 30, 2019
@mobyvb mobyvb marked this pull request as ready for review September 4, 2019 19:02
@mobyvb mobyvb requested a review from a team September 4, 2019 19:02
mniewrzal
mniewrzal previously approved these changes Sep 5, 2019
Copy link
Contributor

@mniewrzal mniewrzal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

Copy link
Contributor

@navillasa navillasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

satellite/repair/repair_test.go Outdated Show resolved Hide resolved
satellite/repair/repair_test.go Show resolved Hide resolved
satellite/repair/repair_test.go Show resolved Hide resolved
satellite/repair/repairer/ec.go Show resolved Hide resolved
return successfulNodes, successfulHashes, nil
}

// copied from ecclient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we keeping these comments long-term? I'm not sure if they're necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they can be removed. I was hoping we could get rid of duplicate code here and in ecclient, but I don't know of a good way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe someone else has thoughts on this? maybe could make a ec_helpers.go file or something?

@VinozzZ
Copy link
Contributor

VinozzZ commented Sep 5, 2019

looks good, I would just spend a moment to think if we can somehow test that e.g. hash doesn't match

test added

Copy link
Contributor

@mniewrzal mniewrzal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice! I just would like to know something about last changes (see comments)

satellite/repair/repairer/segments.go Show resolved Hide resolved
@@ -163,7 +163,7 @@ func (checker *Checker) updateIrreparableSegmentStatus(ctx context.Context, poin

// we repair when the number of healthy pieces is less than or equal to the repair threshold
// except for the case when the repair and success thresholds are the same (a case usually seen during testing)
if numHealthy > redundancy.MinReq && numHealthy <= redundancy.RepairThreshold && numHealthy < redundancy.SuccessThreshold {
if numHealthy >= redundancy.MinReq && numHealthy <= redundancy.RepairThreshold && numHealthy < redundancy.SuccessThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this change? and rest in this file?

Copy link
Member Author

@mobyvb mobyvb Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before repair with hashes, a segment was considered irreparable if it had MinReq or fewer pieces. This is because we needed at least MinReq+1 pieces for the reed solomon error correction to work correctly. Now that we are verifying pieces with hashes before decoding into a segment, the minimum number exactly is sufficient to repair correctly.
The changes in this file ensure that we

  1. Add to repair queue if we have exactly the minimum healthy number of pieces
  2. Do not add to the irreparable db if we have exactly the minimum healthy number of pieces

mniewrzal
mniewrzal previously approved these changes Sep 6, 2019
@VinozzZ VinozzZ merged commit fb10815 into master Sep 6, 2019
@VinozzZ VinozzZ deleted the green/repair-with-hashes branch September 6, 2019 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants