Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate Repo Integrity #173

Merged
merged 5 commits into from
Dec 10, 2019
Merged

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Dec 4, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

Validate the integrity of repo data and commit history.

Description

Describe your changes in detail:

  • crytographic hash of all data (read from disk) and commit refs
  • validate integrity of commits, data, branches, and historical provenance.
  • working in manual tests, not automated yet.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels Dec 4, 2019
@rlizzo rlizzo added this to In progress in Release 0.5 Plan via automation Dec 4, 2019
@codecov
Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #173 into master will increase coverage by 0.18%.
The diff coverage is 97.45%.

@@            Coverage Diff             @@
##           master     #173      +/-   ##
==========================================
+ Coverage   95.22%   95.39%   +0.18%     
==========================================
  Files          66       69       +3     
  Lines       11881    12268     +387     
  Branches     1011     1067      +56     
==========================================
+ Hits        11313    11703     +390     
+ Misses        371      365       -6     
- Partials      197      200       +3
Impacted Files Coverage Δ
src/hangar/backends/selection.py 90.48% <ø> (ø) ⬆️
src/hangar/metadata.py 95.28% <100%> (ø) ⬆️
tests/test_repo_integrity_verification.py 100% <100%> (ø)
src/hangar/constants.py 100% <100%> (ø) ⬆️
src/hangar/arrayset.py 94.97% <100%> (-0.13%) ⬇️
src/hangar/backends/hdf5_00.py 91.44% <100%> (-0.23%) ⬇️
src/hangar/remotes.py 89.38% <100%> (+0.16%) ⬆️
src/hangar/backends/remote_50.py 92.5% <100%> (-0.98%) ⬇️
src/hangar/remote/content.py 93.2% <100%> (+0.07%) ⬆️
src/hangar/records/hashmachine.py 100% <100%> (ø) ⬆️
... and 33 more

@rlizzo
Copy link
Member Author

rlizzo commented Dec 5, 2019

@hhsecond can you take a look at this? The tests aren't complete, it needs documentation, and there are a few corner cases to take care of, but I'd like your opinion on if I'm missing anything significant.

My TODO list right now is:

  • Handle cases where key / values are slightly malformed. We need to catch exceptions thrown by any parsing methods encountered on the way to the integrity module (I'll highlight concerning regions below)
  • Acquire the writer lock before running. These methods don't actually write anything, but it wouldn't be fun to deal with changing world state while the checks run.

Also, i'm not sure if I want to be raising a RuntimeError at every point some issue is found? I like the current implementation because:

  1. We don't do any unnecessary work. As soon as an issue is encountered the world comes crashing down and everything stops.
  2. Deliminating issues in the code via a prominent raise RuntimeError(msg) statement is rather nice in terms of readability. There's no guessing what the issue encountered is, or where it was found since theres a detailed error message right there.

HOWEVER, raising errors in this fashion means that it's not really possible to generate a list of issues should there be multiple things wrong . In addition to stopping at the first error it encounters, this doesn't return a machine readable description of the issue. In its current form, there's not any reasonable path towards effective logging or automated attempts to recover data.

These seem like issues for the future, but I'm not sure if you have other thoughts?

@rlizzo rlizzo added this to the v0.5.0 milestone Dec 5, 2019
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@rlizzo rlizzo added Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. and removed WIP Don't merge; Work in Progress labels Dec 5, 2019
@rlizzo rlizzo self-assigned this Dec 5, 2019
@rlizzo
Copy link
Member Author

rlizzo commented Dec 5, 2019

TODOS complete and tests finished. When you have a chance, would love to hear your thoughts @hhsecond. No rush.

@tensorwerk tensorwerk deleted a comment from lgtm-com bot Dec 5, 2019
@lgtm-com
Copy link

lgtm-com bot commented Dec 7, 2019

This pull request introduces 4 alerts when merging 68fbe4b into d267c0a - view on LGTM.com

new alerts:

  • 4 for Module-level cyclic import

Copy link
Member

@hhsecond hhsecond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had run through the code and found super minor tweaks we could do. I'll try to finish it before EOD

src/hangar/dataloaders/tfloader.py Show resolved Hide resolved
src/hangar/diagnostics/integrity.py Outdated Show resolved Hide resolved
src/hangar/repository.py Outdated Show resolved Hide resolved
src/hangar/records/hashs.py Show resolved Hide resolved
src/hangar/records/hashs.py Show resolved Hide resolved
recs = self._traverse_all_hash_records()
out = list(map(parsing.hash_data_raw_key_from_db_key, recs))
recs = self._traverse_all_hash_records(keys=True, values=False)
out = list(map(hash_data_raw_key_from_db_key, recs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension? Here and few other places where we have map -> list conversion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list comprehension is ~10% slower than mapping when the function is predefined. I've changed it where appropriate, but won't be making the change universally.

Copy link
Member

@hhsecond hhsecond Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think
list(map(somefun, recs)) > [somefun(val) for val in recs] > map(somefun, recs)
I mean, it's probably negligible but just saying if you want to do a comparison

Release 0.5 Plan automation moved this from In progress to Review in progress Dec 9, 2019
@lgtm-com
Copy link

lgtm-com bot commented Dec 10, 2019

This pull request fixes 3 alerts when merging e38844f into d267c0a - view on LGTM.com

fixed alerts:

  • 3 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Dec 10, 2019

This pull request fixes 3 alerts when merging 98123e9 into d267c0a - view on LGTM.com

fixed alerts:

  • 3 for Unused local variable

@rlizzo rlizzo merged commit 519d63b into tensorwerk:master Dec 10, 2019
Release 0.5 Plan automation moved this from Review in progress to Done Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants