BugFix: Removing All Samples in Arrayset Should Not Remove Aset Schema #159

rlizzo · 2019-11-07T22:09:28Z

Motivation and Context

Why is this change required? What problem does it solve?:

In previous versions of hangar, when the last sample in an arrayset was deleted, the arrayset schema would implicitly also be deleted (with no user notice).

>>> len(co.arraysets)
1
>>> len(co.arraysets['foo'])
0
>>> co.arraysets['foo'][1] = np.arange(10)
>>> len(co.arraysets['foo'])
1
>>> del co.arraysets['foo'][1]
>>> len(co.arraysets['foo'])
traceback
---------------------------------------
ReferenceError: 'foo' no longer exists
...
>>> len(co.arraysets)
0

In addition to being a detriment to UX, the level at which this operation occurred (ArraysetDataWriter) meant that the change was not being propagated up to the Arraysets class which holds the only strong reference keeping the object alive. Since the strong reference was never deleted, the user could technically still hold a live weakref proxy to this object which should have been finalized.

In the worst case scenario (if a context manager for any Arrayset class was open at the time the last sample was removed), backend file handles may not be closed and invalidated properly (which would force an exception on any set or get operation). Should this happen, it was possible that a set to the ArraysetDataWriter would actually succeed - saving data to disk and writing a (valid) record reference to the staging db. Though the sample was recorded, the record of the schema spec would have been removed from the staging area / commit refs as soon as the Arrayset implicitly deleted itself. Upon a later checkout of this (or a child) commit/branch, the schema spec corresponding to the added sample refs would not exist, and as such no Arrayset would be generated for the sample (even though a valid reference was present, and a valid schema spec may exist in the commit's ancestory)

>>> co.arraysets['foo'][1] = np.arange(10)
>>> len(co.arraysets['foo'])
1
>>> aset = co.arraysets['foo']
>>> with co:
...    del co.arraysets['foo'][1]
>>> len(co.arraysets)
1
>>> with aset:
...    aset[2] = np.zeros(10)
>>> len(aset)
1
>>>
>>>
>>> co.commit('should not work')
foohash
>>> co.close()
>>> co = repo.checkout(write=True)
>>> len(co.arraysets)
0
>>> co.arraysets['foo']
traceback
--------------------------
KeyError: 'foo' does not exist

If it fixes an open issue, please link to the issue here:

Description

Describe your changes in detail:

Removing all samples from an arrayset no longer deletes the arrayset spec
Initializing or removing an Arrayset can not be performed when any arrayset is open in a context manager.
calculation of schema hash digest (and all hash digest functions) made more deterministic and moved into isolated module.
tests added ensuring correct behavior

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

codecov · 2019-11-07T22:16:13Z

Codecov Report

Merging #159 into master will increase coverage by 0.07%.
The diff coverage is 98.54%.

@@            Coverage Diff             @@
##           master     #159      +/-   ##
==========================================
+ Coverage   95.22%   95.29%   +0.07%     
==========================================
  Files          63       64       +1     
  Lines       11393    11548     +155     
  Branches      974      977       +3     
==========================================
+ Hits        10848    11004     +156     
  Misses        362      362              
+ Partials      183      182       -1

Impacted Files	Coverage Δ
src/hangar/records/summarize.py	`93.58% <ø> (ø)`	⬆️
src/hangar/repository.py	`97.6% <ø> (ø)`	⬆️
src/hangar/metadata.py	`95.28% <100%> (+2.36%)`	⬆️
src/hangar/remote/server.py	`77.38% <100%> (-0.13%)`	⬇️
tests/test_checkout_arrayset_access.py	`99.24% <100%> (+0.11%)`	⬆️
src/hangar/remote/client.py	`80.52% <100%> (-0.17%)`	⬇️
tests/test_arrayset.py	`100% <100%> (ø)`	⬆️
src/hangar/records/hashmachine.py	`100% <100%> (ø)`
src/hangar/arrayset.py	`95.1% <90.63%> (-0.45%)`	⬇️
... and 1 more

hhsecond

LGTM

hhsecond · 2019-11-11T11:21:02Z

src/hangar/arrayset.py

@@ -1093,7 +1084,8 @@ def items(self) -> Iterable[Tuple[str, Union[ArraysetDataReader, ArraysetDataWri
        Iterable[Tuple[str, Union[:class:`.ArraysetDataReader`, :class:`.ArraysetDataWriter`]]]
            returns two tuple of all all arrayset names/object pairs in the checkout.
        """
-        for asetN, asetObj in self._arraysets.items():
+        for asetN in list(self._arraysets.keys()):


just curious: why an explicit list() here?

Yeah.. I hate it too... It's because of the following situation

>>> for name, aset in co.arraysets.items(): >>> del co.arraysets[name] traceback RuntimeError ------------------------ RuntimeError: dictionary changed size during iteration

While I normally subscribe to the belief that "if you mutate a data structure while iterating over it, you are living in a state of sin, and deserve whatever happens to you", it's hangars responsibility to manager usage of this thing which kindof behaves like a dict and class simultaneously. Seemed unfair to put the implications of dealing with an implementation detail on the user when it can be fixed so trivially...

hhsecond · 2019-11-11T11:22:53Z

src/hangar/arrayset.py

@@ -1278,6 +1282,8 @@ def init_arrayset(self,

        Raises
        ------
+        PermissionError
+            If any enclosed arrayset is opned in a connection manager.


…mented

…need to increase tests.

…y when all samples are removed

rlizzo added Bug: Priority 1 ANY chance of data/record corruption or loss. Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. labels Nov 7, 2019

rlizzo requested a review from hhsecond November 7, 2019 22:09

rlizzo self-assigned this Nov 7, 2019

rlizzo force-pushed the aset-cm-removal-limitation branch from 7aed94d to 3224144 Compare November 8, 2019 08:31

hhsecond approved these changes Nov 11, 2019

View reviewed changes

rlizzo added 5 commits November 11, 2019 10:05

method to limit ability to remove or init arraysets while in cm imple…

5c37fdc

…mented

fixinshed working on implicit aset removal and self destruct method. …

75c0b8a

…need to increase tests.

centralized hashing algorithms

7add084

removed self destruct methods in favor of not removing aset implicitl…

415ecdc

…y when all samples are removed

updated changelog and fix typos

03c8afe

rlizzo force-pushed the aset-cm-removal-limitation branch from 3224144 to 03c8afe Compare November 11, 2019 15:28

rlizzo added Resolved Bug: Priority 1 ANY chance of data/record corruption or loss. and removed Bug: Priority 1 ANY chance of data/record corruption or loss. Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. labels Nov 11, 2019

rlizzo merged commit d105e56 into tensorwerk:master Nov 11, 2019

rlizzo deleted the aset-cm-removal-limitation branch November 21, 2019 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BugFix: Removing All Samples in Arrayset Should Not Remove Aset Schema #159

BugFix: Removing All Samples in Arrayset Should Not Remove Aset Schema #159

rlizzo commented Nov 7, 2019

codecov bot commented Nov 7, 2019 •

edited

hhsecond left a comment

hhsecond Nov 11, 2019

rlizzo Nov 11, 2019 •

edited

hhsecond Nov 11, 2019

BugFix: Removing All Samples in Arrayset Should Not Remove Aset Schema #159

BugFix: Removing All Samples in Arrayset Should Not Remove Aset Schema #159

Conversation

rlizzo commented Nov 7, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

If it fixes an open issue, please link to the issue here:

Description

Describe your changes in detail:

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Nov 7, 2019 • edited

Codecov Report

hhsecond left a comment

Choose a reason for hiding this comment

hhsecond Nov 11, 2019

Choose a reason for hiding this comment

rlizzo Nov 11, 2019 • edited

Choose a reason for hiding this comment

hhsecond Nov 11, 2019

Choose a reason for hiding this comment

codecov bot commented Nov 7, 2019 •

edited

rlizzo Nov 11, 2019 •

edited