Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrayset Subsamples #179

Merged
merged 23 commits into from
Feb 4, 2020
Merged

Arrayset Subsamples #179

merged 23 commits into from
Feb 4, 2020

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Jan 4, 2020

Motivation and Context

Why is this change required? What problem does it solve?:

This is a large PR which started with the motivation of allowing arraysets to contain subsamples under a common key. Though minimal work was needed for the technical implementation (with esentially no changes made to the hangar core record parsing, history traversal, or tensor storage backends), the integration of the API into the current model proved difficult, which required some major refactoring of what was previously known as ArraysetDataReader and ArraysetDataWriter classes.

Description

Describe your changes in detail:

Rather than try to combine every possible API method needed by flat and nested arrayset access into a frankenstein monster class, each access convention implements it's own API class methods (fully independent from one another). The appropriate constructors are selected based on the constains_subsamples argument in init_arrayset(). The argument is recorded in the schema so the correct type can be identified in subsequent checkouts.

I'm working on putting together a summary of the API. That will follow shortly.

At the moment, about half the tests for the new nested sample container are missing, and I need to re-evaluate some implementation details for how backend file handles are dealt with.

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. labels Jan 4, 2020
@rlizzo rlizzo requested a review from hhsecond January 4, 2020 01:24
@codecov
Copy link

codecov bot commented Jan 4, 2020

Codecov Report

Merging #179 into master will decrease coverage by 0.15%.
The diff coverage is 96.43%.

@@            Coverage Diff            @@
##           master    #179      +/-   ##
=========================================
- Coverage   95.55%   95.4%   -0.15%     
=========================================
  Files          71      81      +10     
  Lines       12542   14252    +1710     
  Branches     1105    1293     +188     
=========================================
+ Hits        11984   13596    +1612     
- Misses        362     444      +82     
- Partials      196     212      +16
Impacted Files Coverage Δ
src/hangar/remote/content.py 93.2% <ø> (ø) ⬆️
tests/test_initiate.py 100% <100%> (ø) ⬆️
tests/test_branching.py 100% <100%> (ø) ⬆️
tests/test_commit_ref_verification.py 100% <100%> (ø) ⬆️
src/hangar/columns/utils.py 100% <100%> (ø)
src/hangar/txnctx.py 89.8% <100%> (+0.21%) ⬆️
tests/test_dataloaders.py 97.27% <100%> (-2.73%) ⬇️
tests/test_diff.py 99.69% <100%> (ø) ⬆️
tests/test_utils.py 100% <100%> (ø) ⬆️
src/hangar/records/heads.py 93.51% <100%> (+0.06%) ⬆️
... and 70 more

@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 5, 2020
@lgtm-com
Copy link

lgtm-com bot commented Jan 8, 2020

This pull request introduces 7 alerts when merging 9a1674f into fe746ab - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 1 for Variable defined multiple times
  • 1 for Wrong number of arguments in a call

@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 15, 2020
@codecov
Copy link

codecov bot commented Jan 16, 2020

Codecov Report

Merging #179 into master will increase coverage by 0.55%.
The diff coverage is 97.11%.

@@            Coverage Diff            @@
##           master    #179      +/-   ##
=========================================
+ Coverage   95.55%   96.1%   +0.55%     
=========================================
  Files          71      86      +15     
  Lines       12542   15252    +2710     
  Branches     1105    1403     +298     
=========================================
+ Hits        11984   14657    +2673     
- Misses        362     386      +24     
- Partials      196     209      +13
Impacted Files Coverage Δ
src/hangar/remote/content.py 93.2% <ø> (ø) ⬆️
tests/test_utils.py 100% <ø> (ø) ⬆️
tests/test_remotes.py 97.79% <ø> (-0.74%) ⬇️
src/hangar/records/hashmachine.py 100% <ø> (ø) ⬆️
tests/property_based/conftest.py 100% <ø> (ø) ⬆️
tests/test_nested_arrayset.py 99.43% <ø> (ø)
tests/test_commit_ref_verification.py 100% <100%> (ø) ⬆️
tests/property_based/test_pbt_arrayset.py 100% <100%> (ø) ⬆️
src/hangar/txnctx.py 89.8% <100%> (+0.21%) ⬆️
tests/test_initiate.py 100% <100%> (ø) ⬆️
... and 90 more

@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 16, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 16, 2020
@rlizzo rlizzo force-pushed the append-to-arraysets branch 2 times, most recently from bdf6b2e to d4225e4 Compare January 17, 2020 09:13
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 17, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 17, 2020
@tensorwerk tensorwerk deleted a comment from lgtm-com bot Jan 17, 2020
@rlizzo rlizzo force-pushed the append-to-arraysets branch 2 times, most recently from 677271d to cc9dbb7 Compare January 24, 2020 02:43
setup.py Outdated
from os.path import dirname
from os.path import join
from os.path import splitext
from os.path import basename, join, splitext
Copy link
Member

@hhsecond hhsecond Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should use Path everywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distutils does not accept Path-like objects as the source for compiled c extensions (needed for cython)

@@ -64,15 +60,15 @@ class ReaderCheckout(object):
"""

def __init__(self,
base_path: os.PathLike, labelenv: lmdb.Environment,
base_path: Path, labelenv: lmdb.Environment,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be both Path and str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is internal to hangar. We expect a Path object.

click.echo(f'Initialized Arrayset: {aset.name}')
variable_shape=variable_,
contains_subsamples=subsamples_)
click.echo(f'Initialized Arrayset: {aset.arrayset}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO name was more intuitive than arrayset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but when we update the internal naming from arrayset ->columns, this is going to have to be more descriptive than name. I'm actually going to break this NamedTuple class into column-type specific NamedTuple classes. It's going to be messy to refactor, this will just help in the future (it's a temporary name).

…lass no longer work, but only because they haven't been updated yet. There is a lot of room for code deduplication here
…_flat.py and aset_nested.py to arrayset_flat.py and arrayset_nested.py in order to have related file names align with arrayset.py file in editors when listing alphabetically
Added slots and removed some unused code. All tests pass.
…s __slots__ classes, added getstate and setstate to arraysets as well
@rlizzo rlizzo merged commit 698a792 into tensorwerk:master Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. enhancement New feature or request WIP Don't merge; Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants