Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Redesign #115

Merged
merged 6 commits into from Sep 6, 2019
Merged

API Redesign #115

merged 6 commits into from Sep 6, 2019

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Aug 20, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

To simplify user interface with arraysets and provide some concept of a dataset as a view across arraysets.

NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon

If it fixes an open issue, please link to the issue here:

related to #79 and many conversations on the Hangar Users Slack Channel

Description

Describe your changes in detail:

Added CheckoutIndexer class which is inhereted in ReaderCheckout and WriterCheckout to enable the following API. (originally proposed by @lantiga and @elistevens)

dset = repo.checkout(write=True)
# get an arrayset of the dataset (i.e. a "column" of the dataset?)
aset = dset['foo']

# get a specific array from 'foo' (returns a named tuple)
arr = dset['foo', '1']
# set it too
dset['foo', '1'] = arr

# get data from dset (returns a named tuple)
subarr = dset['foo', '1']
# and set into it
dset['foo', '1'] = subarr + 1

# get a sample of a dataset across 'foo' and 'bar' (returns a named tuple)
sample = dset[('foo', 'bar'), '1']

# get a sample of all arraysets in the checkout (returns a named tuple)
sample = dset[:, '1']
sample = dset[..., '1']

# get multiple samples
sample_ids = ['1', '2', '3']
batch = dset[('foo', 'bar'), sample_ids]
batch = dset[:, sample_ids]
batch = dset[..., sample_ids]

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added needs decision Discussion is ongoing to determine what to do. WIP Don't merge; Work in Progress labels Aug 20, 2019
@rlizzo rlizzo self-assigned this Aug 20, 2019
@codecov
Copy link

codecov bot commented Aug 20, 2019

Codecov Report

Merging #115 into master will increase coverage by 0.22%.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   92.25%   92.47%   +0.22%     
==========================================
  Files          50       51       +1     
  Lines        8622     9448     +826     
  Branches      843      930      +87     
==========================================
+ Hits         7954     8737     +783     
- Misses        492      517      +25     
- Partials      176      194      +18
Impacted Files Coverage Δ
src/hangar/repository.py 89.6% <ø> (ø) ⬆️
tests/test_dataloaders.py 98.45% <100%> (ø) ⬆️
tests/test_arrayset.py 100% <100%> (ø) ⬆️
tests/test_checkout.py 99.81% <100%> (ø) ⬆️
tests/test_remotes.py 98.42% <100%> (ø) ⬆️
src/hangar/records/queries.py 91.88% <100%> (ø) ⬆️
tests/conftest.py 97.22% <100%> (ø) ⬆️
src/hangar/records/hashs.py 96.92% <100%> (ø) ⬆️
tests/test_diff.py 99.69% <80%> (ø) ⬆️
src/hangar/arrayset.py 89.22% <84.31%> (+0.48%) ⬆️
... and 4 more

@hhsecond
Copy link
Member

hhsecond commented Aug 21, 2019

@rlizzo @lantiga I found it more intuitive to call

dset = repo.checkout()
single_sample = dset['foo'][1]
samples_nt = dset['foo', 'bar'][1]

It might add another layer over arraysets (something like GroupedAset from dataloaders) to handle the subindex. What do you think?

@rlizzo
Copy link
Member Author

rlizzo commented Aug 21, 2019

@rlizzo @lantiga I found it more intuitive to call

sample = dset[('foo', 'bar')][1]

It might add another layer over arraysets (something like GroupedAset from dataloaders) to handle the subindex. What do you think?

I actually think that's slightly confusing, since to specify multiple arraysets or samples you have to write parentheses (foo, bar, baz) inside of the bracket style notation [(foo, bar, baz)]. However, if we removed the parentheses, specifying sample would look like:

# case 1
samples = dset['foo', 'bar']['1', '2']

# case 2
samples = dset[:]['1']

# case 3
samples = dset[...]['1', '2']

# case 4
samples = dset['foo']['1']

While case 2 and case 4 look OK, 1 and 3 look a bit odd to me (thinking as a numpy user). Having this nested sort of access structure doesn't really give off the same impression that the returned results are a view of sample names over arrayset names (to me at least).

With samples = dset[['foo', 'bar'], 1], I know what i'm going to get, and it's less to type. @elistevens, would love to hear your input on this one as well?

Another Idea

We actually may want to take some minor inspiration from the xarray project, as it is the de-facto standard for working with labelled ndim arrays, or as they say:

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

While we wouldn't want a "one-to-one" mapping of API and access conventions, their approach to positional indexing is essentially the same as the proposed samples = dset[['foo', 'bar'], '1'] method (substitution int indexes for named indexes). vectorized indexing is the view of multiple samples over multiple arraysets samples = dset[['foo', 'bar'], ['1', '2']], and multi-level indexing could be rather adaptable to us as well. Maybe:

# vectorized indexing
samples = dset[['foo', ['1', '2']], ['bar', ['3', '4']]`?

We could also allow "dict style" indexing: samples = dset[{'foo': '1', 'bar': '1'}], though having too many options isn't always desirable.

I would specifically avoid their "Pandas style" named indexing (using .iloc(foo) and .sel(foo)). However, that's only since it's a lot of complexity to add to what is meant to be a "convenience" method.

One more Issue

xarray has also have decided what happens to missing coordinate labels, which is an issue I was about to raise. We need to decide what happens in the following case

# checkout has arraysets 'foo', 'bar', 'baz'.
# 'foo' has samples '1', '2'
# 'bar' has samples '1', '2'
# 'baz' has samples '1'    <-- missing '2'

# Case 1 (expected output)
>>> res = dset[:, '1']
>>> print(res)
('foo'=np.array([1]), 'bar'=np.array([1, 1]), 'baz'=np.array([1, 1, 1]))

# Case 2A
>>> res = dset[:, '2']
KeyError('2' not in arrayset 'baz')

# Case 2B
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2]), 'baz'=None)

# Case 2C
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2]))

I was leaning toward suggesting Case 2C before I read the xarray spec, but it turns out that is the exact behavior they implement as well. Is there any reason we shouldn't adopt that pattern? (regardless of other conventions we decide on)?

@elistevens
Copy link

elistevens commented Aug 21, 2019

I think that the talk of how to layer syntactic sugar over the underlying operations is happening way too early in the design process. Trying to decide how many square brackets you need to get the data out shouldn't happen until the release after you've got verbose methods implemented and released.

Also, keep in mind that 99% of your users are never going to use this API. Data will be written, committed, and then they'll call the API to convert it to a Dataset they can feed into their deep learning framework of choice. I'm not saying that because I think that it's okay to design it poorly, just that it's not a great use of time right now.

There's an unfortunate trend in a lot of the more research/data science tooling to paper over data problems, which I think runs entirely counter to the hangar "data integrity is paramount" ethos. If I ask for something, but some of the keys are missing, I should get an exception. (FWIW, I feel the same way about missing dimensions, but that's a different PR)

I do think that it's okay to have a keyword arg that enables silently discarding rows that don't meet the column constraints, but I should have to enable it explicitly (and I think that needing it points to deeper problems in the data pipeline).

@lantiga
Copy link
Member

lantiga commented Aug 21, 2019

Hey, great job @rlizzo! Various coments:

I agree with @rlizzo's comment on double brackets; I also started with @hhsecond's approach but switched to multi-indexing later on as it was simpler to grasp. When I say that, I'm referring to the fact that our users post 0.3 coming to Hangar will be primarily exposed this API to accessing / writing datasets, as it doesn't require further terminology or object hierarchy than just what you checkout, and it suffices for the majority of the use cases.
I imagine that what they will need to learn is that you checkout something and that you just index into it with named indexes.

@rlizzo it's great that we can get consistent with XArrray. I personally think 2C is the right call (I was originally thinking about returning None for that key), it makes the behavior about a dataset as an indexed subset of a checkout simpler (you can iterate over keys with an ellipse). We can add another method (like get) or an option in the square brackets a-la R (yak!) to make it throw.

@elistevens it's a good warning re: being too soon, but I don't agree with this. One thing we know is that we need to simplify Hangar's API to make users grow. I do think this is the API that 99% of users should use, because it won't require them to go deep into Hangar's design details to make it useful. The current API will be for advanced users in my head.

@rlizzo
Copy link
Member Author

rlizzo commented Sep 4, 2019

@hhsecond, can you take a look at this and let me know what you think about testing for these new access methods? Not sure if we should be repeating many of the low level tests or just writing new ones for the high level API?

@rlizzo rlizzo removed the needs decision Discussion is ongoing to determine what to do. label Sep 4, 2019
Copy link
Member

@hhsecond hhsecond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. I'll take a thorough look once you finish making changes?

arrays stored in each arrayset via a common sample key. Each sample
key is returned values as an individual element in the
List. The sample order is returned in the same order it wasw recieved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also return a NamedTuple (outside of List) on this case dset[('foo', 'bar', 'baz'), '1']?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have that in some cases, but since there isn't anyway to set some modifiable flag during dict style access it would have to be permenant. Two levels of indexing to pull elements which could be stored sequentially in a list would drive me nuts...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I am saying is, It is already returning a NamedTuple in the case dset[('foo', 'bar', 'baz'), '1'] which is not documented in the Returns field. Or Am I missing something here?

src/hangar/checkout.py Outdated Show resolved Hide resolved
@rlizzo
Copy link
Member Author

rlizzo commented Sep 5, 2019

Thanks for starting. I'm gonna say go ahead and give it a look now! I'll come back to the rest of the tests once you give it a once over.

@hhsecond
Copy link
Member

hhsecond commented Sep 5, 2019

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

@hhsecond
Copy link
Member

hhsecond commented Sep 5, 2019

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

@rlizzo
Copy link
Member Author

rlizzo commented Sep 5, 2019

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

Why? The standard API isn't going anywhere, and is really important to work with anything other than a pre-built repo/dataset. Is there some reason we shouldn't just keep co.arraysets.keys()?

@rlizzo
Copy link
Member Author

rlizzo commented Sep 5, 2019

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

It's not on my radar right now, and I'm not sure what it would even return?? arraysets keys, metadata? Values?

I'm actually very wary of actually treating the Checkout object as a dataset in much more of a capacity than it is now. It actually has some super important methods (commit, diff, merge, reset_staging_area, etc..) let alone providing access to the standard Arrayset and Metadata class methods.

from my point of view, these changes are mearly a conveience for a common use case users will face, they aren't changing the fundumental nature of what the Checkout object actually does. I just wouldn't want to send mixed signals about what each layer is responsible for.

Does that make sense? Maybe I'm seeing this wrong though. What's your point of view?

@rlizzo rlizzo added Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. and removed WIP Don't merge; Work in Progress labels Sep 5, 2019
@rlizzo
Copy link
Member Author

rlizzo commented Sep 5, 2019

Documentation updated and all tests pass with good coverage. Ready to be merged!

@hhsecond
Copy link
Member

hhsecond commented Sep 6, 2019

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

Why? The standard API isn't going anywhere, and is really important to work with anything other than a pre-built repo/dataset. Is there some reason we shouldn't just keep co.arraysets.keys()?

It is just more intuitive to get all the relevant information from the high-level API itself, IMO

@hhsecond
Copy link
Member

hhsecond commented Sep 6, 2019

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

It's not on my radar right now, and I'm not sure what it would even return?? arraysets keys, metadata? Values?

I'm actually very wary of actually treating the Checkout object as a dataset in much more of a capacity than it is now. It actually has some super important methods (commit, diff, merge, reset_staging_area, etc..) let alone providing access to the standard Arrayset and Metadata class methods.

from my point of view, these changes are mearly a conveience for a common use case users will face, they aren't changing the fundumental nature of what the Checkout object actually does. I just wouldn't want to send mixed signals about what each layer is responsible for.

Does that make sense? Maybe I'm seeing this wrong though. What's your point of view?

Yep, make sense. Thanks!

src/hangar/checkout.py Outdated Show resolved Hide resolved
Copy link
Member

@hhsecond hhsecond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rlizzo rlizzo removed the Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. label Sep 6, 2019
@rlizzo rlizzo merged commit ecc354a into tensorwerk:master Sep 6, 2019
@rlizzo rlizzo deleted the api-redesign branch October 8, 2019 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants