API Redesign #115

rlizzo · 2019-08-20T07:53:38Z

Motivation and Context

Why is this change required? What problem does it solve?:

To simplify user interface with arraysets and provide some concept of a dataset as a view across arraysets.

NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon

If it fixes an open issue, please link to the issue here:

related to #79 and many conversations on the Hangar Users Slack Channel

Description

Describe your changes in detail:

Added CheckoutIndexer class which is inhereted in ReaderCheckout and WriterCheckout to enable the following API. (originally proposed by @lantiga and @elistevens)

dset = repo.checkout(write=True)
# get an arrayset of the dataset (i.e. a "column" of the dataset?)
aset = dset['foo']

# get a specific array from 'foo' (returns a named tuple)
arr = dset['foo', '1']
# set it too
dset['foo', '1'] = arr

# get data from dset (returns a named tuple)
subarr = dset['foo', '1']
# and set into it
dset['foo', '1'] = subarr + 1

# get a sample of a dataset across 'foo' and 'bar' (returns a named tuple)
sample = dset[('foo', 'bar'), '1']

# get a sample of all arraysets in the checkout (returns a named tuple)
sample = dset[:, '1']
sample = dset[..., '1']

# get multiple samples
sample_ids = ['1', '2', '3']
batch = dset[('foo', 'bar'), sample_ids]
batch = dset[:, sample_ids]
batch = dset[..., sample_ids]

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

codecov · 2019-08-20T19:32:39Z

Codecov Report

Merging #115 into master will increase coverage by 0.22%.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
+ Coverage   92.25%   92.47%   +0.22%     
==========================================
  Files          50       51       +1     
  Lines        8622     9448     +826     
  Branches      843      930      +87     
==========================================
+ Hits         7954     8737     +783     
- Misses        492      517      +25     
- Partials      176      194      +18

Impacted Files	Coverage Δ
src/hangar/repository.py	`89.6% <ø> (ø)`	⬆️
tests/test_dataloaders.py	`98.45% <100%> (ø)`	⬆️
tests/test_arrayset.py	`100% <100%> (ø)`	⬆️
tests/test_checkout.py	`99.81% <100%> (ø)`	⬆️
tests/test_remotes.py	`98.42% <100%> (ø)`	⬆️
src/hangar/records/queries.py	`91.88% <100%> (ø)`	⬆️
tests/conftest.py	`97.22% <100%> (ø)`	⬆️
src/hangar/records/hashs.py	`96.92% <100%> (ø)`	⬆️
tests/test_diff.py	`99.69% <80%> (ø)`	⬆️
src/hangar/arrayset.py	`89.22% <84.31%> (+0.48%)`	⬆️
... and 4 more

hhsecond · 2019-08-21T03:34:36Z

@rlizzo @lantiga I found it more intuitive to call

dset = repo.checkout()
single_sample = dset['foo'][1]
samples_nt = dset['foo', 'bar'][1]

It might add another layer over arraysets (something like GroupedAset from dataloaders) to handle the subindex. What do you think?

rlizzo · 2019-08-21T05:06:26Z

@rlizzo @lantiga I found it more intuitive to call
sample = dset[('foo', 'bar')][1]
It might add another layer over arraysets (something like GroupedAset from dataloaders) to handle the subindex. What do you think?

I actually think that's slightly confusing, since to specify multiple arraysets or samples you have to write parentheses (foo, bar, baz) inside of the bracket style notation [(foo, bar, baz)]. However, if we removed the parentheses, specifying sample would look like:

# case 1
samples = dset['foo', 'bar']['1', '2']

# case 2
samples = dset[:]['1']

# case 3
samples = dset[...]['1', '2']

# case 4
samples = dset['foo']['1']

While case 2 and case 4 look OK, 1 and 3 look a bit odd to me (thinking as a numpy user). Having this nested sort of access structure doesn't really give off the same impression that the returned results are a view of sample names over arrayset names (to me at least).

With samples = dset[['foo', 'bar'], 1], I know what i'm going to get, and it's less to type. @elistevens, would love to hear your input on this one as well?

Another Idea

We actually may want to take some minor inspiration from the xarray project, as it is the de-facto standard for working with labelled ndim arrays, or as they say:

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

While we wouldn't want a "one-to-one" mapping of API and access conventions, their approach to positional indexing is essentially the same as the proposed samples = dset[['foo', 'bar'], '1'] method (substitution int indexes for named indexes). vectorized indexing is the view of multiple samples over multiple arraysets samples = dset[['foo', 'bar'], ['1', '2']], and multi-level indexing could be rather adaptable to us as well. Maybe:

# vectorized indexing
samples = dset[['foo', ['1', '2']], ['bar', ['3', '4']]`?

We could also allow "dict style" indexing: samples = dset[{'foo': '1', 'bar': '1'}], though having too many options isn't always desirable.

I would specifically avoid their "Pandas style" named indexing (using .iloc(foo) and .sel(foo)). However, that's only since it's a lot of complexity to add to what is meant to be a "convenience" method.

One more Issue

xarray has also have decided what happens to missing coordinate labels, which is an issue I was about to raise. We need to decide what happens in the following case

# checkout has arraysets 'foo', 'bar', 'baz'.
# 'foo' has samples '1', '2'
# 'bar' has samples '1', '2'
# 'baz' has samples '1'    <-- missing '2'

# Case 1 (expected output)
>>> res = dset[:, '1']
>>> print(res)
('foo'=np.array([1]), 'bar'=np.array([1, 1]), 'baz'=np.array([1, 1, 1]))

# Case 2A
>>> res = dset[:, '2']
KeyError('2' not in arrayset 'baz')

# Case 2B
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2]), 'baz'=None)

# Case 2C
>>> res = dset[:, '2']
>>> print(res)
('foo'=np.array([2]), 'bar'=np.array([2, 2]))

I was leaning toward suggesting Case 2C before I read the xarray spec, but it turns out that is the exact behavior they implement as well. Is there any reason we shouldn't adopt that pattern? (regardless of other conventions we decide on)?

elistevens · 2019-08-21T05:41:24Z

I think that the talk of how to layer syntactic sugar over the underlying operations is happening way too early in the design process. Trying to decide how many square brackets you need to get the data out shouldn't happen until the release after you've got verbose methods implemented and released.

Also, keep in mind that 99% of your users are never going to use this API. Data will be written, committed, and then they'll call the API to convert it to a Dataset they can feed into their deep learning framework of choice. I'm not saying that because I think that it's okay to design it poorly, just that it's not a great use of time right now.

There's an unfortunate trend in a lot of the more research/data science tooling to paper over data problems, which I think runs entirely counter to the hangar "data integrity is paramount" ethos. If I ask for something, but some of the keys are missing, I should get an exception. (FWIW, I feel the same way about missing dimensions, but that's a different PR)

I do think that it's okay to have a keyword arg that enables silently discarding rows that don't meet the column constraints, but I should have to enable it explicitly (and I think that needing it points to deeper problems in the data pipeline).

lantiga · 2019-08-21T06:51:16Z

Hey, great job @rlizzo! Various coments:

I agree with @rlizzo's comment on double brackets; I also started with @hhsecond's approach but switched to multi-indexing later on as it was simpler to grasp. When I say that, I'm referring to the fact that our users post 0.3 coming to Hangar will be primarily exposed this API to accessing / writing datasets, as it doesn't require further terminology or object hierarchy than just what you checkout, and it suffices for the majority of the use cases.
I imagine that what they will need to learn is that you checkout something and that you just index into it with named indexes.

@rlizzo it's great that we can get consistent with XArrray. I personally think 2C is the right call (I was originally thinking about returning None for that key), it makes the behavior about a dataset as an indexed subset of a checkout simpler (you can iterate over keys with an ellipse). We can add another method (like get) or an option in the square brackets a-la R (yak!) to make it throw.

@elistevens it's a good warning re: being too soon, but I don't agree with this. One thing we know is that we need to simplify Hangar's API to make users grow. I do think this is the API that 99% of users should use, because it won't require them to go deep into Hangar's design details to make it useful. The current API will be for advanced users in my head.

rlizzo · 2019-09-04T17:36:39Z

@hhsecond, can you take a look at this and let me know what you think about testing for these new access methods? Not sure if we should be repeating many of the low level tests or just writing new ones for the high level API?

hhsecond

Looks good so far. I'll take a thorough look once you finish making changes?

hhsecond · 2019-09-05T06:54:05Z

src/hangar/checkout.py

+            arrays stored in each arrayset via a common sample key. Each sample
+            key is returned values as an individual element in the
+            List. The sample order is returned in the same order it wasw recieved.
+


It could also return a NamedTuple (outside of List) on this case dset[('foo', 'bar', 'baz'), '1']?

It might be nice to have that in some cases, but since there isn't anyway to set some modifiable flag during dict style access it would have to be permenant. Two levels of indexing to pull elements which could be stored sequentially in a list would drive me nuts...

No, what I am saying is, It is already returning a NamedTuple in the case dset[('foo', 'bar', 'baz'), '1'] which is not documented in the Returns field. Or Am I missing something here?

src/hangar/checkout.py

rlizzo · 2019-09-05T10:57:02Z

Thanks for starting. I'm gonna say go ahead and give it a look now! I'll come back to the rest of the tests once you give it a once over.

hhsecond · 2019-09-05T12:21:49Z

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

hhsecond · 2019-09-05T12:32:41Z

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

src/hangar/checkout.py

rlizzo · 2019-09-05T14:17:41Z

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

Why? The standard API isn't going anywhere, and is really important to work with anything other than a pre-built repo/dataset. Is there some reason we shouldn't just keep co.arraysets.keys()?

rlizzo · 2019-09-05T14:24:28Z

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

It's not on my radar right now, and I'm not sure what it would even return?? arraysets keys, metadata? Values?

I'm actually very wary of actually treating the Checkout object as a dataset in much more of a capacity than it is now. It actually has some super important methods (commit, diff, merge, reset_staging_area, etc..) let alone providing access to the standard Arrayset and Metadata class methods.

from my point of view, these changes are mearly a conveience for a common use case users will face, they aren't changing the fundumental nature of what the Checkout object actually does. I just wouldn't want to send mixed signals about what each layer is responsible for.

Does that make sense? Maybe I'm seeing this wrong though. What's your point of view?

rlizzo · 2019-09-05T23:56:17Z

Documentation updated and all tests pass with good coverage. Ready to be merged!

hhsecond · 2019-09-06T06:08:42Z

@rlizzo We need something to replace co.arraysets.keys() we had in the older API to get the names of all arraysets

Why? The standard API isn't going anywhere, and is really important to work with anything other than a pre-built repo/dataset. Is there some reason we shouldn't just keep co.arraysets.keys()?

It is just more intuitive to get all the relevant information from the high-level API itself, IMO

hhsecond · 2019-09-06T06:10:06Z

Do we have a plan to enable users to loop over the dataset where dataset = repo.checkout()

It's not on my radar right now, and I'm not sure what it would even return?? arraysets keys, metadata? Values?

I'm actually very wary of actually treating the Checkout object as a dataset in much more of a capacity than it is now. It actually has some super important methods (commit, diff, merge, reset_staging_area, etc..) let alone providing access to the standard Arrayset and Metadata class methods.

from my point of view, these changes are mearly a conveience for a common use case users will face, they aren't changing the fundumental nature of what the Checkout object actually does. I just wouldn't want to send mixed signals about what each layer is responsible for.

Does that make sense? Maybe I'm seeing this wrong though. What's your point of view?

Yep, make sense. Thanks!

src/hangar/checkout.py

hhsecond

LGTM

rlizzo added needs decision Discussion is ongoing to determine what to do. WIP Don't merge; Work in Progress labels Aug 20, 2019

rlizzo requested review from lantiga and hhsecond August 20, 2019 07:53

rlizzo self-assigned this Aug 20, 2019

rlizzo force-pushed the api-redesign branch from 9a6528b to 1f961f6 Compare September 4, 2019 10:13

rlizzo removed the needs decision Discussion is ongoing to determine what to do. label Sep 4, 2019

rlizzo force-pushed the api-redesign branch from 2c16276 to 4150052 Compare September 5, 2019 06:58

hhsecond reviewed Sep 5, 2019

View reviewed changes

src/hangar/checkout.py Show resolved Hide resolved

hhsecond reviewed Sep 5, 2019

View reviewed changes

src/hangar/checkout.py Show resolved Hide resolved

rlizzo added Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. and removed WIP Don't merge; Work in Progress labels Sep 5, 2019

rlizzo added 4 commits September 5, 2019 19:45

WIP added ability to view arraysets directoy from checkout index access

9cedbfe

get and default get not implemented more cleanly

8997a18

actually performant now

6ffae40

completed tests for the new api. awaiting review

c19f2df

rlizzo force-pushed the api-redesign branch from cdf4366 to a5ed8e6 Compare September 5, 2019 23:48

updated tutorial docs and API to include examples

eb24fcc

rlizzo force-pushed the api-redesign branch from a5ed8e6 to eb24fcc Compare September 5, 2019 23:52

hhsecond reviewed Sep 6, 2019

View reviewed changes

src/hangar/checkout.py Outdated Show resolved Hide resolved

hhsecond approved these changes Sep 6, 2019

View reviewed changes

arrayset setup speedup

045960c

rlizzo force-pushed the api-redesign branch from 750e07a to 045960c Compare September 6, 2019 10:56

rlizzo removed the Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. label Sep 6, 2019

rlizzo merged commit ecc354a into tensorwerk:master Sep 6, 2019

rlizzo deleted the api-redesign branch October 8, 2019 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Redesign #115

API Redesign #115

rlizzo commented Aug 20, 2019 •

edited

codecov bot commented Aug 20, 2019 •

edited

hhsecond commented Aug 21, 2019 •

edited

rlizzo commented Aug 21, 2019 •

edited

elistevens commented Aug 21, 2019 •

edited

lantiga commented Aug 21, 2019

rlizzo commented Sep 4, 2019

hhsecond left a comment

hhsecond Sep 5, 2019

rlizzo Sep 5, 2019

hhsecond Sep 6, 2019

rlizzo commented Sep 5, 2019

hhsecond commented Sep 5, 2019

hhsecond commented Sep 5, 2019

rlizzo commented Sep 5, 2019

rlizzo commented Sep 5, 2019

rlizzo commented Sep 5, 2019

hhsecond commented Sep 6, 2019

hhsecond commented Sep 6, 2019

hhsecond left a comment

API Redesign #115

API Redesign #115

Conversation

rlizzo commented Aug 20, 2019 • edited

Motivation and Context

Why is this change required? What problem does it solve?:

If it fixes an open issue, please link to the issue here:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Aug 20, 2019 • edited

Codecov Report

hhsecond commented Aug 21, 2019 • edited

rlizzo commented Aug 21, 2019 • edited

Another Idea

One more Issue

elistevens commented Aug 21, 2019 • edited

lantiga commented Aug 21, 2019

rlizzo commented Sep 4, 2019

hhsecond left a comment

Choose a reason for hiding this comment

hhsecond Sep 5, 2019

Choose a reason for hiding this comment

rlizzo Sep 5, 2019

Choose a reason for hiding this comment

hhsecond Sep 6, 2019

Choose a reason for hiding this comment

rlizzo commented Sep 5, 2019

hhsecond commented Sep 5, 2019

hhsecond commented Sep 5, 2019

rlizzo commented Sep 5, 2019

rlizzo commented Sep 5, 2019

rlizzo commented Sep 5, 2019

hhsecond commented Sep 6, 2019

hhsecond commented Sep 6, 2019

hhsecond left a comment

Choose a reason for hiding this comment

rlizzo commented Aug 20, 2019 •

edited

codecov bot commented Aug 20, 2019 •

edited

hhsecond commented Aug 21, 2019 •

edited

rlizzo commented Aug 21, 2019 •

edited

elistevens commented Aug 21, 2019 •

edited