Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.0 labels #263

Merged
merged 15 commits into from Apr 5, 2016
Merged

1.0.0 labels #263

merged 15 commits into from Apr 5, 2016

Conversation

jwittenbach
Copy link
Contributor

Implementation

This PR implements labels, a new feature on the Series object that allows the user to keep track of the identity of the individual series that make up the Series object even through operations such as Series.filter and indexing (Series[...]). In analogy to how Series.index allows the user to keep tabs on the final dimension of the Series object, Series.labels allows the user to track the identities of the "base axes" (the non-final axes which, in spark mode, are distributed).

Assume we have a Series object named series with shape (x, y, z, t) or (n, t). We can attach a set of labels to these series with:

series.labels = labels

where labels is an array-like object of size (x, y, z) or (n) respectively.

In regards to how they affect the labels, operations on Series fall into three categories:

  1. Operations that are effectively a map do not change the structure of non-final dimensions and thus the labels are unaffected -- e.g. Series.map, Series.zscore, Series.between.
  2. Operations that are effectively a reduce combine all the individual series in a the Series object and thus the identities of the individual series are lost and the labels are dropped -- e.g. Series.reduce, Series.mean.
  3. Operations that are effective a filter will drop some of the series. This is where labels are most useful in tracking the identities of the retained series. In these cases, the labels will be updated to reflect the new structure of the Series object -- e.g. Series.filter and Series.__getitem__ (i.e. indexing).

A note on performance in spark mode:

In the distributed setting, determining which elements of the Series object were dropped/retained during a filter can be expensive. This effectively involves making two passes through the data: the first to determine which values will be dropped (a map) and a second to actually drop those values (a filter). When labels are set (i.e. not None), then these too passes will happen in a non-lazy fashion so that the labels can be appropriately updated (NB: filter is already non-lazy in this setting).

Indexing is similar to a filter in that records are dropped, however the specification of which records will be dropped is knowable directly from the inputs, thus updating the labels (like the indexing itself) is fast and the indexing operation remains lazy.

@jwittenbach
Copy link
Contributor Author

@freeman-lab @sofroniewn would like your feedback on this when you get a chance

(tests are currently failing as this depends on an upstream PR in Bolt: bolt-project/bolt#84)

@sofroniewn
Copy link

@jwittenbach sounds good, having a good chat about it with @freeman-lab right now. Curious about whether labels should be on images too -- imaging I want to drop some images in time, and I want to remember the original timestamp of the image somewhere so that when I make my movie I can display the correct time value

@jwittenbach
Copy link
Contributor Author

@sofroniewn that's a great point. When I was writing up the PR, I almost mentioned that it would be fairly trivial to extend the behavior to Images as well.

@freeman-lab
Copy link
Member

I think I agree with @sofroniewn, that the concept of a label per record is a generic concept, and based on a quick look at the code so far it looks like this could basically just be done on Data instead of Series and everything should "just work". Is that right?

@freeman-lab
Copy link
Member

Oh great, we're all in agreement 👍

@jwittenbach
Copy link
Contributor Author

Alright, moved the implementation of labels as well as all related functions up to Base so now they can be utilized by both Series as well as Images.

There's an interesting relationship between labels and index -- they do something slightly more complicated than trading roles when you go from Images to Series or vice-versa. But I don't think I want to take that on at the moment.

@freeman-lab
Copy link
Member

@jwittenbach can you merge changes from master, i think we're ready to merge this in!

@jwittenbach
Copy link
Contributor Author

Sure thing, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants