-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.0.0 labels #263
1.0.0 labels #263
Conversation
@freeman-lab @sofroniewn would like your feedback on this when you get a chance (tests are currently failing as this depends on an upstream PR in Bolt: bolt-project/bolt#84) |
@jwittenbach sounds good, having a good chat about it with @freeman-lab right now. Curious about whether labels should be on images too -- imaging I want to drop some images in time, and I want to remember the original timestamp of the image somewhere so that when I make my movie I can display the correct time value |
@sofroniewn that's a great point. When I was writing up the PR, I almost mentioned that it would be fairly trivial to extend the behavior to |
I think I agree with @sofroniewn, that the concept of a |
Oh great, we're all in agreement 👍 |
Alright, moved the implementation of There's an interesting relationship between |
@jwittenbach can you merge changes from master, i think we're ready to merge this in! |
Sure thing, will do! |
Implementation
This PR implements
labels
, a new feature on theSeries
object that allows the user to keep track of the identity of the individual series that make up theSeries
object even through operations such asSeries.filter
and indexing (Series[...]
). In analogy to howSeries.index
allows the user to keep tabs on the final dimension of theSeries
object,Series.labels
allows the user to track the identities of the "base axes" (the non-final axes which, inspark
mode, are distributed).Assume we have a
Series
object namedseries
with shape(x, y, z, t)
or(n, t)
. We can attach a set of labels to these series with:where
labels
is an array-like object of size(x, y, z)
or(n)
respectively.In regards to how they affect the
labels
, operations onSeries
fall into three categories:map
do not change the structure of non-final dimensions and thus thelabels
are unaffected -- e.g.Series.map
,Series.zscore
,Series.between
.reduce
combine all the individual series in a theSeries
object and thus the identities of the individual series are lost and thelabels
are dropped -- e.g.Series.reduce
,Series.mean
.filter
will drop some of the series. This is wherelabels
are most useful in tracking the identities of the retained series. In these cases, thelabels
will be updated to reflect the new structure of theSeries
object -- e.g.Series.filter
andSeries.__getitem__
(i.e. indexing).A note on performance in
spark
mode:In the distributed setting, determining which elements of the
Series
object were dropped/retained during afilter
can be expensive. This effectively involves making two passes through the data: the first to determine which values will be dropped (amap
) and a second to actually drop those values (afilter
). Whenlabels
are set (i.e. notNone
), then these too passes will happen in a non-lazy fashion so that thelabels
can be appropriately updated (NB:filter
is already non-lazy in this setting).Indexing is similar to a
filter
in that records are dropped, however the specification of which records will be dropped is knowable directly from the inputs, thus updating thelabels
(like the indexing itself) is fast and the indexing operation remains lazy.