New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split stream class #15
Comments
Are you able to say more about how you would want to override map, sink,
etc?
…On Wed, May 31, 2017 at 4:49 PM, Christopher J. Wright < ***@***.***> wrote:
Would it be possible to split the stream class into two classes?
The first class would hold the init, emit, child, and loop methods.
The second class would inherit from the first and implement map, sink,
buffer, etc.
Inspiration:
I have a very specific data topology and I need the various functions to
operate differently then they currently do. By splitting the class I'd be
able to use the same base class and just have to re-implement map, filter,
etc. for my data needs.
A similar proposition; would it be possible to make the various internals
of Streams hot swappable?
I don't need to change all the methods, for example delay most likely
could stay the same, as could buffer. But map, filter, sliding_windo, etc
won't work for the event model data topology.
Thoughts?
@danielballan <https://github.com/danielballan>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#15>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszI-6mCPokTRAlQrS8wCow6vfZSxOks5r_dJigaJpZM4NsLJ3>
.
|
The data exists as a generator which puts out (name, document) pairs.
For map I need to issue a new start, descriptor, and stop documents. The mapped function will only apply to the event documents, which will also issue new event documents. So I need map to be aware that it only applies the function to the event documents. Similarly for filter, I need to issue new start, descriptor, and stop documents, while true filtering only applies to the event level. |
Sink is most likely ok. |
My first inclination is to try to resolve this with the approach in
#13
However I'm also somewhat busy and so that might not happen in the near
future.
In principle I'm not against subclassing. I would want there to be a
pretty clear reason about which operations we keep in each class.
Alternatively you could just subclass Stream as it is and override the
methods that you care about.
…On Wed, May 31, 2017 at 5:01 PM, Christopher J. Wright < ***@***.***> wrote:
Sink is most likely ok.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszOILqImJWzVQhMJfLyhZOoLLoZRfks5r_dUygaJpZM4NsLJ3>
.
|
One of the issues with subclassing is that the I tried to look at #13 but I'm not certain I understand it fully yet. |
@CJ-Wright to give some context here there are a number of cases where we want to have different map/filter/etc. behaviors (remote, batched, dataframes, with metadata, etc.). To add complexity we sometimes want to apply multiple such behaviors at the same time (remote-batched). We're trying to think of a clean way to enable this kind of behavior more generally. (By "we" I mean myself, @ordirules, @danielballan, and now yourself.) |
Hmm, ok. I will try to think on it. |
@CJ-Wright if I understand correctly, you want to use streams to treat the most general case possible of the event based architecture. (Link is to provide again context for any newcomer) Correct? That is very interesting. As a suggestion for (('start', startdoc), ('descriptor', descriptordoc), ('event', eventdoc), ('stop', stopdoc) ) followed by a You can use Here is an example of I would also argue that treating streams in the most general case (start, stop, descriptor, events documents, where the number of events is unknown until we receive a stop) might not be a good idea. My reasoning is that the Please correct me if I misunderstood anything. I'm interested to hear your ideas. |
This just came across my mind. I think in your situation, you may want something similar to sin = Stream()
# split stream into a stream of starts, events and stops
s_starts = sin.filter(lambda x : x[0] == 'start')
s_events = sin.filter(lambda x : x[0] == 'event')
s_stops = sin.filter(lambda x : x[0] == 'stop')
def myacc(prev, next):
# accumulate docs of form prev = (name, doc)
# into next : next = dict of uids where doc is saved
# assume ('start', doc)
uid = prev[0]['uid']
if uid not in next:
next[uid] = list()
next[uid].append(prev[0])
def myflushroutine(prev, uid):
# pops data off
# here what is returned would be emitted by the accumulator
return prev.pop(uid)
# this will only emit when flushed
s_start_accum = sin.accumulate(myacc, flush=myflushroutine)
# this will also only emit when flushed
s_events_accum = sin.accumulate(myacc, flush=myflushroutine)
# here x[0]['uid'] is the general uid referring to the collection of events
# so it could be some other index. I'm assuming : stop -> ('stop', dict(uid=N,...))
# I make the assumption that when stop emits, *all* events are assumed to have arrived
# flush would need to be written
s_stop_accum.map(lambda x : s_start_accum.flush(x[0]['uid']) )
s_stop_accum.map(lambda x : s_events_accum.flush(x[0]['uid']) )
# this will only emit when all three arrive
sout = s_start_accum.zip(s_events_accum, s_stop_accum) Anyway, I think this is logic that would help you for the event based model. Happy to hear thoughts. For @mrocklin, would giving the accumulator some |
Yes I'd like to work with the event architecture (I don't know if I want the most general case yet (async everything), but moving in that direction may be good). I'd prefer to not have all the duplication if possible, especially as one may not even have a stop document when running the analysis and in the most general case one could get different descriptors at different positions. I don't really understand what you mean by " |
I misunderstood you (thought you meant async); the code was for this, sorry. (I still think it's interesting how streams seems to handle that, in my opinion, quite nicely). I'm mainly worried about the logic handling the full set of event documents being intertwined with other logic. I would still aggregate before doing anything else, but I'm definitely open to suggestions. I should explain some context, as I think we may have different usage cases. In my case, the start, event and stop document stream is either a full image or a time series of images. Thus, reading in an image is basically the same as reading the full set of start, events and stop. The decision making is outside of this logic. We thus can consider a "unit" in the streams to be the start, events and stop packaged together, and thus aggregate them together. I don't think we'll need to step outside that assumption in our case. It seems like this may not be the case for you. |
Yea sorry I should have specified async document generation/data acquisition. @danielballan and I had a discussion about if the stream logic can be separated from the document logic. He made a compelling case as to why it might not be, which I slowly seem to be coming around on. I feel that that aggregation is rather limiting. For in-line data processing we can't wait for the stop document to come in (for an experiment that takes hours this could take a while). My current working model goes something along the lines of this:
|
I'm going to close this as the |
Would it be possible to split the stream class into two classes?
The first class would hold the init, emit, child, and loop methods.
The second class would inherit from the first and implement map, sink, buffer, etc.
Inspiration:
I have a very specific data topology and I need the various functions to operate differently then they currently do. By splitting the class I'd be able to use the same base class and just have to re-implement map, filter, etc. for my data needs.
A similar proposition; would it be possible to make the various internals of Streams hot swappable?
I don't need to change all the methods, for example delay most likely could stay the same, as could buffer. But map, filter, sliding_windo, etc won't work for the event model data topology.
Thoughts?
@danielballan
The text was updated successfully, but these errors were encountered: