Add support for "close" on transforms #53

thbar · 2018-03-22T09:42:58Z

Sources and destinations both support some form of "close":

sources because they are only called once via #each (so one can close as part of this)
destinations via #close

In contrast, transforms cannot currently close.

This will be interesting in particular with Kiba v2 StreamingRunner, to implement some forms of buffering in the transform itself (e.g: keeping N records before doing a grouped query, aggregating N records), in which case close is a way to ensure we flush out the buffer.

See https://stackoverflow.com/questions/49422860/transforming-a-table-into-a-hash-of-sets-using-kiba-etl for an example of use.

The text was updated successfully, but these errors were encountered:

sirfilip · 2018-05-17T13:05:51Z

Not sure that it is really needed cause you can always "cache" stuff inside an instance variable or some other cache store and flush it by returning the contents of the cache and return nil or call next if inside a block to aggregate. So it can be easy accomplished with the current interface.

thbar · 2018-05-25T05:27:59Z

@sirfilip thanks for your input! I'm still thinking through this, and be sure (since I'm always quite aggressively removing features) I'll wait to have an actual production requirement that shows it's really needed, before implementing this :-)

thbar · 2018-06-17T16:51:25Z

A first experimental implementation is available in #57. I'll give myself some time to play around with it and see how necessary it is for various scenarios.

thbar · 2018-06-17T16:53:03Z

FYI @kmayer, #57 may be of interest to you.

thbar · 2018-06-21T16:20:26Z

It just occurred to me (while working on another ETL), that #57 also allows to achieve sorting quite easily (here in memory):

class SortTransform
  def initialize(sort_by:)
    @sort_by = sort_by
    @buffer = []
  end

  def process(row)
    @buffer << row
    nil
  end
  
  def close
    @buffer.sort_by(@sort_by).each do |row|
      yield row
    end
  end
end

I will experiment more, but this looks promising.

* Add support for aggregating transforms (see #53) * Fix spec for MRI 2.0 * Add support for JRuby 1.7

thbar · 2018-09-05T15:19:52Z

#57 has been merged into master. This will be released at next round (Kiba 2.5.0).

ttilberg · 2018-11-17T14:19:31Z

@thbar in addition to sorting, this is going to help me solve the following problem:

A set of snapshot data might include multiple fact values from the same product. Some of my exports expect me to pick a specific value for each product, such as the min, avg, max, or most recent price. This will allow me to bake that into the middle of the pipeline, still using other transforms, and actual destinations, instead of the ole' instance-var-in-the-pipeline-doing--its-thing-in-the-post-process trick.

Cool addition, thanks.

thbar · 2018-11-17T14:32:46Z

instead of the ole' instance-var-in-the-pipeline-doing--its-thing-in-the-post-process trick

@ttilberg you made me laugh with this one 😄 it's such a common thing that it's worth having a long word like this!

Thanks for the feedback, and I can say I definitely share your findings.

I've been using this construct in various places now, and this is useful for a wide range of topics (e.g. data profiling that you can plug conditionally on/off without touching the rest of the pipeline).

And in occasions, you can easily plug stuff (still relying on instance variables though), like the upcoming LambdaTransform that I will soon add to kiba common:

transform LambdaTransform,
  on_init: -> { @count = 0 },
  on_process: -> (r) { @count += 1; r },
  on_close: -> { logger.info "Total #{@count} rows read from source" }

This is very convenient for quick profiling while iterating on the data at development time.

In all cases, thanks for taking the time to comment & provide feedback, appreciated!

thbar added a commit that referenced this issue Jun 17, 2018

Add support for aggregating transforms (see #53)

ebdeb14

thbar mentioned this issue Sep 5, 2018

Add support for aggregating transforms #57

Merged

thbar added a commit that referenced this issue Sep 5, 2018

Add support for aggregating transforms (#57)

a2b5573

* Add support for aggregating transforms (see #53) * Fix spec for MRI 2.0 * Add support for JRuby 1.7

thbar closed this as completed Sep 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for "close" on transforms #53

Add support for "close" on transforms #53

thbar commented Mar 22, 2018

sirfilip commented May 17, 2018

thbar commented May 25, 2018

thbar commented Jun 17, 2018 •

edited

thbar commented Jun 17, 2018 •

edited

thbar commented Jun 21, 2018

thbar commented Sep 5, 2018

ttilberg commented Nov 17, 2018

thbar commented Nov 17, 2018

Add support for "close" on transforms #53

Add support for "close" on transforms #53

Comments

thbar commented Mar 22, 2018

sirfilip commented May 17, 2018

thbar commented May 25, 2018

thbar commented Jun 17, 2018 • edited

thbar commented Jun 17, 2018 • edited

thbar commented Jun 21, 2018

thbar commented Sep 5, 2018

ttilberg commented Nov 17, 2018

thbar commented Nov 17, 2018

thbar commented Jun 17, 2018 •

edited

thbar commented Jun 17, 2018 •

edited