Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow transforms to yield more than one row #4

Closed
thbar opened this issue Apr 20, 2015 · 8 comments
Closed

Allow transforms to yield more than one row #4

thbar opened this issue Apr 20, 2015 · 8 comments

Comments

@thbar
Copy link
Owner

thbar commented Apr 20, 2015

Currently and unlike activewarehouse-etl, it is not possible to yield multiple rows from a transform.

I'd like to implement such feature because it would be useful, but I need to fully think about the consequences first. For instance, it could work by yielding an Array of a specific type (so that the row itself could be an Array too, without risk of collision).

@beirigo
Copy link

beirigo commented May 14, 2015

Hi,
Would this be useful in the case of batch inserting multiple rows (as on a MysqlDestination for example)?

@thbar
Copy link
Owner Author

thbar commented May 14, 2015

Hi @marcosbeirigo! #4 is not required to achieve batch insert. It is meant for scenarios that need to denormalize one row into many rows. Imagine you'd have this csv file as a source:

key,tag_1,tag_2,tag_3
xb167,green,expensive,available

and you'd want to transform such a row into:

key,tag
xb167,green
xb167,expensive
xb167,available

This would currently require a 3-pass ETL process with Kiba. If a transform is able to return an array of rows, then it will become supported. Does that clear this up?

Now back to batch inserting: you can do it today, without #4. You can add batching support for a destination with something similar to this:

class MysqlDestination
  def initialize(xxx, batch_size:)
    @batch_size = batch_size
    @rows = []
  end

  def write(row)
    @rows << row
    flush_rows if @rows.size >= @batch_size
  end

  def flush_rows
    # do the batch write here
  end

  def close
    # flush remaining rows
    flush_rows
  end
end

Note that I may later introduce native support for batching in Kiba, maybe via some kind of middleware stacking like in Rack/Sidekiq, or similar.

Another way of doing faster inserts is to use MySQL bulk import. For this you'd create a destination that would output to a delimited file, then use a post_process step to call mysql bulk import or tools like embulk.

Hope this helps :-)

@beirigo
Copy link

beirigo commented May 14, 2015

hi @thbar,
That makes sense.

I think the flush_rows approach works well enough for my needs.

Thanks for clarifying!

@thbar
Copy link
Owner Author

thbar commented May 14, 2015

@marcosbeirigo you welcome 👍

@thbar
Copy link
Owner Author

thbar commented Jun 5, 2015

Adding some thoughts: some semantic is needed to allow to return more than one row per transform (common request).

One way to put this would be to explicitely use a different keyword, indicating that the transform is expected to yield zero or more times for rows, instead of having to return the row as usual.

This would give:

# if :bought_for field is an array
denormalize do |row|
  row[:bought_for].each do |value|
    yield(row.dup.merge(bought_for: value))
  end
end

@thbar
Copy link
Owner Author

thbar commented Jun 21, 2015

Supporting this properly requires either a rewrite using fibers (I have a prototype) or some potential slow down for everything else. Putting this in standbye mode for now. I will likely implement #15 first.

@thbar
Copy link
Owner Author

thbar commented Jun 24, 2015

Closing this. For now exploding multi-valued attributes can be done at the source level. See this article for a detailed how-to.

@thbar thbar closed this as completed Jun 24, 2015
@thbar
Copy link
Owner Author

thbar commented Jan 23, 2018

Kiba v2 supports yielding multiple rows from a class transform. See https://github.com/thbar/kiba/releases/tag/v2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants