Skip to content

Implementing ETL sources

Thibaut Barrère edited this page Feb 10, 2020 · 4 revisions

Kiba sources are components you can either implement yourself, or pick from other projects (such as Kiba Common and Kiba Pro).

The sources are components responsible for the extraction of data.

Sources are classes implementing:

  • a constructor (to which Kiba will pass the provided arguments in the DSL)
  • the each method (which should yield rows one by one)

Rows are usually Hash instances, but could be other structures as long as the next steps of your pipeline know how to handle them.

Since sources are classes, you can (and are encouraged to) unit test them and reuse them.

Here is a simple CSV source:

require 'csv'

class MyCsvSource
  attr_reader :input_file

  def initialize(input_file)
    @input_file = input_file
  end

  def each
    CSV.open(input_file, headers: true, header_converters: :symbol) do |csv|
      csv.each do |row|
        yield(row.to_hash)
      end
    end
  end
end

Once implemented, you can use your source within Kiba.parse:

job = Kiba.parse do
  source MyCsvSource, filename
  # SNIP
end

The first argument for source is the class name. The other arguments will be passed to the source constructor (initialize) when Kiba runs your pipeline.

Ideally, it is recommended to open and close resources inside each, using a block-form (as seen in this example), to ensure that the resources are closed if the pipeline is interrupted.

A couple of sources are available in kiba-common, if you want to see how they are implemented.

Next: Implementing ETL transforms