Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Real-world usage

Brighter Planet logo

We use data_miner for data science at Brighter Planet and in production at

The killer combination for us is:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata - apply corrections in a transparent way
  4. data_miner (this library!) - import data idempotently


Check out the extensive documentation.

Quick start

You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
  self.primary_key = 'iso_3166_code'

  # the "col" class method is provided by a different library - active_record_inline_schema
  col :iso_3166_code                            # alpha-2 2-letter like GB
  col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
  col :iso_3166_alpha_3_code                    # 3-letter like GBR
  col :name

  data_miner do
    # auto_upgrade! is provided by active_record_inline_schema
    process :auto_upgrade!

    import("'s Country Codes to Country Names list",
           :url => '',
           :format => :delimited,
           :delimiter => '; ',
           :headers => false,
           :skip => 22) do
      key   :iso_3166_code, :field_number => 0
      store :iso_3166_alpha_3_code, :field_number => 1
      store :iso_3166_numeric_code, :field_number => 2
      store :name, :field_number => 5

Now you can run:

>> Country.run_data_miner!
=> nil

More advanced usage

The earth library has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

Model Highlights Reference
Aircraft parsing Microsoft Frontpage HTML (!) data_miner.rb
Airports forcing column names and use of :select block (Proc) data_miner.rb
Automobile model variants super advanced usage of "custom parser" and errata data_miner.rb
Country parsing CSV and a few other tricks data_miner.rb
EGRID regions parsing XLS data_miner.rb
Flight segment (stage) super advanced usage of POSTing form data data_miner.rb
Zip codes downloading a ZIP file and pulling an XLSX out of it data_miner.rb

And many more - look for the data_miner.rb file that corresponds to each model. Note that you would normally put the data_miner declaration right inside the ActiveRecord model file... it's kept separate in earth so that loading it is optional.



  • Make the tests real unit tests
  • sql steps shouldn't shell out if binaries are missing


Copyright (c) 2013 Seamus Abshere