Skip to content

Commit

Permalink
bump patch - docs
Browse files Browse the repository at this point in the history
  • Loading branch information
seamusabshere committed May 4, 2012
1 parent 66b8117 commit d721d9d
Show file tree
Hide file tree
Showing 17 changed files with 933 additions and 414 deletions.
12 changes: 5 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
*.sw?
.DS_Store
coverage
rdoc
pkg
test/test.sqlite3
data_miner.log
/coverage
/rdoc
/pkg
Gemfile.lock
*.gem
test.log
/.yardoc
/doc
13 changes: 13 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
2.0.2 / 2012-05-04

* Breaking changes

* Import descriptions are no longer optional
* Import options are no longer optional (but then, they never were)

* Enhancements

* Real documentation!
* Replace class-level mutexes with simple Thread.exclusive calls
* Simplified DataMiner::Dictionary

2.0.1 / 2012-04-18

* Enhancements
Expand Down
71 changes: 56 additions & 15 deletions README.markdown
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# data_miner

Download and import XLS, ODS, XML, CSV, etc. into your ActiveRecord models.
Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Expand All @@ -13,19 +13,23 @@ We use `data_miner` for [data science at Brighter Planet](http://brighterplanet.
* [Brighter Planet's reference data web service](http://data.brighterplanet.com)
* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)

The killer combination:
The killer combination for us is:

1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure
2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it
3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way
4. [`data_miner`](https://github.com/seamusabshere/remote_table) (this library!) - import data idempotently

## Documentation

Check out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner).

## Quick start

You define <tt>data_miner</tt> blocks in your ActiveRecord models. For example, in <tt>app/models/country.rb</tt>:
You define <code>data_miner</code> blocks in your ActiveRecord models. For example, in <code>app/models/country.rb</code>:

class Country < ActiveRecord::Base
self.primary_key = 'iso_3166_code'
self.primary_key = 'iso_3166_code'

data_miner do
import("OpenGeoCode.org's Country Codes to Country Names list",
Expand All @@ -44,20 +48,57 @@ You define <tt>data_miner</tt> blocks in your ActiveRecord models. For example,

Now you can run:

>> Country.run_data_miner!
=> nil
>> Country.run_data_miner!
=> nil

## More advanced usage

The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

* https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/country/data_miner.rb - CSV and a few other tricks
* https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/egrid_region/data_miner.rb - XLS
* https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/zip_code.rb - pulling an XLSX out of a ZIP file
* https://github.com/brighterplanet/earth/blob/master/lib/earth/air/aircraft/data_miner.rb - parsing Microsoft Frontpage HTML
* https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb - super advanced usage showing "custom parser" and errata usage
* https://github.com/brighterplanet/earth/blob/master/lib/earth/air/flight_segment/data_miner.rb - super advanced usage showing submission of form data
* and many more - look for the `data_miner.rb` file that corresponds to each model.
The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

<table>
<tr>
<th>Model</th>
<th>Highlights</th>
<th>Reference</th>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/aircraft">Aircraft</a></td>
<td>parsing Microsoft Frontpage HTML (!)</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/air/aircraft/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/airports">Airports</a></td>
<td>forcing column names and use of <code>:select</code> block (<code>Proc</code>)</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/air/airport/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/automobile_make_model_year_variants">Automobile model variants</a></td>
<td>super advanced usage of "custom parser" and errata</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/countries">Country</a></td>
<td>parsing CSV and a few other tricks</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/country/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/egrid_regions">EGRID regions</a></td>
<td>parsing XLS</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/egrid_region/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/flight_segments">Flight segment (stage)</a></td>
<td>super advanced usage of POSTing form data</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/air/flight_segment/data_miner.rb">data_miner.rb</a></td>
</tr>
<tr>
<td><a href="http://data.brighterplanet.com/zip_codes">Zip codes</a></td>
<td>downloading a ZIP file and pulling an XLSX out of it</td>
<td><a href="https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/zip_code.rb">data_miner.rb</a></td>
</tr>
</table>

And many more - look for the `data_miner.rb` file that corresponds to each model. Note that you would normally put the `data_miner` declaration right inside the ActiveRecord model file... it's kept separate in `earth` so that loading it is optional.

## Authors

Expand Down
4 changes: 2 additions & 2 deletions data_miner.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Gem::Specification.new do |s|
s.authors = ["Seamus Abshere", "Andy Rossmeissl", "Derek Kastner"]
s.email = ["seamus@abshere.net"]
s.homepage = "https://github.com/seamusabshere/data_miner"
s.summary = %{Mine remote data into your ActiveRecord models.}
s.description = %q{Mine remote data into your ActiveRecord models. You can also convert units.}
s.summary = %{Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.}
s.description = %q{Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. You can also convert units.}

s.rubyforge_project = "data_miner"

Expand Down
38 changes: 26 additions & 12 deletions lib/data_miner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
end
end

require 'data_miner/active_record_extensions'
require 'data_miner/active_record_class_methods'
require 'data_miner/attribute'
require 'data_miner/script'
require 'data_miner/dictionary'
Expand All @@ -24,14 +24,13 @@
require 'data_miner/step/process'
require 'data_miner/run'

# A singleton class that holds global configuration for data mining.
#
# All of its instance methods are delegated to +DataMiner.instance+, so you can call +DataMiner.model_names+, for example.
#
# @see DataMiner::ActiveRecordClassMethods#data_miner Overview of how to define data miner scripts inside of ActiveRecord models.
class DataMiner
class << self
delegate :perform, :to => :instance
delegate :run, :to => :instance
delegate :logger, :to => :instance
delegate :logger=, :to => :instance
delegate :model_names, :to => :instance

# @private
def downcase(str)
defined?(::UnicodeUtils) ? ::UnicodeUtils.downcase(str) : str.downcase
Expand All @@ -48,16 +47,20 @@ def compress_whitespace(str)
end
end

MUTEX = ::Mutex.new
INNER_SPACE = /[ ]+/

include ::Singleton

attr_writer :logger

# Run data miner scripts on models identified by their names. Defaults to all models.
#
# @param [optional, Array<String>] model_names Names of models to be run.
#
# @return [Array<DataMiner::Run>]
def perform(model_names = DataMiner.model_names)
Script.uniq do
model_names.each do |model_name|
model_names.map do |model_name|
model_name.constantize.run_data_miner!
end
end
Expand All @@ -66,8 +69,11 @@ def perform(model_names = DataMiner.model_names)
# legacy
alias :run :perform

# Where DataMiner logs to. Defaults to +Rails.logger+ or +ActiveRecord::Base.logger+ if either is available.
#
# @return [Logger]
def logger
@logger || MUTEX.synchronize do
@logger || ::Thread.exclusive do
@logger ||= if defined?(::Rails)
::Rails.logger
elsif defined?(::ActiveRecord) and active_record_logger = ::ActiveRecord::Base.logger
Expand All @@ -79,12 +85,20 @@ def logger
end
end

# Names of the models that have defined a data miner script.
#
# @note Models won't appear here until the files containing their data miner scripts have been +require+'d.
#
# @return [Set<String>]
def model_names
@model_names || MUTEX.synchronize do
@model_names || ::Thread.exclusive do
@model_names ||= ::Set.new
end
end

class << self
delegate(*DataMiner.instance_methods(false), :to => :instance)
end
end

::ActiveRecord::Base.extend ::DataMiner::ActiveRecordExtensions
::ActiveRecord::Base.extend ::DataMiner::ActiveRecordClassMethods
108 changes: 108 additions & 0 deletions lib/data_miner/active_record_class_methods.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
require 'active_record'
require 'lock_method'

class DataMiner
# Class methods that are mixed into models (i.e. ActiveRecord::Base)
module ActiveRecordClassMethods
# Access this model's script.
#
# @return [DataMiner::Script] This model's data miner script.
def data_miner_script
@data_miner_script || ::Thread.exclusive do
@data_miner_script ||= DataMiner::Script.new(self)
end
end

# Access to recordkeeping.
#
# @return [ActiveRecord::Relation] Records of running the data miner script.
def data_miner_runs
DataMiner::Run.scoped :conditions => { :model_name => name }
end

# Run this model's script.
#
# @return [DataMiner::Run]
def run_data_miner!
data_miner_script.perform
end

# Run the data miner scripts of parent associations. Useful for dependencies. Safe to call using +process+.
#
# @note Used extensively in https://github.com/brighterplanet/earth
#
# @example Since Provinces depend on Countries, make sure Countries are data mined first
# class Country < ActiveRecord::Base
# [...some data miner script...]
# end
# class Province < ActiveRecord::Base
# belongs_to :country
# data_miner do
# [...]
# process "make sure my dependencies have been loaded" do
# run_data_miner_on_parent_associations!
# end
# [...]
# end
# end
#
# @return [Array<DataMiner::Run>]
def run_data_miner_on_parent_associations!
reflect_on_all_associations(:belongs_to).reject do |assoc|
assoc.options[:polymorphic]
end.map do |non_polymorphic_belongs_to_assoc|
non_polymorphic_belongs_to_assoc.klass.run_data_miner!
end
end

# Define a data miner script.
#
# @param [optional, Hash] options
# @option options [TrueClass, FalseClass] :append (false) Add steps to existing data miner script instead of starting from scratch.
#
# @yield [] The block defining the steps.
#
# @see DataMiner::Script#import
# @see DataMiner::Script#process
# @see DataMiner::Script#tap
#
# @example Creating steps
# class MyModel < ActiveRecord::Base
# data_miner do
# process [...]
# import [...]
# import [...yes, it's ok to have more than one import step...]
# process [...]
# [...etc...]
# end
# end
#
# @example From the README
# class Country < ActiveRecord::Base
# self.primary_key = 'iso_3166_code'
# data_miner do
# import("OpenGeoCode.org's Country Codes to Country Names list",
# :url => 'http://opengeocode.org/download/countrynames.txt',
# :format => :delimited,
# :delimiter => '; ',
# :headers => false,
# :skip => 22) do
# key :iso_3166_code, :field_number => 0
# store :iso_3166_alpha_3_code, :field_number => 1
# store :iso_3166_numeric_code, :field_number => 2
# store :name, :field_number => 5
# end
# end
# end
#
# @return [nil]
def data_miner(options = {}, &blk)
DataMiner.model_names.add name
unless options[:append]
@data_miner_script = nil
end
data_miner_script.append_block blk
nil
end
end
end
38 changes: 0 additions & 38 deletions lib/data_miner/active_record_extensions.rb

This file was deleted.

Loading

0 comments on commit d721d9d

Please sign in to comment.