Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 test
Octocat-spinner-32 .gitignore
Octocat-spinner-32 CHANGELOG
Octocat-spinner-32 Gemfile
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.markdown
Octocat-spinner-32 Rakefile
Octocat-spinner-32 errata.gemspec
Octocat-spinner-32 rfc_editor.png
README.markdown

errata

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Inspiration

There's a process for reporting errata on RFC:

screenshot of the RFC Editor

Example

Every errata has a table structure based on the IETF RFC Editor's "How to Report Errata".

date name email type section action x y condition notes
2011-03-22 Ian Hough ian@brighterplanet.com meta Intended use http://example.com/original-data-with-errors.xls A hypothetical document that uses non-ISO country names
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /ANTIGUA & BARBUDA/ ANTIGUA AND BARBUDA
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BOLIVIA/ BOLIVIA, PLURINATIONAL STATE OF
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BOSNIA & HERZEGOVINA/ BOSNIA AND HERZEGOVINA
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BRITISH VIRGIN ISLANDS/ VIRGIN ISLANDS, BRITISH
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /COTE D'IVOIRE/ CÔTE D'IVOIRE
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /DEM\. PEOPLE'S REP\. OF KOREA/ KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /DEM\. REP\. OF THE CONGO/ CONGO, THE DEMOCRATIC REPUBLIC OF THE
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /HONG KONG SAR/ HONG KONG
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /IRAN \(ISLAMIC REPUBLIC OF\)/ IRAN, ISLAMIC REPUBLIC OF

Which would be saved as a CSV:

date,name,email,type,section,action,x,y,condition,notes
2011-03-22,Ian Hough,ian@brighterplanet.com,meta,Intended use,,http://example.com/original-data-with-errors.xls,,A hypothetical document that uses non-ISO country names
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/ANTIGUA & BARBUDA/,ANTIGUA AND BARBUDA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOLIVIA/,"BOLIVIA, PLURINATIONAL STATE OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOSNIA & HERZEGOVINA/,BOSNIA AND HERZEGOVINA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BRITISH VIRGIN ISLANDS/,"VIRGIN ISLANDS, BRITISH",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/COTE D'IVOIRE/,CÔTE D'IVOIRE,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\.  PEOPLE'S REP\. OF KOREA/,"KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\. REP\. OF THE CONGO/,"CONGO, THE DEMOCRATIC REPUBLIC OF THE",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/HONG KONG SAR/,HONG KONG,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/IRAN \(ISLAMIC REPUBLIC OF\)/,"IRAN, ISLAMIC REPUBLIC OF",,

And then used

errata = Errata.new(:url => 'http://example.com/errata.csv')
original = RemoteTable.new(:url => 'http://example.com/original-data-with-errors.xls')
original.each do |row|
  errata.correct! row # destructively correct each row
end

UTF-8

Assumes all input strings are UTF-8. Otherwise there can be problems with Ruby 1.9 and Regexp::FIXEDENCODING. Specifically, ASCII-8BIT regexps might be applied to UTF-8 strings (or vice-versa), resulting in Encoding::CompatibilityError.

More advanced usage

The earth library has dozens of real-life examples showing errata in action:

Model Reference Errata file
Country data_miner.rb wri_errata.csv
Aircraft data_miner.rb faa_errata.csv
Airports data_miner.rb openflights_errata.csv
Automobile model variants data_miner.rb feg_errata.csv

Real-world usage

Brighter Planet logo

We use errata for data science at Brighter Planet and in production at

The killer combination:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata (this library!) - apply corrections in a transparent way
  4. data_miner - import data idempotently

Authors

Copyright

Copyright (c) 2012 Brighter Planet. See LICENSE for details.

Something went wrong with that request. Please try again.