Permalink
Browse files

initial commit

  • Loading branch information...
0 parents commit 2503ca8669761895f2d38e3b47dc470ca0be0e25 @tilo committed Jul 29, 2012
Showing with 281 additions and 0 deletions.
  1. +8 −0 .gitignore
  2. +1 −0 .rvmrc
  3. +4 −0 Gemfile
  4. +23 −0 LICENSE
  5. +97 −0 README.md
  6. +2 −0 Rakefile
  7. +9 −0 lib/extensions/hash.rb
  8. +4 −0 lib/smarter_csv.rb
  9. +113 −0 lib/smarter_csv/smarter_csv.rb
  10. +3 −0 lib/smarter_csv/version.rb
  11. +17 −0 smarter_csv.gemspec
@@ -0,0 +1,8 @@
+*~
+#*#
+*old
+*.bak
+*.gem
+.bundle
+Gemfile.lock
+pkg/*
1 .rvmrc
@@ -0,0 +1 @@
+rvm gemset use smarter_csv
@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+
+# Specify your gem's dependencies in smarter_csv.gemspec
+gemspec
23 LICENSE
@@ -0,0 +1,23 @@
+Copyright (c) 2012 Tilo Sloboda
+
+
+MIT License
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,97 @@
+# SmarterCSV
+
+`smarter_csv` is a Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, suitable for direct processing with Mongoid or ActiveRecord,
+and parallel processing with Resque or Sidekiq.
+
+`smarter_csv` has lots of optional features:
+ * able to process large CSV-files
+ * able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
+ * return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
+ * able to pass a block to the method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
+ * have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
+ * able to re-map CSV "column names" to Hash-keys of your choice (normalization)
+ * able to ignore "columns" in the input (delete columns)
+ * able to eliminate nil or empty fields from the result hashes
+
+## Why?
+
+Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),
+
+As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
+
+## Example 1: Reading a CSV-File in one Chunk, returning one Array of Hashes:
+
+ filename = '/tmp/input_file.txt' # TAB delimited file, each row ending with Control-M
+ recordsA = SmarterCSV.process_csv(filename, {:col_sep => "\t", :row_sep => "\cM"}
+
+ => returns an array of hashes
+
+## Example 2: Populate a MySQL or MongoDB Database with SmarterCSV:
+
+ # without using chunks:
+ filename = '/tmp/some.csv'
+ n = SmarterCSV.process_csv(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |array|
+ # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
+ # when chunking is not enabled, there is only one hash in each array
+ MyModel.create( array.first )
+ end
+
+ => returns number of chunks / rows we processed
+
+
+## Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
+
+ # using chunks:
+ filename = '/tmp/some.csv'
+ n = SmarterCSV.process_csv(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}, :chunk_size => 100}) do |array|
+ # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
+ # when chunking is enabled, there are up to :chunk_size hashes in each array
+ MyModel.collection.insert( array ) # insert up to 100 records at a time
+ end
+
+ => returns number of chunks we processed
+
+
+## Example 4: Reading a CSV-like File, and Processing it with Resque:
+
+ filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
+ n = SmarterCSV.process_csv(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
+ :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |x|
+ puts "Resque.enque( ResqueWorkerClass, #{x.size}, #{x.inspect} )" # simulate processing each chunk
+ end
+ => returns number of chunks
+
+
+## Installation
+
+Add this line to your application's Gemfile:
+
+ gem 'smarter_csv'
+
+And then execute:
+
+ $ bundle
+
+Or install it yourself as:
+
+ $ gem install smarter_csv
+
+## Usage
+
+TODO: Write usage instructions here
+
+## Contributing
+
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Added some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request
+
+
+## See also:
+
+ http://www.unixgods.org/~tilo/Ruby/process_csv_as_hashes.html
+
+ https://gist.github.com/3101950
+
@@ -0,0 +1,2 @@
+#!/usr/bin/env rake
+require "bundler/gem_tasks"
@@ -0,0 +1,9 @@
+# the following extension for class Hash is needed (from Facets of Ruby library):
+
+class Hash
+ def self.zip(keys,values) # from Facets of Ruby library
+ h = {}
+ keys.size.times{ |i| h[ keys[i] ] = values[i] }
+ h
+ end
+end
@@ -0,0 +1,4 @@
+require "smarter_csv/version"
+require "extensions/hash.rb"
+require "smarter_csv/smarter_csv.rb"
+
@@ -0,0 +1,113 @@
+require "../extentions/hash.rb"
+
+module SmarterCSV
+ # this reads and processes a "generalized" CSV file and returns the contents either as an Array of Hashes,
+ # or an Array of Arrays, which contain Hashes, or processes Chunks of Hashes via a given block
+ #
+ # File.read_csv supports the following options:
+ # * :col_sep : column separator , which defaults to ','
+ # * :row_sep : row separator or record separator , defaults to system's $/ , which defaults to "\n"
+ # * :quote_char : quotation character , defaults to '"' (currently not used)
+ # * :comment_regexp : regular expression which matches comment lines , defaults to /^#/ (see NOTE about the CSV header)
+ # * :chunk_size : if set, determines the desired chunk-size (defaults to nil, no chunk processing)
+ # * :remove_empty_fields : remove fields which have nil or empty strings as values (default: true)
+ #
+ # NOTES about CSV Headers:
+ # - as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
+ # - the first line with the CSV header may or may not be commented out according to the :comment_regexp
+ # - any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
+ # - any of the keys in the header line will be converted to Ruby symbols before being used in the returned Hashes
+ #
+ # NOTES on Key Mapping:
+ # - keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes
+ # can be better used internally in our application (e.g. when directly creating MongoDB entries with them)
+ # - if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
+ #
+ # NOTES on the use of Chunking and Blocks:
+ # - chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
+ # - if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
+ # If the chunk_size is not set, then the array will only contain one Hash.
+ # If the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
+ # This can be very useful when passing chunked data to a post-processing step, e.g. through Resque
+ #
+
+ def self.process_csv(filename, options={}, &block)
+ default_options = {:col_sep => ',' , :row_sep => $/ , :quote_char => '"', :remove_empty_fields => true,
+ :comment_regexp => /^#/, :chunk_size => nil , :key_mapping_hash => nil
+ }
+ options = default_options.merge(options)
+ headerA = []
+ result = []
+ old_row_sep = $/
+ begin
+ $/ = options[:row_sep]
+ f = File.open(filename, "r")
+
+ # process the header line in the CSV file..
+ # the first line of a CSV file contains the header .. it might be commented out, so we need to read it anyhow
+ headerA = f.readline.sub(options[:comment_regexp],'').chomp(options[:row_sep]).split(options[:col_sep]).map{|x| x.gsub(%r/options[:quote_char]/,'').gsub(/\s+/,'_').to_sym}
+ key_mappingH = options[:key_mapping]
+
+ # do some key mapping on the keys in the file header
+ # if you want to completely delete a key, then map it to nil or to ''
+ if ! key_mappingH.nil? && key_mappingH.class == Hash && key_mappingH.keys.size > 0
+ headerA.map!{|x| key_mappingH.has_key?(x) ? (key_mappingH[x].nil? ? nil : key_mappingH[x].to_sym) : x}
+ end
+
+ # in case we use chunking.. we'll need to set it up..
+ if ! options[:chunk_size].nil? && options[:chunk_size].to_i > 0
+ use_chunks = true
+ chunk_size = options[:chunk_size].to_i
+ chunk_count = 0
+ chunk = []
+ else
+ use_chunks = false
+ end
+
+ # now on to processing all the rest of the lines in the CSV file:
+ while ! f.eof? # we can't use f.readlines() here, because this would read the whole file into memory at once, and eof => true
+ line = f.readline # read one line.. this uses the input_record_separator $/ which we set previously!
+ next if line =~ options[:comment_regexp] # ignore all comment lines if there are any
+ line.chomp! # will use $/ which is set to options[:col_sep]
+
+ dataA = line.split(options[:col_sep])
+ hash = Hash.zip(headerA,dataA) # from Facets of Ruby library
+ # make sure we delete any key/value pairs from the hash, which the user wanted to delete:
+ hash.delete(nil); hash.delete(''); hash.delete(:"") # delete any hash keys which were mapped to be deleted
+ hash.delete_if{|k,v| v.nil? || v =~ /^\s*$/} if options[:remove_empty_fields]
+
+ if use_chunks
+ chunk << hash # append temp result to chunk
+
+ if chunk.size >= chunk_size || f.eof? # if chunk if full, or EOF reached
+ # do something with the chunk
+ if block_given?
+ yield chunk # do something with the hashes in the chunk in the block
+ else
+ result << chunk # not sure yet, why anybody would want to do this without a block
+ end
+ chunk_count += 1
+ chunk = [] # initialize for next chunk of data
+ end
+ # while a chunk is being filled up we don't need to do anything else here
+
+ else # no chunk handling
+ if block_given?
+ yield [hash] # do something with the hash in the block (better to use chunking here)
+ else
+ result << hash
+ end
+ end
+ end
+ ensure
+ $/ = old_row_sep # make sure this stupid global variable is always reset to it's previous value after we're done!
+ end
+ if block_given?
+ return chunk_count # when we do processing through a block we only care how many chunks we processed
+ else
+ return result # returns either an Array of Hashes, or an Array of Arrays of Hashes (if in chunked mode)
+ end
+ end
+ end
+
+end
@@ -0,0 +1,3 @@
+module SmarterCSV
+ VERSION = "1.0.0.pre1"
+end
@@ -0,0 +1,17 @@
+# -*- encoding: utf-8 -*-
+require File.expand_path('../lib/smarter_csv/version', __FILE__)
+
+Gem::Specification.new do |gem|
+ gem.authors = ["Tilo Sloboda\n"]
+ gem.email = ["tilo.sloboda@gmail.com\n"]
+ gem.description = %q{Ruby Gem for smarter importing of CSV Files as Array(s) of Hashes, with optional features for processing large files in parallel, embedded comments, unusual field- and record-separators, flexible mapping of CSV-headers to Hash-keys}
+ gem.summary = %q{Ruby Gem for smarter importing of CSV Files (and CSV-like files), with lots of optional features, e.g. chunked processing for huge CSV files}
+ gem.homepage = ""
+
+ gem.files = `git ls-files`.split($\)
+ gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
+ gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
+ gem.name = "smarter_csv"
+ gem.require_paths = ["lib"]
+ gem.version = SmarterCSV::VERSION
+end

0 comments on commit 2503ca8

Please sign in to comment.