A simple IO library for using Clojure's reducers
Java Clojure
Latest commit f99267b Apr 16, 2017 @thebusby committed on GitHub Merge pull request #9 from mangolas/fix-chunk-seq-on-large-csv
Fix a bug on rec-seq failing on very big files due interger overflow


Clojars Build Status


Iota is a Clojure library for handling large text files in memory, and offers the following benefits;

  • Tuned for Clojure's reducers, letting you reduce over large files quickly.
  • Uses Java NIO's mmap() for rapid IO and handling files larger than available memory.
  • Efficiently stores data as it is represented in the file, and only converts to java.lang.String when necessary.
  • Offers efficient indexing and caching that emulates Clojure's native vector and seq data structures.
  • Adjustable buffer sizes for IO and caching, enabling tuning for specific data sets.

Why write this library?
I wanted to be able to use Clojure reducers against large text files to speed up data processing, and without needing more than 10% memory overhead. Due to Java's inefficient storage of Strings, I found that a 1GB TSV file consumed 10GB of RAM when loaded line by line into a Clojure vector.


Iota offers iota/seq and iota/vec for two different use cases.
Both treat a line, as delimited by a byte separator (default is newline), as an element.

Differences iota/seq iota/vec
On creation Quick, mmap's the file, and stops Slow, mmap's the file and iterates throught the entire file to generate an index
Sequential access Scans the buffer for the next byte separator Quick, N records are read at once and cached
Random access O(N), just don't Quick, O(1) via index
Via reducers Buffer is divided in half repeatedly until it is smaller than specified size, and then entire buffer is converted to String[] for processing treated exactly like a Clojure vector, but each thread gets it's own cache.
  • If you'll only be reading the entire file at a time, then use iota/seq.
  • If you need random access across the file, then use iota/vec.
  • If you need random access and line numbers, then iota/numbered-vec.
  • WARNING If you're not using reducers with iota/seq, use Clojure's line-seq instead!
  • for iota/vec and iota/seq, empty lines will return nil.
  • for iota/numbered-vec, empty lines will return the line number as a String.


(def file-vec (iota/vec filename)) ;; Map the file into memory, and generate index of lines. Slow.
(def file-seq (iota/seq filename)) ;; Map the file into memory. Quick.

;; Returns first line of file
(first file-vec) 
(first file-seq)

(last file-vec) ;; Returns last line of file

(nth file-vec 2) ;; Returns the 3rd line of the file

;; Count number of non-empty fields in TSV file
(->> (iota/seq filename)
     (clojure.core.reducers/filter identity) ;; filter out empty lines
     (clojure.core.reducers/map  #(->> (clojure.string/split % #"[\t]" -1)
                                       (filter (fn [^String s] (not (.isEmpty s)))) ;; Remove empty fields
     (clojure.core.reducers/fold +))

;; Skips the first line of the file, good for ignoring a header
(iota/subvec file-vec 1) 
(rest file-seq) 

Known issues;

  • Records must be delimited by a single byte value, hence 2 Byte encodings like UTF-16 and UCS-2 can't be parsed correctly.


Iota artifacts are available on Clojars with instructions for leiningen, gradle, and maven.


MIT http://opensource.org/licenses/MIT

I'd also like to thank my employer Gracenote, for allowing me to create this open source port.

Copyright (C) 2012-2013 Alan Busby