Iota

Iota is a Clojure library for handling large text files in memory, and offers the following benefits;

Tuned for Clojure's reducers, letting you reduce over large files quickly.
Uses Java NIO's mmap() for rapid IO and handling files larger than available memory.
Efficiently stores data as it is represented in the file, and only converts to java.lang.String when necessary.
Offers efficient indexing and caching that emulates Clojure's native vector and seq data structures.
Adjustable buffer sizes for IO and caching, enabling tuning for specific data sets.

Why write this library?
I wanted to be able to use Clojure reducers against large text files to speed up data processing, and without needing more than 10% memory overhead. Due to Java's inefficient storage of Strings, I found that a 1GB TSV file consumed 10GB of RAM when loaded line by line into a Clojure vector.

Details

Iota offers iota/seq and iota/vec for two different use cases.
Both treat a line, as delimited by a byte separator (default is newline), as an element.

Differences	iota/seq	iota/vec
On creation	Quick, mmap's the file, and stops	Slow, mmap's the file and iterates throught the entire file to generate an index
Sequential access	Scans the buffer for the next byte separator	Quick, N records are read at once and cached
Random access	O(N), just don't	Quick, O(1) via index
Via reducers	Buffer is divided in half repeatedly until it is smaller than specified size, and then entire buffer is converted to String[] for processing	treated exactly like a Clojure vector, but each thread gets it's own cache.

Advice

If you'll only be reading the entire file at a time, then use iota/seq.
If you need random access across the file, then use iota/vec.
If you need random access and line numbers, then iota/numbered-vec.
WARNING If you're not using reducers with iota/seq, use Clojure's line-seq instead!

Note

for iota/vec and iota/seq, empty lines will return nil.
for iota/numbered-vec, empty lines will return the line number as a String.

Usage

(def file-vec (iota/vec filename)) ;; Map the file into memory, and generate index of lines. Slow.
(def file-seq (iota/seq filename)) ;; Map the file into memory. Quick.

;; Returns first line of file
(first file-vec) 
(first file-seq)

(last file-vec) ;; Returns last line of file

(nth file-vec 2) ;; Returns the 3rd line of the file

;; Count number of non-empty fields in TSV file
(->> (iota/seq filename)
     (clojure.core.reducers/filter identity) ;; filter out empty lines
     (clojure.core.reducers/map  #(->> (clojure.string/split % #"[\t]" -1)
                                       (filter (fn [^String s] (not (.isEmpty s)))) ;; Remove empty fields
                                       count))
     (clojure.core.reducers/fold +))

;; Skips the first line of the file, good for ignoring a header
(iota/subvec file-vec 1) 
(rest file-seq)

Known issues;

Records must be delimited by a single byte value, hence 2 Byte encodings like UTF-16 and UCS-2 can't be parsed correctly.

Artifacts

Iota artifacts are available on Clojars with instructions for leiningen, gradle, and maven.

License

MIT http://opensource.org/licenses/MIT

I'd also like to thank my employer Gracenote, for allowing me to create this open source port.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
test/iota		test/iota
.travis.yml		.travis.yml
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iota

Details

Advice

Note

Usage

Known issues;

Artifacts

License

About

Releases

Packages

Contributors 3

Languages

thebusby/iota

Folders and files

Latest commit

History

Repository files navigation

Iota

Details

Advice

Note

Usage

Known issues;

Artifacts

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages