Skip to content

Parser combinator framework consumes unnececessary amount of memory #319

Open
@scabug

Description

@scabug

Since all Readers provider by parser combinators framework use PagedSeq inside, using those parsers for working with large files seems impossible - because PagedSeq will not release already parsed elements.

For example, consider the scenario of parsing 1GB file, from which you need only a portion of information (you may want to skip headers, comments, etc.). PagedSeq will hold on the whole 1GB until the parsing finishes and GC would step in.

Example code:

import collection.immutable.PagedSeq
import util.parsing.combinator._
import util.parsing.input._

// virtual file reader (simulates ~400Mb file)
def in = new java.io.Reader {
  var buffersRead = 0
  def read(cbuf: Array[Char], offset: Int, l: Int) = {
    if (buffersRead < 100000) {
      (0 until cbuf.size).foreach(cbuf(_) = 't')
      buffersRead += 1
      cbuf.size
    } else -1
  }
  def close() {}
}

def parser = new RegexParsers {
  var gcCountdown = 0
  def tt = new Parser[Char] {
    def apply(in: Input) = {
      gcCountdown += 1
      if (gcCountdown > 10000) {
        System.gc()
        gcCountdown = 0
      }
      if (in.atEnd)
        Failure("", in)
      else
        Success(in.first, in.drop(1024))
    }
  }
  def go(in: java.io.Reader) = parseAll(tt.*, in).get.size
}
println(parser.go(in))

If you would look at memory usage using something like jvisualvm, you would notice that running this process consumes about 800Mb of RAM just to parse 400kb worth of characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions