Speed up projects.json file #273

lizmat · 2018-09-20T09:12:22Z

Please find below the code I wrote to hyper the parsing of the "projects.json" file. This drops the parsing from 2.2 seconds to 1 second on my machine (without the new quick parser on rakudo) and to 0.3 with the new quick parser.

It basically depends on the way the "projects.json" file is created, with the first line being [ and the last line being ], and separated by lines consisting of just , in between.

I'm not sure where to hook that into zef, so I'm adding it here as an issue.

use Test;
use nqp;

my $file = "projects.json";
my $meta = $file.IO.slurp;
my $then = now;
my @old := Rakudo::Internals::JSON.from-json($meta);
say "Old: {now - $then}";

my class Chunkify does Iterator {
    has $.iterator;
    has str @!lines;
    has int $!done;

    method TWEAK() {
        die "did not see beginning" unless $!iterator.pull-one eq '['
    }

    method !done() {
        my \chunk = @!lines.join(" ");
        @!lines = ();
        chunk
    }

    method pull-one() {
        if $!done {
            IterationEnd
        }
        else {
            until (my \pulled = $!iterator.pull-one) =:= IterationEnd {
                if pulled eq ',' {
                    return self!done;
                }
                elsif pulled eq ']' {
                    $!done = 1;
                    return self!done;
                }
                else {
                    @!lines.push(pulled);
                }
            }
            die 'did not see end';
        }
}

sub parse-projects($file) {
    Seq.new(Chunkify.new( iterator => $file.IO.lines.iterator ))
      .hyper.map: { Rakudo::Internals::JSON.from-json($_) }
}

$then = now;
my @new = parse-projects($file);
say "New: {now - $then}";
#dd @new;

#is-deeply @new[$_], @old[$_] for ^@old;

The text was updated successfully, but these errors were encountered:

This parser only handles hashes, arrays and strings and will return Nil if *anything* goes wrong. It is now plugged into "R:I:JSON.from-json" to first attempt the very fast parser: of that fails, fall back to the original, slow, grammar based parser. How much faster: at least 10x faster!

ugexe · 2018-09-20T12:32:42Z

Technically this works, but only for the ecosystem data specifically generated and served by me. It wouldn’t work on the ecosystem data generated by ecosystem-api.p6c.org (a default) or any darkpan-like.

niner · 2018-09-20T12:47:10Z

On Donnerstag, 20. September 2018 14:32:43 CEST Nick Logan wrote: Technically this works, but only for the ecosystem data specifically generated and served by me. It wouldn’t work on the ecosystem data generated by ecosystem-api.p6c.org (a default) or any darkpan-like.

There's a way out: you can use the slow, generic JSON parser after downloading the ecosystem data from wherever and then have generate a more easily parseable version suitable for the fast parser. That's kind of a low cost in-between solution, that would get you most of the benefit of what I'd do, which is just putting the data into an SQLite database after downloading and parsing it.

ugexe · 2018-09-20T13:18:14Z

I'm content with the slower parsing for less technical debt trade off.

lizmat · 2018-09-20T13:39:52Z

Please find below another version of the parallel parsing: this version does not depend on any particular format: it parses both CPAN and ecosystem JSON files. On my machine this reduces the parse time from 2.3 to about 0.9. I think this presents a reasonable speed increase / technical debt tradeoff, as it basically only chunks the JSON files and then uses whatever parser is around to parse the chunks in parallel. The only thing it expects is the first non-whitespace char to be a [ and the last non-whitespace char to be a ].

use Test;
use nqp;

my $file = "projects2.json";
my $meta = $file.IO.slurp;
my $then = now;
my @old := Rakudo::Internals::JSON.from-json($meta);
say "Old: {now - $then}";

my class Chunkify does Iterator {
    has str $.json;
    has int $!pos;

    my int $backslash   = ord('\\');
    my int $open-array  = ord('[');
    my int $close-array = ord(']');
    my int $open-hash   = ord('{');
    my int $close-hash  = ord('}');

    method TWEAK() {
        die "did not see beginning"
          unless nqp::ordat($!json,$!pos) == $open-array;
    }

    method pull-one() {
        my int $ord;
        my int $start;
        my int $open-hashes;
        my int $pos   = $!pos;
        my int $chars = nqp::chars($!json);

        while ++$pos < $chars {
            $ord = nqp::ordat($!json,$pos);
            if $ord == $backslash {
                # no action, we ignore backslashes
            }
            elsif $ord == $close-hash {
                unless --$open-hashes {
                    die "no start found" unless $start;
                    $!pos = $pos;
                    return nqp::substr($!json,$start,$pos - $start + 1)
                }
            }
            elsif $ord == $open-hash {
                $start = $pos unless $open-hashes++;
            }
        }
        nqp::ordat($!json,$pos - 1) == $close-array
          ?? IterationEnd
          !! die 'did not see end';
    }
}

sub parse-projects($file) {
    Seq.new(Chunkify.new( json => $file.IO.slurp.trim ))
      .hyper(:32batch).map: { Rakudo::Internals::JSON.from-json($_) }
}

$then = now;
my @new = parse-projects($file);
say "New: {now - $then}";
#dd @new;

#is-deeply @new[$_], @old[$_] for ^@old;

lizmat · 2018-09-20T13:59:32Z

Never mind: I just realized we could hook this into R:I:from-json, which would relieve the technical debt from zef

lizmat · 2018-09-20T15:28:52Z

rakudo/rakudo@c9432c2072

ugexe · 2018-09-20T15:44:07Z

Hmm, but I do not see a speed difference between 2018.08 and blead for time zef search HTTP

fwiw I get ~7s with or without the JSON improvement, and 4.8s with JSON::Fast

ugexe · 2018-09-20T20:31:55Z

I see a speedup for time perl6 -e 'from-json("projects1.json".IO.slurp)' (2.5s to 1.6s), but for whatever reason that doesn't seem to improve speed for zef search ...

lizmat closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up projects.json file #273

Speed up projects.json file #273

lizmat commented Sep 20, 2018 •

edited

ugexe commented Sep 20, 2018

niner commented Sep 20, 2018 via email

ugexe commented Sep 20, 2018

lizmat commented Sep 20, 2018

lizmat commented Sep 20, 2018

lizmat commented Sep 20, 2018

ugexe commented Sep 20, 2018 •

edited

ugexe commented Sep 20, 2018 •

edited

Speed up projects.json file #273

Speed up projects.json file #273

Comments

lizmat commented Sep 20, 2018 • edited

ugexe commented Sep 20, 2018

niner commented Sep 20, 2018 via email

ugexe commented Sep 20, 2018

lizmat commented Sep 20, 2018

lizmat commented Sep 20, 2018

lizmat commented Sep 20, 2018

ugexe commented Sep 20, 2018 • edited

ugexe commented Sep 20, 2018 • edited

lizmat commented Sep 20, 2018 •

edited

ugexe commented Sep 20, 2018 •

edited

ugexe commented Sep 20, 2018 •

edited