Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird memory usage of random-find benchmark #23

Open
squeek502 opened this issue Jun 3, 2022 · 6 comments
Open

Weird memory usage of random-find benchmark #23

squeek502 opened this issue Jun 3, 2022 · 6 comments

Comments

@squeek502
Copy link
Contributor

squeek502 commented Jun 3, 2022

The maxrss results of the random-find has some anomalous spikes for unknown reasons:

random-find-maxrss

From IRC:

<squeek502> andrewrk, is it expected that the deflate improvements would lead to a speedup in a lot of the benchmarks (zig-fmt, ast-check-os, etc)?
<andrewrk> mmm I don't think so. there are no dependencies on std lib deflate in the compiler
<squeek502> strange, there's a big spike down at exactly that PR being merged for a lot of benchmarks
<andrewrk> is it around jan 25? that also happens to be the end of the backfill script, so perhaps there was some kind of externality accidentally leaking into the measurements?
<squeek502> ah yeah it is, that makes more sense
<andrewrk> it's most noticeable on Max RSS of random-find
<andrewrk> I have no idea what is happening in this graph
<andrewrk> but the big dip is the end of the backfill script
<andrewrk> ummmm maybe it's the hash table of the records.csv file in the backfill script itself, and we're unintentionally including the parent process memory in the measurement
<andrewrk> so as it goes backwards over time, more measurements collected, more hash table memory
<squeek502> ah that would be interesting, wouldn't explain the CPU instruction count/wall clock/etc dropping though
<andrewrk> agreed, but at least it shows you where the script stopped running
<squeek502> zig-fmt benchmark is the most noticeable for the instruction count/wall clock
<andrewrk> hmm maybe are we accidentally using the (more recent) zig exe that is running the backfill script, instead of using the freshly built zig exe from the relevant commit

I've tried testing the theory that the memory usage of the backfill script itself is being included and I ultimately don't think that's what's happening.

To test this, I added:

    // allocate 100MiB to attempt to skew the results to check if the backfill process is included
    const dummy = try arena.alloc(u8, 100 * 1024 * 1024);
    _ = dummy;

to backfill.zig and then ran the benchmarks once with collect-measurements.zig and then once with backfill.zig for the same commit (e498fb155051f548071da1a13098b8793f527275). Here's the maxrss of random-find for each:

collect-measurements backfill
2408 2404

Interestingly, values ~2400 were only gotten by any gotta-go-fast benchmarking during the original runs, and it seems like somehow all results since then (most done by the backfill script AFAIK) seem to be getting larger and larger (for no discernible reason, as the benchmark and the hash map implementation haven't changed much in that time). Also, if you look at the graph, there are actually quite a few of these mini-spikes which are also presumably in line with where the backfill script was run previously.

So, no clue what's going on here still, but hopefully I've narrowed it down slightly. Will probably try running the backfill script locally with just random-find next, and seeing if I can reproduce the increasing maxrss results.

@squeek502
Copy link
Contributor Author

squeek502 commented Jun 3, 2022

Can't seem to reproduce it when just running random-find:

maxrss timestamp benchmark_name commit_hash commit_timestamp zig_version
2408 1654218527 random-find 288e89b606b46328a5ab358b2eef2c5dc277bc8f 1654127128 0.10.0-dev.2472+288e89b60
2408 1654218803 random-find a4cdb49a58f62776b73ad873b25243e65ac29266 1654124353 0.10.0-dev.2471+a4cdb49a5
2404 1654219072 random-find b82cccc9e9b2230097f81fecec12ac0fdae97518 1654123401 0.10.0-dev.2464+b82cccc9e
2408 1654219344 random-find b095aa6986badc3b8c2255ad2c824ca4ea9959d9 1654117558 0.10.0-dev.2463+b095aa698
2408 1654219617 random-find de14fba2478019ba6070407502b422d349ebf754 1654109580 0.10.0-dev.2462+de14fba24
2408 1654219893 random-find 69323fc1432146bf166175060dea9d0248dbeba4 1654107559 0.10.0-dev.2461+69323fc14
2404 1654220171 random-find cbb806da6e0b8f2f4d0d12ca57618c291acfa0c0 1654062999 0.10.0-dev.2460+cbb806da6

maxrss is consistent across commits when running the backfill script for me locally. Will try running the full set of benchmarks to see if somehow memory usage is leaking across different benchmarks or something (seems impossible but might as well try ruling it out).

EDIT: Also doesn't occur when running the full set of benchmarks, I still get ~2400 consistently.

@andrewrk
Copy link
Member

andrewrk commented Jun 3, 2022

To test this, I added:

Note that you need to poke every page (i.e. memset all the bytes) otherwise it won't count towards the RSS.

@squeek502
Copy link
Contributor Author

squeek502 commented Jun 3, 2022

Note that you need to poke every page (i.e. memset all the bytes) otherwise it won't count towards the RSS.

Changed the added code to:

    // allocate 100MiB to attempt to skew the results to check if the backfill process is included
    var dummy = try arena.alloc(u8, 100 * 1024 * 1024);
    std.mem.set(u8, dummy, 1);
    std.debug.print("last byte is {}\n", .{dummy[dummy.len - 1]});

and verified that last byte is 1 was printed.

Still got the same result: 2404 maxrss for find-random when using the backfill script.

@squeek502
Copy link
Contributor Author

squeek502 commented Jun 3, 2022

Ok, finally got something. I downloaded the records.csv from ziglang.org/perf, deleted the entries for e498fb155051f548071da1a13098b8793f527275 and then ran:

$ZIG run collect-measurements.zig -- records-upstream.csv $ZIG $COMMIT_SHA1 $COMMIT_TIMESTAMP

for e498fb155051f548071da1a13098b8793f527275. random-find had a reported maxrss of 11272.

So, it's the collect-measurements.zig memory usage that's being included, and it's increasing based on the size of the csv file (since it does readFileAlloc on the csv file). random-find is the only benchmark that's noticeably affected since its real maxrss is low enough for it to be dwarfed by the readFileAlloc.

EDIT: Double checked this by adding the dummy allocation code from my previous comment into collect-measurements.zig. The maxrss of random-find jumped to 103000 so this is definitely what's happening.

@squeek502
Copy link
Contributor Author

This seems to be the relevant note from getrusage(2):

Resource usage metrics are preserved across an execve(2).

@acheronfail
Copy link

I've been trying to measure the max_rss of some very small, hand-crafted assembly programs, and ran into the same issue with getrusage(2)!

Although what I've been doing is unrelated here, this might be a good read for you: https://jkz.wtf/random-linux-oddity-1-ru_maxrss since it mentions parts of the linux source where the metrics are maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants