Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

eregon · 2023-09-16T12:14:23Z

Currently it's:

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  79450610
total serialized/total source: 0.881

Stats of ratio serialized/source per file:
average: 1.083
median:  1.074
1st quartile: 0.782
3rd quartile: 1.358
min - max: 0.040 - 4.000

total serialized/total source is slightly better than in #741 (reply in thread) which was 0.91 (for protobuf)

However the new stats reveal than on average and median, serialized size > source size (by ~8%), which is kind of unexpected.
So likely some files are big and have many comments or so and then for those serialized size is much smaller than source size and that could explain the total sizes being better for serialized.

Note this is without #1450 (so with all location fields), I'll post the stats with that here later.

eregon · 2023-09-16T12:19:41Z

It's a lot better with #1450:

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  59596096
total serialized/total source: 0.661

Stats of ratio serialized/source per file:
average: 0.774
median:  0.757
1st quartile: 0.560
3rd quartile: 0.967
min - max: 0.036 - 3.556

With that serialized size is ~77% of source size on average.

kddnewton · 2023-09-19T17:22:03Z

.github/workflows/main.yml

@@ -170,6 +170,12 @@ jobs:
      run: bundle exec rake lex:topgems
    - name: Parse Top 100 Gems
      run: bundle exec rake parse:topgems
+    - name: Serialized size stats with all fields


This is an interesting metric, but I don't think we want it to run on every CI run. The topgems check is already one of the longest-running. I'm fine keeping the task around though.

This only takes like 2 seconds to execute, so I think it has no observable impact on CI time.

To be precise, the 3 commands below respectively take, locally on my machine: 2s, 6s, 2s.
I'd like to keep this check in CI because it's a very convenient way to check if serialized size got better or worse or same for a PR.

On this CI run https://github.com/ruby/yarp/actions/runs/6207594674/job/16852941189 it's 3+19+3 seconds.

I wonder if rake lex:topgems could be optimized further, that takes 4m22s in that run.

A quick profile with stackprof/autorun gives:

stackprof /tmp/stackprof20230919-104519-jjlzxp.dump ================================== Mode: wall(1000) Samples: 21520 (90.63% miss rate) GC: 13603 (63.21%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 8441 (39.2%) 8441 (39.2%) (marking) 7520 (34.9%) 7520 (34.9%) (sweeping) 5530 (25.7%) 5530 (25.7%) Thread#join 3 (0.0%) 3 (0.0%) Kernel#require 15963 (74.2%) 2 (0.0%) (garbage collection) 5530 (25.7%) 0 (0.0%) Array#map 7890 (36.7%) 0 (0.0%) YARP.parallelize ...

Which looks strange, like non-main threads are not included or so.
That does seem like a lot of GC though.
Also ENV.fetch("WORKERS") { 16 } is probably way too high for GitHub Actions, maybe it could use Etc.nprocessors or so, or it might be more efficient to not parallelize that at all on CRuby (might just increase GVL contention).

@kddnewton I can move this to a separate CI job, then it won't make this one any longer. Would that address your concern?

…the top 100 gems

eregon · 2023-09-19T17:33:01Z

I've rebased on main so we can see the actual values with YARP_SERIALIZE_ONLY_SEMANTICS_FIELDS=1

eregon requested a review from kddnewton September 16, 2023 12:14

eregon force-pushed the serialized-size-stats branch 2 times, most recently from 9abd366 to 2b02b4e Compare September 16, 2023 12:24

eregon mentioned this pull request Sep 16, 2023

Only keep semantic fields in Java, i.e. skip location fields #1450

Merged

eregon force-pushed the serialized-size-stats branch from 2b02b4e to 0ec7688 Compare September 16, 2023 13:12

kddnewton requested changes Sep 19, 2023

View reviewed changes

Add a task to show statisticts about serialized size/source size for …

291177e

…the top 100 gems

eregon force-pushed the serialized-size-stats branch from 0ec7688 to 291177e Compare September 19, 2023 17:31

eregon mentioned this pull request Sep 20, 2023

consider using 32-bit offsets for start and end in yp_location_t #1566

Open

kddnewton approved these changes Sep 22, 2023

View reviewed changes

kddnewton merged commit c9620ec into ruby:main Sep 22, 2023
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

eregon commented Sep 16, 2023 •

edited

Loading

eregon commented Sep 16, 2023 •

edited

Loading

kddnewton Sep 19, 2023

eregon Sep 19, 2023

eregon Sep 19, 2023

eregon Sep 19, 2023

eregon Sep 19, 2023

eregon Sep 19, 2023

eregon Sep 21, 2023

eregon commented Sep 19, 2023

Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

Conversation

eregon commented Sep 16, 2023 • edited Loading

eregon commented Sep 16, 2023 • edited Loading

kddnewton Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 19, 2023

Choose a reason for hiding this comment

eregon Sep 21, 2023

Choose a reason for hiding this comment

eregon commented Sep 19, 2023

eregon commented Sep 16, 2023 •

edited

Loading

eregon commented Sep 16, 2023 •

edited

Loading