Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary Size #583

Closed
cramertj opened this issue May 31, 2019 · 15 comments · Fixed by #613
Closed

Binary Size #583

cramertj opened this issue May 31, 2019 · 15 comments · Fixed by #613

Comments

@cramertj
Copy link
Member

This isn't yet focused on any particular issue, but if the maintainers are okay with it, I'd like to use this as an issue to track improvements to binary size. Today, a simple hello world build with --release and stripped is ~240KB. Adding regex::Regex::new("x").unwrap() to that binary and recompiling + restripping gives ~1.25MB-- that's an increase of over 1 megabyte, which is quite large when targeting resource-constrained systems.

cargo bloat --release output:

 File  .text     Size        Crate Name
12.9%  65.9% 379.6KiB              [1519 Others]
 1.7%   8.5%  49.2KiB regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_with_comments
 0.6%   2.8%  16.4KiB        regex regex::exec::ExecBuilder::build
 0.5%   2.4%  14.1KiB regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::as...
 0.4%   2.0%  11.6KiB regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_escape
 0.3%   1.8%  10.3KiB regex_syntax regex_syntax::unicode::class
 0.3%   1.7%   9.8KiB        regex regex::compile::Compiler::c
 0.3%   1.7%   9.8KiB regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::as...
 0.3%   1.5%   8.6KiB    [Unknown] read_line_info
 0.3%   1.4%   8.0KiB          std _ZN9libunwind10CFI_ParserINS_17LocalAddressSpaceEE17parseInstr...
 0.2%   1.1%   6.5KiB    [Unknown] elf_add
 0.2%   1.1%   6.3KiB regex_syntax regex_syntax::hir::literal::suffixes
 0.2%   1.1%   6.3KiB          std __rdos_backtrace_dwarf_add
 0.2%   1.1%   6.2KiB        regex aho_corasick::nfa::Builder::build
 0.2%   1.1%   6.2KiB regex_syntax regex_syntax::hir::literal::prefixes
 0.2%   0.8%   4.8KiB        regex regex::compile::Compiler::compile
 0.2%   0.8%   4.7KiB          std rustc_demangle::demangle
 0.2%   0.8%   4.7KiB          std <rustc_demangle::Demangle as core::fmt::Display>::fmt
 0.2%   0.8%   4.5KiB regex_syntax regex_syntax::ast::visitor::visit
 0.1%   0.7%   4.2KiB regex_syntax regex_syntax::ast::visitor::visit
 0.1%   0.7%   4.1KiB        regex regex::compile::Compiler::c_class
19.5% 100.0% 575.8KiB              .text section size, the file size is 2.9MiB

cargo bloat --release --crates output:

Compiling ...
Analyzing target/release/x

 File  .text     Size Name
 7.6%  39.0% 224.4KiB regex_syntax
 6.2%  31.7% 182.4KiB std
 3.8%  19.5% 112.5KiB regex
 1.3%   6.4%  36.9KiB [Unknown]
 0.4%   1.9%  11.2KiB aho_corasick
 0.2%   0.8%   4.5KiB memchr
 0.1%   0.4%   2.2KiB utf8_ranges
 0.0%   0.2%   1.4KiB ucd_util
 0.0%   0.0%     273B x
19.5% 100.0% 575.8KiB .text section size, the file size is 2.9MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.
@BurntSushi
Copy link
Member

BurntSushi commented Jun 1, 2019

I'm aware of the large binary size. Could you help me understand the purpose of this ticket? Is it a request to reduce the binary size?

Without doing any actual work, my guess is that the large binary size is attributable to the large Unicode tables in regex-syntax and due to the fairly aggressive function inlining in regex proper. The former is hard to reduce without losing functionality and the latter is hard to reduce without losing performance. I'm sure there are probably some wins here, but I doubt they'd be dramatic.

It would be helpful to put the binary size into context, e.g., by comparing it with other mature regex engines with similar performance goals such as PCRE2.

(FWIW, one of my longer term goals is to provide a "lite" regex library which compiles faster and has a smaller binary size, but gives up performance and potentially some functionality, such as Unicode support.)

Moreover, I don't think regex can even work on resource constrained systems today, in practice. At least, not where 1MB would be an issue, since the standard library is a hard dependency at present. At some point, I'd like to add a no-std-with-alloc mode, and that might open some doors and put some pressure on reducing the binary size.

@RReverser
Copy link
Contributor

RReverser commented Jun 1, 2019

Today, a simple hello world build with --release

In addition to already said above, it's worth noting that if you're targetting small size, using the default release config doesn't make much sense for comparison, as it focuses on balanced speed of compilation+execution.

Instead, adding these few lines to Cargo.toml can already shave off ~320 KB (30% compared to original 1 MB on my machine) of a sample binary:

[profile.release]
lto = true
codegen-units = 1
opt-level = "z"

@ignatenkobrain
Copy link
Contributor

@BurntSushi thanks a lot for explanation! Using tricks from here allowed me to reduce size from 1.4M to 1M. I think having "lite" regex version would be amazing.

And thanks for your work :)

@cramertj
Copy link
Member Author

cramertj commented Jun 3, 2019

@RReverser

In addition to already said above, it's worth noting that if you're targetting small size, using the default release config doesn't make much sense for comparison, as it focuses on balanced speed of compilation+execution.

Right-- we already use opt-level = "z" and other techniques to reduce binary size in our real builds-- the description in the issue was meant to be a minimal reproducer demonstrating that the binary size is quite large.

@BurntSushi

Could you help me understand the purpose of this ticket? Is it a request to reduce the binary size?

I certainly would like to reduce the binary size, but I'm not intending to request anything from you aside from a place to discuss possible improvements and track ongoing work. I'm planning to spend some time investigating possible improvements, and I figured an issue on this repo would be a good place to discuss and share progress with others who may have the same goal.

Without doing any actual work, my guess is that the large binary size is attributable to the large Unicode tables in regex-syntax and due to the fairly aggressive function inlining in regex proper. The former is hard to reduce without losing functionality and the latter is hard to reduce without losing performance. I'm sure there are probably some wins here, but I doubt they'd be dramatic.

Yes, I think so as well (WRT unicode tables). One option I had considered was investigating the possibility of a regex-syntax-free regex frontend that used a procedural macro to parse the regular expression and transform it. This obviously wouldn't solve all the usecases of the existing crate, and so would at best be supplemental and at worst be an entirely separate project.

Moreover, I don't think regex can even work on resource constrained systems today, in practice. At least, not where 1MB would be an issue, since the standard library is a hard dependency at present. At some point, I'd like to add a no-std-with-alloc mode, and that might open some doors and put some pressure on reducing the binary size.

In Fuchsia, we have a large number of separate Rust binaries that range from 250-750KB, all of which use the standard library. Adding a Regex::new to any of them increases the binary size by an additional 200KB. Percentage-wise, that's a huge increase and means that we basically can't allow widespread use of the regex crate if we want to keep size down.

@BurntSushi
Copy link
Member

Adding a Regex::new to any of them increases the binary size by an additional 200KB. Percentage-wise, that's a huge increase and means that we basically can't allow widespread use of the regex crate if we want to keep size down.

It still seems like small potatoes to me. Could you say more about the decision procedure here? Could you also say the size at which regex would have to be in order to permit widespread use?

I kind of feel like your decision procedure here is pretty opaque, so without a target, it's a bit tricky to know how to make progress. For example, if you said the crate had to reduce its size by an order of magnitude, then it would be easy for me to say that regex is probably not the right tool for you. But if shaving off a few dozen KB would be enough, then some low hanging fruit might be enough.

For example, we could consider adding a unicode feature to regex (and regex-syntax) which, when disabled, dropped all of the Unicode features. As another example, we could consider adding a inline feature which, when disabled, got rid of most (or all) of the inline(always) directives in regex. I don't think either of those things is particularly difficult, but could potentially have a dramatic impact on binary size. Of course, this depends on whether you're okay with losing Unicode features and some runtime performance. (There may be some middle ground here too. regex-syntax bundles a lot of Unicode stuff, but not all Unicode features have equivalent utility. So perhaps there are different levels of Unicode support that we could offer.)

One option I had considered was investigating the possibility of a regex-syntax-free regex frontend that used a procedural macro to parse the regular expression and transform it.

That's interesting, but it's not obvious to me that it's a clear win. At that point, the compiled regex program will become part of your binary, and regex programs aren't themselves exactly small. There's probably also some implementation difficulties in getting this to work, but they are perhaps not insurmountable.

@cramertj
Copy link
Member Author

cramertj commented Jun 3, 2019

It still seems like small potatoes to me. Could you say more about the decision procedure here? Could you also say the size at which regex would have to be in order to permit widespread use?

Perhaps I'm not explaining clearly: for our programs, adding a use of regex increases their size by 25-50% (~200KB addition to a 250-500KB program). 50KB would probably be small enough to consider general usage. Unfortunately I can't be too precise about our specific project goals, but to give some sense of scale, single digit megabyte increases or decreases to the system matter quite a bit (enough to be worth spending multiple engineering days on them). With that in mind, the per-target effort reduced by including the regex crate has to be more significant than the effort required to reduce other parts of the system by an equivalent size.

As another example, we could consider adding a inline feature which, when disabled, got rid of most (or all) of the inline(always) directives in regex.

This sounds like a great thing to experiment with! I think the biggest wins are probably likely to come from dropping unnecessary copies of the unicode tables (e.g. in binaries whose regular expressions don't require them), but I think that's a much harder to achieve goal, and minimizing #[inline(always)] would be great, especially in light of our existing uses of LTO / PGO / opt-level=z, which hopefully would allow LLVM to make the right decisions about what level of inlining is right for us.

@BurntSushi
Copy link
Member

@cramertj All righty. I'll do some experimenting soonish w.r.t. inlining and Unicode and see what I can come up with. Longer term, the "lite" regex crate will probably be the better bet, although I was still planning on using regex-syntax for that. Even aside from the Unicode tables, the parser itself is fairly sizable too unfortunately. But I haven't looked at any concrete numbers in a long time, so it might not be an issue in practice.

Also, I feel obliged to mention regex-automata here. If you aren't using Unicode and can sacrifice a bit of runtime performance, then you can embed fully compiled sparse DFAs directly into your binary. Unicode screws the pooch here since it inflates the size of the DFAs by quite a bit, but if you disable Unicode (via (?-u)), then the DFAs can get pretty small. The actual runtime requirements are very light-weight and were designed with no_std in mind. There are unfortunately several niceties that you lose (capturing groups, and look-around assertions, e.g., ^$\b). Here's an example: https://docs.rs/regex-automata/0.1.6/regex_automata/#example-deserialize-a-dfa

@BurntSushi
Copy link
Member

@cramertj

Okay, so as far as I can tell, a good chunk of the binary size (~60KB) belongs
to just the parser itself, according to cargo bloat. This is before any
Unicode tables are used at all, and the parser doesn't force any inlining,
so I'm not sure what more I can do on my end. The only solution I can see
to this is to write another parser that is somehow smaller. Perhaps poorer
error handling, or maybe a parser that produces HIR directly. (The overall
regex compilation process is like this: concrete syntax -> AST -> HIR ->
NFA byte codes. The AST and HIR translation steps are the responsibility of
regex-syntax. The AST is meant to be a nearly round-trippable representation
of the concrete syntax, where as the HIR condenses a lot of stuff and makes it
much more amenable to analysis.)

The next thing I tried was removing all of the Unicode tables from
regex-syntax, and replacing them with empty slices. This overall decreased the
binary size by about 400KB, although this seems to be mostly opaque to cargo bloat. This seems to be about in the ballpark I'd expect, since the
approximate total source file size of the Unicode tables is 600KB. From a quick
glance, we could probably shave off 100-200KB without removing some of the more
"essential" Unicode features. (i.e., Keeping case folding, general categories,
scripts and \w, but dropping most of the random boolean properties in addition
to things like the word/grapheme/sentence breaking properties.)

Moreover, removing the Unicode tables shaves off about 2 seconds of compile
time. (With the baseline being ~12s, so it's not that much.)

The next thing I tried was to remove all of the pertinent inline(always)
annotations through the regex engine. This sped up compile time for regex by
4s (~20s to ~16s). Measuring the binary size for this is a bit trickier, since
to really gauge how much inlining is inflating the binary size, you need to
invoke multiple code paths. So I ended up using this program:

use regex::Regex;

fn main() {
    let data = std::fs::read("wat").unwrap();
    assert!(regex::bytes::Regex::new(r"\w").unwrap().is_match(&data));

    let data = std::fs::read("wat").unwrap();
    let _ = regex::bytes::Regex::new(r"\w").unwrap().find(&data).unwrap();

    let data = std::fs::read("wat").unwrap();
    let _ = regex::bytes::Regex::new(r"\w").unwrap().captures(&data).unwrap();

    let data = std::fs::read("wat").unwrap();
    let sdata = String::from_utf8(data).unwrap();
    assert!(Regex::new(r"\w").unwrap().is_match(&sdata));

    let data = std::fs::read("wat").unwrap();
    let sdata = String::from_utf8(data).unwrap();
    let _ = Regex::new(r"\w").unwrap().find(&sdata).unwrap();

    let data = std::fs::read("wat").unwrap();
    let sdata = String::from_utf8(data).unwrap();
    let _ = Regex::new(r"\w").unwrap().captures(&sdata).unwrap();
}

This hits on the three major code paths in the regex engine (is_match, find
and captures), and then multiplies it by invoking byte regexes as well.

Overall, removing all of the inline(always) annotations ended up reducing the
total binary size by about 100KB. And cargo bloat seems to confirm that.
Here's its output with the inline(always) annotations:

 File  .text     Size Crate
 9.0%  39.7% 295.2KiB regex
 6.6%  29.0% 215.8KiB regex_syntax
 5.2%  22.7% 168.8KiB std
 1.1%   4.8%  35.7KiB [Unknown]
 0.3%   1.5%  11.2KiB aho_corasick
 0.1%   0.6%   4.5KiB thread_local
 0.1%   0.6%   4.5KiB memchr
 0.1%   0.5%   3.5KiB regex_binary_size
 0.1%   0.3%   2.2KiB utf8_ranges
 0.0%   0.2%   1.4KiB ucd_util
22.7% 100.0% 742.7KiB .text section size, the file size is 3.2MiB

And then with the annotations removed:

 File  .text     Size Crate
 6.8%  34.0% 215.8KiB regex_syntax
 5.9%  29.5% 187.7KiB regex
 5.3%  26.6% 169.0KiB std
 1.1%   5.6%  35.7KiB [Unknown]
 0.4%   1.8%  11.2KiB aho_corasick
 0.1%   0.7%   4.5KiB thread_local
 0.1%   0.7%   4.5KiB memchr
 0.1%   0.6%   3.5KiB regex_binary_size
 0.1%   0.3%   2.2KiB utf8_ranges
 0.0%   0.2%   1.4KiB ucd_util
20.1% 100.0% 635.5KiB .text section size, the file size is 3.1MiB

So that's a reduction of about 36%. Which seems pretty good. This is further
confirmed by looking at the per-function breakdown. With inlining:

11.2%  49.5% 367.9KiB                   [1522 Others]
 1.6%   7.2%  53.7KiB             regex <regex::exec::ExecNoSync as regex::re_trait::RegularExpression>::captures_read_at
 1.6%   6.9%  51.6KiB      regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_with_comments
 0.5%   2.4%  17.7KiB             regex regex::re_unicode::Regex::find_at
 0.5%   2.4%  17.7KiB             regex regex::re_bytes::Regex::find_at
 0.5%   2.2%  16.2KiB             regex regex::exec::ExecBuilder::build
 0.4%   1.8%  13.1KiB      regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::ast::visitor::Visitor>::visit_post
 0.4%   1.6%  11.6KiB      regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_escape
 0.4%   1.5%  11.5KiB             regex regex::re_unicode::Regex::shortest_match_at
 0.4%   1.5%  11.5KiB             regex regex::re_bytes::Regex::shortest_match_at
 0.3%   1.3%   9.8KiB      regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::ast::visitor::Visitor>::visit_class_set_item_post
 0.3%   1.3%   9.8KiB             regex regex::compile::Compiler::c
 0.3%   1.2%   8.8KiB         [Unknown] read_line_info
 0.2%   0.9%   6.8KiB               std __rdos_backtrace_dwarf_add
 0.2%   0.9%   6.8KiB         [Unknown] elf_add
 0.2%   0.9%   6.3KiB      regex_syntax regex_syntax::hir::literal::suffixes

And now without inlining:

11.4%  56.7% 360.5KiB                   [1550 Others]
 1.6%   8.1%  51.6KiB      regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_with_comments
 0.5%   2.6%  16.2KiB             regex regex::exec::ExecBuilder::build
 0.4%   2.1%  13.1KiB      regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::ast::visitor::Visitor>::visit_post
 0.4%   1.8%  11.6KiB      regex_syntax regex_syntax::ast::parse::ParserI<P>::parse_escape
 0.3%   1.5%   9.8KiB      regex_syntax <regex_syntax::hir::translate::TranslatorI as regex_syntax::ast::visitor::Visitor>::visit_class_set_item_post
 0.3%   1.5%   9.8KiB             regex regex::compile::Compiler::c
 0.3%   1.4%   8.8KiB         [Unknown] read_line_info
 0.2%   1.1%   6.8KiB               std __rdos_backtrace_dwarf_add
 0.2%   1.1%   6.8KiB         [Unknown] elf_add
 0.2%   1.0%   6.3KiB      regex_syntax regex_syntax::hir::literal::suffixes
 0.2%   1.0%   6.2KiB             regex aho_corasick::nfa::Builder::build
 0.2%   1.0%   6.2KiB      regex_syntax regex_syntax::hir::literal::prefixes
 0.2%   0.8%   5.0KiB               std rustc_demangle::demangle
 0.2%   0.8%   4.8KiB               std <rustc_demangle::legacy::Demangle as core::fmt::Display>::fmt

You can see here that the various
find_at/captures_read_at/shortest_match_at functions have disappeared
off the list of biggest functions, which is pretty much what I would have
expected to see.

But a 100KB still looks to be an overall fairly small fraction of the total
binary size, unfortunately. Overall, if we combine disabling inlining with the
removal of all Unicode tables, then we bring the size of the release binary
(for the above program, with six different regex searches) down from 3.7MB to
3.1MB. If I remove the regex dep entirely and just compile a "Hello, world!"
program, then that yields a 2.4MB binary. So overall, the regex crate still
leads to a 700KB increase in binary size after applying the two "easy"
suggestions. And that's the best case---I imagine you'll probably want some
Unicode support, so that 700KB figure is probably on the low end.

It sounds to me like 700KB is not good enough for your use case. Is that right?
If so, then I'm not too sure what I can do on my end. Given that the parser
alone is already over 50KB, your budget appears to be unreasonable to meet
unfortunately. If you need to optimize binary size that much, then it sounds
like you probably need a special purposed regex engine built for it, likely by
sacrificing features and performance.

for our programs, adding a use of regex increases their size by 25-50% (~200KB addition to a 250-500KB program).

N.B. As my numbers show above, I'm seeing a much greater increase in
binary size than you are here. Based on your original comment, I think you
are looking at stripped binary size, which makes sense. So in my figures above,
the baseline hello world for me is 207KB, the status quo with the regex
searches is 1.4MB and the binary size without aggressive inlining or Unicode
tables is 843KB. That figure looks a little better, but it's still a 600KB jump. And
these still don't quite line up with your figures, but they're closer than my numbers
above. Either way, it still seems like a far cry from a 50KB increase.

@BurntSushi
Copy link
Member

BurntSushi commented Sep 2, 2019

OK, I've begun to make some progress on this particular issue. In light of the recent focus on dependency weight, I decided to try my hand at making a "regex lite" a reality. But instead of achieving that by building a completely separate crate (which was my original intent), I'm now thinking that it can be done by use of crate features. This overlaps a great deal with this issue, so I decided to just tackle them both at the same time. The high level idea is that it should be possible to strip regex down to its bare essentials: a non-Unicode aware parser, no literal optimizations, no aggressive inlining, no lazy DFA and no fast caching. Collectively, that should reduce compile times, reduce binary size and reduce the dependency count of regex down to exactly one unavoidable crate: regex-syntax. Of course, this would only happen when all features are disabled. But, it turns out that when all features are disabled, what you're left with is a perfectly serviceable regex library for a lot of use cases.

Thus far, I've done the work to make all Unicode data optional. Effectively, this surfaces itself via additional errors when building a regex. For example, if you turn off the Unicode case tables, but try to compile (?i)a, then that will result in a regex compilation error because the i flag is Unicode-aware by default. Instead, you can either enable those tables or use non-Unicode-aware case insensitivity, i.e., (?i-u)a. This same logic applies to all other Unicode-aware constructs, which are mostly centered around the \p{...} syntax.

An alternative strategy would be to permit all these constructs to continue to compile, but simply omit any additional Unicode processing. Aside from certain things being a little weird, the biggest problem with this is that someone might disable Unicode features, write (?i)k, and rely on the fact that it only matches ASCII characters. But if those Unicode features get enabled by anything else in the dependency graph, that behavior change will impact code that assumed no Unicode support. This would be very bad I think, so it's better to just return an error. Stated differently, there is a key principle at work here: crate features should never modify the match semantics of any regular expressions. The only change permitted is the space of valid regexes: adding features can only expand that space by adding more valid regexes. Conversely, removing features can only result in a smaller number of valid regexes.

Thus far, my planned feature breakdown for regex-syntax is as follows:

  • unicode -
    Enables all Unicode features. This feature is enabled by default, and will
    always cover all Unicode features, even if more are added in the future.
  • unicode-age -
    Provide the data for the
    Unicode Age property.
    This makes it possible to use classes like \p{Age:6.0} to refer to all
    codepoints first introduced in Unicode 6.0
  • unicode-bool -
    Provide the data for numerous Unicode boolean properties. The full list
    is not included here, but contains properties like Alphabetic, Emoji,
    Lowercase, Math, Uppercase and White_Space.
  • unicode-case -
    Provide the data for case insensitive matching using
    Unicode's "simple loose matches" specification.
  • unicode-gencat -
    Provide the data for
    Uncode general categories.
    This includes, but is not limited to, Decimal_Number, Letter,
    Math_Symbol, Number and Punctuation.
  • unicode-perl -
    Provide the data for supporting the Unicode-aware Perl character classes,
    corresponding to \w, \s and \d. Note that if this feature is disabled,
    the \s and \d character classes are still available if the unicode-bool
    and unicode-gencat features are enabled, respectively.
  • unicode-script -
    Provide the data for
    Unicode scripts and script extensions.
    This includes, but is not limited to, Arabic, Cyrillic, Hebrew,
    Latin and Thai.
  • unicode-segment -
    Provide the data necessary to provide the properties used to implement the
    Unicode text segmentation algorithms.
    This enables using classes like \p{gcb=Extend}, \p{wb=Katakana} and
    \p{sb=ATerm}.

I plan to re-export these features from regex proper, and then add several other features for controlling inlining, literal optimizations, the lazy DFA and fast caching.

With just dropping the Unicode data, compile times improve by 30% in debug mode, and 15% in release mode:

$ time cargo b
   Compiling regex-syntax v0.6.11 (/home/andrew/code/rust/regex/regex-syntax)
    Finished dev [unoptimized + debuginfo] target(s) in 4.41s

real    4.432
user    7.258
sys     0.286
maxmem  441 MB
faults  0

$ cargo clean

$ time cargo b --no-default-features
   Compiling regex-syntax v0.6.11 (/home/andrew/code/rust/regex/regex-syntax)
    Finished dev [unoptimized + debuginfo] target(s) in 3.12s

real    3.140
user    5.745
sys     0.281
maxmem  357 MB
faults  0

$ cargo clean

$ time cargo b --release
   Compiling regex-syntax v0.6.11 (/home/andrew/code/rust/regex/regex-syntax)
    Finished release [optimized + debuginfo] target(s) in 6.85s

real    6.865
user    21.404
sys     0.397
maxmem  618 MB
faults  0

$ cargo clean

$ time cargo b --no-default-features --release
   Compiling regex-syntax v0.6.11 (/home/andrew/code/rust/regex/regex-syntax)
    Finished release [optimized + debuginfo] target(s) in 5.76s

real    5.776
user    18.550
sys     0.259
maxmem  506 MB
faults  0

I was kind of hoping for a better improvement here, but as my analysis above showed, it turns out that the regex parser and its supporting infrastructure is pretty beefy all on its own.

@ethanpailes
Copy link
Contributor

The high level idea is that it should be possible to strip regex down to its bare essentials: a non-Unicode aware parser, no literal optimizations, no aggressive inlining, no lazy DFA and no fast caching

Does this mean dropping extra matching engines and literal optimizations as well? I didn't see any features for turning off different parts of the back end, which is why I ask.

On the face of it, it seems like it could be desirable to strip everything down to just the PikeVM. If you do end up deciding to provide feature flags for each of the extra matching engines, it means I should add one to the onepass patch right?

@BurntSushi
Copy link
Member

Does this mean dropping extra matching engines and literal optimizations as well? I didn't see any features for turning off different parts of the back end, which is why I ask.

Yeah, the knobs don't exist. I'm in the process of adding them.

I don't necessarily want a knob for every matching engine. For instance, I don't see a reason to provide a knob to turn off the backtracker. The code is small. The lazy DFA, on the other hand, is substantially larger.

On the face of it, it seems like it could be desirable to strip everything down to just the PikeVM. If you do end up deciding to provide feature flags for each of the extra matching engines, it means I should add one to the onepass patch right?

I would hold off on making changes to onepass. My very loose plan at the moment is to figure out how to bring the onepass engine into regex-automata, which should, at some point, become a dependency of regex. (Although, likely an optional dependency.)

I'm sorry I haven't been able to merge it yet, but I just haven't been able to push myself to do it because I'm so concerned about bugs in the existing infrastructure, which is extremely brittle. But that's a separate conversation...

@ethanpailes
Copy link
Contributor

Gocha. Definitely not trying to rush you.

Do you mean existing infrastructure in the rest of the regex crate or in the onepass patch? Is it anything I can help with?

@BurntSushi
Copy link
Member

BurntSushi commented Sep 2, 2019

Do you mean existing infrastructure in the rest of the regex crate or in the onepass patch? Is it anything I can help with?

No, not onepass. I mean the existing infrastructure. I think I'm at a point where trying to split up the work would be counter-productive. There's too much shifting, and the full picture isn't quite clear to me yet. I also expect this to take months or maybe even a year to really finish. Basically, I'm looking to solve a lot of the fundamental issues that plague regex right now: hard to reason about optimizations, better literal infrastructure, more principled look-around support, make $ support CRLF, and so on. It turns out that these require some fairly deep changes into how everything is tied together.

Once that base is there, I'm really hoping things will open up for many more optimization opportunities, including things like onepass.

Thank you for your offer though. :-)

@ethanpailes
Copy link
Contributor

Ok got it. Those sort of architectural decisions do seem like something you have to do on your own.

BurntSushi added a commit that referenced this issue Sep 2, 2019
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).

An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.

So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).

This partially addresses #583 since it permits callers to decrease
binary size.
BurntSushi added a commit that referenced this issue Sep 2, 2019
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).

An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.

So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).

This partially addresses #583 since it permits callers to decrease
binary size.
@BurntSushi
Copy link
Member

All right, I've done what I can I think. I've managed to decrease the binary size overhead of regex by about an order of magnitude once all features are disabled (1.3M -> 332K). On top of that, from-scratch compile times improve by about a factor of 2, along with trimming the dependency tree down to a single crate (regex-syntax).

I spent a couple additional hours trying to see if I could shrink the binary size even more. But the tools available to me to debug and fix that sort of issue appear to be fairly limited. The parser is definitely taking up a chunk of space, but there's really no one huge thing left that's being reported by cargo bloat. To me, that suggests that if binary size needs to shrink further, than you probably need a purpose-built regex engine for it (likely by sacrificing something). Still though, some of the sizes of the parsing routines look quite big to me, so perhaps there is still something I'm missing there.

I've posted a bit more of an analysis in #613.

BurntSushi added a commit that referenced this issue Sep 3, 2019
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).

An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.

So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).

This partially addresses #583 since it permits callers to decrease
binary size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants