Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex::replace* methods taking &str rather than Cow<str> harms efficiency #676

Closed
chris-morgan opened this issue May 11, 2020 · 14 comments
Closed
Labels

Comments

@chris-morgan
Copy link
Member

The Regex::replace* methods return a Cow<str> for efficiency, but can't take a Cow<str>. This harms efficiency in cases like multiple sequential regular expressions:

|text: &str /* or Cow<str> even */| -> Cow<str> {
    let text = regex1.replace(text, replacement);
    let text = regex2.replace(text, replacement);
    text
}

Instead you’re stuck with this which necessarily wastes an allocation:

|text: &str| -> String {
    let text = regex1.replace(text, replacement);
    let text = regex2.replace(&text, replacement);
    text.into_owned()
}
@RReverser
Copy link
Contributor

Note that, as a workaround, you can write own helper to expand Cow:

fn replace_cow<'a>(cow: Cow<'a, str>, regex: &Regex, replacement: &str) -> Cow<'a, str> {
    match cow {
        Cow::Borrowed(s) => regex.replace(s, replacement),
        Cow::Owned(s) => Cow::Owned(regex.replace(&s, replacement).into_owned()),
    }
}

@BurntSushi
Copy link
Member

@chris-morgan Could you please post a complete example that compiles? Along with your desired variant that doesn't compile?

It would also help if you could suggest your desired fix.

@chris-morgan
Copy link
Member Author

@RReverser That still does an unnecessary clone in the owned case when there are no matches.

@BurntSushi Full trivial example that makes an unnecessary allocation in order to compile (otherwise beth references local variable aleph):

use std::borrow::Cow;

fn sub(input: &str) -> String {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace(&aleph, "x");
    beth.into_owned()
}

fn main() {
    println!("{}", sub("foo"));
}

What I would like to work, and which would make no allocations in the second replace call:

use std::borrow::Cow;

fn sub(input: &str) -> Cow<str> {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace(aleph, "x");
    beth
}

fn main() {
    println!("{}", sub("foo"));
}

The general nature of the proposed alteration would be that the type of the text parameter to the Regex::replace* methods change from &'t str to impl Into<Cow<'t, str>>. impl Trait is in Rust 1.26, and regex looks to use an MSRV of Rust 1.28, so it should be acceptable. Because there’s already one generic parameter, I don’t believe a second can be added compatibly (can’t do T: Into<Cow<'t, str>> = &'t str on fn bounds to make it optional).

I’m not confident that this would not be a breaking change, because I think it could conceivably mess with inference. If it’s deemed a breaking change, then adding new methods is all I can think of, which is eww.

The changes within the body of replacen would be straightforward: the addition of a let text = text.into();, the changing of two text to &text, and the removal of two Cow::Borrowed(…) wrappings around text.

@RReverser
Copy link
Contributor

That still does an unnecessary clone in the owned case when there are no matches.

Do you mean when input is already a Cow::Owned? Then no, it doesn't. Note that into_owned doesn't create a clone in that case, so original data will be returned back.

@chris-morgan
Copy link
Member Author

Yes it does: &s, you only passed an &str into replace, so it returned a Cow::Borrowed.

@RReverser
Copy link
Contributor

Right, good point, the second case also needs to be expanded.

FWIW a very similar problem is the reason I've created cow-utils. While it's not strictly Regex-oriented, you can use its cow_replace method with a Regex on nightly thanks to the Pattern API.

If there is enough interest, it should be quite easy to add a special feature to support regex on stable Rust as well.

@RReverser
Copy link
Contributor

One limitation is that it still intentionally accepts only &str.

The reason for that is that consuming Cow can harm efficiency of other edge cases, where right now it can reuse slice of intermediate strings instead.

@BurntSushi
Copy link
Member

I see. Thanks for elaborating. That makes things a lot clearer.

My general opinion on Cow is honestly pretty low. I occasionally regret even making replace return a Cow in the first place. But it is decently motivated: the case when no replacements occur is pretty common. My distaste for Cow is mostly that it is a somewhat unergonomic structure, and has subtleties to it (like getting a &str from a Cow<'a, str> doesn't mean your &str has a lifetime of 'a). Using Cow also results in users having to pretty commonly tack on an into_owned() at the end of every replace call, which I find somewhat annoying.

I think the suggestion of impl Into<Cow> is probably not plausible. Firstly, making another part of replace generic is really not something I want to do. It adds a lot of complexity to the type signature of what really should be a simple function. Secondly, my guess is that it would be a significant breaking change. For example, this code compiles when using &'t str, but using impl<Into<Cow<'t, str>>> causes it to break. I would expect there to be more issues here, because of inference.

In terms of non-breaking changes, I think that only leaves us with adding new replacement APIs. As you hinted at, I also find this unsatisfactory. I think there are probably already too many replacement APIs. Adding more for a fairly specialized use case would be unfortunate. Personally, if I were going to add more replacement APIs, then I think I would ditch Cow altogether and go with a simpler API like the one in the aho-corasick crate:

fn replace_with(&self, haystack: &str, replacement: R, dst: &mut String)

This way, you don't need to mess with Cow at all. Unfortunately, the problem with this approach is that if there are no matches, then you still need to copy everything to dst. Which does stink.

With all of that said, I actually think your solution is much simpler: don't use regex's replace APIs. Implementing your own replace should be pretty easy. It should only be a few lines of code.

@BurntSushi
Copy link
Member

I'm going to close this since it's partially not feasible because of it being a breaking change, and I also personally find the increased complexity of the API not worth it.

@damooo
Copy link

damooo commented Nov 14, 2021

@chris-morgan ,
We can handle this case of sequential substitution, with-no-allocation-if-no-match using RegexSet.
We can first match with RegexSet, and if there are no matches, just return cow<str>. If there are matches, then perform sequential-replacements. In the case of match, last allocation is any-way-necessary.

@xkr47
Copy link
Contributor

xkr47 commented Nov 28, 2022

Here is a workaround, based on rust-lang/rust#65143 (comment) — it's ugly because it relies on the implementation detail that a borrowed string currently only ever returned when nothing was replaced — in theory replacing beginning or end of a string with empty could return a borrowed substring of the original but feel pretty confident that optimizing for these cases is never going to happen in the replace* functions.

let orig = String::from("Hello");
let result = regex.replace(&orig, "foo");
let result = if let Cow::Owned(_) = result {
    result.into_owned()
} else {
    orig
};

As I see it, the if expression can't be extracted to a helper function because the borrowed orig is referenced in result and thus its ownership cannot also be passed as an argument to that helper function. But a macro could surely do the trick.

Applied for the earlier example by @chris-morgan at #676 (comment) assuming I understood the request correctly:

fn sub(input: &str) -> String {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m").into_owned();
    let beth = regex.replace(&aleph, "x");
    if let Cow::Owned(_) = beth {
        beth.into_owned()
    } else {
        aleph
    }
}

@chris-morgan
Copy link
Member Author

chris-morgan commented Nov 28, 2022

Good thinking, I like that workaround. As written, you’ve ended up with an unnecessary .into_owned() there—you already had the String by destructuring, you just discarded it instead of binding and using it. I’d also then switch to using match, which I find clearer (and shorter in conventional formatting):

let result = match result {
    Cow::Owned(result) => result,
    Cow::Borrowed(_) => orig,
};

As you say, the if expression can’t be extracted—but you can extract it along with the replace call. and so here’s an extension trait that defines new variants of replace, replace_all and replacen that take Cow<str> instead of &str:

use std::borrow::Cow;
use regex::{Regex, Replacer};

/// Extension methods for `Regex` that operate on `Cow<str>` instead of `&str`.
pub trait RegexCowExt {
    /// [`Regex::replace`], but taking text as `Cow<str>` instead of `&str`.
    fn replace_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str>;

    /// [`Regex::replace_all`], but taking text as `Cow<str>` instead of `&str`.
    fn replace_all_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str>;

    /// [`Regex::replacen`], but taking text as `Cow<str>` instead of `&str`.
    fn replacen_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, limit: usize, rep: R) -> Cow<'t, str>;
}

impl RegexCowExt for Regex {
    fn replace_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str> {
        match self.replace(&text, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }

    fn replace_all_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, rep: R) -> Cow<'t, str> {
        match self.replace_all(&text, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }

    fn replacen_cow<'t, R: Replacer>(&self, text: Cow<'t, str>, limit: usize, rep: R) -> Cow<'t, str> {
        match self.replacen(&text, limit, rep) {
            Cow::Owned(result) => Cow::Owned(result),
            Cow::Borrowed(_) => text,
        }
    }
}

And sample usage (playground) with my earlier example plus returning Cow<str> because why not:

use path::to::RegexCowExt;  // if not in the same module

fn sub(input: &str) -> Cow<str> {
    let regex = regex::Regex::new("f").unwrap();
    let aleph = regex.replace(input, "m");
    let beth = regex.replace_cow(aleph, "x");
    beth
}

fn main() {
    println!("{}", sub("foo"));
}

A few possible options here:

  1. Land these as full regular methods. Fairly sure Andrew won’t want that.
  2. Land this as an extension trait in the regex crate. Doubt Andrew will go for that either, but it’d be nice.
  3. Publish the extension trait as a separate crate, regex-cow or such. Can’t be bothered, personally.
  4. Just leave the code here so anyone that wants it can schnaffle it. (License: I reckon it’s ineligible for copyright protection, it’s glue code with no meaningful variation possible, but I’ll offer it under the same terms as regex crate if you want.)

Option four it is, so far as I’m concerned!

@xkr47
Copy link
Contributor

xkr47 commented Nov 28, 2022

I thought the contents of the Owned and Borrowed variants are not pub so they couldn't be extracted.. but apparently it works, thanks. I love your RegexCowExt! ❤️

@chris-morgan
Copy link
Member Author

chris-morgan commented Nov 28, 2022

Enum variants and their fields are always public. The closest you get to private is #[doc(hidden)]. If it was private like you can do with structs, you couldn’t even write _there.

(Also if it was private Rust wouldn’t tell you what the fields were at all, only that there were other private fields.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants