Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exponential execution time by memoizing format_expr #5139

Merged
merged 1 commit into from Apr 17, 2022

Conversation

HKalbasi
Copy link
Member

fix #4476
fix #4867
fix #5128

The problem in those issues is, when max_width is tight, rustfmt should check all possible formattings, which is exponential relative to input size. By memoization of format_expr function, worst case execution time becomes O(n*max_width^3). I tried codes in those issues and it completed in a fraction of second.

@HKalbasi
Copy link
Member Author

Test are failing strangely. I think my QueryId is not unique enough. What is needed to be added?

@calebcartwright
Copy link
Member

Test are failing strangely. I think my QueryId is not unique enough. What is needed to be added?

I can't say definitively whether there actually is a combination that will prove sufficiently unique, so you'll probably have to continue experimenting to gauge feasibility. Perhaps the expression's nodeid?

I also wouldn't be terribly keen on introducing a mutexted global, would it be feasible to attach the map to the rewrite context somehow?

@HKalbasi
Copy link
Member Author

I can't say definitively whether there actually is a combination that will prove sufficiently unique, so you'll probably have to continue experimenting to gauge feasibility. Perhaps the expression's nodeid?

Expression's node id is shared between all expressions of a def id. But I found that if I exclude macros (memoization doesn't help there anyway) it will pass tests. Probably span value is unique outside of macros.

I also wouldn't be terribly keen on introducing a mutexted global, would it be feasible to attach the map to the rewrite context somehow?

I tried adding a RefCell to rewrite context, (last commit) but it doesn't improve performance and those issues still takes minutes. But the very same code with global mutex works. I don't know what is happening.

@calebcartwright
Copy link
Member

I tried adding a RefCell to rewrite context, (last commit) but it doesn't improve performance and those issues still takes minutes. But the very same code with global mutex works. I don't know what is happening.

Sorry haven't had a chance to dig into this in detail and probably won't for the next couple weeks due to the holidays. I'd suspect that's a case of a cache miss with the former, perhaps due to some slightly different logic needed vs. the global.

Overall I think you're on to something here so really appreciate you digging in! Would like to continue iterating on the implementation specifics though, as well as incorporating some tests

@HKalbasi HKalbasi force-pushed the master branch 2 times, most recently from a888951 to 6ff4a71 Compare December 17, 2021 13:03
@HKalbasi
Copy link
Member Author

The problem was that rewrite context doesn't live long enough, so I moved it one layer up and it is now fixed.

as well as incorporating some tests

I added examples in those issues as regular tests, so a regression would cause cargo test to not terminate in reasonable time (but it won't fail). A global timeout for all tests would make it to fail.

@calebcartwright
Copy link
Member

Fantastic thank you!

src/rewrite.rs Outdated
pub(crate) span: SpanData,
}

pub(crate) type Memoize = Option<HashMap<QueryId, Option<String>>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the outer Option necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! Yes, in case of None function should clear the store at the end, but in Some(empty) it should fill it and keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wouldn't it be the same to do clean = map.is_empty()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*map = Some(HashMap::default());
clean = true;

If we set clean = map.is_empty(), then *map = Some(HashMap::default()); should become inserting some dummy thing (which isn't good) to map, otherwise subsequent calls will also receive empty map and clean it, while they shouldn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I wonder if there is a cleaner solution here, such as clearing the map after each Item is formatted.

src/rewrite.rs Outdated Show resolved Hide resolved
src/rewrite.rs Outdated Show resolved Hide resolved
src/rewrite.rs Outdated Show resolved Hide resolved
@camsteffen
Copy link
Contributor

Expression's node id is shared between all expressions of a def id.

Maybe this is because id's arent assigned until macro expansion (code), which I assume doesn't run for rustfmt? If that is the case, I wonder if it would be worthwhile to make a change upstream so that node ids may be assigned without macro expansion.

@calebcartwright
Copy link
Member

Thanks again @HKalbasi! Few questions/asks:

  • Could you please incorporate some inline comments explaining the why and thinking? Even after just a couple weeks I found I'd forgotten a lot of context, and I can only imagine someone trying to look at this for the first time months/years from now
  • I feel like we should be able to avoid at least one of clones, most likely once we have a hit. Would using a mutable borrow on the map and remove_entry instead of get here do the trick?
  • Any thoughts on some of the additional optimizations @camsteffen suggested, e.g. Fix exponential execution time by memoizing format_expr #5139 (comment) ?

@HKalbasi
Copy link
Member Author

I feel like we should be able to avoid at least one of clones, most likely once we have a hit. Would using a mutable borrow on the map and remove_entry instead of get here do the trick?

Nothing prevent it from hitting more than once. So if we remove it from map in the first hit, we need to calculate it again. If clones are too bad, I think we can use Rc since caller just read the result, but I don't feel it is necessary.

Any thoughts on some of the additional optimizations @camsteffen suggested

I don't think that's an optimization, but it would make code cleaner. I'm not confident where to clean cache so it cleans enough and not more. And in current way all of codes related to memoizing are in the same function, so lets keep it as is. I tried to explain why there is Option<HashMap> in the comments.

The second one (RefCell vs Cell) was an optimization, and it's done.

@calebcartwright
Copy link
Member

calebcartwright commented Jan 29, 2022

re: my question in #5139 (comment) and 👇

Nothing prevent it from hitting more than once. So if we remove it from map in the first hit, we need to calculate it again.

I was trying to think of a scenario where we'd have to recompute a Some(...) with the exact same shape (and I still can't see that), but I think I could see a scenario where it's a None that we could potentially hit twice. Does that sound right or is there a scenario for the former you can think of?

In short, I'm trying to be mindful of time vs. space complexity, especially since the originating issues occur in only the most extreme of circumstances derived from generated code whereas this change is going to impact quite literally everything. As such, I do want to be mindful of clones. If there's no way around it then so be it, but it's certainly worth the thought exercise

@HKalbasi
Copy link
Member Author

HKalbasi commented Jan 29, 2022

I'm not sure at this point if we hit Some in the cache. If there is no such case, we can use a HashSet instead of HashMap and return None if we calculated it before. But would it format correctly if this conjecture becomes wrong? (Edit: I tried it and tests failed)

In short, I'm trying to be mindful of time vs. space complexity, especially since the originating issues occur in only the most extreme of circumstances derived from generated code whereas this change is going to impact quite literally everything. As such, I do want to be mindful of clones. If there's no way around it then so be it, but it's certainly worth the thought exercise

Yes, now I see the second clone is pure overhead in normal cases. Rc is still an option.

.and_then(|state| {
state.sequence(|state| {
self::space(state)
.and_then(|state| super::hidden::skip(state))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here is quite off. I can't say offhand whether that's directly related because we don't really have a good frame of reference (as IIRC rustfmt on master is crashing on this input), but it could be indicative of a shape issue. The human-level diff comparison is also tough to do by eye given the heavy ident reuse.

Would you be willing to try to boil this down to a smaller repo, starting with the containing function at the same indentation level to see if we can rule out there being any shape-related oddities in the map entries? If it turns out to just be a chain issue (which wouldn't surprise me in the least) that'll be good to know, but definitely doesn't need to be resolved here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any example should fill the max_width limit, either with indentation or with decreasing the limit. example in the #4867 is fairly minimal and I tried to minimize it more:

fn func() {
    state . rule (Rule :: myrule , | state | { state . sequence (| state | { state . sequence (| state | { state . match_string ("abc") . and_then (| state | { super :: hidden :: skip (state) }) . and_then (| state | { state . match_string ("def") }) }) . and_then (| state | { super :: hidden :: skip (state) }) . and_then (| state | { state . sequence (| state | { state . optional (| state | { state . sequence (| state | { state . match_string ("abc") . and_then (| state | { super :: hidden :: skip (state) }) . and_then (| state | { state . match_string ("def") }) }) . and_then (| state | { state . repeat (| state | { state . sequence (| state | { super :: hidden :: skip (state) . and_then (| state | { state . sequence (| state | { state . match_string ("abc") }) }) }) }) }) }) }) }) }) });
}

this will take 7s on my system with (max_width=70) and then return no solution. Examples that have solution in the max_width limit are faster since the solution is not the last thing that we try, so minimizing them is harder because each step halves the execution time. Does this example helps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I don't understand your response in relation to the question that was posed. The issue here is that the emitted formatting is wrong. The runtime performance of how long it takes to produce the incorrect formatting is orthogonal.

Do you have any insight into why this is producing an incorrect result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I completely ignored the selected lines and misunderstood your message. This seems very bad, will look at it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, and thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in lines 192-207 tabs are 2 spaces but in rest tabs are 4 spaces, any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the input had two spaces iirc, so perhaps it's somehow "successfully" falling back to the span contents of the input? fwiw it might be more helpful on human eyes to butcher the input formatting a bit more and/or use some different idents given how much similarity there is (as that would make things stand out more between input/output)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, if I change the spaces to 1, it will be 1 in the output. And I can reproduce it on rustfmt installed on my machine after ~50s, so it is not caused by this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thank you for confirming. That strikes me as very odd behavior considering that we're dealing with a chain (a massive one but a chain nevertheless) and how chain formatting behaves in less.. intense scenarios, but so long as it's not something being introduced/modified as part of this PR then we should be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly minimized, takes 3s on rustfmt master: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d36a8d8e3ddc4b6b0223144a4b73ded9

Increasing max_width will fix the problem, so there is a possiblity that this problem have the same root with the long running time.

@calebcartwright
Copy link
Member

I'm not sure at this point if we hit Some in the cache. If there is no such case, we can use a HashSet instead of HashMap and return None if we calculated it before. But would it format correctly if this conjecture becomes wrong? (Edit: I tried it and tests failed)

I won't speak to the set vs. map piece, though to the original question, the only way rustfmt would ever format one of these expressions incorrectly would be if we'd done something egregious and were storing the wrong Some(String...) for the wrong query id. In practice the real worst case would be returning a None speciously, in which case rustfmt simply emits the original span contents of the expression (functionally skipping attempting to format that expression). Not ideal of course, but it's something done elsewhere on other code paths so certainly not disastrous either.

Yes, now I see the second clone is pure overhead in normal cases. Rc is still an option.

Will differ to your preference, but I want to note the execution time numbers I'm seeing locally with your current version (clone) and with the adjustment I suggested to remove the entry from the map via a mutable ref so as to avoid the need for the clone in the return case:

(note I've elided the --check contents as it's irrelevant noise in this context):

Current with the get + clone:

$ time target/debug/rustfmt tests/target/performance/issue-5128.rs --check
.....

real	0m0.632s
user	0m0.619s
sys	0m0.012s

With take_entry skipping the clone:

$ time target/debug/rustfmt tests/target/performance/issue-5128.rs --check
.....

real	0m0.635s
user	0m0.622s
sys	0m0.012s

And using the original/source input, as-is with get + clone

$ time target/debug/rustfmt tests/source/performance/issue-5128.rs --check
.....

real	0m8.926s
user	0m8.822s
sys	0m0.069s

with take_entry:

$ time target/debug/rustfmt tests/source/performance/issue-5128.rs --check
.....

real	0m8.776s
user	0m8.734s
sys	0m0.041s

Realize the above doesn't exactly qualify as a statistically significant analysis but I'm seeing the same pattern consistently across multiple runs, and for the other input files as well. This gives me sufficient confidence the extra clone has no substantial execution time impact, at least not in any context known to us today, and we should go with one of the discussed approaches to avoid it.

@calebcartwright
Copy link
Member

Also just want to say thanks again for all your efforts on this! I think it'll be a great step forward and address several known problems, but also want to be very careful and thoughtful given the code paths that are getting touched.

@HKalbasi
Copy link
Member Author

HKalbasi commented Feb 3, 2022

Third hits are not impossible, we can see by testing this (it fails):

if let Some(r) = map.remove(&query_id) {
    map.insert(query_id, Some("we don't hit again".to_string()));
    context.memoize.set(Some(map)); // restore map in the memoize cell for other users
    return r;
}

Your numbers shows that things which we calculate exactly two times are more than things which we calculate more times. Existence of third hits can lead to exponential time theoretically, thought it isn't probably important in practical cases. Anyway, that clone is in cold path in normal codes (issue-5128.rs isn't normal code), and what we did before memoizing was to completely recalculate the string from the ground up, which is definitely more costly than cloning, and it was fine.

What clone that is IMO more important, and has overhead relative to previous state for normal codes, is the second clone:

if let Some(mut map) = context.memoize.take() {
    map.insert(query_id, r.clone()); // insert the result in the memoize map
    context.memoize.set(Some(map)); // so it won't be computed again
}

In normal cases, most of expressions won't even hit once and they will waste. I don't have solution for it. I tried Rc and it doesn't work either because some (most?) of the callers want to mutate the result (for example append something) so they need to clone the actual data anyway.

Also just want to say thanks again for all your efforts on this! I think it'll be a great step forward and address several known problems, but also want to be very careful and thoughtful given the code paths that are getting touched.

I completely understand your concerns, and thanks for maintaining rustfmt!

@calebcartwright
Copy link
Member

What clone that is IMO more important, and has overhead relative to previous state for normal codes, is the second clone

Yes, though I'd basically accepted that we're going to have to introduce at least one, just as one of the inherent drawbacks with memoization. While I agree that the second clone won't always be at play in normal (non-generated, non-borderline-pathlogical 😆) code, I feel like it may with certain expressions in scenarios such as RHS comparisons.

Third hits are not impossible, we can see by testing this (it fails):

That's a much more practical way of testing it than my trying to mentally execute code flows!

Based on what we're seeing here I'm starting to suspect there's probably a middle ground somewhere that would both be a marked improvement in the generated code cases while also still minimizing unnecessary clones in more standard cases, even if the result is still less than optimal from an asymptotic complexity pov.

I wonder if we could use a more advanced type for the map values which included some kind of a hit counter that would be used to conditionally determine whether to clone + attach the result to the map entry, and potentially conditionally remove the entry as well to split the difference on some of the cloning I was previously referring to

@HKalbasi
Copy link
Member Author

I measured execution time of system and idempotence tests (I know they are not necessarily good data, just it was easy to add a timer in run_test_with) in some scenarios, excluding tests in this PR. Results have a ±0.03 second error, I tried to pick the mean in multiple runs.

  1. Without caching:
Time elapsed in system: 0.579 seconds
Time elapsed in idempotence: 0.704 seconds
  1. This PR at current state: Seems it is helping in normal code as well!
Time elapsed in system: 0.531 seconds
Time elapsed in idempotence: 0.640 seconds
  1. Removing from map at first hit:
Time elapsed in system: 0.544 seconds
Time elapsed in idempotence: 0.632 seconds

results are mixed, but I think 3 is slightly faster. The difference is cost of the second clone, minus cost of recalculations in third times.

I tried to make a version of 1 with an additional clone, but optimizer seems catch me and remove the clone.

Based on what we're seeing here I'm starting to suspect there's probably a middle ground somewhere that would both be a marked improvement in the generated code cases while also still minimizing unnecessary clones in more standard cases, even if the result is still less than optimal from an asymptotic complexity pov.

I'm fine with removing from map on first hit, but data above doesn't support it makes a noticeable difference.

And I want to mention Rc/Cow again, using it needs some change in most of the codebase (so out of scope for this PR), but it probably saves the first clones in addition to the second clones.

@calebcartwright
Copy link
Member

calebcartwright commented Apr 1, 2022

I'd put this on hold for a bit as the release we had pending was already large and I didn't want to try to pull this in on top, but am circling back to it now.

I am, however, unsure as to what you were trying to test/assert with your last comment as it seems duplicative of what was established earlier relative to timing: my prior point was that the run/execution time was largely the same regardless of whether there was an extra clone, so I'd rather avoid allocating increasingly long strings that weren't providing any benefit at runtime nor IMO pre-compile time in terms of the source code.

And I want to mention Rc/Cow again, using it needs some change in most of the codebase (so out of scope for this PR), but it probably saves the first clones in addition to the second clones.

Apologies if I forgot to respond to this thread previously, but I saw it and agreed completely that I don't want to go down this route.

@calebcartwright
Copy link
Member

calebcartwright commented Apr 1, 2022

Also, the review comment I left in #5139 (comment) is still outstanding and is something we'd need clarity on before proceeding. I can't tell if the shape is just wrong or if we've perhaps got a mapping issue, but something is wrong there.

If we can come to a resolution on that then I think I'll be good to move forward with this as-is, though I'll probably spend a little time afterwards doing more experimentation.

@HKalbasi
Copy link
Member Author

HKalbasi commented Apr 2, 2022

so I'd rather avoid allocating increasingly long strings that weren't providing any benefit at runtime nor IMO pre-compile time in terms of the source code.

I don't think its affect on memory usage would be more than ten megabytes even in machine generated codes (we can measure it) so is it that bad?

@calebcartwright
Copy link
Member

I don't think its affect on memory usage would be more than ten megabytes even in machine generated codes (we can measure it) so is it that bad?

This is an increasingly moot topic of debate because as noted previously the only outstanding issue needing resolution is relative to this change producing incorrect formatting.

I'm not requesting that you make any changes on this piece, however, I want to stress that we don't want to just incorporate unhelpful/unnecessary steps into the codebase unless they clear a threshold of being unnecessarily "bad enough"; we've very much a strategy of "only do that which is necessary".

There's plenty of cases, both those that could easily be contrived and practical ones within the rustfmt codebase, where some minimally (or even be negligibly) impactful step could be unnecessarily taken, e.g. a requiring clone instead of working with a slice. However, we'd still want to avoid such an unnecessary action regardless, i.e. go with the slice instead of the unnecessary clone.

I'm simply pointing out that the exact same principle is applicable here: we need to have a rationalization for the usage of a clone; we don't want to use clones freely/loosely and then have to rationalize the removal/avoidance of them

@calebcartwright
Copy link
Member

Alrighty, let's give this a try!

@calebcartwright calebcartwright merged commit a37d3ab into rust-lang:master Apr 17, 2022
@ytmimi
Copy link
Contributor

ytmimi commented Jun 22, 2022

@calebcartwright given that we just had to revert these changes in #5403 do we want to reopen all the linked issues?

@HKalbasi
Copy link
Member Author

Ah, sorry for breaking rustfmt.

If I understand correctly, the problem is that the result is not a function of just shape and expr. Would including the fields of RewriteContext (except config, parse_sess and snippet_provider which it seems they are global) in QueryId be enough to solve this problem?

@calebcartwright
Copy link
Member

No worries! This is a fairly tricky area and I didn't foresee this either.

I suspect there will be more to it and that we may need a more advanced/conditional type of memoization flow like I was alluding to earlier in these discussions (albeit with even more conditions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants