[NTOS:CC][NTOS:MM] Restore cache trimming functionality #5237

I-Tom-I · 2023-04-12T22:36:26Z

Purpose

Restore cache trimming functionality which allows evicting pages from the cache when low on memory.

Proposed changes

Restore CcRosTrimCache() which is called from MiBalancerThread().
Change KeSetEvent() increment to EVENT_INCREMENT in MmRebalanceMemoryConsumers() to ensure the balancer runs as soon as possible. MmRebalanceMemoryConsumers() is called by MiDecrementAvailablePages() when pages are critically low (MmAvailablePages < MmMinimumFreePages) meaning the balancer should be scheduled immediately, as there might not be enough pages for the balancer itself (ASSERT(MmAvailablePages >= 32);) later on.

HBelusca · 2023-04-12T22:50:44Z

If CcRosTrimCache is added back, this means it has been removed in the past. Can you mention which commit removed it originally?

ntoskrnl/mm/balance.c

ntoskrnl/cc/view.c

ntoskrnl/mm/balance.c

JoachimHenze · 2023-04-13T12:42:26Z

PR5237 on top of 0.4.15-dev-5938-g77d4653:
https://build.reactos.org/#/builders/9/builds/33783 the logs do show a lot of additional logspam.
VBox https://reactos.org/testman/compare.php?ids=86835,86837 LGTM
KVM https://reactos.org/testman/compare.php?ids=86834,86836 one additional reboot at mshtml:htmldoc, but that one is most likely unrelated. I have seen it randomly appearing on the unpatched releases/0.4.14 also on the testbots.

I am atm testing the PR in practice:
In the current state of this PR the following 2 lines are spammed once every second in first stage setup even when user is absolutely doing nothing (e.g. during language selection):
(/ntoskrnl/cc/view.c:473) CcRosTrimCache(Target 4294848615)
(/ntoskrnl/cc/view.c:575) Evicted 0 cache pages
We should mute them before merging!

I do like the PR.

HBelusca · 2023-04-13T12:48:22Z

1st-commit log:

-which was removed in 0.4.15-dev-1717-gd8cdb89fb03.
+which was removed in commit d8cdb89fb.

(the 0.4.15-*** prefix doesn't mean anything and is just a build number, unrelated to commits)

ntoskrnl/cc/view.c

JoachimHenze · 2023-04-13T13:09:46Z

 0.4.15-dev-1717-gd8cdb89fb03

just write a space after the g and the link will be clickable as intended, and even preserves the human readable time-content provided by "0.4.15-dev-1717-g".
0.4.15-dev-1717-g d8cdb89

tkreuzer · 2023-04-13T18:17:56Z

Just want to point out that while this PR looks like a good fix (I haven't done a detailed review yet), it can by itself not entirely fix the kernel crash, it just makes it much more unlikely. The part that will prevent any such crash is here: #5238. That PR addresses the lower level issue (preventing the page fault handler from crashing, when running out of pages), but it relies on either this PR or #5237 to get the pages back. Even all 3 might make sense together. Testers are welcome to test combinations of them.

binarymaster · 2023-07-29T11:17:14Z

#5238 is now merged, should this PR be closed as superseded by it?

Daniel7689 · 2023-07-29T17:35:59Z

This PR should be merged because I was successful in doing what reactosfanboy (@JoachimHenze) tried doing in the following screenshot of a comment in https://jira.reactos.org/browse/CORE-17624.

JoachimHenze · 2023-07-30T23:32:46Z

I would recommend to rerun all the bots on PR5237, now that PR5238 was merged already, to get new build artifacts from their combined result. I can't find the button in github though.

HBelusca · 2023-07-31T14:39:02Z

I would recommend to rerun all the bots on PR5237, now that PR5238 was merged already, to get new build artifacts from their combined result. I can't find the button in github though.

First, this PR should be rebased on top of current master, in order to reliably test this.

Restore CcRosTrimCache() which was removed in commit d8cdb89.

Change KeSetEvent() increment to EVENT_INCREMENT in MmRebalanceMemoryConsumers() to ensure the balancer runs as soon as possible. MmRebalanceMemoryConsumers() is called by MiDecrementAvailablePages() when pages are critically low (i.e. MmAvailablePages < MmMinimumFreePages) meaning the balancer should be scheduled immediately, as there might not be enough pages for the balancer itself (ASSERT(MmAvailablePages >= 32);) later on.

binarymaster · 2023-08-05T11:02:50Z

Rebased on top of master.

ThFabba · 2023-08-05T15:56:19Z

Here's my comment on why I thought adding this back seemed sensible for now:
https://jira.reactos.org/browse/CORE-17624?focusedId=131622&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-131622

Okay, what's going on here is that 0.4.15-dev-1717-gd8cdb89fb03 removed the MC_CACHE memory consumer from legacy Mm, and with it the CcRosTrimCache function responsible for evicting pages from the cache when we're low on memory.
This is correct in terms of Windows compatibility, because cache pages shouldn't need to be explicitly evicted. Instead:

the balancer thread should trim the cache working set, moving pages to either the modified or the standby list

the modified page writer thread will take pages from the modified list, write them to disk and move them to the standby list

any code needing a physical page can directly take it from the standby list

One of Jérôme's open PRs, #3300, enables working set trimming in the balancer thread.
However, we don't have a cache working set yet, nor do we have the modified page writer. So in the mean time, we'll probably need an in-between solution that will trim the cache when necessary – ideally something that can later be used as a basis for the modified page writer.

Also note that none of this is new code, it's simply restoring the exact implementation that was removed in Jérôme's commit. So while we can revisit questionable parts of it, I'd suggest taking it mostly as-is rather than reviewing it like new code.

Anyway, I don't expect this to be a permanent fix or a 100% fix -- but it would serve as a "regression shield" while the proper implementation is being worked on.

Doug-Lyons · 2023-08-07T18:04:43Z

ntoskrnl/mm/balance.c

@@ -372,6 +379,14 @@ MiBalancerThread(PVOID Unused)
                    InitialTarget = MiTrimMemoryConsumer(i, InitialTarget);
                }

+                /* Trim cache */
+                Target = max(InitialTarget, MiMinimumAvailablePages - MmAvailablePages);


This line results in a request for a HUGE amount of pages requested to be freed form CcRosTrimCache as can be seem from @JoachimHenze's comment including his debug output above showing the requested pages:

(/ntoskrnl/cc/view.c:473) CcRosTrimCache(Target >>4294848615<<)

This subtract for this one case is in the wrong order. I suggest this change to fix all cases:

Target = max(InitialTarget, abs(MiMinimumAvailablePages - MmAvailablePages));

It may actually be worth studying both cases:

MiMinimumAvailablePages < MmAvailablePages

and MmAvailablePages < MiMinimumAvailablePages , which is what seems to happen.

Note that elsewhere in that file we can see this:
https://git.reactos.org/?p=reactos.git;a=blob;f=ntoskrnl/mm/balance.c;hb=8532f1874a34ae41818a770f8c8bd8a77b70e21a#l111

111 if (MmAvailablePages < MiMinimumAvailablePages) 112 { 113 /* Global page limit exceeded */ 114 Target = (ULONG)max(Target, MiMinimumAvailablePages - MmAvailablePages); 115 } 116 else if (MiMemoryConsumers[Consumer].PagesUsed > MiMemoryConsumers[Consumer].PagesTarget) 117 { 118 /* Consumer page limit exceeded */ 119 Target = max(Target, MiMemoryConsumers[Consumer].PagesUsed - MiMemoryConsumers[Consumer].PagesTarget); 120 }

I agree that changing the code at this location to match what you show from above is a good solution here. Thanks.

Doug-Lyons · 2023-08-07T18:34:44Z

ntoskrnl/mm/balance.c

@@ -293,7 +293,8 @@ MmRebalanceMemoryConsumers(VOID)
 {
    // if (InterlockedCompareExchange(&PageOutThreadActive, 0, 1) == 0)
    {
-        KeSetEvent(&MiBalancerEvent, IO_NO_INCREMENT, FALSE);
+        /* Reset the event and boost balancer priority */
+        KeSetEvent(&MiBalancerEvent, EVENT_INCREMENT, FALSE);


I have done much testing with this PR and what I found is that this change does not improve the outcome. The commented line just above appears to have been an attempt to limit the times that this Event can be consecutively reset. I added a DPRINT1 to balance.c at the entry of MmRebalanceMemoryConsumers and you can see the effect in the debug log here;

(ntoskrnl/mm/ARM3/pfnlist.c:126) Running low on pages: 80 remaining (ntoskrnl/mm/balance.c:301) MmRebalanceMemoryConsumers. (ntoskrnl/mm/ARM3/pfnlist.c:126) Running low on pages: 79 remaining (ntoskrnl/mm/balance.c:301) MmRebalanceMemoryConsumers. (ntoskrnl/mm/ARM3/pfnlist.c:126) Running low on pages: 78 remaining (ntoskrnl/mm/balance.c:301) MmRebalanceMemoryConsumers. (ntoskrnl/mm/ARM3/pfnlist.c:126) Running low on pages: 77 remaining (ntoskrnl/mm/balance.c:301) MmRebalanceMemoryConsumers. (ntoskrnl/mm/ARM3/pfnlist.c:126) Running low on pages: 76 remaining (ntoskrnl/mm/balance.c:301) MmRebalanceMemoryConsumers. . . .

Since pfnlist.c is a high priority function, it consistently hammers the KeSetEvent function resulting in multiple resets with no good outcome.

My suggestion is to uncomment and fix line # 294 above by changing it to the following:

-// if (InterlockedCompareExchange(&PageOutThreadActive, 0, 1) == 0) + if (InterlockedCompareExchange(&PageOutThreadActive, 1, 0) == 0)

After this fix, the Event can work as expected without being constantly reset.

The way PageOutThreadActive is currently used is not technically safe -- if you uncomment that line, it may be that the balancer thread misses a wake-up. The set-event-if-interlocked pattern usually requires a lock in order to work right.
That may not be important for the balancer, given that it runs on a timer anyway, but 6f2b94c seems to indicate that this change was helping somehow.

Anyway, I'd prefer to separate this out from the cache trimming, and have more focused discussion/testing around this if someone strongly feels that either of the two changes is necessary.

(On a separate note, this boolean is pointless -- we could just make the event a NotificationEvent to achieve the same effect)

For posterity, the exact race condition around PageOutThreadActive looks like this:

the balancer thread is inactive (PageOutThreadActive=0) and somebody calls MmRebalanceMemoryConsumers

the InterlockedCompareExchange goes through, it sets PageOutThreadActive to 1 and signals the event

MiBalancerThread wakes up and runs its loop. It finishes the loop but gets unscheduled before it decrements PageOutThreadActive

another thread calls MmRebalanceMemoryConsumers. PageOutThreadActive is still 1 so it does not set the event

MiBalancerThread decreases PageOutThreadActive and waits

MiBalancerEvent did not get set so the thread remains until the next wake-up.

I just wanted to point out that my suggested changed was not just to remove the comment marker.
Also, this line needs to have the '1' changed to a '0' and the '0' changed to a '1' in the InterlockedCompareExchange.
I modified the comment above to show the Previous line with a '-' followed by the New line with a '+'.
Hopefully this makes the changes more obvious. Just removing the commenting does not fix its intent.
It is fine to leave this line as it is for now. It can be revisited at a later time. Thanks.

Thanks! Right, with 0, 1 the event would never get set... so that explains why Jérôme had to remove it

Doug-Lyons · 2023-08-07T18:53:35Z

I would recommend to rerun all the bots on PR5237, now that PR5238 was merged already, to get new build artifacts from their combined result. I can't find the button in github though.

First, this PR should be rebased on top of current master, in order to reliably test this.

@HBelusca, I have run the bots on almost this exact code in PR #5527 with only two small changes. There results were as follows:

PR5237-05.patch JID66166 on top of 0.4.15-dev-6415-g21e30f3

VBox: https://reactos.org/testman/compare.php?ids=88675,88680 LGTM
KVM: https://reactos.org/testman/compare.php?ids=88676,88679 LGTM

The only changes were that these results included the Lock Fix from PR #5527 and did NOT include the change from the above new line # 297 here;

KeSetEvent(&MiBalancerEvent, EVENT_INCREMENT, FALSE);

Doug-Lyons

Looks good to me.

JoachimHenze · 2023-08-14T18:39:15Z

I do recommend to commit the following patch: https://jira.reactos.org/browse/CORE-17624?focusedId=140032&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-140032
instead of pure PR5237.
The patch in there entirely fixes that issue, while PR5237 does only slightly improve further.

Doug-Lyons

Suggest that commit #2 should be reverted in accordance with discussion on Mattermost Development here https://chat.reactos.org/reactos/pl/wfk4g746xid69gpgise7j99d5o
ThFabba [1:13 PM]

I'm okay with the part that restores the old code. The priority boost change is more contentious and should be separated out.

I-Tom-I · 2023-09-05T18:24:19Z

This PR will be closed once #5630 is merged. Also my reasoning behind the priority change was that it simply did not work without it when tested using only this patch. Things may have changed in the meantime though, allowing it to be used without the priority change in newer revisions.

Doug-Lyons · 2023-09-05T18:34:57Z

This PR will be closed once #5630 is merged. Also my reasoning behind the priority change was that it simply did not work without it when tested using only this patch. Things may have changed in the meantime though, allowing it to be used without the priority change in newer revisions.

@I-Tom-I Thanks very much for your comments explaining your reasons. I appreciate your PR's and look forward to more.

JoachimHenze · 2023-09-06T11:35:45Z

Closed, because superseded. Thank you anyway for the contribution.

I-Tom-I requested review from HeisSpiter, ThFabba and tkreuzer as code owners April 12, 2023 22:36

github-actions bot added the kernel&hal Code changes to the ntoskrnl and HAL label Apr 12, 2023

HBelusca requested review from zefklop and HBelusca April 12, 2023 22:49

HBelusca reviewed Apr 12, 2023

View reviewed changes

ntoskrnl/mm/balance.c Outdated Show resolved Hide resolved

JoachimHenze reviewed Apr 12, 2023

View reviewed changes

ntoskrnl/mm/balance.c Outdated Show resolved Hide resolved

ntoskrnl/mm/balance.c Outdated Show resolved Hide resolved

JoachimHenze reviewed Apr 12, 2023

View reviewed changes

ntoskrnl/cc/view.c Outdated Show resolved Hide resolved

JoachimHenze requested changes Apr 12, 2023

View reviewed changes

ntoskrnl/mm/balance.c Outdated Show resolved Hide resolved

I-Tom-I force-pushed the master branch 3 times, most recently from 4096d74 to 6f553a1 Compare April 12, 2023 23:48

JoachimHenze approved these changes Apr 13, 2023

View reviewed changes

JoachimHenze added bugfix For bugfix PRs. no squash merge Author has no full name in GitHub profile, either merge by rebase or manually labels Apr 13, 2023

JoachimHenze reviewed Apr 13, 2023

View reviewed changes

ntoskrnl/cc/view.c Outdated Show resolved Hide resolved

JoachimHenze reviewed Apr 13, 2023

View reviewed changes

ntoskrnl/cc/view.c Outdated Show resolved Hide resolved

JoachimHenze reviewed Apr 13, 2023

View reviewed changes

ntoskrnl/cc/view.c Outdated Show resolved Hide resolved

I-Tom-I force-pushed the master branch from 88f9559 to e55c937 Compare April 13, 2023 17:01

tkreuzer mentioned this pull request Apr 13, 2023

Fix kernel crash, when copying large files #5238

Merged

2 tasks

I-Tom-I force-pushed the master branch from e55c937 to d3d1b68 Compare April 13, 2023 21:02

binarymaster added this to New PRs in ReactOS PRs via automation Jun 10, 2023

Doug-Lyons mentioned this pull request Aug 5, 2023

[NTOS:CC] Fix locking of CcRosVacbIncRefCount in CcRosCreateVacb #5527

Merged

ThFabba and others added 2 commits August 5, 2023 13:58

[NTOS:CC] Add back CcRosTrimCache. CORE-17624

ecddb73

Restore CcRosTrimCache() which was removed in commit d8cdb89.

binarymaster force-pushed the master branch from d3d1b68 to bf56344 Compare August 5, 2023 11:01

This comment was marked as outdated.

Sign in to view

Doug-Lyons reviewed Aug 7, 2023

View reviewed changes

ReactOS PRs automation moved this from New PRs to Approved by reviewers Aug 13, 2023

Doug-Lyons approved these changes Aug 13, 2023

View reviewed changes

Doug-Lyons requested changes Aug 24, 2023

View reviewed changes

Doug-Lyons mentioned this pull request Aug 30, 2023

[NTOS:CC][NTOS:MM] Add back CcRosTrimCache and add Delay for MM to recover memory #5630

Merged

JoachimHenze closed this Sep 6, 2023

ReactOS PRs automation moved this from Approved by reviewers to Done Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NTOS:CC][NTOS:MM] Restore cache trimming functionality #5237

[NTOS:CC][NTOS:MM] Restore cache trimming functionality #5237

I-Tom-I commented Apr 12, 2023

HBelusca commented Apr 12, 2023

JoachimHenze commented Apr 13, 2023 •

edited

HBelusca commented Apr 13, 2023

JoachimHenze commented Apr 13, 2023

tkreuzer commented Apr 13, 2023

binarymaster commented Jul 29, 2023

Daniel7689 commented Jul 29, 2023 •

edited

JoachimHenze commented Jul 30, 2023

HBelusca commented Jul 31, 2023

binarymaster commented Aug 5, 2023

This comment was marked as outdated.

ThFabba commented Aug 5, 2023

Doug-Lyons Aug 7, 2023 •

edited

HBelusca Aug 7, 2023

Doug-Lyons Aug 7, 2023

Doug-Lyons Aug 7, 2023 •

edited

ThFabba Aug 12, 2023

Doug-Lyons Aug 12, 2023

ThFabba Aug 13, 2023

Doug-Lyons commented Aug 7, 2023

Doug-Lyons left a comment

JoachimHenze commented Aug 14, 2023

Doug-Lyons left a comment

I-Tom-I commented Sep 5, 2023

Doug-Lyons commented Sep 5, 2023

JoachimHenze commented Sep 6, 2023

[NTOS:CC][NTOS:MM] Restore cache trimming functionality #5237

[NTOS:CC][NTOS:MM] Restore cache trimming functionality #5237

Conversation

I-Tom-I commented Apr 12, 2023

Purpose

Proposed changes

HBelusca commented Apr 12, 2023

JoachimHenze commented Apr 13, 2023 • edited

HBelusca commented Apr 13, 2023

JoachimHenze commented Apr 13, 2023

tkreuzer commented Apr 13, 2023

binarymaster commented Jul 29, 2023

Daniel7689 commented Jul 29, 2023 • edited

JoachimHenze commented Jul 30, 2023

HBelusca commented Jul 31, 2023

binarymaster commented Aug 5, 2023

This comment was marked as outdated.

ThFabba commented Aug 5, 2023

Doug-Lyons Aug 7, 2023 • edited

Choose a reason for hiding this comment

HBelusca Aug 7, 2023

Choose a reason for hiding this comment

Doug-Lyons Aug 7, 2023

Choose a reason for hiding this comment

Doug-Lyons Aug 7, 2023 • edited

Choose a reason for hiding this comment

ThFabba Aug 12, 2023

Choose a reason for hiding this comment

Doug-Lyons Aug 12, 2023

Choose a reason for hiding this comment

ThFabba Aug 13, 2023

Choose a reason for hiding this comment

Doug-Lyons commented Aug 7, 2023

Doug-Lyons left a comment

Choose a reason for hiding this comment

JoachimHenze commented Aug 14, 2023

Doug-Lyons left a comment

Choose a reason for hiding this comment

I-Tom-I commented Sep 5, 2023

Doug-Lyons commented Sep 5, 2023

JoachimHenze commented Sep 6, 2023

JoachimHenze commented Apr 13, 2023 •

edited

Daniel7689 commented Jul 29, 2023 •

edited

Doug-Lyons Aug 7, 2023 •

edited

Doug-Lyons Aug 7, 2023 •

edited