Wasted memory in inode cache and dentry cache #14506

avikivity · 2023-07-04T16:53:08Z

An sstable is composed of 7 files. They will be opened in close proximity. Two will remain open, five will be closed.

After some time, the kernel will run out of memory. It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

Let's take for example 200k sstables. That's 200k*1k*7 = 1.4GB memory used, of which 200k*1k*5 = 1GB is wasted.

avikivity · 2023-07-04T16:55:38Z

Possible solutions:

increase vm.vfs_cache_pressure sysctl dramatically; unlikely to work since there is no memory pressure during boot
change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

avikivity · 2023-07-04T16:55:51Z

/cc @xemul @tchaikov

raphaelsc · 2023-07-04T17:29:57Z

The 2nd proposed solution is easy to implement.

Another improvement that I'd like to make is to reopen the file descriptor in the shard that really owns it, as slab allocation is today NUMA aware, and therefore we can benefit from fair distribution of resources across NUMA nodes.

Actually, by just reopening the file descriptor, I guess we'll cause a natural defragmentation of xfs's internal and dentry caches?! Worth testing.

xemul · 2023-07-04T17:38:15Z

It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

More sstables will appear and kernel will re-use the holes in partially-filled slabs?

avikivity · 2023-07-04T17:41:27Z

It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

More sstables will appear and kernel will re-use the holes in partially-filled slabs?

With a very high traffic node it may defragment itself after a while. But the odds are low. These are 16k slabs so 4 sstables in the same slab need to be deleted.

xemul · 2023-07-04T17:41:43Z

Actually, by just reopening the file descriptor, I guess we'll cause a natural defragmentation of xfs's internal and dentry caches?! Worth testing.

Closing a file doesn't make its dentry+inode to be evicted. Quite opposite -- kernel caches dentries for closed files and respective inodes are pinned in memory too. Eviction (called "reclaim" in kernel) happens either on page allocation failure (not small object allocation failure) or in the background thread (but I'm not very confident about the thread, background reclaim was very different in different kernels)

avikivity · 2023-07-04T17:42:00Z

And with a low-traffic node, we're stuck with 25% utilization.

avikivity · 2023-07-04T17:43:11Z

Yes, the idea is to allocate long-lived inodes together, and short-lived inodes together, so that if/when the short-lived inodes are reclaimed they free a slab.

xemul · 2023-07-04T17:45:12Z

There's kmem-controller in the cgroups area. When allocated in different groups kmem-cache dedicates different slab pages for that. If threads can live in different kmem groups and this thing still works 🤞 it can be an option too

xemul · 2023-07-04T17:48:28Z

With a very high traffic node it may defragment itself after a while. But the odds are low.

I don't get why. If the traffic is low we have 25% utilization of few pages, so hardly a problem. If the traffic is high then those 5 files that are opened once (or twice) for a short time should be reclaimed and then re-allocated by newly appearing sstables.

avikivity · 2023-07-04T17:50:53Z

Traffic is low ≠ few files. We've seen a misconfigured TWCS table with ~200k sstables. That translated to ~800k files and ~1GB slab, instead of 200MB slab.

avikivity · 2023-07-04T17:51:42Z

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

avikivity · 2023-07-04T17:52:09Z

≢-< this does work

raphaelsc · 2023-07-04T17:52:33Z

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

That's why email is better. I was able to see your symbol from my mail reader :-)

avikivity · 2023-07-04T17:52:35Z

⍯ <- this too

avikivity · 2023-07-04T17:53:11Z

Maybe it's just misrendered locally.

michoecho · 2023-07-04T17:53:33Z

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

You mean the ≠ in low ≠ few above? I'm seeing it normally, on github. Maybe you have font problems.

avikivity · 2023-07-04T17:55:56Z

In your quote, the first ≠ is misrendered and I can select individual glyphs (but it's also kerned!). The second ≠ is rendered correctly (so for sure font related, since the second instance uses a different font).

xemul · 2023-07-04T17:57:06Z

Ah, now I see it. It's ~two-step. First, inode cache is populated with [used, used, unused, unused, unused, unised, unused] * pattern, then OOM comes and translates this into [used, used, freed, freed, freed, freed, freed]* one :(

I remember that Cristopher tried to do something about it long ago, e.g. -- https://lwn.net/Articles/371892/

This can be a justification to rewriting Linux in C++ with the ability to std::move() dentries+inodes during slab ~~eviction~~ reclaim 🦸

avikivity · 2023-07-04T17:58:33Z

Yes, about 4000 years ago @tgrabiec demonstrated the problem (IIRC on seastar memcached) and everyone is busy admiring LSA ever since.

Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref scylladb#14506

avikivity · 2023-07-04T18:01:17Z

↑ patch implements first proposal, but it won't solve it without the second, due to fragmentation.

avikivity · 2023-07-04T18:01:45Z

← celebrating Unicode and the Compose key

xemul · 2023-07-04T18:13:18Z

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

raphaelsc · 2023-07-04T18:22:39Z

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

I think the point is to open long lived first and short lived later. So it will be used used used ... unused unused ..., Then reclaiming will have an easy job freeing from the back

raphaelsc · 2023-07-04T18:24:57Z

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

I think the point is to open long lived first and short lived later. So it will be used used used ... unused unused ..., Then reclaiming will have an easy job freeing from the back

But note patching sstable loader is only deferring the issue as newer sstables are introduced one at a time, so we cannot prevent fragmentation in the long run

avikivity · 2023-07-04T18:50:33Z

In the long run sstables replace each other. A deleted inode is freed from slab and the next sstable to be created will take its place.

The problem only happens during initial load.

xemul · 2023-07-04T19:07:07Z

I think the point is to open long lived first and short lived later.

Long-lived from several sstables first, then short lived from the same set

newer sstables are introduced one at a time,

Exactly

so we cannot prevent fragmentation in the long run

... unless long-lived files are eventually closed, AFAIU, we expect that compaction should provide that in the long run

avikivity · 2023-07-04T19:10:17Z

I was thinking to open ALL short-lived components first. We have to open -TOC.txt first, so at that point we may as well open all the other short-lived components.

Then a second pass to open Index/Data.

Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref #14506 Closes #14509

avikivity · 2023-07-10T18:42:58Z

We might also tune vm.min_slab_ratio

min_slab_ratio
==============

This is available only on NUMA kernels.

A percentage of the total pages in each zone.  On Zone reclaim
(fallback from the local zone occurs) slabs will be reclaimed if more
than this percentage of pages in a zone are reclaimable slab pages.
This insures that the slab growth stays under control even in NUMA
systems that rarely perform global reclaim.

The default is 5 percent.

Note that slab reclaim is triggered in a per zone / node fashion.
The process of reclaiming slab memory is currently not node specific
and may not be fast.

If we set it to 1%, then as soon as we reach 1% we'll evict inodes and new sstables will start to fill the holes. 5% is too close to our 7% reserve.

xemul · 2023-07-12T12:12:57Z

LKML might appreciate fcntl extension (or O_SOMETHING flag) that would prevent file's dentry and inode from being cached

bhalevy · 2024-01-04T08:04:04Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today
close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)
reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each.
We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

avikivity · 2024-01-04T11:08:03Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)

reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each. We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

Opening and closing does nothing for the inode cache.

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

avikivity · 2024-01-04T11:08:40Z

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

Well except in the streaming/repair case.

bhalevy · 2024-01-04T12:19:59Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.
Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)

reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each. We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

Opening and closing does nothing for the inode cache.

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

avikivity · 2024-01-04T14:49:33Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

bhalevy · 2024-01-04T15:10:04Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

I think that delayed loading of data/index aligns better on the trajectory towards tiered storage.

avikivity · 2024-01-04T15:53:32Z

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

I think that delayed loading of data/index aligns better on the trajectory towards tiered storage.

I don't think it matters, it's localized to sstable_directory.

bhalevy · 2024-01-04T18:57:55Z

@denesb, @raphaelsc how's that for an introductory project for @lkshminarayanan ?

denesb · 2024-01-05T08:35:57Z

@denesb, @raphaelsc how's that for an introductory project for @lkshminarayanan ?

Looks good to me. Should be a good introduction on how we load sstable files.

Update the SSTable loader mechanism to allow loading the SSTable files in two phases. First, load the files of all SSTables in a directory, which will be closed immediately after loading. Later, load the respective data and index files of all those SSTables. Loading them in two phases will help reduce memory fragmentation that can occur in the dentry/inode cache due to interleaving the opening of files that will be closed immediately and the files that will be held open for a longer duration. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

…ectory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

…ata directory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

…fying data directory contents' from Lakshmi Narayanan Sreethar During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes #14506 Closes #16952 * github.com:scylladb/scylladb: directories: prevent inode cache fragmentation by orderly verifying data directory contents directories: skip verifying data directory contents during startup directories: co-routinize create_and_verify

…ata directory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

avikivity mentioned this issue Jul 4, 2023

dist: sysctl: increase vm.vfs_cache_pressure #14509

Closed

DoronArazii added this to the Backlog milestone Aug 22, 2023

mykaul added the area/sstableloader label Jan 4, 2024

denesb assigned denesb and lkshminarayanan Jan 5, 2024

lkshminarayanan mentioned this issue Jan 15, 2024

Change --workdir to use directories class #5626

Closed

lkshminarayanan mentioned this issue Jan 23, 2024

directories: prevent inode cache fragmentation by orderly verifying data directory contents #16952

Merged

scylladb-promoter closed this as completed in dbe758d Feb 1, 2024

scylladb-promoter closed this as completed in #16952 Feb 1, 2024

scylladb-promoter added the Backport candidate label Feb 1, 2024

Wasted memory in inode cache and dentry cache #14506

Wasted memory in inode cache and dentry cache #14506

Comments

avikivity commented Jul 4, 2023 • edited by lkshminarayanan

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

raphaelsc commented Jul 4, 2023

xemul commented Jul 4, 2023

avikivity commented Jul 4, 2023

xemul commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

xemul commented Jul 4, 2023

xemul commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

raphaelsc commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

michoecho commented Jul 4, 2023

avikivity commented Jul 4, 2023

xemul commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 4, 2023

xemul commented Jul 4, 2023

raphaelsc commented Jul 4, 2023

raphaelsc commented Jul 4, 2023

avikivity commented Jul 4, 2023

xemul commented Jul 4, 2023

avikivity commented Jul 4, 2023

avikivity commented Jul 10, 2023

xemul commented Jul 12, 2023

bhalevy commented Jan 4, 2024

avikivity commented Jan 4, 2024

avikivity commented Jan 4, 2024

bhalevy commented Jan 4, 2024

avikivity commented Jan 4, 2024

bhalevy commented Jan 4, 2024

avikivity commented Jan 4, 2024

bhalevy commented Jan 4, 2024

denesb commented Jan 5, 2024

avikivity commented Jul 4, 2023 •

edited by lkshminarayanan