Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasted memory in inode cache and dentry cache #14506

Closed
avikivity opened this issue Jul 4, 2023 · 49 comments · Fixed by #16952
Closed

Wasted memory in inode cache and dentry cache #14506

avikivity opened this issue Jul 4, 2023 · 49 comments · Fixed by #16952

Comments

@avikivity
Copy link
Member

avikivity commented Jul 4, 2023

An sstable is composed of 7 files. They will be opened in close proximity. Two will remain open, five will be closed.

After some time, the kernel will run out of memory. It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

Let's take for example 200k sstables. That's 200k*1k*7 = 1.4GB memory used, of which 200k*1k*5 = 1GB is wasted.

@avikivity
Copy link
Member Author

Possible solutions:

  • increase vm.vfs_cache_pressure sysctl dramatically; unlikely to work since there is no memory pressure during boot
  • change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

@avikivity
Copy link
Member Author

/cc @xemul @tchaikov

@raphaelsc
Copy link
Member

The 2nd proposed solution is easy to implement.

Another improvement that I'd like to make is to reopen the file descriptor in the shard that really owns it, as slab allocation is today NUMA aware, and therefore we can benefit from fair distribution of resources across NUMA nodes.

Actually, by just reopening the file descriptor, I guess we'll cause a natural defragmentation of xfs's internal and dentry caches?! Worth testing.

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

More sstables will appear and kernel will re-use the holes in partially-filled slabs?

@avikivity
Copy link
Member Author

It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.

More sstables will appear and kernel will re-use the holes in partially-filled slabs?

With a very high traffic node it may defragment itself after a while. But the odds are low. These are 16k slabs so 4 sstables in the same slab need to be deleted.

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

Actually, by just reopening the file descriptor, I guess we'll cause a natural defragmentation of xfs's internal and dentry caches?! Worth testing.

Closing a file doesn't make its dentry+inode to be evicted. Quite opposite -- kernel caches dentries for closed files and respective inodes are pinned in memory too. Eviction (called "reclaim" in kernel) happens either on page allocation failure (not small object allocation failure) or in the background thread (but I'm not very confident about the thread, background reclaim was very different in different kernels)

@avikivity
Copy link
Member Author

And with a low-traffic node, we're stuck with 25% utilization.

@avikivity
Copy link
Member Author

Yes, the idea is to allocate long-lived inodes together, and short-lived inodes together, so that if/when the short-lived inodes are reclaimed they free a slab.

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

There's kmem-controller in the cgroups area. When allocated in different groups kmem-cache dedicates different slab pages for that. If threads can live in different kmem groups and this thing still works 🤞 it can be an option too

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

With a very high traffic node it may defragment itself after a while. But the odds are low.

I don't get why. If the traffic is low we have 25% utilization of few pages, so hardly a problem. If the traffic is high then those 5 files that are opened once (or twice) for a short time should be reclaimed and then re-allocated by newly appearing sstables.

@avikivity
Copy link
Member Author

Traffic is low ≠ few files. We've seen a misconfigured TWCS table with ~200k sstables. That translated to ~800k files and ~1GB slab, instead of 200MB slab.

@avikivity
Copy link
Member Author

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

@avikivity
Copy link
Member Author

≢-< this does work

@raphaelsc
Copy link
Member

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

That's why email is better. I was able to see your symbol from my mail reader :-)

@avikivity
Copy link
Member Author

⍯ <- this too

@avikivity
Copy link
Member Author

Maybe it's just misrendered locally.

@michoecho
Copy link
Contributor

Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here.

You mean the ≠ in low ≠ few above? I'm seeing it normally, on github. Maybe you have font problems.

@avikivity
Copy link
Member Author

In your quote, the first ≠ is misrendered and I can select individual glyphs (but it's also kerned!). The second ≠ is rendered correctly (so for sure font related, since the second instance uses a different font).

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

Ah, now I see it. It's ~two-step. First, inode cache is populated with [used, used, unused, unused, unused, unised, unused] * pattern, then OOM comes and translates this into [used, used, freed, freed, freed, freed, freed]* one :(

I remember that Cristopher tried to do something about it long ago, e.g. -- https://lwn.net/Articles/371892/

This can be a justification to rewriting Linux in C++ with the ability to std::move() dentries+inodes during slab eviction reclaim 🦸

@avikivity
Copy link
Member Author

Yes, about 4000 years ago @tgrabiec demonstrated the problem (IIRC on seastar memcached) and everyone is busy admiring LSA ever since.

avikivity added a commit to avikivity/scylladb that referenced this issue Jul 4, 2023
Our usage of inodes is dual:

 - the Index.db and Data.db components are pinned in memory as
   the files are open
 - all other components are read once and never looked at again

As such, tune the kernel to prefer evicting dcache/inodes to
memory pages. The default is 100, so the value of 2000 increases
it by a factor of 20.

Ref scylladb#14506
@avikivity
Copy link
Member Author

↑ patch implements first proposal, but it won't solve it without the second, due to fragmentation.

@avikivity
Copy link
Member Author

← celebrating Unicode and the Compose key

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

@raphaelsc
Copy link
Member

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

I think the point is to open long lived first and short lived later. So it will be used used used ... unused unused ..., Then reclaiming will have an easy job freeing from the back

@raphaelsc
Copy link
Member

change the sstable loading system to change the temporal locality of sstable opening; for example, first open and immediately close all components except Index/Data, then load normally. This will put all reclaimable inodes/dentries close together.

There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the [used, used, unused, unused, unused, unused, unused] patter in the slab, so even 20x times harder pressure on slabs won't help reclaiming this page. So what's the point?

I think the point is to open long lived first and short lived later. So it will be used used used ... unused unused ..., Then reclaiming will have an easy job freeing from the back

But note patching sstable loader is only deferring the issue as newer sstables are introduced one at a time, so we cannot prevent fragmentation in the long run

@avikivity
Copy link
Member Author

In the long run sstables replace each other. A deleted inode is freed from slab and the next sstable to be created will take its place.

The problem only happens during initial load.

@xemul
Copy link
Contributor

xemul commented Jul 4, 2023

I think the point is to open long lived first and short lived later.

Long-lived from several sstables first, then short lived from the same set

newer sstables are introduced one at a time,

Exactly

so we cannot prevent fragmentation in the long run

... unless long-lived files are eventually closed, AFAIU, we expect that compaction should provide that in the long run

@avikivity
Copy link
Member Author

I was thinking to open ALL short-lived components first. We have to open -TOC.txt first, so at that point we may as well open all the other short-lived components.

Then a second pass to open Index/Data.

xemul pushed a commit that referenced this issue Jul 10, 2023
Our usage of inodes is dual:

 - the Index.db and Data.db components are pinned in memory as
   the files are open
 - all other components are read once and never looked at again

As such, tune the kernel to prefer evicting dcache/inodes to
memory pages. The default is 100, so the value of 2000 increases
it by a factor of 20.

Ref #14506

Closes #14509
@avikivity
Copy link
Member Author

We might also tune vm.min_slab_ratio

min_slab_ratio
==============

This is available only on NUMA kernels.

A percentage of the total pages in each zone.  On Zone reclaim
(fallback from the local zone occurs) slabs will be reclaimed if more
than this percentage of pages in a zone are reclaimable slab pages.
This insures that the slab growth stays under control even in NUMA
systems that rarely perform global reclaim.

The default is 5 percent.

Note that slab reclaim is triggered in a per zone / node fashion.
The process of reclaiming slab memory is currently not node specific
and may not be fast.

If we set it to 1%, then as soon as we reach 1% we'll evict inodes and new sstables will start to fill the holes. 5% is too close to our 7% reserve.

@xemul
Copy link
Contributor

xemul commented Jul 12, 2023

LKML might appreciate fcntl extension (or O_SOMETHING flag) that would prevent file's dentry and inode from being cached

@DoronArazii DoronArazii added this to the Backlog milestone Aug 22, 2023
@bhalevy
Copy link
Member

bhalevy commented Jan 4, 2024

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today
  2. close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)
  3. reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each.
We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

@avikivity
Copy link
Member Author

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

  1. close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)
  2. reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each. We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

Opening and closing does nothing for the inode cache.

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

@avikivity
Copy link
Member Author

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

Well except in the streaming/repair case.

@bhalevy
Copy link
Member

bhalevy commented Jan 4, 2024

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.
Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

  1. close the data and index files of all sstables in the directory (while keeping the metadata components from pass one)
  2. reopen the data and index files

For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each. We'll need to synchronize this with readers, and probably allow on-demand opening of the data and index streams if needed.

Opening and closing does nothing for the inode cache.

I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused.

@avikivity
Copy link
Member Author

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

@bhalevy
Copy link
Member

bhalevy commented Jan 4, 2024

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

I think that delayed loading of data/index aligns better on the trajectory towards tiered storage.

@avikivity
Copy link
Member Author

Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:

  1. open all sstables normally as we do today

This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.

I thought that the next step will resolve the fragmentation issue.

No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.

Do we need to open the data and index files only after the first phase that will only check for their existence and load the sstables metadata?

Or vice versa, first open only data/index and then open all the rest.

I think that delayed loading of data/index aligns better on the trajectory towards tiered storage.

I don't think it matters, it's localized to sstable_directory.

@bhalevy
Copy link
Member

bhalevy commented Jan 4, 2024

@denesb, @raphaelsc how's that for an introductory project for @lkshminarayanan ?

@denesb
Copy link
Contributor

denesb commented Jan 5, 2024

@denesb, @raphaelsc how's that for an introductory project for @lkshminarayanan ?

Looks good to me. Should be a good introduction on how we load sstable files.

lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 23, 2024
Update the SSTable loader mechanism to allow loading the SSTable files
in two phases. First, load the files of all SSTables in a directory,
which will be closed immediately after loading. Later, load the
respective data and index files of all those SSTables. Loading them in
two phases will help reduce memory fragmentation that can occur in the
dentry/inode cache due to interleaving the opening of files that will be
closed immediately and the files that will be held open for a longer
duration.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 23, 2024
Update the SSTable loader mechanism to allow loading the SSTable files
in two phases. First, load the files of all SSTables in a directory,
which will be closed immediately after loading. Later, load the
respective data and index files of all those SSTables. Loading them in
two phases will help reduce memory fragmentation that can occur in the
dentry/inode cache due to interleaving the opening of files that will be
closed immediately and the files that will be held open for a longer
duration.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 26, 2024
During startup, the contents of the data directory are verified to ensure
that they have the right owner and permissions. Verifying all the
contents, which includes files that will be read and closed immediately,
and files that will be held open for longer durations, together, can
lead to memory fragementation in the dentry/inode cache.

Mitigate this by updating the verification in a such way that these two
set of files will be verified separately ensuring their separation in
the dentry/inode cache.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 29, 2024
During startup, the contents of the data directory are verified to ensure
that they have the right owner and permissions. Verifying all the
contents, which includes files that will be read and closed immediately,
and files that will be held open for longer durations, together, can
lead to memory fragementation in the dentry/inode cache.

Mitigate this by updating the verification in a such way that these two
set of files will be verified separately ensuring their separation in
the dentry/inode cache.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 30, 2024
…ectory contents

During startup, the contents of the data directory are verified to ensure
that they have the right owner and permissions. Verifying all the
contents, which includes files that will be read and closed immediately,
and files that will be held open for longer durations, together, can
lead to memory fragementation in the dentry/inode cache.

Prevent this by updating the verification in a such way that these two
set of files will be verified separately ensuring their separation in
the dentry/inode cache.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue Jan 31, 2024
…ata directory contents

During startup, the contents of the data directory are verified to ensure
that they have the right owner and permissions. Verifying all the
contents, which includes files that will be read and closed immediately,
and files that will be held open for longer durations, together, can
lead to memory fragementation in the dentry/inode cache.

Prevent this by updating the verification in a such way that these two
set of files will be verified separately ensuring their separation in
the dentry/inode cache.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
denesb added a commit that referenced this issue Feb 1, 2024
…fying data directory contents' from Lakshmi Narayanan Sreethar

During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache.

Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache.

Fixes #14506

Closes #16952

* github.com:scylladb/scylladb:
  directories: prevent inode cache fragmentation by orderly verifying data directory contents
  directories: skip verifying data directory contents during startup
  directories: co-routinize create_and_verify
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
…ata directory contents

During startup, the contents of the data directory are verified to ensure
that they have the right owner and permissions. Verifying all the
contents, which includes files that will be read and closed immediately,
and files that will be held open for longer durations, together, can
lead to memory fragementation in the dentry/inode cache.

Prevent this by updating the verification in a such way that these two
set of files will be verified separately ensuring their separation in
the dentry/inode cache.

Fixes scylladb#14506

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment