cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

jcsp · 2021-12-21T11:15:52Z

This system had been subject to a bunch of previous work, but the workload running at time of failure was 64 random readers trying to read from a topic with 1G segments. This was intended to check if a 50GB cache limit would be enforced against 64GB of concurrent IO, but ended up showing up a separate issue.

The cache XFS filesystem is full, but the result of listing out all the files and summing sizes

The limit is set to 50GB:

  cloud_storage_cache_directory: /var/lib/shadow-index-cache
  cloud_storage_cache_size: "50000000000"

The filesystem is in a state where du and df disagree:

$ df -h /var/lib/shadow-index-cache/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        750G  750G   76K 100% /var/lib/shadow-index-cache

$ du -sh /var/lib/shadow-index-cache/
40G	/var/lib/shadow-index-cache/

$ du -sh --apparent-size /var/lib/shadow-index-cache/
41G	/var/lib/shadow-index-cache/

There are far more tiny files/dirs than there are actual log segments in cache:

redpanda@rp-7985e61-1:/var/lib/shadow-index-cache$ ls -lh `find ./ -name "*.log" ` | wc -l
39
redpanda@rp-7985e61-1:/var/lib/shadow-index-cache$ find ./ | wc -l
115876

To it's credit, redpanda isn't crashing, but it is permanently failing to fulfil read requests:

[pod/rp-7985e61-1/redpanda] ERROR 2021-12-21 11:05:04,518 [shard 2] cloud_storage - [fiber34~34~136~0|1|58866ms] - remote.cc:107 - System error std::__1::__fs::filesystem::filesystem_error (error system:28, filesystem error: open failed: No space left on device [/var/lib/shadow-index-cache/b3fe651d/kafka/verifier-10-s-gig/1_666/336780-1-v1.log_2_772.part])
[pod/rp-7985e61-1/redpanda] WARN  2021-12-21 11:05:04,518 [shard 2] cloud_storage - [fiber34~34~136~0|1|58866ms] - remote.cc:462 - Downloading segment from {violet-desert-kangaroo-vectorized-archival}, {failed}, segment at {"b3fe651d/kafka/verifier-10-s-gig/1_666/336780-1-v1.log"} not available

The text was updated successfully, but these errors were encountered:

Lazin · 2021-12-21T11:33:40Z

Maybe we're putting to much pressure on a file system by keeping the random prefix here. I think it's safe to just remove it.

jcsp · 2021-12-21T11:36:26Z

Deleting directories didn't free any blocks, but killing redpanda released all of them back to where df roughly matched du, so it looks like redpanda was failing to close some files.

Lazin · 2021-12-21T11:59:28Z

Maybe there was a lot of .part files? Looks like there was only 39 completed downloads but more than 100K files total. Those other files could only be .part files which are incomplete downloads.

jcsp · 2021-12-21T12:03:21Z

.part files would have shown up in the du sums -- the higher block counts in df indicate content that wasn't linked into the filesystem tree at all (like deleted files still held open by a running process).

Because the cache cleanup function runs independently, it makes sense that nothing is poking things like the reader cache to close their open handles, although the sheer number was surprising (a 750GB filesystem filled up with a mixture of 1GB and 64MB segments).

Lazin · 2021-12-21T12:08:34Z

got it, so we open files without closing them properly in some cases, and they get deleted from cache but inodes and data blocks are not
yep, most likely related to previous bug

VadimPlh · 2021-12-30T18:53:26Z

From my test:
I did not see this errors in log
Use 20 Gb for cache

But it contains info about old topic with 10 MB segment size

cloud_storage_cache_directory: /var/lib/shadow-index-cache
cloud_storage_cache_size: "20000000000"```


redpanda@rp-a766973-0:/$ df -h /var/lib/shadow-index-cache/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        750G  347G  404G  47% /var/lib/shadow-index-cache
redpanda@rp-a766973-0:/$ du -sh /var/lib/shadow-index-cache/
17G	/var/lib/shadow-index-cache/
redpanda@rp-a766973-0:/$ du -sh --apparent-size /var/lib/shadow-index-cache/
17G	/var/lib/shadow-index-cache/
redpanda@rp-a766973-0:/var/lib/shadow-index-cache$ ls -lh `find ./ -name "*.log" ` | wc -l
17507
redpanda@rp-a766973-0:/var/lib/shadow-index-cache$  find ./ | wc -l
70442

Lazin · 2021-12-30T19:39:18Z

The find output is larger in the second case because it prints all subdirectories and we don't delete directories from cache.

Lazin · 2021-12-30T19:40:12Z

Another interesting thing is that the file names are different since we added term to the name but the cache is still able to find and delete them anyway.

jcsp · 2022-01-04T13:29:31Z

On 21.11.3-si-beta5, I can reproduce the disk space leak (i.e. divergence between df and du).

I'm going to work on a direct automated test of this with whatever instrumentation we need to check it directly without waiting for large clusters to run for some time.

Lazin · 2022-01-04T14:14:05Z

How big is the difference between the df and du numbers? Currently, cache eviction can remove the segment which is being used at the moment. Maybe this is what is happening here.

…

On Tue, Jan 4, 2022, 16:29 John Spray ***@***.***> wrote: Reopened #3334 <#3334>. — Reply to this email directly, view it on GitHub <#3334 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAWMNRCPIOEAXFTU3ZLWELUULY4RANCNFSM5KP5GA2Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

The garbage collection of remote_segment objects was conditional on the reference count in the segment's sharded_ptr being 1 (i.e. nothing else is using it). This is not true if there are entries in the `readers` attribute of materialized_segment_state, because readers hold a reference to the segment. Deal with this by proactively evicting readers from any materialized_segment_state objects whose atime makes them elegible for removal. Fixes: redpanda-data#3334 Signed-off-by: John Spray <jcs@vectorized.io>

The garbage collection of remote_segment objects was conditional on the reference count in the segment's sharded_ptr being 1 (i.e. nothing else is using it). This is not true if there are entries in the `readers` attribute of materialized_segment_state, because readers hold a reference to the segment. Deal with this by proactively evicting readers from any materialized_segment_state objects whose atime makes them elegible for removal. Fixes: redpanda-data#3334 Signed-off-by: John Spray <jcs@vectorized.io> (cherry picked from commit 42945cb)

jcsp · 2022-01-11T11:52:22Z

Backporting to 21.11.x here:

[v21.11.x] cloud_storage: fix file descriptor leak #3437

jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem labels Dec 21, 2021

jcsp changed the title ~~cloud_storage: fails to enforce cache size limit if filesystem is full but file sizes appear within bound~~ cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) Dec 21, 2021

dswang assigned Lazin Dec 22, 2021

jcsp mentioned this issue Dec 22, 2021

cloud_storage: bad_allocs with only 1000 partitions & a full cache #3332

Closed

Lazin mentioned this issue Dec 22, 2021

Fix disk space issues #3348

Merged

Lazin closed this as completed in #3348 Dec 28, 2021

jcsp reopened this Jan 4, 2022

jcsp self-assigned this Jan 4, 2022

jcsp mentioned this issue Jan 4, 2022

cloud_storage: fix file descriptor leak #3392

Merged

jcsp closed this as completed in #3392 Jan 5, 2022

jcsp mentioned this issue Jan 5, 2022

[backport v21.11.x-si-testing] cloud_storage: fix file descriptor leak #3399

Merged

jcsp mentioned this issue Jan 11, 2022

[v21.11.x] cloud_storage: fix file descriptor leak #3437

Merged

LenaAn mentioned this issue Jan 11, 2022

cloud_storage: si cache has a lot of obsolete directories #3449

Closed

jcsp mentioned this issue Jan 13, 2022

Shadow indexing backport tracker #3473

Closed

NyaliaLui mentioned this issue Mar 16, 2022

ducky: port SI e2e tests to ducktape #3960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

VadimPlh commented Dec 30, 2021 •

edited

Lazin commented Dec 30, 2021

Lazin commented Dec 30, 2021

jcsp commented Jan 4, 2022

Lazin commented Jan 4, 2022 via email

jcsp commented Jan 11, 2022

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

Comments

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

jcsp commented Dec 21, 2021

Lazin commented Dec 21, 2021

VadimPlh commented Dec 30, 2021 • edited

Lazin commented Dec 30, 2021

Lazin commented Dec 30, 2021

jcsp commented Jan 4, 2022

Lazin commented Jan 4, 2022 via email

jcsp commented Jan 11, 2022

VadimPlh commented Dec 30, 2021 •

edited