Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

Closed
jcsp opened this issue Dec 21, 2021 · 11 comments · Fixed by #3348 or #3392
Closed

cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) #3334

jcsp opened this issue Dec 21, 2021 · 11 comments · Fixed by #3348 or #3392
Assignees
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Dec 21, 2021

This system had been subject to a bunch of previous work, but the workload running at time of failure was 64 random readers trying to read from a topic with 1G segments. This was intended to check if a 50GB cache limit would be enforced against 64GB of concurrent IO, but ended up showing up a separate issue.

The cache XFS filesystem is full, but the result of listing out all the files and summing sizes

The limit is set to 50GB:

  cloud_storage_cache_directory: /var/lib/shadow-index-cache
  cloud_storage_cache_size: "50000000000"

The filesystem is in a state where du and df disagree:

$ df -h /var/lib/shadow-index-cache/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        750G  750G   76K 100% /var/lib/shadow-index-cache

$ du -sh /var/lib/shadow-index-cache/
40G	/var/lib/shadow-index-cache/

$ du -sh --apparent-size /var/lib/shadow-index-cache/
41G	/var/lib/shadow-index-cache/

There are far more tiny files/dirs than there are actual log segments in cache:

redpanda@rp-7985e61-1:/var/lib/shadow-index-cache$ ls -lh `find ./ -name "*.log" ` | wc -l
39
redpanda@rp-7985e61-1:/var/lib/shadow-index-cache$ find ./ | wc -l
115876

To it's credit, redpanda isn't crashing, but it is permanently failing to fulfil read requests:

[pod/rp-7985e61-1/redpanda] ERROR 2021-12-21 11:05:04,518 [shard 2] cloud_storage - [fiber34~34~136~0|1|58866ms] - remote.cc:107 - System error std::__1::__fs::filesystem::filesystem_error (error system:28, filesystem error: open failed: No space left on device [/var/lib/shadow-index-cache/b3fe651d/kafka/verifier-10-s-gig/1_666/336780-1-v1.log_2_772.part])
[pod/rp-7985e61-1/redpanda] WARN  2021-12-21 11:05:04,518 [shard 2] cloud_storage - [fiber34~34~136~0|1|58866ms] - remote.cc:462 - Downloading segment from {violet-desert-kangaroo-vectorized-archival}, {failed}, segment at {"b3fe651d/kafka/verifier-10-s-gig/1_666/336780-1-v1.log"} not available
@jcsp jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem labels Dec 21, 2021
@Lazin
Copy link
Contributor

Lazin commented Dec 21, 2021

Maybe we're putting to much pressure on a file system by keeping the random prefix here. I think it's safe to just remove it.

@jcsp
Copy link
Contributor Author

jcsp commented Dec 21, 2021

Deleting directories didn't free any blocks, but killing redpanda released all of them back to where df roughly matched du, so it looks like redpanda was failing to close some files.

@jcsp jcsp changed the title cloud_storage: fails to enforce cache size limit if filesystem is full but file sizes appear within bound cloud_storage: holding open files prevents filesystem reclaiming space (full cache drive) Dec 21, 2021
@Lazin
Copy link
Contributor

Lazin commented Dec 21, 2021

Maybe there was a lot of .part files? Looks like there was only 39 completed downloads but more than 100K files total. Those other files could only be .part files which are incomplete downloads.

@jcsp
Copy link
Contributor Author

jcsp commented Dec 21, 2021

.part files would have shown up in the du sums -- the higher block counts in df indicate content that wasn't linked into the filesystem tree at all (like deleted files still held open by a running process).

Because the cache cleanup function runs independently, it makes sense that nothing is poking things like the reader cache to close their open handles, although the sheer number was surprising (a 750GB filesystem filled up with a mixture of 1GB and 64MB segments).

@Lazin
Copy link
Contributor

Lazin commented Dec 21, 2021

got it, so we open files without closing them properly in some cases, and they get deleted from cache but inodes and data blocks are not
yep, most likely related to previous bug

@VadimPlh
Copy link
Contributor

VadimPlh commented Dec 30, 2021

From my test:
I did not see this errors in log
Use 20 Gb for cache

But it contains info about old topic with 10 MB segment size

cloud_storage_cache_directory: /var/lib/shadow-index-cache
cloud_storage_cache_size: "20000000000"```


redpanda@rp-a766973-0:/$ df -h /var/lib/shadow-index-cache/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        750G  347G  404G  47% /var/lib/shadow-index-cache
redpanda@rp-a766973-0:/$ du -sh /var/lib/shadow-index-cache/
17G	/var/lib/shadow-index-cache/
redpanda@rp-a766973-0:/$ du -sh --apparent-size /var/lib/shadow-index-cache/
17G	/var/lib/shadow-index-cache/
redpanda@rp-a766973-0:/var/lib/shadow-index-cache$ ls -lh `find ./ -name "*.log" ` | wc -l
17507
redpanda@rp-a766973-0:/var/lib/shadow-index-cache$  find ./ | wc -l
70442

@Lazin
Copy link
Contributor

Lazin commented Dec 30, 2021

The find output is larger in the second case because it prints all subdirectories and we don't delete directories from cache.

@Lazin
Copy link
Contributor

Lazin commented Dec 30, 2021

Another interesting thing is that the file names are different since we added term to the name but the cache is still able to find and delete them anyway.

@jcsp
Copy link
Contributor Author

jcsp commented Jan 4, 2022

On 21.11.3-si-beta5, I can reproduce the disk space leak (i.e. divergence between df and du).

I'm going to work on a direct automated test of this with whatever instrumentation we need to check it directly without waiting for large clusters to run for some time.

@jcsp jcsp reopened this Jan 4, 2022
@Lazin
Copy link
Contributor

Lazin commented Jan 4, 2022 via email

@jcsp jcsp self-assigned this Jan 4, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 4, 2022
The garbage collection of remote_segment objects
was conditional on the reference count in the
segment's sharded_ptr being 1 (i.e. nothing else
is using it).

This is not true if there are entries in the `readers`
attribute of materialized_segment_state, because
readers hold a reference to the segment.

Deal with this by proactively evicting readers
from any materialized_segment_state objects
whose atime makes them elegible for removal.

Fixes: redpanda-data#3334

Signed-off-by: John Spray <jcs@vectorized.io>
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 4, 2022
The garbage collection of remote_segment objects
was conditional on the reference count in the
segment's sharded_ptr being 1 (i.e. nothing else
is using it).

This is not true if there are entries in the `readers`
attribute of materialized_segment_state, because
readers hold a reference to the segment.

Deal with this by proactively evicting readers
from any materialized_segment_state objects
whose atime makes them elegible for removal.

Fixes: redpanda-data#3334

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp jcsp closed this as completed in #3392 Jan 5, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 5, 2022
The garbage collection of remote_segment objects
was conditional on the reference count in the
segment's sharded_ptr being 1 (i.e. nothing else
is using it).

This is not true if there are entries in the `readers`
attribute of materialized_segment_state, because
readers hold a reference to the segment.

Deal with this by proactively evicting readers
from any materialized_segment_state objects
whose atime makes them elegible for removal.

Fixes: redpanda-data#3334

Signed-off-by: John Spray <jcs@vectorized.io>
(cherry picked from commit 42945cb)
jcsp added a commit to jcsp/redpanda that referenced this issue Jan 11, 2022
The garbage collection of remote_segment objects
was conditional on the reference count in the
segment's sharded_ptr being 1 (i.e. nothing else
is using it).

This is not true if there are entries in the `readers`
attribute of materialized_segment_state, because
readers hold a reference to the segment.

Deal with this by proactively evicting readers
from any materialized_segment_state objects
whose atime makes them elegible for removal.

Fixes: redpanda-data#3334

Signed-off-by: John Spray <jcs@vectorized.io>
(cherry picked from commit 42945cb)
@jcsp
Copy link
Contributor Author

jcsp commented Jan 11, 2022

Backporting to 21.11.x here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants