Skip to content

Resolving the issue of fdcheck reporting errors when an epoll file descriptor (fd) is closed #16431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 26, 2025

Conversation

Donny9
Copy link
Contributor

@Donny9 Donny9 commented May 23, 2025

Summary

Resolving the issue of fdcheck reporting errors when an epoll file descriptor (fd) is closed.

The primary cause of the error is as follows:
When a task exits by calling exit, it closes all file descriptors (fds) associated with the current task.
When the epoll fd is closed, if there are still fds in the epoll queue that have values greater than the epoll fd, these fds may be closed first. This can lead to fdcheck failures when call epoll_close
To resolve this, we should use a file-based approach instead of directly managing fds in epoll.

Impact

fix epoll fdcheck issue

Testing

open vela monkey

…tors

Signed-off-by: dongjiuzhu1 <dongjiuzhu1@xiaomi.com>
@github-actions github-actions bot added Area: File System File System issues Size: M The size of the change in this PR is medium labels May 23, 2025
@xiaoxiang781216 xiaoxiang781216 requested a review from pussuw May 23, 2025 08:52
@pussuw
Copy link
Contributor

pussuw commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@Donny9
Copy link
Contributor Author

Donny9 commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c

But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

@pussuw
Copy link
Contributor

pussuw commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c

But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

Yes the problem here is that the data is in filep. After #16361 you could do this magic in userspace.

@jlaitine
Copy link
Contributor

I completely agree with @pussuw here! There is nice cleanup available for the FDs in PR #16361, which fixes the operation to be acc. to the posix.

Please don't inject debug functionality which should NOT be inside kernel in there. Integrate first the #16361, and then implement the debug functionality in where it belongs. That is, in the kernel-userspace interface, not directly in the kernel.

@Donny9
Copy link
Contributor Author

Donny9 commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c
But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

Yes the problem here is that the data is in filep. After #16361 you could do this magic in userspace.

@pussuw #16361 would definitely be the best if we could cover this type of scenario.
However, the changes in this current Pull Request (PR) are quite extensive, so it's unlikely to be merged into the mainline quickly. There's a possibility of stability issues, so we should ensure its reliability through some testing.
Therefore, I will pick it up internally and conduct LTP (Linux Test Project) functional testing, 24-hour Monkey stress testing, as well as functional testing on multiple platforms, before approving it.

Therefore, our priority is to ensure that the current design is sound. You can continue to rework based on #16361 .

@pussuw
Copy link
Contributor

pussuw commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c
But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

Yes the problem here is that the data is in filep. After #16361 you could do this magic in userspace.

@pussuw #16361 would definitely be the best if we could cover this type of scenario. However, the changes in this current Pull Request (PR) are quite extensive, so it's unlikely to be merged into the mainline quickly. There's a possibility of stability issues, so we should ensure its reliability through some testing. Therefore, I will pick it up internally and conduct LTP (Linux Test Project) functional testing, 24-hour Monkey stress testing, as well as functional testing on multiple platforms, before approving it.

Therefore, our priority is to ensure that the current design is sound. You can continue to rework based on #16361 .

Thanks, I do really prefer fixing the file descriptor subsystem like @xiaoxiang781216 suggested in #16326 . It is better to do a comprehensive fix that as far as I can see, will solve most of your problems as well.

GUIDINGLI
GUIDINGLI previously approved these changes May 23, 2025
@Donny9
Copy link
Contributor Author

Donny9 commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c
But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

Yes the problem here is that the data is in filep. After #16361 you could do this magic in userspace.

@pussuw #16361 would definitely be the best if we could cover this type of scenario. However, the changes in this current Pull Request (PR) are quite extensive, so it's unlikely to be merged into the mainline quickly. There's a possibility of stability issues, so we should ensure its reliability through some testing. Therefore, I will pick it up internally and conduct LTP (Linux Test Project) functional testing, 24-hour Monkey stress testing, as well as functional testing on multiple platforms, before approving it.
Therefore, our priority is to ensure that the current design is sound. You can continue to rework based on #16361 .

Thanks, I do really prefer fixing the file descriptor subsystem like @xiaoxiang781216 suggested in #16326 . It is better to do a comprehensive fix that as far as I can see, will solve most of your problems as well.

@pussuw @jlaitine
Actually, this current PR doesn't just address the issue with fdcheck.
Firstly, optimizing the file descriptors in epoll to be managed as files is quite reasonable, because fds shouldn't exist in the kernel in their raw form; they should be converted into file structures or the subsequent struct fd representation. 4629fc5
Secondly, we've removed poll_fdsetup to ensure that everyone uses interfaces with reference counting to access fds, which is much safer."0cb3037
The other two modifications are minor and only target the usage of file references.

@pussuw
Copy link
Contributor

pussuw commented May 23, 2025

Why is fdcheck implemented in the kernel, the kernel should not have to care about such things. Shouldn't it be implemented at the user interface i.e. libc ?

@pussuw The implementation of fdcheck originally resides in libc https://github.com/apache/nuttx/blob/889a26db1d229ee462bc8ea075c2330dc5f79b94/libs/libc/misc/lib_fdcheck.c
But we need to perform the conversion from file descriptor (fd) to file structure pointer (filep) within the kernel, as this conversion cannot be done in user space. And, the parameter for system calls is an fd, nothing else, and the filep is also a kernel data structure.

Yes the problem here is that the data is in filep. After #16361 you could do this magic in userspace.

@pussuw #16361 would definitely be the best if we could cover this type of scenario. However, the changes in this current Pull Request (PR) are quite extensive, so it's unlikely to be merged into the mainline quickly. There's a possibility of stability issues, so we should ensure its reliability through some testing. Therefore, I will pick it up internally and conduct LTP (Linux Test Project) functional testing, 24-hour Monkey stress testing, as well as functional testing on multiple platforms, before approving it.
Therefore, our priority is to ensure that the current design is sound. You can continue to rework based on #16361 .

Thanks, I do really prefer fixing the file descriptor subsystem like @xiaoxiang781216 suggested in #16326 . It is better to do a comprehensive fix that as far as I can see, will solve most of your problems as well.

@pussuw @jlaitine Actually, this current PR doesn't just address the issue with fdcheck. Firstly, optimizing the file descriptors in epoll to be managed as files is quite reasonable, because fds shouldn't exist in the kernel in their raw form; they should be converted into file structures or the subsequent struct fd representation. 4629fc5 Secondly, we've removed poll_fdsetup to ensure that everyone uses interfaces with reference counting to access fds, which is much safer."0cb3037 The other two modifications are minor and only target the usage of file references.

Yes, I agree with patches 1 and 4, this is why I did not comment on them, those could be merged with no disputes from my side

@xiaoxiang781216
Copy link
Contributor

@pussuw here is our plan:

  1. Merge the file handle related fix found in our internal stress test before your file handle restructure patch
  2. Resolve the conflict in your patch base on item 1 by @Donny9
  3. Run all internal function and stress tests to ensure all work as before by @Donny9

Do you think this plan reasonable?

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented May 24, 2025

I completely agree with @pussuw here! There is nice cleanup available for the FDs in PR #16361, which fixes the operation to be acc. to the posix.

Yes, all agree #16361 is the right direction to go.

Please don't inject debug functionality which should NOT be inside kernel in there. Integrate first the #16361, and then implement the debug functionality in where it belongs. That is, in the kernel-userspace interface, not directly in the kernel.

The patchset doesn't change or add any new debugging functionality, but fix the nasty bug found by fdsan/fdcheck.
@jlaitine, we can improve the implementation detail of fdsan/fdcheck after merging this and #16361.

Allow users to operate poll in the kernel using the file_poll
approach, as file is protected with reference counting,
making it more secure.

Signed-off-by: dongjiuzhu1 <dongjiuzhu1@xiaomi.com>
@Donny9
Copy link
Contributor Author

Donny9 commented May 26, 2025

@pussuw please review. For the previous two patches, we made enhancements or implementations based on your PR.

@xiaoxiang781216 xiaoxiang781216 requested a review from pussuw May 26, 2025 04:23
@pussuw
Copy link
Contributor

pussuw commented May 26, 2025

@pussuw here is our plan:

  1. Merge the file handle related fix found in our internal stress test before your file handle restructure patch
  2. Resolve the conflict in your patch base on item 1 by @Donny9
  3. Run all internal function and stress tests to ensure all work as before by @Donny9

Do you think this plan reasonable?

Yes, thank you. I'm OoO until tomorrow so don't have time to read github much until then.

@pussuw
Copy link
Contributor

pussuw commented May 26, 2025

@pussuw please review. For the previous two patches, we made enhancements or implementations based on your PR.

Sorry, it will go to tomorrow

Copy link
Contributor

@pussuw pussuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaoxiang781216 xiaoxiang781216 merged commit 4f57ebc into apache:master May 26, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: File System File System issues Size: M The size of the change in this PR is medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants