Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventfd with epoll EPOLLET event doesn't match Linux #2673

Closed
ppopth opened this issue Jan 19, 2023 · 5 comments
Closed

Eventfd with epoll EPOLLET event doesn't match Linux #2673

ppopth opened this issue Jan 19, 2023 · 5 comments
Labels
Type: Bug Error or flaw producing unexpected results

Comments

@ppopth
Copy link
Contributor

ppopth commented Jan 19, 2023

Describe the issue
When using eventfd with epoll with EPOLLET flag set, the result of epoll_wait in the real Linux is different from the one in Shadow

To Reproduce
Run the following code in Linux and in Shadow and see the difference.

#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

#define handle_error(msg) \
           do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *thread_fn(void *ptr);

int timeout1 = 2;
int timeout2 = 4;
int efd, epollfd;

void main() {
     pthread_t thread1, thread2;
     int iret1, iret2;
#define MAX_EVENTS 10
     struct epoll_event ev, events[MAX_EVENTS];
     int nfds;

     epollfd = epoll_create1(EPOLL_CLOEXEC);

     efd = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
     if (efd == -1)
         handle_error("eventfd");

     ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
     ev.data.fd = efd;
     if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1)
         handle_error("epoll_ctl: efd");

     iret1 = pthread_create(&thread1, NULL, thread_fn, (void*) &timeout1);
     iret2 = pthread_create(&thread2, NULL, thread_fn, (void*) &timeout2);

     for (;;) {
         nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
         if (nfds == -1)
             handle_error("epoll_wait");

         for (int n = 0; n < nfds; ++n) {
             if (events[n].data.fd == efd) {
                 uint64_t u;
                 printf("event found\n");
             }
         }
         fflush(stdout);
     }
     pthread_join(thread1, NULL);
     pthread_join(thread2, NULL); 

     printf("Thread 1 returns: %d\n",iret1);
     printf("Thread 2 returns: %d\n",iret2);
     exit(0);
}

void *thread_fn(void *ptr) {
    int timeout = *((int *)ptr);
    sleep(timeout);
    printf("%d\n", timeout);
    fflush(stdout);

    uint64_t inc = 10;
    write(efd, &inc, sizeof(uint64_t));
}

In Linux, the output will be

2
event found
4
event found

But in Shadow, the output is

2
event found
4

Operating System (please complete the following information):

  • OS and version: Ubuntu 20.04.5 LTS
  • Kernel version: Linux thinkpad-t14-g2 5.14.0-1055-oem # 62-Ubuntu SMP Wed Nov 30 04:54:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Shadow (please complete the following information):

  • Version: Shadow v2.3.0-0-gb09b00ed 2022-11-29--13:01:25
  • Which plug-ins you are using: The one shown above

Additional context

@ppopth ppopth added the Type: Bug Error or flaw producing unexpected results label Jan 19, 2023
@ppopth
Copy link
Contributor Author

ppopth commented Jan 19, 2023

TCP sockets seem to be the same https://github.com/ppopth/tcp_epoll_shadow

@stevenengler
Copy link
Contributor

Thanks for the bug report!

So I think this comes down to the question "what is an edge?" with respect to Linux files. Shadow uses the definition "the file was not readable/writable, and has now become readable/writable". It seems that this is the definition that was originally intended for Linux, but some files would wake the edge-triggered epoll even when it already had data, so programmers started relying on this unintended behaviour.

This is literally an epoll() confusion about what an "edge" is.

An edge is not "somebody wrote more data". An edge is "there was no data, now there is data".

And a level triggered event is also not "somebody wrote more data". A level-triggered signal is simply "there is data".

Notice how neither edge nor level are about "more data". One is about the edge of "no data" -> "some data", and the other is just a "data is available".

Sadly, it seems that our old "we'll wake things up whether needed or not" implementation ended up being something that people thought was edge-triggered semantics.

https://lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/

Since Shadow uses this "there was no data, now there is data" definition for an edge, it doesn't wake the second epoll_wait() after the second write() unless the main thread was also to do a read().

It seems that the WSL also originally took this same approach as Shadow: microsoft/WSL#2462. There's also a LWN article about this: https://lwn.net/Articles/864947/

We should probably change Shadow to follow the Linux behaviour here, even if it's not "correct". The issue is that Shadow's status listeners don't have a way of listening for "READABLE" -> "READABLE" status changes, so this might be tricky to fix.

@stevenengler
Copy link
Contributor

Do you have a real-world application that depends on this behaviour?

We discussed this today, and while we probably do want to make these changes, they're not currently high on our priority list. This is probably something we'd look at when migrating our Epoll file objects to rust. We're looking at this issue from the perspective that Shadow already "does the right thing", but just isn't exactly compatible with Linux's idiosyncrasies.

There is a possible downside to making this change. Waking up the epoll_wait each time the file is written to will cause a lot more wakeups of the managed process, and for many applications these wakeups may just be extra spurious wakeups. This could cause Shadow to have worse performance in these situations, and if applications aren't relying on this Linux-specific behaviour, there isn't any benefit to having this worse performance. We would need some benchmarks to understand the performance implications. We could make this behaviour a configuration option that users opt into or out of, but this would increase Shadow's maintenance burden, so we probably don't want to do that.

@ppopth
Copy link
Contributor Author

ppopth commented Jan 19, 2023

I encountered this while running the command lcli eth1-genesis from https://github.com/sigp/lighthouse. What it does is that it only writes to the eventfd but never read it and uses only epoll to get the notification. I'm not sure it's the behavior of the binary or some libraries inside it.

We're looking at this issue from the perspective that Shadow already "does the right thing", but just isn't exactly compatible with Linux's idiosyncrasies

I think the most important thing is the intention of the application developers, i.e. how the developers want the applications to behave.

What the developers usually do when they write the applications is they have to test them against the machines which most of the time are Linux. If the machines don't behave correctly and then make the application behave incorrectly, they just change the application to be consistent with the machine.

I think "doing the right thing" is the right thing to do, if the developers (or we encourage them to) test the applications against both Linux and Shadow. If so, we can keep doing the right thing because the applications will behave correctly in Shadow anyway. But if the developers test the applications only against Linux (and we will assume it will be that way forever), there is a good chance that the applications will not behave correctly in Shadow (and the intention mentioned above is violated) if Shadow behaves differently from Linux.

There is a possible downside to making this change. Waking up the epoll_wait each time the file is written to will cause a lot more wakeups of the managed process, and for many applications these wakeups may just be extra spurious wakeups

I don't think it will cause worse performance. I think only when the managed process will waken up more often is when the managed process already knows there is data but still epolls the fd.

This doesn't quite make sense to me. Why does the managed process need to epoll the fd while it can just read the data? The only answer I can think of is that "the managed process doesn't want to read the data". It probably just wants to get notified when there is more data. This will reduce the number of syscalls the process needs to make (by calling only epoll_wait).

In my opinion, I will conclude that applications that don't rely on this Linux-specific behavior will not epoll the fd while they already know there is data anyway, so there will be no penalty.

PS> However, I decided to patch Shadow to make it similar to Linux myself anyway. Because I think it's quite easier for me to patch Shadow than to patch the application and I also have a plan to use Shadow with many more applications. I think it's quite convenient for me not to think whether this behavior will be an issue again.

@stevenengler stevenengler changed the title Eventfd with epoll EPOLLET event Eventfd with epoll EPOLLET event doesn't match Linux Jun 7, 2023
@stevenengler
Copy link
Contributor

This should be fixed by #3243.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Error or flaw producing unexpected results
Projects
None yet
Development

No branches or pull requests

2 participants