Skip to content

Conversation

@agherzan
Copy link
Contributor

@agherzan agherzan commented Feb 1, 2016

Signed-off-by: Andrei Gherzan andrei@resin.io

Signed-off-by: Andrei Gherzan <andrei@resin.io>
@agherzan
Copy link
Contributor Author

agherzan commented Feb 1, 2016

We have a simple EGL & OMX application which runs fine on the host (rpi2). When running the same binary in a container, the application behaves strangely (lacks transitions and other animation components) while "vcdbg log msg" complains with:

025819.942: *** No KHAN handle found for pid 24 025836.615: *** No KHAN handle found for pid 24 025853.285: *** No KHAN handle found for pid 24

Running the container without a new process namespace, everything gets back to normal.

Although I don't really understand the entire stack as I failed to see how this pid is (passed) used in userland and or vchiq, I pushed the patch as it is now. If someone can give me a short overview on this graphics architecture and how pid member is used, it would be great.

@pelwell
Copy link
Contributor

pelwell commented Feb 2, 2016

When a process dies it is vital that any resources it used are released. VCHIQ is the comms interface between the application and the graphics APIs, and it tells the GPU the PID associated with each client to enable the multiple connections from each process to be grouped as a bundle.

I'm new to PID namespaces, but I can see what the aim is. However, I'm confused by your use of task_pid_vnr which returns the "virtual" (namespaced) PID - I would have expected you to use task_pid_nr to get the global pid, unless the problem is that some other part of the GL client APIs is using the virtual PID and failing to get a match. If so, I would have thought that changing that API/driver would be better.

I'm concerned that by switching to the virtual PIDs we may accidentally get a collision between processes in different namespaces; explain to me why that isn't the case.

@petrosagg
Copy link

Hi @pelwell. We co-discovered this issue with @agherzan.

However, I'm confused by your use of task_pid_vnr which returns the "virtual" (namespaced) PID - I would have expected you to use task_pid_nr to get the global pid,

Using task_pid_nr yields the same results as the original current->pid, i.e the global PID, so it wouldn't change the results.

unless the problem is that some other part of the GL client APIs is using the virtual PID and failing to get a match. I would have thought that changing that API/driver would be better.

Indeed, we also think that this is the case. I tried to find where the PID is passed from the EGL libraries to the driver but I couldn't find where this happens as it's a big and unfamiliar codebase for me.

On the EGL side however, the process will only be able to see its virtual PID. You can't get IDs of a parent namespace from inside a child namespace. This means that the driver will either have to do accounting based on a tuple (ns_id, pid) or the EGL part will have to use some kind of proxy in the kernel module to do the translation for it.

I'm concerned that by switching to the virtual PIDs we may accidentally get a collision between processes in different namespaces; explain to me why that isn't the case.

You're right and it is the case that there will be collisions. This patch is mostly provided to demonstrate the problem.

@pelwell
Copy link
Contributor

pelwell commented Feb 2, 2016

Thanks for that clarification. I'm going to close this PR in case somebody gets tempted to merge it, but this is an issue we should look at.

@pelwell pelwell closed this Feb 2, 2016
@petrosagg
Copy link

Would a minimal test.c program that demonstrates this make sense to debug this?

@pelwell
Copy link
Contributor

pelwell commented Feb 2, 2016

Of course - anything that helps us to focus on the problem is helpful.

@petrosagg
Copy link

Hey @pelwell. Here is an easy way of reproducing this problem.

First, compile the hello_triangle.bin example. If you run it by itself you should see a cube with some textures rotating on the screen.

Next use this little launcher to run the same hello_triangle.bin file. NOTE: The launcher needs to run as root. It first creates a PID namespace and then it execlps the binary from hello_triangle. This time only a couple of frames are rendered and then it hangs forever. The PID from the EGL point of view will be 1 but from the VideoCore point of view it will be the real global PID.

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>

#define FILENAME "./hello_triangle.bin"

static int childFunc(void *arg) {
    printf("childFunc(): PID  = %ld\n", (long) getpid());
    printf("childFunc(): PPID = %ld\n", (long) getppid());

    execlp(FILENAME, FILENAME, (char *) NULL);
}

#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];    /* Space for child's stack */

int main() {
    pid_t child_pid;

    child_pid = clone(childFunc, child_stack + STACK_SIZE, CLONE_NEWPID | SIGCHLD, NULL);

    printf("PID returned by clone(): %ld\n", (long) child_pid);

    waitpid(child_pid, NULL, 0);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants