Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAKETIME_SKIP_CMDS does not seem to work in certain automake cases #355

Closed
dragon512 opened this issue Nov 24, 2021 · 22 comments
Closed

FAKETIME_SKIP_CMDS does not seem to work in certain automake cases #355

dragon512 opened this issue Nov 24, 2021 · 22 comments

Comments

@dragon512
Copy link

This might be a user error issue, or a question might be a real issue. I am trying to use libfaketime to help with certain reproducible issues I have when building cmake and automaker-based projects. In general, this tool seems to work great. I use the value of

FAKETIME_SKIP_CMDS=ls

to address automake configure script issues related to "sane" build environments. However, I have found a few cases related to components that use jemalloc in which configure will lock up when the configure script tries to check for jemalloc values. This seems to be easy to work around as the configure script is trying to run a conftest program. I tried adding this to the FAKETIME_SKIP_CMDS variable, however, it does not seem to be accepted. I did

FAKETIME_SKIP_CMDS=ls,conftest

As stated this locks up as described in issue #327 In this case I was compiling Apache Trafficserver (https://github.com/apache/trafficserver) with jemalloc support. I am unclear if I need to use a different value to allow conftest to be ignored. Ideally, conftest should be ignored in most cases as it testing some value we normally don't need to tweak with libfaketime. For me, this would address the jemalloc issue for building these components as I just need to ignore the tests the configure script is compiling and running.

@wolfcw
Copy link
Owner

wolfcw commented Nov 24, 2021

Thanks for pointing this out, interesting use case you have there!

Indeed we have a long-standing incompatibility with jemalloc, as discussed in some of the (closed) issues you already pointed to, and back then we concluded that there is no viable way on libfaketime's side to fix this problem.

The FAKETIME_SKIP_CMDS functionality works by comparing the name of the running process (__progname or argv[0]) against the list given in the environment variable, and assuming that conftest is run under this name, you're doing nothing wrong at all.

I think the problem here is that FAKETIME_SKIP_CMDS works like that: libfaketime is triggered by some time-related system call, determines that it was triggered by a process that should be skipped, and thus becomes inactive, meaning it won't fake any results of the real time-related system calls.

However, even for a "skipped command", libfaketime is LD_PRELOADED, and thus clashes with jemalloc during initialisation, leading to the lock-up.

So, unfortunately, using FAKETIME_SKIP_CMDS is not sufficient to make libfaketime work with programs using jemalloc. Instead, you'd have to avoid libfaketime being LD_PRELOADED to such commands/programs at all.

I know that this requires a much more fine-grained approach to start some programs with libfaketime LD_PRELOADed and other without in any non-trivial build process, and this is unsatisfying, but this sort of control must be exerted from outside libfaketime, because once libfaketime becomes active, it already is too late.

One possible approach would be a wrapper, so programs are not directly started, but through this wrapper, and the wrapper decides (based on the name of the program to start) whether libfaketime should be LD_PRELOADed or not. That's not very elegant, and adds another layer of complexity to the build process that might not be worth the effort, but it's also somewhat simple and straight-forward.

@dragon512
Copy link
Author

Still need to do more testing but the master branch with the PR for jemalloc seems to solve the issue.

@dragon512
Copy link
Author

Just an update I am using the patch for libjemalloc. I am not sure if this was merged or not.

It seems to solve the issue for me... it seem to be gone from the PR page. However the issues I am seeing now is that in -j based builds I tend to see with high job/thread values is

libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory libfaketime: sem_name was /faketime_sem_43239, created locally: false libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory libfaketime: sem_name was /faketime_sem_43239, created locally: false

I don't know if there is something that can be done to address this.

@wolfcw
Copy link
Owner

wolfcw commented Jan 10, 2022

The PR (#333) was closed by @ronrother without merging as it does not (and probably can't) work in all cases.

I can't tell how close your new problem is related to this (if you applied the PR). Basically, during initialisation libfaketime accesses semaphores and shared memory for inter-process synchronisation of settings. It tries 3 times to either access existing ones or to re-create those, which are supposed to already exist (because their filename was passed in the environment variable FAKETIME_SHARED). If all attempts fail, libfaketime exits with error code 1. You could try to increase the number of attempts in

if (nt > 3)

or add short delays before re-tries. Usually, the cleanest solution is to use libfaketime on a more long-lived parent process (whois PID is then used as part of the semaphore filename)

@dragon512
Copy link
Author

I will give it a try and let you know what I see.
For the stack I am building we are looking at https://github.com/microsoft/mimalloc as we are seeing much better speeds with this. I don't believe I am seeing issues with this as I am with jemalloc.

I am happy that I have 100% reproducible builds now! it just the random failures I am seeing which seem related to fixing the jemalloc issues

@dragon512
Copy link
Author

dragon512 commented Jan 12, 2022

I am still testing.. but so far it is still failing... I admit I don't understand why we need to have a shared semaphore in this code. Can you explain a little? It seems that it would be simpler without this. I am sure there is a good reason related to the spawning process or something. I am getting in the logs stuff like this:

libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false
libfaketime: In ft_shm_init(), sem_open failed and recreation attempts failed: No such file or directory
libfaketime: sem_name was /faketime_sem_20741, created locally: false

which suggests some process making lots of subprocesses and they are failing as they are referring to a parent process PID value? ( I think that is what is happening)

Is the path being used bad? it seem making a file in root '/' is going to fail

@dragon512
Copy link
Author

dragon512 commented Jan 12, 2022

Ok .. looking around and have a better idea now... I don't this code works as you stated...

the ft_shim_init() functions check for the ENV var as you stated and does the open call as it "should" exist from a running parent process with the

if (SEM_FAILED == (shared_sem = sem_open(sem_name, 0)))

This fails so it does the else statement of

else{
ft_shm_init();
return;
}

however, it just calls itself. There is no recreation attempt. it just tries to reopen an existing item that we know does not exist. Looking at what should be called to recreate the item being tested for

@dragon512
Copy link
Author

I have been digging into the issue more today. I think I see the issue. The ft_shm_create() is being called. It fails silently with a "File exists" error. It is important to note the PID used is the current process, not the parent. We then return back to the ft_shm_init() call and try to open the parent PID value object. This fails as the parent is dead and clean up before the child got to process the information from the process. The loop with n is pointless at the moment, as it does not address the issue of creating the correct PID object that init is looking for.

My question is do you feel we should tweak this to try to create an object with the parent PIP defined via FAKETIME_SHARED or should FAKETIME_SHARED be cleared and try to recreate it based on the child PID?

@wolfcw
Copy link
Owner

wolfcw commented Jan 13, 2022

ft_shm_create() is supposed to set FAKETIME_SHARED to the new value, and ft_shm_init() re-reads FAKETIME_SHARED after its call to ft_shm_create(). So, actually, the current (child) PID is used, which makes sense given that the parent PID apparently no longer exists.

@dragon512
Copy link
Author

Right.. so the fix is to re create() via the parent or the child. The current logic tries to create on the child then returns without FAKETIME_SHARED being updated to the child value.. so it fails as it always mismatched. It seems if i read between the lines the expected logic is to use the child pid and have FAKETIME_SHARED updated correctly

@wolfcw
Copy link
Owner

wolfcw commented Jan 13, 2022

Can you check your /dev/shm/ whether those are old, stale semaphore files? If ft_shm_create() is called through ft_shm_init() because the explicitly passed parent's semaphore file is no longer available, it's somewhat unusual that a semaphore file with the current (child) PID already exists, since it's supposed to be freshly created.

@dragon512
Copy link
Author

I have lots of files .. 1000s

@wolfcw
Copy link
Owner

wolfcw commented Jan 13, 2022

Please delete them. They must be left over from something that went wrong (so ft_cleanup() was not called), and they explain the silent fails you observe, but this is not a pid vs. parent pid issue with a lost update of the FAKETIME_SHARED in ft_shm_create().

@dragon512
Copy link
Author

I applied this patch

--- a/src/libfaketime.c
+++ b/src/libfaketime.c
@@ -519,6 +519,8 @@ static void ft_shm_init (void)
     }
     if (SEM_FAILED == (shared_semR = sem_open(sem_name, 0))) /* gone stale? */
     {
+      /* clear the FAKETIME_SHARED so we use the child PID value */
+      unsetenv("FAKETIME_SHARED");
       ft_shm_create();
       ft_shared_env = getenv("FAKETIME_SHARED");
     }

This seems to have solved the issue. I feel I need to test it more.. but the issue has not shown itself again so far. As far as files in /dev/shm I am still seeing files show up here, however, I seem to see less of them. From what I can see it seem rpmbuild is causing stale files to show up. I going to apply this locally and test it out in our CI for a day to see what happens.

@wolfcw
Copy link
Owner

wolfcw commented Jan 13, 2022

There should not be any semaphore files except for those processes which are still running. It'd be interesting to figure out why they are left over, especially whether it's application crashes or a libfaketime-internal error that causes abnormal process termination.

I get the intention of your patch and appreciate it. However, ft_shm_create() never reads, but only writes the environment variable, so I don't think this can have any effect at all.

@dragon512
Copy link
Author

I think the main issue is that on short processes there is a race in which the parent dies and the children are still running. This errors out in the child as the FAKETIME_SHARED values are invalid, and the process tries to open a file that does not exist, and if it tries to re-create the file it uses a PID value that is a mismatch. This is clearly a bug at the moment in the main code.

@wolfcw
Copy link
Owner

wolfcw commented Jan 13, 2022

I'm really failing to follow you on where you clearly see a bug. libfaketime tries to open the file whose name was passed to it explicitly through an environment variable by its parent process. If the semaphore file does not exist any more, because the parent process already cleaned up and exited, a new file is created, obviously using the current process' PID to make it unique, because the current process could become the parent process of yet other spawned processes later. Using the parent's PID instead of its own PID for this purpose seems like a horrible idea, because that other process has already exited and its PID might be re-used by a completely different process sooner or later.

You're right that one basic assumption is that libfaketime initialisation is run while the parent process still exists. That's usually the case when applications are started, e.g., via an interactive shell, or when a wrapper like faketime is used. If that's not going to work in your build environment, you can patch out anything that sets FAKETIME_SHARED in the first place to simply avoid this overall issue; but this also means synchronisation of settings between parent and child processes won't work then. Depending on whether you need that, you'll have to come up with a different naming scheme for the files, one which does not use a PID for a sufficient degree of uniqueness.

@dragon512
Copy link
Author

Using the parent's PID instead of its own PID for this purpose seems like a horrible idea, because that other process has already exited and its PID might be re-used by a completely different process sooner or later.

I agree. The patch I had above clears the value of FAKETIME_SHARED so when the ft_shm_create() is called it uses the child pid value and correctly then sets the FAKETIME_SHARED value. For some reason the code was before not resetting the value of the FAKETIME_SHARED to the new child pid value, and was always trying to use the parent value. This is why it failed.

@dragon512
Copy link
Author

dragon512 commented Jan 13, 2022

Just as a note. I did change the n value to 100. it was clear in the dump of test that the parent ID was always used, not the new child ID value that was made in ft_shm_create(). Everytime the ft_shm_init() was called within the process it was getting the old parent ID value from the getenv() call. I was testing this on centos 7 and rhel 8

@wolfcw
Copy link
Owner

wolfcw commented Jan 14, 2022

ft_shm_create() is supposed to set FAKETIME_SHARED to the new value anyhow and I think it would be important to find out why it does not in your case (or in general, if it's buggy). However, clearing FAKETIME_SHARED before the call to ft_shm_create() works on the symptoms, not the root cause. Except for selected few code paths where ft_shm_create() can fail silently, it already should print any errors it faces on stderr, so I'm wondering what exactly is going wrong there and why clearning the environment variable beforehand actually changes anything in your case.

You mentioned ft_shm_create() silently failing before, which should only occur when the newly to create file (with the child PID) already exists. This was explainable with 1000s of stale files in your /dev/shm, for which we also do not know yet why they were left there. Silently failing explains why changing the number of attempts from 3 to 100 makes no difference. One option would be to set the environment variable to the new but already existing filename in this case in ft_shm_create(). But again, this means that we are in a setup where the parent process unexpectedly already exited, and the child process trips over an old, stale file based on its PID. Several things must have gone wrong already by this time, and ignoring them by force might not be the best idea. If you consider it sufficient to simply ignore those stale files in your case, I suggest to rather try with removing O_EXCL from the oflag of sem_open() in ft_shm_create() instead of setting the environment variable to the name of an old, stale file.

@dragon512
Copy link
Author

So for me at the moment everything seems to be working. The CI was set up to just run over and over again and test the outputs.. not one error. I agree with what you are saying, and was not sure what was wrong. However, it is hard to debug as you can imagine and it might be related to the jemalloc patch as well. I have to deal with some other items today, but I think I can post here a small example of the setup I have, that would at least show the leaking of the files in /dev/shm

@wolfcw
Copy link
Owner

wolfcw commented Jan 14, 2022

Good to know, thanks. Files are typically left in /dev/shm when the process crashes and libfaketime can't clean up as part of a regular program exit. The main question here in libfaketime context is whether libfaketime causes the crash (then there is a bug) or the application crashes by itself (then there is nothing libfaketime can do). In a high-frequently used build environment, one typical option is to clean up /dev/shm occasionally, either at the beginning of build processes if there are not several running in parallel, or by using find or a similar tool to identify and delete files that are, e.g., older than 1 hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants