Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes during parallel tasks #100

Open
pabloyoyoista opened this issue Mar 11, 2022 · 10 comments
Open

Crashes during parallel tasks #100

pabloyoyoista opened this issue Mar 11, 2022 · 10 comments

Comments

@pabloyoyoista
Copy link

pabloyoyoista commented Mar 11, 2022

I have a testing setup for generating the appstream data using appstream-generator in alpine and I am seeing some crashes during parallel operations. These errors happen seldomly and I don't really have a good reproducer or have a clear idea of the packages that were being processed by the generator when this happened. I know that opening issues of the kind "this isn't working!" are really not good, so my goal is more to ask how could I debug this, or what would be needed to trim the error down. I am also happy to help debugging any way possible.

Error output look like this:

core.exception.RangeError@../src/asgen/engine.d(532): Range violation
----------------
??:? onRangeError [0x7fa7dcd9cc90]
??:? _d_arraybounds [0x7fa7dcd9d370]
??:? /usr/bin/appstream-generator [0x55e77d5783e0]
??:? void std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)).doIt() [0x55e77d57ed30]
??:? void std.parallelism.TaskPool.executeWorkLoop() [0x7fa7dd0bcbd0]
??:? thread_entryPoint [0x7fa7dcdc2d60]

edit, there seems to be another variation of the crash. Unfortunately still didn't manage to get a core dump:

core.exception.RangeError@../src/asgen/engine.d(532): Range violation
----------------
??:? onRangeError [0x7f5709254c90]
??:? _d_arraybounds [0x7f5709255370]
??:? /usr/bin/appstream-generator [0x55e16500a3f0]
??:? void std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)).doIt() [0x55e165010d40]
??:? void std.parallelism.submitAndExecute(std.parallelism.TaskPool, void delegate()) [0x7f5709575ea0]
??:? int std.parallelism.ParallelForeach!(asgen.backends.interfaces.Package[]).ParallelForeach.opApply(scope int delegate(ref asgen.backends.interfaces.Package)) [0x55e165008d60]
??:? void asgen.engine.Engine.exportIconTarballs(asgen.config.Suite, immutable(char)[], asgen.backends.interfaces.Package[]) [0x55e165009ea0]
??:? bool asgen.engine.Engine.processSuiteSection(asgen.config.Suite, const(immutable(char)[]), asgen.reportgenerator.ReportGenerator) [0x55e16500b1c0]
??:? void asgen.engine.Engine.run(immutable(char)[]) [0x55e16500c360]
??:? _Dmain [0x55e164fbee00]
@ximion
Copy link
Owner

ximion commented Mar 11, 2022

It's annoying that this only happens occasionally... It would help to generate a backtrace for this, you can maybe run this under GDB all the time with debug symbols present and automatically generate a backtrace on error. Or use systemd-coredumpd for debugging, which is an awesome tool for issues like this!
Also, make sure you are on the latest appstream-generator version, 0.8.7.
I also don't see how you could get a range violation there, as the associative array access is guarded by a sync statement, but you could try to replace the line synchronized (this) iconTarFiles[iconSize.toString] ~= path; with synchronized iconTarFiles[iconSize.toString] ~= path; and see if that makes a difference (it changes the synchronization from being tied to just the current object to be a global lock, so nothing else will run in parallel while the following statement is executed).

@pabloyoyoista
Copy link
Author

Ok, thank you! systemd isn't really available in alpine, but I'll try to figure out a way add debug symbols and get coredump or GDB to extract a backtrace. I will report my findings! If it doesn't work, then I guess I'll follow on the synchronization changes you mention.

@ximion
Copy link
Owner

ximion commented Apr 10, 2022

Could this issue actually have been related to #101 ? Can you check if that patch fixes your issue?

@pabloyoyoista
Copy link
Author

I have tried updating to 0.8.8, but it looks like 922c210 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?

@ximion
Copy link
Owner

ximion commented Apr 12, 2022

I have tried updating to 0.8.8, but it looks like 922c210 introduces a subtle test dependency on appstream >= 0.15.3. We don't have meson 0.62 in alpine, so I wonder if disabling tests would be the recommended way to go here?

I would either

  1. Get Meson 0.62+ into Alpine
  2. Revert the test fix for now and fix it once you have AppStream 0.15.3

That's a bit better then disabling tests and forgetting that they are disabled - you can remove the reverted patch once AppStream 0.15.3 has landed.

@pabloyoyoista
Copy link
Author

People at Alpine got 0.62, so I have done some testing. I have been trying to get a backtrace with gdb, but it doesn't seem to be getting the symbols right. I have seen at least the problem once since the upgrade, though, so it might not be totally gone. I will try to keep testing this, but I am quite slow in the process due to other tasks and my lack of experience debugging something like this. Sorry for that.

@minlexx
Copy link

minlexx commented Apr 26, 2022

The issue #101 mentions stack size problem, this is one of the differences of musl compared to glibc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Thread-stack-size

@pabloyoyoista
Copy link
Author

Just to follow-up, I have the generator running under gdb, with the following script for more than a week and still no crashes. I am sharing it here, because I am not sure that I am doing something wrong...

handle SIGUSR1 nostop noprint                  
handle SIGUSR2 nostop noprint    

catch signal SIGSEGV                                                     
      command 1                                                  
      backtrace full                                              
      shell touch /cache/export/logs/$(date "+%Y%m%d").fail   
end                                                       

run  
quit   

@ximion
Copy link
Owner

ximion commented May 13, 2022

This looks pretty much like what I was doing a long time ago on Ubuntu, so I think your gdb commands are fine - it's just weird that the crashes are gone then!

@pabloyoyoista
Copy link
Author

Ok, thank you! Let's see if I manage to capture it. Otherwise, I guess blindly increasing the stack size like Alexei pointed could be an option...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants