Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot print backtraces within a fiber #76

Closed
davexunit opened this issue Jan 26, 2023 · 13 comments
Closed

Cannot print backtraces within a fiber #76

davexunit opened this issue Jan 26, 2023 · 13 comments

Comments

@davexunit
Copy link

At Spritely we've been experiencing a strange issue with fibers >= 1.1.0 where we cannot print backtraces for errors that we catch within a fiber. Below is a reproducer program. This program should catch an error and display a backtrace:

(use-modules (fibers)
             (fibers conditions)
             (ice-9 control))

(run-fibers
 (lambda ()
   (spawn-fiber
    (lambda ()
      (call/ec
       (lambda (abort)
         (with-exception-handler
             (lambda (e)
               (setenv "COLUMNS" "72")
               (display-backtrace (make-stack #t)
                                  (current-error-port))
               (abort))
           (lambda ()
             (error "oops")))))))
   (wait (make-condition))))

However, instead it hangs after printing a partial backtrace:

In fibers.scm:
   163:13 13 (run-fibers _ #:hz _ #:scheduler _ #:parallelism _ # _ # …)
     84:3 12 (%run-fibers _ _ _ _)
In fibers/interrupts.scm:
     69:4 11 (with-interrupts/thread-cputime _ _ _ _)
In fibers/scheduler.scm:
   314:26 10

The CPU running the program then goes to 100% usage. It always hangs on the fibers/scheduler.scm stack frame.

Interestingly, reverting commit 84addfb resolves the issue.

Bonus: You might be wondering why (setenv "COLUMNS" "72") is in the code above. It's because otherwise we encounter what seems to be a bug in Guile:

Uncaught exception in task:
(skipping a bunch of frames...)
In system/repl/debug.scm:
    72:40  1 (_)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
In procedure string->number: Wrong type argument in position 1 (expecting string): #f

This is a weird one!

@cwebber
Copy link
Collaborator

cwebber commented Jan 26, 2023

That's #65, #58, and this issue all affected by 84addfb

I'm advocating for reverting 84addfb here.

@aconchillo
Copy link
Collaborator

Hi! I'd love to have some time to review and understand this. I've been too busy at work and no time for Guile lately 😞. Please, bear with me! 🙏

@wingo
Copy link
Owner

wingo commented Apr 19, 2023

It sounds to me like stack maps are somehow at fault here. I have seen some backtrace-printing problems. But unless I misunderstand, it's not a fibers bug. Unfortunately the original report doesn't repro for me!

@davexunit
Copy link
Author

davexunit commented Apr 19, 2023

Hmm, that's a bummer. Well, I took another look at this issue and I have a new reproducer that doesn't involve backtraces or stack frames at all.

(use-modules (fibers)
             (fibers conditions)
             (fibers scheduler))

(define done? (make-condition))

(run-fibers
 (lambda ()
   (spawn-fiber
    (lambda ()
      (pk (current-scheduler)) ; does not terminate for me :(
      (signal-condition! done?)))
   (wait done?)))

There is circularity in the scheduler object and this program just prints endlessly until I kill it. I ran this program against Guile built from main and fibers built from master to make sure I could still reproduce it using the freshest code.

Can anyone else reproduce this? I guess it's a bug with cycle detection in Guile's printer? I don't know how to explain why this bug would happen for me but not others.

@cbaines
Copy link

cbaines commented Apr 21, 2023

Can anyone else reproduce this? I guess it's a bug with cycle detection in Guile's printer? I don't know how to explain why this bug would happen for me but not others.

Yeah, I see the same behavior.

Also, with the original reproducer you posted, it doesn't hang forever for me, but does pause for quite a few seconds at the point you indicated, which could be because it's getting stuck trying to print the scheduler record.

Maybe a quick way to avoid this to set a custom printer for the scheduler record? #82

@wingo
Copy link
Owner

wingo commented Apr 21, 2023

Printers should be better certainly. But the backtrace code should probably abort when it has enough characters, instead of truncating the string result to the available characters.

@davexunit
Copy link
Author

davexunit commented Apr 21, 2023

How many cores do you have @cbaines? I wonder if mine prints for so much longer because I have 20 cores.

So how do these next steps sound?

  1. Merge a simpler printer for schedulers like scheduler: Set a simple custom record printer. #82 and close this issue.
  2. File a bug with Guile about fixing the truncation code in the backtrace printer.

@cbaines
Copy link

cbaines commented Apr 21, 2023

How many cores do you have @cbaines? I wonder if mine prints for so much longer because I have 20 cores.

I was testing on a machine with 8 cores, but I'm think I'm having this problem acutely on a machine with 32, so you're probably right that cores makes the problem worse.

@wingo
Copy link
Owner

wingo commented Jun 2, 2023

Regarding backtraces and circular data structures, the wip-custom-ports branch in Guile should fix this. @civodul is it ok to merge?

@wingo
Copy link
Owner

wingo commented Jun 6, 2023

So the hang is the circular data structure printing that is fixed in wip-custom-ports, AFAIU. But the other part of this bug is the COLUMNS error. Minimal reproducer:

(with-exception-handler
  (lambda (e)
    (catch #t
      (lambda () (error "bar"))
      (lambda args #f))
   (error "qux"))
  (lambda () (error "foo")))

Expected result is a backtrace in the REPL at the (error "qux"). Actual result is a backtrace at (error "bar"). Not sure why.

@davexunit
Copy link
Author

@wingo Yeah! That's exactly the behavior I was seeing. Strange, huh?

@wingo
Copy link
Owner

wingo commented Jun 6, 2023

I know why :) working on a fix...

@wingo
Copy link
Owner

wingo commented Jun 7, 2023

Fixed in wip-soft-ports in Guile: https://git.savannah.gnu.org/gitweb/?p=guile.git;a=commit;h=359d061974139adf7c08cc47ae086f2b1039b1f9

Will merge to Guile soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants