Feature request: don't die on panic #625

smacker · 2019-02-22T10:17:30Z

It would be nice to wrap query handler in defer recover() so in case a query panics it would return an error to the client instead of server exiting.

Similar to how it usually works with http servers:

func(w http.ResponseWriter, r *http.Request) {
	defer func() {
		if rp := recover(); rvr != nil {
			log.Error("panic %v", rp)
			http.Error(w, http.StatusText(http.StatusInternalServerError), http.StatusInternalServerError)
		}
	}()

	next.ServeHTTP(w, r) // actual handler
}

The text was updated successfully, but these errors were encountered:

ajnavarro · 2019-02-25T11:44:34Z

I remember had this functionality with previous versions of vitess server, maybe they implemented that in another way in newer versions. @kuba-- could you have a look?

kuba-- · 2019-02-25T12:01:21Z

Looks, it's still there: https://github.com/src-d/go-vitess/blob/34b2260c9dc53f8eb12195c8eb7a2bd273e548c6/mysql/server.go#L257

But I'll take a look closer.

kuba-- · 2019-02-25T14:04:14Z

After testing looks that recover still works fine.
I hardcoded a panic("boooooooo") in group by plan and ran query:

SELECT 
r.repository_id, SUM(ARRAY_LENGTH(SPLIT(b.blob_content, '\n'))) AS lines_count
FROM refs r
NATURAL JOIN commit_blobs ct
NATURAL JOIN blobs b
GROUP BY r.repository_id
ORDER BY lines_count DESC

mysql server panicked:

INFO[0007] ConnectionClosed: client 1                   
ERRO[0007] mysql_server caught panic:
boooooooo
/usr/local/opt/go/libexec/src/runtime/panic.go:513 (0x402cdf8)
	gopanic: reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/group_by.go:80 (0x45bd829)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*GroupBy).RowIter: panic("boooooooo")
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/project.go:64 (0x45c64d3)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*Project).RowIter: i, err := p.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/sort.go:88 (0x45cfa9c)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*Sort).RowIter: i, err := s.UnaryNode.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/process.go:50 (0x45c48d0)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*QueryProcess).RowIter: iter, err := p.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/engine.go:106 (0x4851eaa)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server%2ev0.(*Engine).Query: iter, err := analyzed.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/server/handler.go:90 (0x4852f48)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/server.(*Handler).ComQuery: schema, rows, err := h.e.Query(ctx, query)
gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql/conn.go:730 (0x476460e)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql.(*Conn).handleNextCommand: err := handler.ComQuery(c, query, func(qr *sqltypes.Result) error {
gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql/server.go:430 (0x4773f07)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql.(*Listener).handle: err := c.handleNextCommand(l.handler)
/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1333 (0x405a660)
	goexit: BYTE	$0x90	// NOP 
INFO[0007] NewConnection: client 2                      
INFO[0007] audit trail                                   action=authentication address="127.0.0.1:51298" success=true system=audit user=root

but the gitbase is still running.

ajnavarro · 2019-02-25T15:13:44Z

on this issue src-d/gitbase#705 , maybe the panic is thrown in another go routine? could you check if you are able to stop gitbase working introducing a panic on the go-git lru-cache?

kuba-- · 2019-02-25T15:42:05Z

Yes, everything what goes on go routines (e.g. squash joins, go-git) may crash gitbase.
Basically, in func (it *exchangeRowIter) start() { all it.iterPartitions(partitions) may crash gitbase.
I'm going to add recover to mysql, in func (it *exchangeRowIter) iterPartition(p sql.Partition) {

This issue needs to be transferred to mysql project.

creachadair · 2019-02-25T16:08:36Z

Be careful about swallowing panics high up in the call stack. You may avert a process termination, but generally the heap is now in an unknown and generally invalid state. You are likely to resume only to crash again later much further away from the original problem, and it is very hard to debug these.

As a rule, try to avoid handing panics that you didn't throw. If code indirects a nil pointer or reads off the end of a slice, the program is broken, and crashing is actually preferable. Otherwise the experience for the user is potentially much worse: It could lead to weird data corruptions, broken transactions, and other infelicities. I mention this because I've had to debug this, and it's much worse than the problem it is meant to solve IMO.

erizocosmico · 2019-02-25T17:01:25Z

As @creachadair says, I'd rather catch the panic earlier and have a proper stack trace than keep it running potentially incorrectly.

kuba-- · 2019-02-25T19:48:37Z

We already have the main recover mechanism as a listener's handler (implemented by vitess).
Of course, adding another one to every go routine does not make sense. On the other hand, root cause analysis showed that go-git's object lru cache may panic.
I'm totally against implementing recover on the cache level (in my opinion it should be fixed).

I'm not a big fan of recover in defers, so the question is - how do we plan to solve this mystery (@ajnavarro, @smola), because based on my analysis func (it *exchangeRowIter) iterPartition(p sql.Partition) looks like a place between internals and session.

Personally I would do nothing (so far), just integrate go-git with @jfontan 's fix: src-d/go-git#1076 and monitor situation if something like that happen again.

ajnavarro · 2019-02-26T08:42:19Z

So right now the problem is on partitions.

I think if we panic in one of the partitions, we should stop the whole query. this will mimic the previous query behavior without parallelization. Problems related to inconsistencies should be minimal. WDYT?

creachadair · 2019-02-26T15:30:19Z

So right now the problem is on partitions.

I think if we panic in one of the partitions, we should stop the whole query. this will mimic the previous query behavior without parallelization. Problems related to inconsistencies should be minimal. WDYT?

Using a panic to handle a controlled unwinding event—e.g., a parse or query that fails deep inside, and where the only thing you need to do is get control back to the top so you can report and resume—seems like a reasonable use. I'd generally advise defining your own panic signal type, e.g.

type queryFailed string

…
// hypothetical interface, I have no idea what the real API looks like.
func query(ctx context.Context, q *Query) (_ *Result, err error) {
   defer func() {
      if x := recover(); x != nil {
         if s, ok := x.(queryFailed); ok {
            err = errors.New(string(s))
         } else {
            panic(x)
         }
      }
   }()
   return q.Run(ctx, new(Result)) // whatever
}
…
if badThing(happened) {
  panic(queryFailed("they're destroying our city!"))
}

or words to that effect. That way you only get the panic you care about.

kuba-- · 2019-02-27T11:55:49Z

@creachadair - what you proposed, is what we already have (by vitess).
Now we need something similar for partition iterators (go routines, so you cannot catch them on the main thread). Moreover it's hard to define panic interface, because we don't panic on purpose.
What I propose: #626 is just a simple recover in already existing defer which closes all iterators and appends recovery error to error channel.
So, the outside process can receive an error and follow an error scenario.

creachadair · 2019-02-27T14:51:30Z

@creachadair - what you proposed, is what we already have (by vitess).

As I understand it, Vitess made the same mistake that a lot of HTTP implementations do: It logs and swallows any panic, not just controlled ones.

Now we need something similar for partition iterators (go routines, so you cannot catch them on the main thread). Moreover it's hard to define panic interface, because we don't panic on purpose.

Right, and that's specifically not the kind of panic I'm talking about. That's a program integrity violation, and swallowing that panic will leave the heap in a corrupt state with unknown future effects.

What I propose: #626 is just a simple recover in already existing defer which closes all iterators and appends recovery error to error channel.
So, the outside process can receive an error and follow an error scenario.

Obviously it's not my decision, but I would not recommend this approach at all. I don't know what the trigger for these panics is, but swallowing them isn't going to prevent the program from being broken, it will only change where it breaks (and perhaps make it less obvious).

kuba-- · 2019-02-27T22:50:47Z

Personally I think in case of partition iterator it should be fine if we just close the iterator. The only concern I have is the fact that recover will stop stack unwinding.
@ajnavarro - so maybe if we catch a panic we'll restart mysql process by sending a user specific sygnal. Here is a simple prototype:

func foo() {
	defer func() {
		if x := recover(); x != nil {
			fmt.Println("recover:", x)
			syscall.Kill(syscall.Getpid(), syscall.SIGUSR2)
		}
	}()

	panic("foo")
}

func fork() {
	exe, _ := os.Executable()
	dir := filepath.Dir(exe)

	// Spawn child process.
	os.StartProcess(exe, []string{exe}, &os.ProcAttr{
		Dir: dir,
		Files: []*os.File{
			os.Stdin,
			os.Stdout,
			os.Stderr,
		},
		Sys: &syscall.SysProcAttr{},
	})
}

func main() {	
	signalCh := make(chan os.Signal, 1)
	signal.Notify(signalCh, syscall.SIGUSR2)

	go foo()

	for {
		select {
		case s := <-signalCh:
			fork()
			return
		}
	}
}

Let me know what do you think.

creachadair · 2019-02-27T23:06:29Z

If you don't want to stop unwinding, you can just panic again in the handler:

if x := recover(); x != nil {
   cleanup(whatever)
   panic(x)  // resume panicking
}

maybe that helps simplify things a bit?

kuba-- self-assigned this Feb 25, 2019

kuba-- transferred this issue from src-d/gitbase Feb 25, 2019

kuba-- mentioned this issue Feb 27, 2019

recover panic for partitions #626

Merged

kuba-- added the proposal proposal for new additions or changes label Feb 27, 2019

ajnavarro closed this as completed in #626 Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: don't die on panic #625

Feature request: don't die on panic #625

smacker commented Feb 22, 2019

ajnavarro commented Feb 25, 2019 •

edited

Loading

kuba-- commented Feb 25, 2019

kuba-- commented Feb 25, 2019

ajnavarro commented Feb 25, 2019

kuba-- commented Feb 25, 2019

creachadair commented Feb 25, 2019

erizocosmico commented Feb 25, 2019

kuba-- commented Feb 25, 2019

ajnavarro commented Feb 26, 2019 •

edited

Loading

creachadair commented Feb 26, 2019 •

edited

Loading

kuba-- commented Feb 27, 2019

creachadair commented Feb 27, 2019 •

edited

Loading

kuba-- commented Feb 27, 2019

creachadair commented Feb 27, 2019 •

edited

Loading

Feature request: don't die on panic #625

Feature request: don't die on panic #625

Comments

smacker commented Feb 22, 2019

ajnavarro commented Feb 25, 2019 • edited Loading

kuba-- commented Feb 25, 2019

kuba-- commented Feb 25, 2019

ajnavarro commented Feb 25, 2019

kuba-- commented Feb 25, 2019

creachadair commented Feb 25, 2019

erizocosmico commented Feb 25, 2019

kuba-- commented Feb 25, 2019

ajnavarro commented Feb 26, 2019 • edited Loading

creachadair commented Feb 26, 2019 • edited Loading

kuba-- commented Feb 27, 2019

creachadair commented Feb 27, 2019 • edited Loading

kuba-- commented Feb 27, 2019

creachadair commented Feb 27, 2019 • edited Loading

ajnavarro commented Feb 25, 2019 •

edited

Loading

ajnavarro commented Feb 26, 2019 •

edited

Loading

creachadair commented Feb 26, 2019 •

edited

Loading

creachadair commented Feb 27, 2019 •

edited

Loading

creachadair commented Feb 27, 2019 •

edited

Loading