Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Feature request: don't die on panic #625

Closed
smacker opened this issue Feb 22, 2019 · 14 comments
Closed

Feature request: don't die on panic #625

smacker opened this issue Feb 22, 2019 · 14 comments
Assignees
Labels
proposal proposal for new additions or changes

Comments

@smacker
Copy link
Contributor

smacker commented Feb 22, 2019

It would be nice to wrap query handler in defer recover() so in case a query panics it would return an error to the client instead of server exiting.

Similar to how it usually works with http servers:

func(w http.ResponseWriter, r *http.Request) {
	defer func() {
		if rp := recover(); rvr != nil {
			log.Error("panic %v", rp)
			http.Error(w, http.StatusText(http.StatusInternalServerError), http.StatusInternalServerError)
		}
	}()

	next.ServeHTTP(w, r) // actual handler
}
@ajnavarro
Copy link
Contributor

ajnavarro commented Feb 25, 2019

I remember had this functionality with previous versions of vitess server, maybe they implemented that in another way in newer versions. @kuba-- could you have a look?

@kuba--
Copy link
Contributor

kuba-- commented Feb 25, 2019

Looks, it's still there: https://github.com/src-d/go-vitess/blob/34b2260c9dc53f8eb12195c8eb7a2bd273e548c6/mysql/server.go#L257

But I'll take a look closer.

@kuba-- kuba-- self-assigned this Feb 25, 2019
@kuba--
Copy link
Contributor

kuba-- commented Feb 25, 2019

After testing looks that recover still works fine.
I hardcoded a panic("boooooooo") in group by plan and ran query:

SELECT 
r.repository_id, SUM(ARRAY_LENGTH(SPLIT(b.blob_content, '\n'))) AS lines_count
FROM refs r
NATURAL JOIN commit_blobs ct
NATURAL JOIN blobs b
GROUP BY r.repository_id
ORDER BY lines_count DESC

mysql server panicked:

INFO[0007] ConnectionClosed: client 1                   
ERRO[0007] mysql_server caught panic:
boooooooo
/usr/local/opt/go/libexec/src/runtime/panic.go:513 (0x402cdf8)
	gopanic: reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/group_by.go:80 (0x45bd829)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*GroupBy).RowIter: panic("boooooooo")
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/project.go:64 (0x45c64d3)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*Project).RowIter: i, err := p.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/sort.go:88 (0x45cfa9c)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*Sort).RowIter: i, err := s.UnaryNode.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/process.go:50 (0x45c48d0)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*QueryProcess).RowIter: iter, err := p.Child.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/engine.go:106 (0x4851eaa)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server%2ev0.(*Engine).Query: iter, err := analyzed.RowIter(ctx)
gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/server/handler.go:90 (0x4852f48)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/server.(*Handler).ComQuery: schema, rows, err := h.e.Query(ctx, query)
gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql/conn.go:730 (0x476460e)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql.(*Conn).handleNextCommand: err := handler.ComQuery(c, query, func(qr *sqltypes.Result) error {
gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql/server.go:430 (0x4773f07)
	com/src-d/gitbase/vendor/gopkg.in/src-d/go-vitess.v1/mysql.(*Listener).handle: err := c.handleNextCommand(l.handler)
/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1333 (0x405a660)
	goexit: BYTE	$0x90	// NOP 
INFO[0007] NewConnection: client 2                      
INFO[0007] audit trail                                   action=authentication address="127.0.0.1:51298" success=true system=audit user=root

but the gitbase is still running.

@ajnavarro
Copy link
Contributor

on this issue src-d/gitbase#705 , maybe the panic is thrown in another go routine? could you check if you are able to stop gitbase working introducing a panic on the go-git lru-cache?

@kuba--
Copy link
Contributor

kuba-- commented Feb 25, 2019

Yes, everything what goes on go routines (e.g. squash joins, go-git) may crash gitbase.
Basically, in func (it *exchangeRowIter) start() { all it.iterPartitions(partitions) may crash gitbase.
I'm going to add recover to mysql, in func (it *exchangeRowIter) iterPartition(p sql.Partition) {

This issue needs to be transferred to mysql project.

@kuba-- kuba-- transferred this issue from src-d/gitbase Feb 25, 2019
@creachadair
Copy link

Be careful about swallowing panics high up in the call stack. You may avert a process termination, but generally the heap is now in an unknown and generally invalid state. You are likely to resume only to crash again later much further away from the original problem, and it is very hard to debug these.

As a rule, try to avoid handing panics that you didn't throw. If code indirects a nil pointer or reads off the end of a slice, the program is broken, and crashing is actually preferable. Otherwise the experience for the user is potentially much worse: It could lead to weird data corruptions, broken transactions, and other infelicities. I mention this because I've had to debug this, and it's much worse than the problem it is meant to solve IMO.

@erizocosmico
Copy link
Contributor

As @creachadair says, I'd rather catch the panic earlier and have a proper stack trace than keep it running potentially incorrectly.

@kuba--
Copy link
Contributor

kuba-- commented Feb 25, 2019

We already have the main recover mechanism as a listener's handler (implemented by vitess).
Of course, adding another one to every go routine does not make sense. On the other hand, root cause analysis showed that go-git's object lru cache may panic.
I'm totally against implementing recover on the cache level (in my opinion it should be fixed).

I'm not a big fan of recover in defers, so the question is - how do we plan to solve this mystery (@ajnavarro, @smola), because based on my analysis func (it *exchangeRowIter) iterPartition(p sql.Partition) looks like a place between internals and session.

Personally I would do nothing (so far), just integrate go-git with @jfontan 's fix: src-d/go-git#1076 and monitor situation if something like that happen again.

@ajnavarro
Copy link
Contributor

ajnavarro commented Feb 26, 2019

So right now the problem is on partitions.

I think if we panic in one of the partitions, we should stop the whole query. this will mimic the previous query behavior without parallelization. Problems related to inconsistencies should be minimal. WDYT?

@creachadair
Copy link

creachadair commented Feb 26, 2019

So right now the problem is on partitions.

I think if we panic in one of the partitions, we should stop the whole query. this will mimic the previous query behavior without parallelization. Problems related to inconsistencies should be minimal. WDYT?

Using a panic to handle a controlled unwinding event—e.g., a parse or query that fails deep inside, and where the only thing you need to do is get control back to the top so you can report and resume—seems like a reasonable use. I'd generally advise defining your own panic signal type, e.g.

type queryFailed string// hypothetical interface, I have no idea what the real API looks like.
func query(ctx context.Context, q *Query) (_ *Result, err error) {
   defer func() {
      if x := recover(); x != nil {
         if s, ok := x.(queryFailed); ok {
            err = errors.New(string(s))
         } else {
            panic(x)
         }
      }
   }()
   return q.Run(ctx, new(Result)) // whatever
}
…
if badThing(happened) {
  panic(queryFailed("they're destroying our city!"))
}

or words to that effect. That way you only get the panic you care about.

@kuba--
Copy link
Contributor

kuba-- commented Feb 27, 2019

@creachadair - what you proposed, is what we already have (by vitess).
Now we need something similar for partition iterators (go routines, so you cannot catch them on the main thread). Moreover it's hard to define panic interface, because we don't panic on purpose.
What I propose: #626 is just a simple recover in already existing defer which closes all iterators and appends recovery error to error channel.
So, the outside process can receive an error and follow an error scenario.

@kuba-- kuba-- added the proposal proposal for new additions or changes label Feb 27, 2019
@creachadair
Copy link

creachadair commented Feb 27, 2019

@creachadair - what you proposed, is what we already have (by vitess).

As I understand it, Vitess made the same mistake that a lot of HTTP implementations do: It logs and swallows any panic, not just controlled ones.

Now we need something similar for partition iterators (go routines, so you cannot catch them on the main thread). Moreover it's hard to define panic interface, because we don't panic on purpose.

Right, and that's specifically not the kind of panic I'm talking about. That's a program integrity violation, and swallowing that panic will leave the heap in a corrupt state with unknown future effects.

What I propose: #626 is just a simple recover in already existing defer which closes all iterators and appends recovery error to error channel.
So, the outside process can receive an error and follow an error scenario.

Obviously it's not my decision, but I would not recommend this approach at all. I don't know what the trigger for these panics is, but swallowing them isn't going to prevent the program from being broken, it will only change where it breaks (and perhaps make it less obvious).

@kuba--
Copy link
Contributor

kuba-- commented Feb 27, 2019

Personally I think in case of partition iterator it should be fine if we just close the iterator. The only concern I have is the fact that recover will stop stack unwinding.
@ajnavarro - so maybe if we catch a panic we'll restart mysql process by sending a user specific sygnal. Here is a simple prototype:

func foo() {
	defer func() {
		if x := recover(); x != nil {
			fmt.Println("recover:", x)
			syscall.Kill(syscall.Getpid(), syscall.SIGUSR2)
		}
	}()

	panic("foo")
}

func fork() {
	exe, _ := os.Executable()
	dir := filepath.Dir(exe)

	// Spawn child process.
	os.StartProcess(exe, []string{exe}, &os.ProcAttr{
		Dir: dir,
		Files: []*os.File{
			os.Stdin,
			os.Stdout,
			os.Stderr,
		},
		Sys: &syscall.SysProcAttr{},
	})
}

func main() {	
	signalCh := make(chan os.Signal, 1)
	signal.Notify(signalCh, syscall.SIGUSR2)

	go foo()

	for {
		select {
		case s := <-signalCh:
			fork()
			return
		}
	}
}

Let me know what do you think.

@creachadair
Copy link

creachadair commented Feb 27, 2019

If you don't want to stop unwinding, you can just panic again in the handler:

if x := recover(); x != nil {
   cleanup(whatever)
   panic(x)  // resume panicking
}

maybe that helps simplify things a bit?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposal proposal for new additions or changes
Projects
None yet
Development

No branches or pull requests

5 participants