Sentry error reporting doesn't quite work #350

wh5a · 2014-05-17T04:33:06Z

In rqworker.py the main process creates a sentry client and passes it to register_sentry. However when an exception is raised in the worker process, the client inherited from parent doesn't work any more. No error is logged into sentry.

The text was updated successfully, but these errors were encountered:

nvie · 2014-05-18T06:41:27Z

Hey @wh5a, I make a fix for this on this branch. I've seen this unreliable Sentry delivery first-hand recently and I wasn't sure what was the root cause here, so thanks for reporting.

Could you test if this new setup works for you? I haven't given it a shot yet. If it works, I'll put this in prod myself after the weekend, and release RQ 0.4.6 with this change applied.

Note that this is a backward-incompatible change: you'll need to pass register_sentry() the DSN string now, instead of a fully initialized client instance. I didn't see another way around it.

wh5a · 2014-05-18T08:13:11Z

Hmm I thought the failure was due to forking. But your patch still didn't fix it. Now I'm puzzled...

selwin · 2014-05-18T08:14:32Z

@nvie I'd actually like to see contrib.sentry deprecated and encourage the use of more standard logging mechanism. I have an example logging configuration using django-rq here: https://github.com/ui/django-rq .

Thoughts?

selwin · 2014-05-18T08:46:37Z

As a side note, I'm also using sentry as my error logger in production (via django-rq) and I've never had any problem with it

wh5a · 2014-05-19T01:19:40Z

@selwin I copied your LOGGING setting, and created a Django function that intentionally triggers an exception. Then I used django_rq.enqueue to enqueue this erratic function, but didn't get any logs in sentry. What did I do wrong?

selwin · 2014-05-19T01:23:45Z

@wh5a can you show me your logging configuration (don't forget to censor out sensitive data)? Also, did you run rqworker using python manage.py rqworker instead of the rqworker script built into RQ?

nvie · 2014-05-19T07:17:22Z

@selwin I like your suggestion a lot, thanks.

wh5a · 2014-05-19T07:34:58Z

@selwin good point. I was directly running rqworker. But when I switched to python manage.py rqworker, the rqworker process died the moment it received a job:

08:31:51 Registering birth of worker send-clone.19216
08:31:51 RQ worker started, version 0.4.5
08:31:51
08:31:51 *** Listening on default...
08:31:51 Sent heartbeat to prevent worker timeout. Next one should arrive within 420 seconds.
08:32:22 default: m2m.blah.oo() (c10ebc9e-66bb-48c6-b9d8-48aa5ff41f31)
08:32:22 Sent heartbeat to prevent worker timeout. Next one should arrive within 420 seconds.
08:32:22 Registering death
OSError: [Errno 12] Cannot allocate memory

My logging configuration is identical to the one in https://github.com/ui/django-rq/blob/master/README.rst.

selwin · 2014-05-19T08:01:36Z

From the OSError in the log you provided, it seems like your box doesn't have enough memory.

wh5a · 2014-05-19T08:49:26Z

@selwin that doesn't seem to be the real error. It still has plenty of memory, and the error only happens when it's started through manage.py.

selwin · 2014-05-19T08:52:37Z

How much memory does your box have? Workers started from manage.py rqworker will consume more memory than RQ's built in version since it also needs to load your Django app into memory.

wh5a · 2014-05-19T09:11:40Z

@selwin Sorry you were right. It was really running out of memory. After killing a few processes I didn't get the OOM, but for some reason I didn't get sentry log either.

wh5a · 2014-05-19T10:21:25Z

It seems to be the same issue going through the logging system - exceptions in worker process for some reason don't get logged into sentry.

selwin · 2014-05-19T10:30:54Z

Mind copy pasting the relevant logging configuration so I can take a look?

Win~

On Mon, May 19, 2014 at 5:21 PM, Wei Hu notifications@github.com wrote:

It seems to be the same issue going through the logging system -
exceptions in worker process for some reason don't get logged into sentry.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/350#issuecomment-43486140
.

wh5a · 2014-05-19T10:32:32Z

I had the same logging configuration as your README:

LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "rq_console": {
            "format": "%(asctime)s %(message)s",
            "datefmt": "%H:%M:%S",
        },
    },
    "handlers": {
        "rq_console": {
            "level": "DEBUG",
            "class": "rq.utils.ColorizingStreamHandler",
            "formatter": "rq_console",
            "exclude": ["%(asctime)s"],
        },
        # If you use sentry for logging
        'sentry': {
            'level': 'ERROR',
            'class': 'raven.contrib.django.handlers.SentryHandler',
        },
    },
    'loggers': {
        "rq.worker": {
            "handlers": ["rq_console", "sentry"],
            "level": "DEBUG"
        },
    }
}

selwin · 2014-05-19T10:37:24Z

Did you set up Sentry's DSN and other related settings correctly? You can test it by setting Django's error logger to Sentry and see if it logs correctly.Sent from my phoneOn Mon, May 19, 2014 at 3:32 AM -0700, "Wei Hu" notifications@github.com wrote:

I had the same logging configuration as your README:

LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"rq_console": {
"format": "%(asctime)s %(message)s",
"datefmt": "%H:%M:%S",
},
},
"handlers": {
"rq_console": {
"level": "DEBUG",
"class": "rq.utils.ColorizingStreamHandler",
"formatter": "rq_console",
"exclude": ["%(asctime)s"],
},
# If you use sentry for logging
'sentry': {
'level': 'ERROR',
'class': 'raven.contrib.django.handlers.SentryHandler',
},
},
'loggers': {
"rq.worker": {
"handlers": ["rq_console", "sentry"],
"level": "DEBUG"
},
}
}

—Reply to this email directly or view it on GitHub.

wh5a · 2014-05-19T17:13:31Z

Yes, I have verified that the main process can log to sentry, but not worker process.

wh5a · 2014-05-19T21:41:20Z

I thought it could be a bug in the raven library, but I created a simple test and it worked fine in the child process:

from os import fork
if fork():
    print('parent')
else:
    print('child')
    from raven import Client
    client=Client()
    try:
        2/0
    except:
        client.captureException()

nvie · 2014-05-19T22:48:49Z

Could you immediately call os._exit() after doing the Sentry call? This mimics what happens in the worker. I haven't tried myself yet, but I have a hunch that Sentry is doing async delivery and the child process is terminated before the event gets a change to be delivered.

wh5a · 2014-05-20T05:41:10Z

@nvie Yes, I believe you're on the right track. Calling os._exit(0) would stop sentry delivery.

nvie · 2014-05-20T07:01:17Z

Could you try changing your Sentry DSN to 'sync+https://...', so prefix it with sync+? I can confirm this just worked for me.

nvie · 2014-05-20T07:16:45Z

Note: I'm not advocating you should permanently do this. This is not the final solution, but it definitely confirms the theory that this is the underlying issue.

It is not the final solution, because if Sentry's down, this will block the worker processes, preventing them from termination (and thus stops job execution unnecessarily).

Here are a few solutions that we could consider:

Letting the work horse process always live a little bit longer after an exception occured, say 2-5 seconds, so there is some time to deliver the Sentry event. This would be as simple as adding a time.sleep() right after all exceptions have been handled, but before the os._exit() call.
Letting the work horse process live a little bit longer after an exception handler, but detect if the event is delivered quicker, so we don't block the worker unnecessarily. This would be more optiomal, but I'm unsure if Raven can inform us about this.
Letting the main worker deliver the Sentry events in a thread. This is going to be harder to implement due to RQ's worker architecture, though.

Option 2 sounds like the most desirable one, but how would this work if we move Sentry handling to the more standard logging method, @selwin? I don't think we can directly interact with Raven anymore that way, and so we cannot know if the event has been delivered or not.

Which one has your preference?

wh5a · 2014-05-20T08:14:53Z

Yes, indeed running synchronously made the sentry delivery go through. Would it block job execution forever? I thought there would be a timeout after which the worker would be killed anyway? Or do you just not like the unnecessary timeout?

nvie · 2014-05-20T08:41:36Z

That, and the timeout itself causes the exception in the first place, after which Sentry delivery is due, so it's a bit of a corner case.

selwin · 2014-05-20T15:44:29Z

Unfortunately, I'm -1 on all three options. I think we should just document that if sentry is used to handle the errors, sync transport should be used.

This is, after all, due to the way raven is designed - it doesn't play well with anything that uses os.fork(). I'm not too worried about Sentry being down either. Being a cloud based logging service, there's always a risk that the service is unreachable - the user is probably familiar with the risks.

I never ran into these issues because I'm running raven < 3.6 (switched to threaded transport by default) in production.

wh5a · 2014-05-20T16:34:17Z

@selwin Some people run self hosted sentry servers which may not be as reliable as the official cloud.

selwin · 2014-05-21T00:35:26Z

Then they should be aware of the risks of hosting their own error logging service, just kidding ;)

We already have a mechanism in place to kill long running processes using timeouts, and blocking Sentry calls seem to be in this category of problems.

Apart from that, perhaps we should encourage the use of custom Worker, Job or Queue classes? We can then document or pull these custom workers/jobs into contrib.

What do you think @nvie?

Sent from my phone

On May 20, 2014, at 11:34 PM, Wei Hu notifications@github.com wrote:

@selwin Some people run self hosted sentry servers which may not be as reliable as the official cloud.

—
Reply to this email directly or view it on GitHub.

nvie · 2014-05-21T07:32:40Z

@selwin I think you're right. The timeout that I wanted to instate can actually be specified in the DSN string as well, if you encode it in the URL param, ?timeout=2. So I think we should indeed do this. I'll change RQ to raise a warning when a DSN is used without the sync+ prefix, and release 0.4.6 with that. I would like to postpone the switch to the logging-based Sentry configuration to the 0.5 release, since that will require some more work for most users. Thanks for the help, all!

selwin · 2014-05-21T07:41:11Z

@wh5a do you mind copy pasting a working django-rq logging configuration using Raven's sync transport? I want to update django-rq's documentation to make sure it works with newer Raven versions.

wh5a · 2014-05-21T07:44:23Z

@selwin I decided to run rqworker directly rather than going through the djang-rq wrapper, since as you pointed out it's more memory efficient. The change was simply to prefix the server with sync+.

selwin · 2014-05-21T07:46:30Z

@wh5a alright, cool. I'm closing this issue since this is caused by Raven's transport method rather than RQ itself.

nvie · 2014-05-21T07:49:52Z

Cool, I'll release RQ 0.4.6 in an hour.

nvie mentioned this issue May 20, 2014

sentry DSN in config needs sync+ or requests+ prefix to work #353

Closed

selwin closed this as completed May 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentry error reporting doesn't quite work #350

Sentry error reporting doesn't quite work #350

wh5a commented May 17, 2014

nvie commented May 18, 2014

wh5a commented May 18, 2014

selwin commented May 18, 2014

selwin commented May 18, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

nvie commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

wh5a commented May 19, 2014

nvie commented May 19, 2014

wh5a commented May 20, 2014

nvie commented May 20, 2014

nvie commented May 20, 2014

wh5a commented May 20, 2014

nvie commented May 20, 2014

selwin commented May 20, 2014

wh5a commented May 20, 2014

selwin commented May 21, 2014

nvie commented May 21, 2014

selwin commented May 21, 2014

wh5a commented May 21, 2014

selwin commented May 21, 2014

nvie commented May 21, 2014

Sentry error reporting doesn't quite work #350

Sentry error reporting doesn't quite work #350

Comments

wh5a commented May 17, 2014

nvie commented May 18, 2014

wh5a commented May 18, 2014

selwin commented May 18, 2014

selwin commented May 18, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

nvie commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

selwin commented May 19, 2014

wh5a commented May 19, 2014

wh5a commented May 19, 2014

nvie commented May 19, 2014

wh5a commented May 20, 2014

nvie commented May 20, 2014

nvie commented May 20, 2014

wh5a commented May 20, 2014

nvie commented May 20, 2014

selwin commented May 20, 2014

wh5a commented May 20, 2014

selwin commented May 21, 2014

nvie commented May 21, 2014

selwin commented May 21, 2014

wh5a commented May 21, 2014

selwin commented May 21, 2014

nvie commented May 21, 2014