Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry error reporting doesn't quite work #350

Closed
wh5a opened this issue May 17, 2014 · 32 comments
Closed

Sentry error reporting doesn't quite work #350

wh5a opened this issue May 17, 2014 · 32 comments

Comments

@wh5a
Copy link

wh5a commented May 17, 2014

In rqworker.py the main process creates a sentry client and passes it to register_sentry. However when an exception is raised in the worker process, the client inherited from parent doesn't work any more. No error is logged into sentry.

@nvie
Copy link
Collaborator

nvie commented May 18, 2014

Hey @wh5a, I make a fix for this on this branch. I've seen this unreliable Sentry delivery first-hand recently and I wasn't sure what was the root cause here, so thanks for reporting.

Could you test if this new setup works for you? I haven't given it a shot yet. If it works, I'll put this in prod myself after the weekend, and release RQ 0.4.6 with this change applied.

Note that this is a backward-incompatible change: you'll need to pass register_sentry() the DSN string now, instead of a fully initialized client instance. I didn't see another way around it.

@wh5a
Copy link
Author

wh5a commented May 18, 2014

Hmm I thought the failure was due to forking. But your patch still didn't fix it. Now I'm puzzled...

@selwin
Copy link
Collaborator

selwin commented May 18, 2014

@nvie I'd actually like to see contrib.sentry deprecated and encourage the use of more standard logging mechanism. I have an example logging configuration using django-rq here: https://github.com/ui/django-rq .

Thoughts?

@selwin
Copy link
Collaborator

selwin commented May 18, 2014

As a side note, I'm also using sentry as my error logger in production (via django-rq) and I've never had any problem with it

@wh5a
Copy link
Author

wh5a commented May 19, 2014

@selwin I copied your LOGGING setting, and created a Django function that intentionally triggers an exception. Then I used django_rq.enqueue to enqueue this erratic function, but didn't get any logs in sentry. What did I do wrong?

@selwin
Copy link
Collaborator

selwin commented May 19, 2014

@wh5a can you show me your logging configuration (don't forget to censor out sensitive data)? Also, did you run rqworker using python manage.py rqworker instead of the rqworker script built into RQ?

@nvie
Copy link
Collaborator

nvie commented May 19, 2014

@selwin I like your suggestion a lot, thanks.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

@selwin good point. I was directly running rqworker. But when I switched to python manage.py rqworker, the rqworker process died the moment it received a job:

08:31:51 Registering birth of worker send-clone.19216
08:31:51 RQ worker started, version 0.4.5
08:31:51
08:31:51 *** Listening on default...
08:31:51 Sent heartbeat to prevent worker timeout. Next one should arrive within 420 seconds.
08:32:22 default: m2m.blah.oo() (c10ebc9e-66bb-48c6-b9d8-48aa5ff41f31)
08:32:22 Sent heartbeat to prevent worker timeout. Next one should arrive within 420 seconds.
08:32:22 Registering death
OSError: [Errno 12] Cannot allocate memory

My logging configuration is identical to the one in https://github.com/ui/django-rq/blob/master/README.rst.

@selwin
Copy link
Collaborator

selwin commented May 19, 2014

From the OSError in the log you provided, it seems like your box doesn't have enough memory.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

@selwin that doesn't seem to be the real error. It still has plenty of memory, and the error only happens when it's started through manage.py.

@selwin
Copy link
Collaborator

selwin commented May 19, 2014

How much memory does your box have? Workers started from manage.py rqworker will consume more memory than RQ's built in version since it also needs to load your Django app into memory.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

@selwin Sorry you were right. It was really running out of memory. After killing a few processes I didn't get the OOM, but for some reason I didn't get sentry log either.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

It seems to be the same issue going through the logging system - exceptions in worker process for some reason don't get logged into sentry.

@selwin
Copy link
Collaborator

selwin commented May 19, 2014

Mind copy pasting the relevant logging configuration so I can take a look?

Win~

On Mon, May 19, 2014 at 5:21 PM, Wei Hu notifications@github.com wrote:

It seems to be the same issue going through the logging system -
exceptions in worker process for some reason don't get logged into sentry.


Reply to this email directly or view it on GitHubhttps://github.com//issues/350#issuecomment-43486140
.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

I had the same logging configuration as your README:

LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "rq_console": {
            "format": "%(asctime)s %(message)s",
            "datefmt": "%H:%M:%S",
        },
    },
    "handlers": {
        "rq_console": {
            "level": "DEBUG",
            "class": "rq.utils.ColorizingStreamHandler",
            "formatter": "rq_console",
            "exclude": ["%(asctime)s"],
        },
        # If you use sentry for logging
        'sentry': {
            'level': 'ERROR',
            'class': 'raven.contrib.django.handlers.SentryHandler',
        },
    },
    'loggers': {
        "rq.worker": {
            "handlers": ["rq_console", "sentry"],
            "level": "DEBUG"
        },
    }
}

@selwin
Copy link
Collaborator

selwin commented May 19, 2014

Did you set up Sentry's DSN and other related settings correctly? You can test it by setting Django's error logger to Sentry and see if it logs correctly.Sent from my phoneOn Mon, May 19, 2014 at 3:32 AM -0700, "Wei Hu" notifications@github.com wrote:

I had the same logging configuration as your README:

LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"rq_console": {
"format": "%(asctime)s %(message)s",
"datefmt": "%H:%M:%S",
},
},
"handlers": {
"rq_console": {
"level": "DEBUG",
"class": "rq.utils.ColorizingStreamHandler",
"formatter": "rq_console",
"exclude": ["%(asctime)s"],
},
# If you use sentry for logging
'sentry': {
'level': 'ERROR',
'class': 'raven.contrib.django.handlers.SentryHandler',
},
},
'loggers': {
"rq.worker": {
"handlers": ["rq_console", "sentry"],
"level": "DEBUG"
},
}
}

—Reply to this email directly or view it on GitHub.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

Yes, I have verified that the main process can log to sentry, but not worker process.

@wh5a
Copy link
Author

wh5a commented May 19, 2014

I thought it could be a bug in the raven library, but I created a simple test and it worked fine in the child process:

from os import fork
if fork():
    print('parent')
else:
    print('child')
    from raven import Client
    client=Client()
    try:
        2/0
    except:
        client.captureException()

@nvie
Copy link
Collaborator

nvie commented May 19, 2014

Could you immediately call os._exit() after doing the Sentry call? This mimics what happens in the worker. I haven't tried myself yet, but I have a hunch that Sentry is doing async delivery and the child process is terminated before the event gets a change to be delivered.

@wh5a
Copy link
Author

wh5a commented May 20, 2014

@nvie Yes, I believe you're on the right track. Calling os._exit(0) would stop sentry delivery.

@nvie
Copy link
Collaborator

nvie commented May 20, 2014

Could you try changing your Sentry DSN to 'sync+https://...', so prefix it with sync+? I can confirm this just worked for me.

@nvie
Copy link
Collaborator

nvie commented May 20, 2014

Note: I'm not advocating you should permanently do this. This is not the final solution, but it definitely confirms the theory that this is the underlying issue.

It is not the final solution, because if Sentry's down, this will block the worker processes, preventing them from termination (and thus stops job execution unnecessarily).

Here are a few solutions that we could consider:

  1. Letting the work horse process always live a little bit longer after an exception occured, say 2-5 seconds, so there is some time to deliver the Sentry event. This would be as simple as adding a time.sleep() right after all exceptions have been handled, but before the os._exit() call.
  2. Letting the work horse process live a little bit longer after an exception handler, but detect if the event is delivered quicker, so we don't block the worker unnecessarily. This would be more optiomal, but I'm unsure if Raven can inform us about this.
  3. Letting the main worker deliver the Sentry events in a thread. This is going to be harder to implement due to RQ's worker architecture, though.

Option 2 sounds like the most desirable one, but how would this work if we move Sentry handling to the more standard logging method, @selwin? I don't think we can directly interact with Raven anymore that way, and so we cannot know if the event has been delivered or not.

Which one has your preference?

@wh5a
Copy link
Author

wh5a commented May 20, 2014

Yes, indeed running synchronously made the sentry delivery go through. Would it block job execution forever? I thought there would be a timeout after which the worker would be killed anyway? Or do you just not like the unnecessary timeout?

@nvie
Copy link
Collaborator

nvie commented May 20, 2014

That, and the timeout itself causes the exception in the first place, after which Sentry delivery is due, so it's a bit of a corner case.

@selwin
Copy link
Collaborator

selwin commented May 20, 2014

Unfortunately, I'm -1 on all three options. I think we should just document that if sentry is used to handle the errors, sync transport should be used.

This is, after all, due to the way raven is designed - it doesn't play well with anything that uses os.fork(). I'm not too worried about Sentry being down either. Being a cloud based logging service, there's always a risk that the service is unreachable - the user is probably familiar with the risks.

I never ran into these issues because I'm running raven < 3.6 (switched to threaded transport by default) in production.

@wh5a
Copy link
Author

wh5a commented May 20, 2014

@selwin Some people run self hosted sentry servers which may not be as reliable as the official cloud.

@selwin
Copy link
Collaborator

selwin commented May 21, 2014

Then they should be aware of the risks of hosting their own error logging service, just kidding ;)

We already have a mechanism in place to kill long running processes using timeouts, and blocking Sentry calls seem to be in this category of problems.

Apart from that, perhaps we should encourage the use of custom Worker, Job or Queue classes? We can then document or pull these custom workers/jobs into contrib.

What do you think @nvie?

Sent from my phone

On May 20, 2014, at 11:34 PM, Wei Hu notifications@github.com wrote:

@selwin Some people run self hosted sentry servers which may not be as reliable as the official cloud.


Reply to this email directly or view it on GitHub.

@nvie
Copy link
Collaborator

nvie commented May 21, 2014

@selwin I think you're right. The timeout that I wanted to instate can actually be specified in the DSN string as well, if you encode it in the URL param, ?timeout=2. So I think we should indeed do this. I'll change RQ to raise a warning when a DSN is used without the sync+ prefix, and release 0.4.6 with that. I would like to postpone the switch to the logging-based Sentry configuration to the 0.5 release, since that will require some more work for most users. Thanks for the help, all!

@selwin
Copy link
Collaborator

selwin commented May 21, 2014

@wh5a do you mind copy pasting a working django-rq logging configuration using Raven's sync transport? I want to update django-rq's documentation to make sure it works with newer Raven versions.

@wh5a
Copy link
Author

wh5a commented May 21, 2014

@selwin I decided to run rqworker directly rather than going through the djang-rq wrapper, since as you pointed out it's more memory efficient. The change was simply to prefix the server with sync+.

@selwin
Copy link
Collaborator

selwin commented May 21, 2014

@wh5a alright, cool. I'm closing this issue since this is caused by Raven's transport method rather than RQ itself.

@selwin selwin closed this as completed May 21, 2014
@nvie
Copy link
Collaborator

nvie commented May 21, 2014

Cool, I'll release RQ 0.4.6 in an hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants