Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_spider_exception() not invoked for generators #220

Closed
Mimino666 opened this issue Jan 3, 2013 · 18 comments
Closed

process_spider_exception() not invoked for generators #220

Mimino666 opened this issue Jan 3, 2013 · 18 comments

Comments

@Mimino666
Copy link
Contributor

@Mimino666 Mimino666 commented Jan 3, 2013

When a spider implements parse() function through generator (i.e. returning results with yields) and an exception raises in parse(), then spidermiddleware's process_spider_exception() is not called.

The reason is that exception is finally invoked in one of the spidermiddlewares.

@dangra
Copy link
Member

@dangra dangra commented Jan 7, 2013

bug for sure, do you have an idea of what can be done to fix it?

@Mimino666
Copy link
Contributor Author

@Mimino666 Mimino666 commented Jan 8, 2013

A question for the beginning. What is the desired behavior?:

  1. when the exception in spider raises, all the previously yielded results get lost and only the exception passes through spidermiddleware's process_spider_exception()
  2. results yielded before the exception was raised are processed normally, i.e. create some kind of sublist of them and pass it through spidermiddleware's process_spider_output(). Then raise the exception and pass it through spidermiddleware's process_spider_exception().

On one hand, it seems more intuitive to process the results yielded before the error normally. On the other hand, invoking process_spider_output() and process_spider_exception() for the same request somehow doesn't feel right.

In either solution it seems to me, that it can't be done without collecting all the results from the generator, first, to invoke the exception, if there is one. Which kind of sucks, if the intention to use generator is the fact that there are too many results to fit in the memory or something like that.

@dangra
Copy link
Member

@dangra dangra commented Jan 9, 2013

More like option 2 but do not collect all generator results into a list.

There are two kind of spider errors, one that fails before first result is yielded (this is what Scrapy catches now), and another that raises after generator yielded some results (the one you pointed)

For the second case, I think we can call process_spider_output til the generator fails and then call process_spider_exception even if called for same request-response pair.

def spidercallback(maxn, failat, failfast):
    if failfast:
        1 / 0 

    for n in xrange(maxn):
        if n == failat:
            1 / 0 
        yield n

def errtrap(testid, exc):
    print testid, 'ERROR', exc 

def scrapercall(testid, maxn=10, failat=2, failfast=False):
    try:
        for n in spidercallback(maxn, failat, failfast):
            print testid, 'OK', n
    except Exception as exc:
        errtrap(testid, exc)

scrapercall(1, failat=2)
scrapercall(2, failfast=True)

outputs:

1 OK 0
1 OK 1
1 ERROR integer division or modulo by zero
2 ERROR integer division or modulo by zero

@Blender3D
Copy link

@Blender3D Blender3D commented Jul 12, 2013

Has there been any progress with this bug? It's still present in the current version of Scrapy.

@dangra
Copy link
Member

@dangra dangra commented Jul 13, 2013

it's still present

@dangra dangra added the easy label Apr 22, 2014
@lezorich
Copy link

@lezorich lezorich commented Feb 17, 2015

Hi! I want to start contributing to scrapy fixing this issue. I want to fix the bug using the solution proposed by @dangra: call process_spider_output til the generator fails and then call spidermiddleware's process_spider_exception.

I have been looking scrapy source code for a while now and I think the solution must be somewhere in here 1, 2 but still can't figure out where should the fix go. I also already looked at twisted Deferreds documentation, so I know more or less how it works :)

From what I have understood, the spidermiddleware process_spider_output and process_spider_exception are called here:

dfd = mustbe_deferred(process_spider_input, response)
dfd.addErrback(process_spider_exception)
dfd.addCallback(process_spider_output)

But the spider parse() returns a generator right? so process_spider_output is called because no exception has been raise yet. Because process_spider_output is called, the scraper thinks that everything is fine, so each spidermiddleware process_spider_output is called, where finally the exception is invoked.

I think that I have a good understanding of what's happening below the hood, but I still think that I'm missing something, that's why I wanted to ask if someone can guide me a little here.

Thanks!!

@souravsingh
Copy link

@souravsingh souravsingh commented Apr 5, 2015

I would like to work on this issue.I would like to know how should I start with it?

@vionemc
Copy link

@vionemc vionemc commented Apr 17, 2017

@jeebb
Copy link

@jeebb jeebb commented Jul 14, 2017

To whom it may concern, I used spider_error signal as a work-around for this issue (https://doc.scrapy.org/en/latest/topics/signals.html)

@yalopov
Copy link

@yalopov yalopov commented Jul 18, 2017

i was trying to make a custom spider middleware to handle exceptions with process_spider_exception method but still doesn't work

@jeebb that saved me, thanks!

@dangra
Copy link
Member

@dangra dangra commented Jul 19, 2017

#2061 from @elacuesta is in the right direction IMHO.

@dalepo
Copy link

@dalepo dalepo commented Oct 2, 2017

@jeeb can you provide an example?
I tried the following
self.crawler.signals.send_catch_log(failure=ex, signal=signals.spider_error, response=response, spider=self)

But still, process_spider_exception did not call

@amineallani
Copy link

@amineallani amineallani commented Dec 14, 2017

I recommend using decorators :

First, you need to define the decorator at the top of the file ( before the spider class ) :

def handle_exceptions(function):
    def parse_wrapper(spider, response):
        try:
            for result in function(spider, response):
                yield result
        except YourException:
            # handle exception here
            yield next(spider.start_requests())
    return parse_wrapper

To use it:

@handle_exceptions
def parse(self, response):

@ariasuni
Copy link

@ariasuni ariasuni commented Jul 9, 2018

I want to be able to log response headers, text, and meta when any exception occurs so that I can analyze it later on, it would be really useful for websites that have random bugs, or to see information about the bug without the need to launch the spider again.

Do I really need to put a decorator on each and every parse function I use?

@amineallani
Copy link

@amineallani amineallani commented Jul 9, 2018

Each error prone funtion, where you need to catch the exception needs to be decorated.

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Jul 9, 2018

For the record, this decorator approach is only one of the available workarounds while this problem is still present. If you only need to log information about the exceptions, personally I think the best option would be to write an extension (https://doc.scrapy.org/en/latest/topics/extensions.html) that listens to the spider_error signal (https://doc.scrapy.org/en/latest/topics/signals.html#spider-error), as pointed out by @jeebb.
Keep in mind that the purpose of the process_spider_exceptions method is to be able to recover from exceptions by returning items or requests, while the signals feature only allow listeners to be notified about events.

@ariasuni
Copy link

@ariasuni ariasuni commented Jul 13, 2018

@elacuesta Thank you for you help, it’s working for me!

Lukas0907 added a commit to Lukas0907/feeds that referenced this issue Sep 27, 2018
Lukas0907 added a commit to Lukas0907/feeds that referenced this issue Sep 27, 2018
Lukas0907 added a commit to Lukas0907/feeds that referenced this issue Sep 28, 2018
@PhilipGarnero
Copy link

@PhilipGarnero PhilipGarnero commented Dec 6, 2018

I came across the exact same issue but I also wanted to be able to send errors happening in the items pipeline to be dispatched to sentry. I ended up using signals for both problems. Here is the code for anyone having the issue :

# Custom signal used to catch errors from pipelines
pipeline_error = object()

class SentryMiddleware(object):
    """This class is used to catch errors happening in the spiders or the pipelines"""
    def __init__(self):
        self.client = Client(dsn=settings.SENTRY_DSN)

    @classmethod
    def from_crawler(cls, crawler):
        inst = cls()
        # There is a bug that prevents errors to be received via `process_spider_exception`
        # so we use an internal signal automatically dispatched in case of errors instead
        crawler.signals.connect(inst.process_spider_exception_from_signal, signal=signals.spider_error)
        # We listen to a custom made signal we implemented in order to receive errors from pipelines
        crawler.signals.connect(inst.process_pipeline_exception_from_signal, signal=pipeline_error)
        return inst

    def process_spider_exception(self, response, exception, spider):
        self.client.extra_context({"page_url": response.url, "spider_name": spider.name})
        # sys.exc_info() is not working because the exception has been handled already
        # so we build a fake exc info (type, value, traceback)
        self.client.captureException(exc_info=(type(exception), exception, exception.__traceback__))

    def process_spider_exception_from_signal(self, response, failure, spider, **kwargs):
        exception = failure.value
        self.process_spider_exception(response, exception, spider)

    def process_pipeline_exception_from_signal(self, exception, item, pipeline, spider, **kwargs):
        self.client.extra_context({"item": item, "pipeline_name": pipeline.__class__.__name__, "spider_name": spider.name})
        # sys.exc_info() is not working because the exception has been handled already
        # so we build a fake exc info (type, value, traceback)
        self.client.captureException(exc_info=(type(exception), exception, exception.__traceback__))


class AcknowledgeErrorsPipeline(object):
    """
    Every pipeline we use should be subclassed from this one in order for sentry to receive errors.
    Note: This pipeline does not work alone, the `SentryMiddleware` needs to be enabled in order for errors to be sent
    Note: It does not work with pipelines that retrigger requests because
          the request callbacks would not be wrapped with `handle_exceptions` (but you can still wrap them yourself)
    """
    def __init_subclass__(cls, **kwargs):
        # We use this new python feature in order to avoid using a full fledged metaclass
        # for the simple purpose of wrapping a method
        cls.process_item = AcknowledgeErrorsPipeline.handle_exceptions(cls.process_item)
        super().__init_subclass__(**kwargs)

    @classmethod
    def from_crawler(cls, crawler):
        inst = cls()
        inst.signals = crawler.signals
        return inst

    @staticmethod
    def handle_exceptions(process_item):
        def wrapper(self, item, spider, **kwargs):
            try:
                return process_item(self, item, spider, **kwargs)
            except Exception as e:
                # If we catch an exception, we send it as a signal so that `SentryMiddleware` can receive it
                self.signals.send_catch_log(
                    signal=pipeline_error,
                    exception=e, item=item,
                    pipeline=self, spider=spider
                )
                # And we raise it again to prevent disturbing the normal process
                raise e
        return wrapper

lucywang000 pushed a commit to lucywang000/scrapy that referenced this issue Feb 24, 2019
fixed import error in docs for tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.