Option to stop spider on errors #47

andrewbaxter · 2018-10-05T18:27:24Z

I think it would be nice to have validation errors treated like other spider errors - increment the error counter, optionally stop the spider with an error condition if over an error count.

I'd primarily like to stop the spider so I made that the title, but having both would be nice.

rennerocha · 2018-11-29T13:51:14Z

You can define a suite that will be executed periodically using SPIDERMON_PERIODIC_MONITORS setting and include a monitor that would stop the spider if an error threshold is reached.

The number of validation errors is included in spidermon/validation/fields/errors stats (and all validation related stat in included in spidermon/validation/*) so I believe it is enough to track validation errors and will not be mixed in the spider execution errors.

Does it solves your problem? If not we can try figure out a better way to handle these two scenarios.

andrewbaxter · 2018-12-05T09:13:19Z

Sorry, this seems like it overlaps a bit with #50 and I probably forgot about this ticket when making the other.

That would probably work, but I think it would be nice to use existing channels

Because it doesn't require extra programming/testing, just a setting
Because it's easier to understand the composition of existing tools (or rather, it's not clear why it wouldn't work this way)
The response to errors would be faster

Gallaecio · 2019-04-05T11:22:55Z

Closing as per #50 (comment)

andrewbaxter · 2019-04-05T11:30:35Z

Sorry @Gallaecio 😓 I rushed and accidentally closed #50 - alright to reopen this?

raphapassini · 2019-06-13T13:07:00Z

I think this is a very good candidate for the next release.
We can add a new built-in monitor to close the spider when a number of errors happen.

I would like to suggest a new settings:
A SPIDERMON_CLOSESPIDER_BY_STATS where the key is a stats key and the value is the maximum number of errors for that given error for example:

SPIDERMON_CLOSESPIDER_BY_STATS = {
    "spidermon/validation/": 10,
}

We can based our close_spider method on the one from the built-in memory usage extension:
https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/memusage.py#L77

rennerocha · 2019-06-13T13:42:03Z

Maybe expressions monitors can be used to achieve this, and we can use the pattern described in periodic monitors documentation. I am not yet quite sure that a built-in monitor is the best solution, as we may want to close the spider based on different assertions.

rosheen33 · 2019-06-20T12:36:13Z

@rennerocha @raphapassini What about SPIDERMON_CLOSESPIDER_BY_EXPRESSIONS_MONITOR ?
In which we check periodically for errors like we do in periodic monitors and end spider if certain limit is reached.
And allow all the expressions that we allow in Expressions Monitor ?

rosheen33 · 2019-06-20T12:36:57Z

@rennerocha @raphapassini

Maybe Something like this roughly ?

class SpiderPeriodicMonitorSuit(MonitorSuite):
    monitors_m = {
        "tests": [
            {"expression": "stats['item_scraped_count'] < 30"},
        ]
    }
    monitors = [create_monitor_class_from_dict(monitors_m, monitor_class=ExpressionsMonitor)]

    def on_monitors_failed(self, result):
        self._crawler.stop()

SPIDERMON_PERIODIC_MONITORS = {
    'myproject.monitors.SpiderPeriodicMonitorSuit': 10,
}

Gallaecio · 2019-06-24T11:53:32Z

If the same can be achieved with periodic monitors, as @rennerocha suggests, I agree with him that it may be better to use them instead. They give great flexibility, allowing complex expression evaluations and also allowing to react in different ways (not just to close the spider).

We could extend the periodic monitors documentation page, instead. Maybe cover the specific case of stopping a spider on errors.

andrewbaxter · 2019-06-24T18:12:54Z

I'm not against monitors per se but it's a significant extra burden on getting set up. I think the discussion from #50 is getting dragged in here since the proposed solutions are similar, so continuing down that path:

Suppose you start with a spider and you have a rough schema - to get from that to errors in the log you need to

Add spidermon dep, jsonschema dep
Write a 15 line monitor suite, maybe a monitor for the suite, put it in a new module, add it to the project
Add 4-5 settings

To figure this out, if you're not well versed in Spidermon, requires looking at multiple documentation pages, copying bits and pieces of several examples. Getting any of the settings wrong (or missing a setting) can lead to Spidermon doing nothing.

There are a lot of projects that don't need all of Spidermon's features right away, just validation, or just one small check or whatever, and all this setup is a fairly high hurdle.

Compare to writing a new pipeline to do validation with jsonschema directly + log the errors: one new dependency (jsonschema), one setting, 15 lines of code in an existing file (pipelines.py), and it's easier to verify operation because there's no layers of indirection between Scrapy and your own code in the pipeline.

Making a new feature for every use case obviously isn't a good solution, but putting complete examples in the documentation for every use case doesn't seem great either and would lead to lots difficult to update duplicate code there.

Perhaps there are some basic use cases that could be made into 1-setting features, although that would make gradually increasing usage difficult. Or integrating more with Scrapy's error handling mechanisms?

rosheen33 · 2019-06-26T11:59:11Z

@Gallaecio @rennerocha @raphapassini
Please have a look at the comment of @andrewbaxter

Gallaecio · 2019-07-05T11:01:28Z

Well, I guess this may be a common-enough scenario to have its own option. Worth a pull request, at least; we can discuss further over an implementation proposal.

rosheen33 · 2019-09-23T11:25:15Z

#216
Proposed Solution [PR] ⬆️

I have added a new setting in spidermon as proposed by @raphapassini above

SPIDERMON_CLOSESPIDER_BY_STATS = {
    "spidermon/validation/": 10,
}

Note: Documentation is not added so far

I have implemented the solution
Please have a look
cc. @Gallaecio @rennerocha @andrewbaxter

rennerocha · 2019-09-24T11:29:59Z

As @andrewbaxter said, it is not a good solution to create new features for every use case and we need to consider that a Spidermon user is a developer that is able to create her/his own monitors with custom (and more complex) checks that matches their scenarios. So we need to keep new settings and automatic validation as simple as possible.

I didn't like the SPIDERMON_CLOSESPIDER_BY_STATS format as suggested by @raphapassini . A stat is not necessarily an integer, so we can create some confusion if we allow the user to include stats other that the validation stat. We should also avoid to handle more specific use cases other than verifying the more generic spidermon/validation/items/errors stat.

I believe that a more specific setting as SPIDERMON_CLOSESPIDER_VALIDATIONERRORCOUNT following the same idea as Closespider extension would be a better way to achieve this.

So we add in our settings: SPIDERMON_CLOSESPIDER_VALIDATIONERRORCOUNT = 10 and the spider will be closed when it reaches 10 validation errors. Anything more complex than that (checking specific fields, aggregate stats, etc) needs a custom monitor.

Also, instead of adding this logic inside the pipeline, as @rosheen33 did, maybe it is a better idea to follow the same patterns of Closespider extension L39 that verify the number of errors only when a item_scraped signal is sent (making it possible for the user to create their own monitors that also changes the validation stat).

What do you all think? :-)

ejulio · 2019-10-07T16:07:42Z

Just throwing my 2 cents here.
I agree that it must something simple and specific, not completely generic (at first).
So, a concrete setting should be the best place to start.
The only thing I don't like about SPIDERMON_CLOSESPIDER_VALIDATIONERRORCOUNT is that VALIDATIONERRORCOUNT is not _ separated.
This is something that always get me to some confusion in scrapy settings and I need to go to documentation to get it right.
But, if it is a pattern, we should follow it.
Also, maybe, this kind of feature would be better suited in scrapy then in spidermon.
As we already have CLOSESPIDER_ITEMCOUNT it would be a matter of having CLOSESPIDER_ERRORCOUNT, since it is not a direct feature of monitoring per se.
Finally, if following the approach to have it in spidermon, I'd add to @rennerocha comment above that we should connect to other signals then item_scraped as we the callback may throw an error, so we should run in this signal as well and others.

VMRuiz · 2024-05-07T10:44:13Z

CLOSESPIDER_ERRORCOUNT Is already implemented on scrapy so we can close this

raphapassini added the Type: Enhancement label Jan 8, 2019

Gallaecio mentioned this issue Apr 5, 2019

Option to make validation errors ERRORs in job #50

Closed

Gallaecio closed this as completed Apr 5, 2019

Gallaecio reopened this Apr 5, 2019

rosheen33 self-assigned this Jun 20, 2019

rosheen33 removed their assignment Aug 7, 2019

rosheen33 self-assigned this Sep 23, 2019

rosheen33 added the Status: Review Needed label Sep 23, 2019

rosheen33 mentioned this issue Sep 23, 2019

Close spider by stats [#47][#50] #216

Closed

rosheen33 pinned this issue Sep 23, 2019

rosheen33 unpinned this issue Sep 23, 2019

VMRuiz closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to stop spider on errors #47

Option to stop spider on errors #47

andrewbaxter commented Oct 5, 2018 •

edited

Loading

rennerocha commented Nov 29, 2018

andrewbaxter commented Dec 5, 2018 •

edited

Loading

Gallaecio commented Apr 5, 2019

andrewbaxter commented Apr 5, 2019

raphapassini commented Jun 13, 2019

rennerocha commented Jun 13, 2019

rosheen33 commented Jun 20, 2019

rosheen33 commented Jun 20, 2019 •

edited

Loading

Gallaecio commented Jun 24, 2019

andrewbaxter commented Jun 24, 2019 •

edited

Loading

rosheen33 commented Jun 26, 2019

Gallaecio commented Jul 5, 2019

rosheen33 commented Sep 23, 2019 •

edited

Loading

rennerocha commented Sep 24, 2019 •

edited

Loading

ejulio commented Oct 7, 2019

VMRuiz commented May 7, 2024

Option to stop spider on errors #47

Option to stop spider on errors #47

Comments

andrewbaxter commented Oct 5, 2018 • edited Loading

rennerocha commented Nov 29, 2018

andrewbaxter commented Dec 5, 2018 • edited Loading

Gallaecio commented Apr 5, 2019

andrewbaxter commented Apr 5, 2019

raphapassini commented Jun 13, 2019

rennerocha commented Jun 13, 2019

rosheen33 commented Jun 20, 2019

rosheen33 commented Jun 20, 2019 • edited Loading

Gallaecio commented Jun 24, 2019

andrewbaxter commented Jun 24, 2019 • edited Loading

rosheen33 commented Jun 26, 2019

Gallaecio commented Jul 5, 2019

rosheen33 commented Sep 23, 2019 • edited Loading

rennerocha commented Sep 24, 2019 • edited Loading

ejulio commented Oct 7, 2019

VMRuiz commented May 7, 2024

andrewbaxter commented Oct 5, 2018 •

edited

Loading

andrewbaxter commented Dec 5, 2018 •

edited

Loading

rosheen33 commented Jun 20, 2019 •

edited

Loading

andrewbaxter commented Jun 24, 2019 •

edited

Loading

rosheen33 commented Sep 23, 2019 •

edited

Loading

rennerocha commented Sep 24, 2019 •

edited

Loading