Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to make validation errors ERRORs in job #50

Closed
andrewbaxter opened this issue Oct 25, 2018 · 16 comments
Closed

Option to make validation errors ERRORs in job #50

andrewbaxter opened this issue Oct 25, 2018 · 16 comments

Comments

@andrewbaxter
Copy link

At the moment validation errors can be added to the items in the _validation key, but there doesn't seem to be any way to make it more obvious that some errors are incorrect.

@rennerocha
Copy link
Collaborator

What do you mean by more obvious? Do you have some scenario and an idea on how to improve this?

@andrewbaxter
Copy link
Author

Thanks, forgot about these tickets! Basically right now even if there are dozens of validation errors, the job will finish with "0 errors" in the scrapy cloud page. In practice this makes it easy to think a job was okay when it actually had lots of data problems.

@raphapassini
Copy link
Contributor

@andrewbaxter can you test this again with the branch master?
I'm pretty sure that the errors are being reported as log.ERRORS and thus going to show up in Scrapy Cloud.

@Gallaecio
Copy link
Member

@andrewbaxter @raphapassini @rennerocha I also believe that is the case with the latest version of spidermon. Does that mean #50 here and #47 can be closed?

@andrewbaxter
Copy link
Author

Yep, this looks good to me! Thanks!

@andrewbaxter
Copy link
Author

I guess I'll go ahead and close it.

@andrewbaxter
Copy link
Author

Sorry @Gallaecio :( I was looking at the wrong logs...

Here's what my logs look like when I get a validation error currently:

2019-04-05 20:25:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapinghub.com/>
{'v': 'hi', '_validation': defaultdict(<class 'list'>, {'v': ['Invalid int']})}
2019-04-05 20:25:45 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 20:25:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 876,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 18973,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 5, 11, 25, 45, 200465),
 'item_scraped_count': 1,
 'log_count/DEBUG': 5,
 'log_count/INFO': 9,
 'memusage/max': 64413696,
 'memusage/startup': 64413696,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spidermon/validation/fields': 1,
 'spidermon/validation/fields/errors': 1,
 'spidermon/validation/fields/errors/invalid_int': 1,
 'spidermon/validation/fields/errors/invalid_int/v': 1,
 'spidermon/validation/items': 1,
 'spidermon/validation/items/errors': 1,
 'spidermon/validation/validators': 1,
 'spidermon/validation/validators/item/jsonschema': True,
 'start_time': datetime.datetime(2019, 4, 5, 11, 25, 42, 717363)}
2019-04-05 20:25:45 [scrapy.core.engine] INFO: Spider closed (finished)

My settings are:

SPIDERMON_ENABLED = True
EXTENSIONS = {
    'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}
SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True
SPIDERMON_VALIDATION_SCHEMAS = [
    'xyz/schema.json',
]

I also tried without ADD_ERRORS_TO_ITEMS, but in that case I get neither an ERROR nor a warning... in fact, it's not clear what the validation does here other than increase a job stat.

@andrewbaxter andrewbaxter reopened this Apr 5, 2019
@rosheen33
Copy link

@andrewbaxter What kind of error you want here ?
'spidermon/validation/items/errors': 1,
This field already describes how many items have validation errors

@rosheen33
Copy link

rosheen33 commented Jun 19, 2019

@andrewbaxter If you want to get a warning on items, you can add the following setting
SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS = True
https://spidermon.readthedocs.io/en/latest/item-validation.html#spidermon-validation-drop-items-with-errors

@andrewbaxter
Copy link
Author

I think this was more about how the validation errors are logged/interact with the logging system.

Ideally, I'd like to see

'log_count/ERROR': 1,

and also have the errors appear in the logs with log level ERROR.

That way, say, in Scrapy Cloud, if you look at a job in the dashboard it will show "1 error" with a link, and when you click on it it will take you to the logs page with the ERROR filter already applied.

Right now even if there's lots of validation issues the job appears as finished with 0 errors.

@rosheen33
Copy link

@andrewbaxter Maybe we can convert the Warning into an error on the above setting ?
@rennerocha @raphapassini @Gallaecio What are your thoughts on this one ?

@rennerocha
Copy link
Collaborator

In my opinion, log/ERROR stat is not the right place to include the validation failures. It should be reserved to errors like code exceptions that affects the spider execution (not the data it was returned). I can see scenarios where is acceptable to have some number of validation failures and including them as errors may confuse the user.

When a monitor fails, it includes an error line in the logs, so I'd create a monitor that fails if spidermon/validation/items/errors is not equal to zero. When it fails, you'll see one error in Scrapy Cloud, and then you can check it.

Enabling SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS can help you to identify the items with problems (just look for the ones that have a _validation field filled).

@rosheen33
Copy link

rosheen33 commented Jun 19, 2019

@rennerocha I agree
That is exactly what we do in the maintenance projects [Attach a monitor to see how many items are not conforming to the validators]
It is better to have a single error rather then a huge number of errors.

So should we close this issue ?

@andrewbaxter
Copy link
Author

I think the severity of incorrect items depends on the project, and in projects where data quality needs to be high it's as significant an issue as any other error.

How about making whether it's an error or warning an option?

@rosheen33
Copy link

@rennerocha @raphapassini Do we need a change here ?

@rennerocha
Copy link
Collaborator

Validation failures should not be logged as errors and added to log_count/ERROR stat. As mentioned before we have spidermon/validation/items/errors stat and SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS setting to add the validation error information into the items scraped so we can use monitors to raise errors when some threshold is not satisfied.

If we want a monitor to check the stats and then raise an error in the logs if it fails, we have the built-in monitor ItemValidationMonitor that can be easily included into the project. For more sophisticated validations, a custom monitor should be created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants