Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support restricting the coverage monitor fields #416

Closed
curita opened this issue Aug 30, 2023 · 4 comments
Closed

Support restricting the coverage monitor fields #416

curita opened this issue Aug 30, 2023 · 4 comments

Comments

@curita
Copy link
Member

curita commented Aug 30, 2023

Background

Currently, the coverage monitor tracks and reports the coverage of all fields, including nested fields (i.e., keys inside dictionary values specifically). It follows all nested field levels.

This sometimes isn't desired, especially for dictionaries that have non-standardized keys. The coverage tracking can stay the same if there's a fixed schema for those dictionaries or too many nested levels. This is particularly troublesome when running spiders in ScrapyCloud, as there's a hard limit on the stats's storage size.

Alternatives

  1. New setting to skip the coverage tracking on fields defined by name or regex. It should support nested field definitions as well.
  2. Support limiting the nested levels of coverage, like SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS, but for dictionaries.
  3. Allow the possibility of storing the coverage tracking stats somewhere else. For example, each <key: field count> pair could be stored as a separate entry in a ScrapyCloud collection to avoid triggering its size limitation.

All of them could be viable and coexist as they address different parts of the problem.

Let me know what you think!

@rennerocha
Copy link
Collaborator

@curita

Skipping the coverage of certain patterns may not solve the problem, considering that for data not well structured, we may get keys unexpected from our targets. Regarding the stats limit in Scrapy Cloud, maybe changing what HubStorageStatsCollector uploads as stats is a more reliable alternative and is specific to the platform, so we don't introduce new Zyte specific code to Spidermon.

@rennerocha
Copy link
Collaborator

rennerocha commented Sep 19, 2023

The alternative of SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS looks promising in my point of view. We can set the "depth" where we want to check the coverage, so with an item like:

{
    "field_1": "some_value",
    "field_2": "some_value",
    "field_3": {
        "field_3_1": "some_value",
        "field_3_2": "some_value",
    }
}

If SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS = 1, so the stats will be like:

{
    "spidermon_field_coverage/dict/field1/": "some_value",
    "spidermon_field_coverage/dict/field2/": "other_value",
    "spidermon_field_coverage/dict/field3/": "other_value",
}

If SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS = 2, so the stats will be like:

{
    "spidermon_field_coverage/dict/field1/": "some_value",
    "spidermon_field_coverage/dict/field2/": "other_value",
    "spidermon_field_coverage/dict/field3/": "other_value",
   "spidermon_field_coverage/dict/field3/field_3_1": "other_value",
   "spidermon_field_coverage/dict/field3/field_3_2": "other_value",
}

What do you think???

@curita
Copy link
Member Author

curita commented Oct 10, 2023

I like the depth approach the most 👍 Mauricio implemented SPIDERMON_DICT_FIELDS_COVERAGE_LEVELS as a separate setting in his PR, though both SPIDERMON_LIST_FIELDS_COVERAGE_LEVELS and SPIDERMON_DICT_FIELDS_COVERAGE_LEVELS could be unified into a single setting if we consider so.

@VMRuiz
Copy link
Collaborator

VMRuiz commented May 7, 2024

Fixed by #433

@VMRuiz VMRuiz closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants