Skip to content

Logs seem to be coming from the wrong spider #162

@Yoyoda75

Description

@Yoyoda75

Hello,

After PR #156 I noticed a strange behavior on spiders log records. Here is the extension I use to add the spider's name to each log (based on this stack overflow discussion):

class CustomLogExtension:
    """
    Removes logs coming from the self.modules list from the Scrapy logs.
    """

    class ContentFilter(logging.Filter):
        """
        Creates a filter that fetches the spider's name from the log record.
        """
        def filter(self, record):
            record.spider_name = ''
            # enter the spider's name
            if hasattr(record, 'spider'):
                record.spider_name = record.spider.name

            return True

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise NotConfigured otherwise
        if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
            raise NotConfigured

        # instantiate the extension object
        ext = cls()

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        """Prefixes the spider's name to every log emitted."""

        formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
        # add the new format and filter to all the handlers
        for handler in logging.root.handlers:
            handler.formatter = formatter
            handler.addFilter(self.ContentFilter())

On regular spiders, this produces the desired effect. Here is an extract of my logs (I'm launching all spiders asynchronously through a CrawlerProcess):

[f5] 2023-01-27 17:29:13,110 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api-u.f5.com/support/kb-articles/K43357358> (referer: https://api-u.f5.com/support/fedsearch/v2/rss?page=1&results=100&source=kbarticles&source=techcomm&documentType=Security%20Advisory&firstPublishedDateStart=2022-01-01&linkBack=https://support.f5.com/csp/new-updated-articles)
[a10] 2023-01-27 17:29:13,117 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://support.a10networks.com/support/security-advisories> (referer: None)
[huawei] 2023-01-27 17:29:13,122 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.huawei.com/en/psirt/all-bulletins?page=2> (referer: None)
[moxa_edr_810] 2023-01-27 17:29:13,129 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moxa.com/en/support/product-support/security-advisory/edr-g903-g902-810-secure-routers-vulnerability-(1)> (referer: https://www.moxa.com/en/support/product-support/security-advisory/security-advisories-all?psid=48041)
[nutanix] 2023-01-27 17:29:13,183 [scrapy.core.engine] INFO: Closing spider (finished)
[nutanix] 2023-01-27 17:29:13,188 [scrapy.core.engine] INFO: Spider closed (finished)
[wapt] 2023-01-27 17:29:13,216 [scrapy.core.engine] INFO: Closing spider (finished)
[wapt] 2023-01-27 17:29:13,224 [scrapy.core.engine] INFO: Spider closed (finished)
[roundcube] 2023-01-27 17:29:13,254 [scrapy.core.engine] INFO: Closing spider (finished)
[roundcube] 2023-01-27 17:29:13,259 [scrapy.core.engine] INFO: Spider closed (finished)

However, with scrapy-playwright spiders I get only the name of the last spider that called spider_opened. Here I launched spiders "bmc", "veeam", "hpe" and "checkpoint" in that order:

[checkpoint] 2023-01-27 17:40:23,791 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,807 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,825 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,839 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://communities.bmc.com/blogs/application-security-news> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,842 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,851 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.veeam.com/knowledge-base.html?type=security> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,858 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://support.hpe.com/connect/s/securitybulletinlibrary?language=en_US> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,865 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:24,145 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (referrer: None)
[checkpoint] 2023-01-27 17:40:24,222 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.veeam.com/knowledge-base.html?type=security> (referrer: None)

After running print(record.spider.name, record.__dict__) in my filter function, I noticed that every log has "checkpoint" as its spider.name and CheckpointSpider as the spider class that called it:

checkpoint {'name': 'scrapy-playwright', 'msg': '[Context=%s] Response: <%i %s> (referrer: %s)', 'args': ('default', 200, 'https://support.hpe.com/connect/s/sfsites/c/resource/DCECommunityStyles/fonts/DCEIcons.ttf?7abo6h', None), 'levelname': 'DEBUG', 'levelno': 10, 'pathname': 'removed for privacy', 'filename': 'handler.py', 'module': 'handler', 'exc_info': None, 'exc_text': None, 'stack_info': None, 'lineno': 499, 'funcName': '_log_request', 'created': 1674838086.8665972, 'msecs': 866.0, 'relativeCreated': 80859.15327072144, 'thread': 140704363071680, 'threadName': 'MainThread', 'processName': 'MainProcess', 'process': 81449, 'spider': <CheckpointSpider 'checkpoint' at 0x10b5dd5d0>, 'spider_name': ''}

This is strange behavior as:

  1. The "hpe" spider should make requests related to the hpe.com domain;
  2. The allowed_domains setting should prevent any other spider from making a request to the hpe.com domain;
  3. The filter works on regular spider, and I don't see why it wouldn't work on scrapy-playwright spiders.

I don't know enough about scrapy-playwright's core to know where this issue could be coming from. However, as a wild guess I set the DOWNLOAD_HANDLERS to scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler in my general Scrapy settings instead of in a custom_settings dictionnary inside my parent scrapy-playwright spider (which other playwright spiders inherit from), and the issue replicated to every spiders, even the regular ones.

I can provide more information if necessary.
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions