Logs seem to be coming from the wrong spider

Hello,

After PR #156 I noticed a strange behavior on spiders log records. Here is the extension I use to add the spider's name to each log (based on this [stack overflow discussion](https://stackoverflow.com/questions/75106880)): 
```python
class CustomLogExtension:
    """
    Removes logs coming from the self.modules list from the Scrapy logs.
    """

    class ContentFilter(logging.Filter):
        """
        Creates a filter that fetches the spider's name from the log record.
        """
        def filter(self, record):
            record.spider_name = ''
            # enter the spider's name
            if hasattr(record, 'spider'):
                record.spider_name = record.spider.name

            return True

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise NotConfigured otherwise
        if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
            raise NotConfigured

        # instantiate the extension object
        ext = cls()

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        """Prefixes the spider's name to every log emitted."""

        formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
        # add the new format and filter to all the handlers
        for handler in logging.root.handlers:
            handler.formatter = formatter
            handler.addFilter(self.ContentFilter())
```

On regular spiders, this produces the desired effect. Here is an extract of my logs (I'm launching all spiders asynchronously through a CrawlerProcess):
```Logtalk
[f5] 2023-01-27 17:29:13,110 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api-u.f5.com/support/kb-articles/K43357358> (referer: https://api-u.f5.com/support/fedsearch/v2/rss?page=1&results=100&source=kbarticles&source=techcomm&documentType=Security%20Advisory&firstPublishedDateStart=2022-01-01&linkBack=https://support.f5.com/csp/new-updated-articles)
[a10] 2023-01-27 17:29:13,117 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://support.a10networks.com/support/security-advisories> (referer: None)
[huawei] 2023-01-27 17:29:13,122 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.huawei.com/en/psirt/all-bulletins?page=2> (referer: None)
[moxa_edr_810] 2023-01-27 17:29:13,129 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moxa.com/en/support/product-support/security-advisory/edr-g903-g902-810-secure-routers-vulnerability-(1)> (referer: https://www.moxa.com/en/support/product-support/security-advisory/security-advisories-all?psid=48041)
[nutanix] 2023-01-27 17:29:13,183 [scrapy.core.engine] INFO: Closing spider (finished)
[nutanix] 2023-01-27 17:29:13,188 [scrapy.core.engine] INFO: Spider closed (finished)
[wapt] 2023-01-27 17:29:13,216 [scrapy.core.engine] INFO: Closing spider (finished)
[wapt] 2023-01-27 17:29:13,224 [scrapy.core.engine] INFO: Spider closed (finished)
[roundcube] 2023-01-27 17:29:13,254 [scrapy.core.engine] INFO: Closing spider (finished)
[roundcube] 2023-01-27 17:29:13,259 [scrapy.core.engine] INFO: Spider closed (finished)
```

However, with scrapy-playwright spiders I get only the name of the last spider that called `spider_opened`. Here I launched spiders "bmc", "veeam",  "hpe" and "checkpoint" in that order:
```Logtalk
[checkpoint] 2023-01-27 17:40:23,791 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,807 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,825 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,839 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://communities.bmc.com/blogs/application-security-news> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,842 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,851 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.veeam.com/knowledge-base.html?type=security> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,858 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://support.hpe.com/connect/s/securitybulletinlibrary?language=en_US> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,865 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:24,145 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (referrer: None)
[checkpoint] 2023-01-27 17:40:24,222 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.veeam.com/knowledge-base.html?type=security> (referrer: None)
```

After running ```print(record.spider.name, record.__dict__)``` in my `filter` function, I noticed that every log has "checkpoint" as its `spider.name` and `CheckpointSpider` as the spider class that called it:
```Logtalk
checkpoint {'name': 'scrapy-playwright', 'msg': '[Context=%s] Response: <%i %s> (referrer: %s)', 'args': ('default', 200, 'https://support.hpe.com/connect/s/sfsites/c/resource/DCECommunityStyles/fonts/DCEIcons.ttf?7abo6h', None), 'levelname': 'DEBUG', 'levelno': 10, 'pathname': 'removed for privacy', 'filename': 'handler.py', 'module': 'handler', 'exc_info': None, 'exc_text': None, 'stack_info': None, 'lineno': 499, 'funcName': '_log_request', 'created': 1674838086.8665972, 'msecs': 866.0, 'relativeCreated': 80859.15327072144, 'thread': 140704363071680, 'threadName': 'MainThread', 'processName': 'MainProcess', 'process': 81449, 'spider': <CheckpointSpider 'checkpoint' at 0x10b5dd5d0>, 'spider_name': ''}
```

This is strange behavior as:

1. The "hpe" spider should make requests related to the hpe.com domain;
2. The `allowed_domains` setting should prevent any other spider from making a request to the hpe.com domain;
3. The filter works on regular spider, and I don't see why it wouldn't work on scrapy-playwright spiders.

I don't know enough about scrapy-playwright's core to know where this issue could be coming from. However, as a wild guess I set the `DOWNLOAD_HANDLERS` to `scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler` in my general Scrapy settings instead of in a custom_settings dictionnary inside my parent scrapy-playwright spider (which other playwright spiders inherit from), and the issue replicated to every spiders, even the regular ones. 

I can provide more information if necessary. 
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Logs seem to be coming from the wrong spider #162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Logs seem to be coming from the wrong spider #162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions