-
Couldn't load subscription status.
- Fork 146
Description
Hello,
After PR #156 I noticed a strange behavior on spiders log records. Here is the extension I use to add the spider's name to each log (based on this stack overflow discussion):
class CustomLogExtension:
"""
Removes logs coming from the self.modules list from the Scrapy logs.
"""
class ContentFilter(logging.Filter):
"""
Creates a filter that fetches the spider's name from the log record.
"""
def filter(self, record):
record.spider_name = ''
# enter the spider's name
if hasattr(record, 'spider'):
record.spider_name = record.spider.name
return True
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise NotConfigured otherwise
if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
raise NotConfigured
# instantiate the extension object
ext = cls()
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
# return the extension object
return ext
def spider_opened(self, spider):
"""Prefixes the spider's name to every log emitted."""
formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
# add the new format and filter to all the handlers
for handler in logging.root.handlers:
handler.formatter = formatter
handler.addFilter(self.ContentFilter())On regular spiders, this produces the desired effect. Here is an extract of my logs (I'm launching all spiders asynchronously through a CrawlerProcess):
[f5] 2023-01-27 17:29:13,110 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api-u.f5.com/support/kb-articles/K43357358> (referer: https://api-u.f5.com/support/fedsearch/v2/rss?page=1&results=100&source=kbarticles&source=techcomm&documentType=Security%20Advisory&firstPublishedDateStart=2022-01-01&linkBack=https://support.f5.com/csp/new-updated-articles)
[a10] 2023-01-27 17:29:13,117 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://support.a10networks.com/support/security-advisories> (referer: None)
[huawei] 2023-01-27 17:29:13,122 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.huawei.com/en/psirt/all-bulletins?page=2> (referer: None)
[moxa_edr_810] 2023-01-27 17:29:13,129 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moxa.com/en/support/product-support/security-advisory/edr-g903-g902-810-secure-routers-vulnerability-(1)> (referer: https://www.moxa.com/en/support/product-support/security-advisory/security-advisories-all?psid=48041)
[nutanix] 2023-01-27 17:29:13,183 [scrapy.core.engine] INFO: Closing spider (finished)
[nutanix] 2023-01-27 17:29:13,188 [scrapy.core.engine] INFO: Spider closed (finished)
[wapt] 2023-01-27 17:29:13,216 [scrapy.core.engine] INFO: Closing spider (finished)
[wapt] 2023-01-27 17:29:13,224 [scrapy.core.engine] INFO: Spider closed (finished)
[roundcube] 2023-01-27 17:29:13,254 [scrapy.core.engine] INFO: Closing spider (finished)
[roundcube] 2023-01-27 17:29:13,259 [scrapy.core.engine] INFO: Spider closed (finished)However, with scrapy-playwright spiders I get only the name of the last spider that called spider_opened. Here I launched spiders "bmc", "veeam", "hpe" and "checkpoint" in that order:
[checkpoint] 2023-01-27 17:40:23,791 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,807 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,825 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,839 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://communities.bmc.com/blogs/application-security-news> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,842 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
[checkpoint] 2023-01-27 17:40:23,851 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.veeam.com/knowledge-base.html?type=security> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,858 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://support.hpe.com/connect/s/securitybulletinlibrary?language=en_US> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:23,865 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (resource type: document, referrer: None)
[checkpoint] 2023-01-27 17:40:24,145 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsecurityalerts=> (referrer: None)
[checkpoint] 2023-01-27 17:40:24,222 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://www.veeam.com/knowledge-base.html?type=security> (referrer: None)After running print(record.spider.name, record.__dict__) in my filter function, I noticed that every log has "checkpoint" as its spider.name and CheckpointSpider as the spider class that called it:
checkpoint {'name': 'scrapy-playwright', 'msg': '[Context=%s] Response: <%i %s> (referrer: %s)', 'args': ('default', 200, 'https://support.hpe.com/connect/s/sfsites/c/resource/DCECommunityStyles/fonts/DCEIcons.ttf?7abo6h', None), 'levelname': 'DEBUG', 'levelno': 10, 'pathname': 'removed for privacy', 'filename': 'handler.py', 'module': 'handler', 'exc_info': None, 'exc_text': None, 'stack_info': None, 'lineno': 499, 'funcName': '_log_request', 'created': 1674838086.8665972, 'msecs': 866.0, 'relativeCreated': 80859.15327072144, 'thread': 140704363071680, 'threadName': 'MainThread', 'processName': 'MainProcess', 'process': 81449, 'spider': <CheckpointSpider 'checkpoint' at 0x10b5dd5d0>, 'spider_name': ''}This is strange behavior as:
- The "hpe" spider should make requests related to the hpe.com domain;
- The
allowed_domainssetting should prevent any other spider from making a request to the hpe.com domain; - The filter works on regular spider, and I don't see why it wouldn't work on scrapy-playwright spiders.
I don't know enough about scrapy-playwright's core to know where this issue could be coming from. However, as a wild guess I set the DOWNLOAD_HANDLERS to scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler in my general Scrapy settings instead of in a custom_settings dictionnary inside my parent scrapy-playwright spider (which other playwright spiders inherit from), and the issue replicated to every spiders, even the regular ones.
I can provide more information if necessary.
Thanks