-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ItemLoader: support non-TextResponse #5145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Gallaecio is this a good issue for my first time? |
@markhv-code Possibly, yes. |
@Gallaecio I would like to take this issue please! |
Thar’s great! No need to ask for permission 🙂 |
@Gallaecio could I be assigned this issue? |
We only assign issues very rarely. In general we are not against multiple people working on the same issue, as long as they are aware that there’s someone else working on it, and your comments here should be enough for people to figure that out. |
@Gallaecio I was just checking this, I didn't properly understand the "binary response part"...But to handle the support issue def __init__(self, item=None, selector=None, response=None, parent=None, **context):
if not isinstance(response,TextResponse): #My change
response = None
if selector is None and response is not None:
selector = self.default_selector_class(response)
context.update(response=response) |
@Gallaecio I am not able to reproduce the issue. Spider codeimport scrapy
from scrapy.crawler import CrawlerProcess
from itemloaders import ItemLoader
class ItemLoaderTestSpider(scrapy.Spider):
name = "ItemLoader"
custom_settings = {"DOWNLOAD_DELAY":1}
def start_requests(self):
yield scrapy.Request(url='https://github.com/favicon.ico', callback=self.parse)
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', callback=self.parse)
yield scrapy.Request(url='https://raw.githubusercontent.com/scrapy/scrapy/master/.gitattributes', callback=self.parse)
def parse(self, response):
mil = ItemLoader(response=response)
mil.add_value('response_url', response.url)
mil.add_value('response_type', str(type(response)))
mil.add_value('data', response.body)
yield mil.load_item()
process = CrawlerProcess()
process.crawl(ItemLoaderTestSpider)
process.start() Log output
|
@GeorgeA92 You are testing the upstream >>> from scrapy.http.response import Response
>>> from scrapy.loader import ItemLoader
>>> ItemLoader(response=Response(url='https://example.org'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/adrian/proxectos/scrapy/scrapy/loader/__init__.py", line 86, in __init__
selector = self.default_selector_class(response)
File "/home/adrian/proxectos/scrapy/scrapy/selector/unified.py", line 78, in __init__
text = response.text
File "/home/adrian/proxectos/scrapy/scrapy/http/response/__init__.py", line 129, in text
raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text @calicomills That avoids the exception, but prevents the |
@Gallaecio I reproduced the issue. On scrapy/scrapy/http/response/text.py Lines 131 to 136 in e63188c
It means that instantiation of response.selector object will be on response.css or response.xpath (not in __init__ ).
In case if |
scrapy/scrapy/http/response/init.py
The error is caused when we try to access |
Just from looking at the traceback, my first approach would be to see if it makes sense to catch |
Hi, I'm a master's student in computer sience and engineering at Linköping University. Currently I'm looking into getting started with open source development. Is Scrapy a good first project to start in? I have knowledge Python. |
Scrapy is an active project, and if you start a pull request to make a change we will provide you feedback. But the number of “good first issues” in Scrapy that are free for the taking is close to 0, and Scrapy is a complex Python project, which heavily relies on Twisted. So getting started may not be trivial, depending on which issue you want to address. But if you have the time and the motivation, we are here to help you through it. Feel free to create a pull request in whatever state, and ask questions there. |
…n-ItemLoader Fix bug #5145 - Removing Selector for Response's that are not Http or Xml
At the moment,
ItemLoader(response=response)
fails ifresponse
is not aTextResponse
instance.Passing a binary response can still be useful, though. For example, to allow processors to access the response from their loader context, and hence be able to report the source URL (
response.url
) when reporting input issues.The text was updated successfully, but these errors were encountered: