-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception with non-UTF-8 Content-Type #5914
Comments
I can't confirm this. E.g. this is from curl:
I see the same in a browser. |
I wrote about the headers of the file itself:
If I use the requests library in python, then the response from the server comes without any errors.
|
I don't think these matter to any of the libraries you mentioned. Even your
Sure, it doesn't try to decode the response, unlike Scrapy. The actual problem is that the |
I use this code to download documents from this site:
|
Description
I am trying to download a document from the link http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&firstDoc=1&page=all
Everything works fine in the browser, but when I try to automate this process through scrapy, everything break down.
Steps to Reproduce
1.Create new spider
scrapy genspider test pravo.gov.ru
2. Paste code
OR
run
scrapy fetch http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&firstDoc=1&page=all""
Expected behavior: The document on the link is downloaded
Actual behavior:
2023-04-28 00:07:35 [scrapy.core.scraper] ERROR: Error downloading <GET http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&fir
stDoc=1&page=all>
Traceback (most recent call last):
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\internet\defer.py", line 1693, in _inlineCallbacks
result = context.run(
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\python\failure.py", line 518, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
return (yield download_func(request=request, spider=spider))
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\core\downloader\handlers\http11.py", line 501, in _cb_bodydone
respcls = responsetypes.from_args(headers=headers, url=url, body=result["body"])
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 113, in from_args
cls = self.from_headers(headers)
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 75, in from_headers
cls = self.from_content_type(
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 55, in from_content_type
mimetype = to_unicode(content_type).split(";")[0].strip().lower()
File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\utils\python.py", line 97, in to_unicode
return text.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 33: invalid continuation byte
Reproduces how often: 100% runs
Versions
Scrapy : 2.8.0
lxml : 4.9.2.0
libxml2 : 2.9.12
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)]
pyOpenSSL : 23.1.1 (OpenSSL 3.1.0 14 Mar 2023)
cryptography : 40.0.2
Platform : Windows-10-10.0.19044-SP0
Additional context
These documents are encoded in cp1251 encoding, which is clearly indicated in their headers :
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="windows-1251"
The same behavior when trying to save a file using FilesPipeline
The text was updated successfully, but these errors were encountered: