Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception with non-UTF-8 Content-Type #5914

Closed
Keeyahto opened this issue Apr 27, 2023 · 4 comments · Fixed by #5917
Closed

Exception with non-UTF-8 Content-Type #5914

Keeyahto opened this issue Apr 27, 2023 · 4 comments · Fixed by #5917

Comments

@Keeyahto
Copy link

Keeyahto commented Apr 27, 2023

Description

I am trying to download a document from the link http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&firstDoc=1&page=all
Everything works fine in the browser, but when I try to automate this process through scrapy, everything break down.

Steps to Reproduce

1.Create new spider scrapy genspider test pravo.gov.ru
2. Paste code

import scrapy


class TestSpider(scrapy.Spider):
    name = "test"
    allowed_domains = ["pravo.gov.ru"]
    start_urls = ["http://pravo.gov.ru/proxy/ips/"]

    def parse(self, response):
        yield scrapy.Request(
            self.start_urls[0] + f'?savertf=&link_id=0&nd=128284801&intelsearch=&firstDoc=1&page=all',
            callback=self.test_parse,
            encoding='cp1251') #The same behavior without this line
           

    def test_parse(self, response):
        pass
  1. run
    OR
    run scrapy fetch http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&firstDoc=1&page=all""
    Expected behavior: The document on the link is downloaded
    Actual behavior:
    2023-04-28 00:07:35 [scrapy.core.scraper] ERROR: Error downloading <GET http://pravo.gov.ru/proxy/ips/?savertf=&link_id=0&nd=128284801&intelsearch=&fir
    stDoc=1&page=all>
    Traceback (most recent call last):
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\internet\defer.py", line 1693, in _inlineCallbacks
    result = context.run(
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\python\failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
    return (yield download_func(request=request, spider=spider))
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
    current.result = callback( # type: ignore[misc]
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\core\downloader\handlers\http11.py", line 501, in _cb_bodydone
    respcls = responsetypes.from_args(headers=headers, url=url, body=result["body"])
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 113, in from_args
    cls = self.from_headers(headers)
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 75, in from_headers
    cls = self.from_content_type(
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\responsetypes.py", line 55, in from_content_type
    mimetype = to_unicode(content_type).split(";")[0].strip().lower()
    File "C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\scrapy\utils\python.py", line 97, in to_unicode
    return text.decode(encoding, errors)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 33: invalid continuation byte

Reproduces how often: 100% runs

Versions

Scrapy : 2.8.0
lxml : 4.9.2.0
libxml2 : 2.9.12
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)]
pyOpenSSL : 23.1.1 (OpenSSL 3.1.0 14 Mar 2023)
cryptography : 40.0.2
Platform : Windows-10-10.0.19044-SP0

Additional context

These documents are encoded in cp1251 encoding, which is clearly indicated in their headers :
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="windows-1251"
The same behavior when trying to save a file using FilesPipeline

@wRAR
Copy link
Member

wRAR commented Apr 28, 2023

which is clearly indicated in their headers

I can't confirm this. E.g. this is from curl:

< HTTP/1.1 200 OK
< Server: nginx
< Date: Fri, 28 Apr 2023 07:50:55 GMT
< Content-Type: application/x-download; filename=�-593-24_04_2023.rtf
< Content-Length: 15961
< Connection: keep-alive
< Content-Disposition: attachment; filename=�-593-24_04_2023.rtf

I see the same in a browser.

@Keeyahto
Copy link
Author

which is clearly indicated in their headers

I can't confirm this. E.g. this is from curl:

< HTTP/1.1 200 OK
< Server: nginx
< Date: Fri, 28 Apr 2023 07:50:55 GMT
< Content-Type: application/x-download; filename=�-593-24_04_2023.rtf
< Content-Length: 15961
< Connection: keep-alive
< Content-Disposition: attachment; filename=�-593-24_04_2023.rtf

I see the same in a browser.

I wrote about the headers of the file itself:

------=_NextPart_01CAD650.0093E2A0
Content-Location: file:///C:/B1334631/001.htm
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="windows-1251"

If I use the requests library in python, then the response from the server comes without any errors.

import requests
response = requests.get('http://pravo.gov.ru/proxy/ips/?savertf=&link_id=5&nd=128284801&&page=all')
print(response.content.decode('cp1251'))
print(response.headers)

MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_01CAD650.0093E2A0" ..................................................................................
{'Server': 'nginx', 'Date': 'Fri, 28 Apr 2023 08:51:06 GMT', 'Content-Type': 'application/x-download; filename=Ï-593-24_04_2023.rtf', 'Content-Length': '15961', 'Connection': 'keep-alive', 'Content-Disposition': 'attachment; filename=Ï-593-24_04_2023.rtf'}
I don't understand why exactly an encoding error occurs when trying to make a scrapy request
scrapy.Request('http://pravo.gov.ru/proxy/ips/?savertf=&link_id=5&nd=128284801&&page=all')

@Keeyahto Keeyahto reopened this Apr 28, 2023
@wRAR
Copy link
Member

wRAR commented Apr 28, 2023

I wrote about the headers of the file itself:

I don't think these matter to any of the libraries you mentioned. Even your requests sample prints them as a part of the response body.

If I use the requests library in python, then the response from the server comes without any errors.

Sure, it doesn't try to decode the response, unlike Scrapy.

The actual problem is that the Content-Type header value is in CP1251 (I guess?): the actual exception happens when trying to get the response MIME type and the code assumes it's in UTF-8 while it's actually b'application/x-download; filename=\xcf-593-24_04_2023.rtf'. We can specify the latin1 encoding instead of the implicit utf-8 when converting the Content-Type header value to str (scrapy.http.response.text.TextResponse._headers_encoding(), scrapy.http.response.text.TextResponse._body_inferred_encoding(), scrapy.responsetypes.ResponseTypes.from_content_type()) and this will prevent the exceptions in this case, I don't think we can do anything better. I also don't think we need the file name from this header value (or from the Content-Disposition header value which has the same problem) as we don't support Content-Disposition: attachment, and return the raw response body directly so we don't need to guess the encoding for unquoted non-ASCII file names.

@wRAR wRAR changed the title Error when trying to save a file in the wrong encoding Exception with non-UTF-8 Content-Type Apr 28, 2023
@Keeyahto
Copy link
Author

I use this code to download documents from this site:

yield scrapy.Request(
          self.start_urls[0] + f'?savertf=&link_id=5&nd=128284801&&page=all',
          callback=self.test_parse,
          meta={'download_file': True}
      )


class FileDownloaderMiddleware:
    def process_request(self, request, spider):
        if request.meta.get('download_file'):
            response = requests.get(request.url)
            response.raise_for_status()
            return scrapy.http.Response(
                url=request.url,
                body=response.content,
                headers=response.headers,
            )
        return None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants