Skip to content

UnicodeEncodeError encountered on a few cases #71

@clippered

Description

@clippered

Im not sure how to describe but there were a few html body that has encoding errors. It throws UnicodeEncodeError on https://github.com/scrapy-plugins/scrapy-playwright/blob/master/scrapy_playwright/handler.py#L288
So it looks like these websites do not really follow their encodings.

for now, i've patched it with:

--- handler.py.orig	2022-03-22 17:28:32.000000000 +1100
+++ handler.py.mod	2022-03-22 17:16:08.000000000 +1100
@@ -279,7 +279,11 @@
         headers = Headers(response.headers)
         headers.pop("Content-Encoding", None)
         encoding = _get_response_encoding(headers, body_str) or "utf-8"
-        body = body_str.encode(encoding)
+        try:
+            body = body_str.encode(encoding)
+        except UnicodeEncodeError:
+            encoding = html_body_declared_encoding(body_str) or "utf-8"
+            body = body_str.encode(encoding)
         respcls = responsetypes.from_args(headers=headers, url=page.url, body=body)
         return respcls(
             url=page.url,

but maybe you have a better way of handling this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions