-
Notifications
You must be signed in to change notification settings - Fork 148
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Im not sure how to describe but there were a few html body that has encoding errors. It throws UnicodeEncodeError on https://github.com/scrapy-plugins/scrapy-playwright/blob/master/scrapy_playwright/handler.py#L288
So it looks like these websites do not really follow their encodings.
for now, i've patched it with:
--- handler.py.orig 2022-03-22 17:28:32.000000000 +1100
+++ handler.py.mod 2022-03-22 17:16:08.000000000 +1100
@@ -279,7 +279,11 @@
headers = Headers(response.headers)
headers.pop("Content-Encoding", None)
encoding = _get_response_encoding(headers, body_str) or "utf-8"
- body = body_str.encode(encoding)
+ try:
+ body = body_str.encode(encoding)
+ except UnicodeEncodeError:
+ encoding = html_body_declared_encoding(body_str) or "utf-8"
+ body = body_str.encode(encoding)
respcls = responsetypes.from_args(headers=headers, url=page.url, body=body)
return respcls(
url=page.url,
but maybe you have a better way of handling this case.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working