French characters "à" are not converted correctly #500

Jeremytijal · 2022-06-28T09:27:53Z

Hi Sophie,

I noticed that some non-English (French) content end up not getting converted correctly. As a random example, try scraping this page: https://ensemblesurleterrain.bouyguestelecom.fr the result https://test-webscrapper.netlify.app

You'll see that the page is saved with a bunch of � characters in place of the "à"

This problem existed before the last PR. I thought it would be solved, but it was not.

s0ph1e · 2022-06-29T08:40:28Z

Hi @Jeremytijal 👋

Thank you for reporting an issue.
Could you please check if that happens with the latest version 5.2.0? I'll be able to test it by myself later today or tomorrow

Jeremytijal · 2022-06-29T10:44:47Z

Hi @s0ph1e,

I tested it with the latest version 5.2.0 and I have the same problem.

s0ph1e · 2022-06-30T21:28:05Z

Yep, now I see it. The fix works when html response contains a charset in the content-type header. In this case there is no charset in the header and it saves it in binary as it was in previous versions.

Looks like it makes sense to use utf8 for html files if there was no charset. @phawxby WDYT (as the author of the latest fix)?

phawxby · 2022-06-30T21:58:52Z

I think it's risky to assume a charset unless we know for sure what it is. What if it's ANSI, ASCII, etc, we could break existing users. However tha then brings about a new issue, there's multiple ways to specify the charset of a file

On the response header, like we use now.
On a meta-tag for html
As a rule in CSS
And although deprecated, as an attribute.

I think we have 2 options.

Default to 'utf-8' for all text types, which could potentially break things for existing users, especially for those scraping older applications.
Add some basic additional rules which we switch to based on the content type. I think 2 & 3 above are going to cover 99% of use cases. I think this is the best approach personally.

The best place to do this is is here.

If the encoding is still binary after checking the headers then get the mime type of the response.
Add a switch based on mimetype for css/html.
Add 2 new functions, getMimeFromHtml and getMimeFromCss.
a. HTML: Use cheerio to parse the response body and see if you find a <meta charset="utf-8">
b. CSS: The CSS spec is incredibly strict, you should just be able to do .includes('@charset "UTF-8"');

I would take a stab at it but I go on vacation in a few days and i'm intentionally not taking a laptop

s0ph1e · 2022-07-01T07:11:43Z

I like the idea of getting the encoding from html or css content, it sounds better than using utf8 as default and should not be difficult to do.

Most probably I will not have time to implement it in the near future so I really appreciate any help with it.

Thank you for your input and enjoy your vacation without laptop @phawxby 🏖️🌴

phawxby mentioned this issue Jul 1, 2022

feat: add parsing of response body for encoding #501

Closed

s0ph1e mentioned this issue Aug 29, 2022

Use encoding from resource text #504

Merged

s0ph1e closed this as completed in #504 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

French characters "à" are not converted correctly #500

French characters "à" are not converted correctly #500

Jeremytijal commented Jun 28, 2022

s0ph1e commented Jun 29, 2022

Jeremytijal commented Jun 29, 2022

s0ph1e commented Jun 30, 2022

phawxby commented Jun 30, 2022 •

edited

s0ph1e commented Jul 1, 2022

French characters "à" are not converted correctly #500

French characters "à" are not converted correctly #500

Comments

Jeremytijal commented Jun 28, 2022

s0ph1e commented Jun 29, 2022

Jeremytijal commented Jun 29, 2022

s0ph1e commented Jun 30, 2022

phawxby commented Jun 30, 2022 • edited

s0ph1e commented Jul 1, 2022

phawxby commented Jun 30, 2022 •

edited