-
-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
French characters "à" are not converted correctly #500
Comments
Hi @Jeremytijal 👋 Thank you for reporting an issue. |
Hi @s0ph1e, I tested it with the latest version 5.2.0 and I have the same problem. |
Yep, now I see it. The fix works when html response contains a charset in the Looks like it makes sense to use utf8 for html files if there was no charset. @phawxby WDYT (as the author of the latest fix)? |
I think it's risky to assume a charset unless we know for sure what it is. What if it's ANSI, ASCII, etc, we could break existing users. However tha then brings about a new issue, there's multiple ways to specify the charset of a file
I think we have 2 options.
The best place to do this is is here.
I would take a stab at it but I go on vacation in a few days and i'm intentionally not taking a laptop |
I like the idea of getting the encoding from html or css content, it sounds better than using utf8 as default and should not be difficult to do. Most probably I will not have time to implement it in the near future so I really appreciate any help with it. Thank you for your input and enjoy your vacation without laptop @phawxby 🏖️🌴 |
Hi Sophie,
I noticed that some non-English (French) content end up not getting converted correctly. As a random example, try scraping this page: https://ensemblesurleterrain.bouyguestelecom.fr the result https://test-webscrapper.netlify.app
You'll see that the page is saved with a bunch of � characters in place of the "à"
This problem existed before the last PR. I thought it would be solved, but it was not.
The text was updated successfully, but these errors were encountered: