-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose html.parser.unescape #47176
Comments
There is currently a private method inside of html.parser.HTMLParser to Additionally, many websites don't use proper unicode or iso-8859-1 The unescaping logic was slightly simplified too. This is my first Python patch submission, so please let me know if I've A new test case was also added for this functionality. |
The plan is to add html.escape(). Adding html.unescape() wouldn't hurt. |
Trying to run the test and I get:- c:\py3k\Lib>..\PCbuild\python_d.exe test\test_htmlparser.py |
It's using the old Python 2 unicode string literal syntax. It also doesn't keep to 80 cols. I'd also rather continue using a lazily initialized dict instead of catching a KeyError for '. I also feel that with the changes to Unicode in py3k, the cp1252 stuff won't work as desired and should be cut. === Is anyone still interested in html.unescape or html.escape anyway? Every web framework seems to have their own support routines already. Otherwise I'd recommend close -> wontfix. |
msg110657 recommends close -> wontfix. Does anybody want this kept open or can it be closed? |
I'm not sure that using an hardcoded mapping CP1252 => unicode is a good idea. |
I don't think Django includes an HTML unescape. I'm not familiar with other frameworks. So I'd still find this useful to include in the stdlib. |
New patch attached, tested against Python 3.2. This is my first Python patch so apologies if I've done something wrong here. Feedback appreciated! Changes:
|
I added comments on Rietveld. Yet one thing. For now the html module is very simple and has no dependencies. The patch adds an import of re and html.escapes and relative heavy re.compile operations. Due to the fact that the html module is implicitly imported when any of the html submodules is imported, this can affect a code which doesn't use unescape(). However a cure for this problem (lazy import and initialization) may be worse than the problem itself, perhaps we should live with it. |
Here's an updated patch that addresses comments on rietveld and adds a few more tests and docs. Regarding your concern:
Overall I don't think it's a big problem. As a side node, the "if '&' in s:" in the unescape function could be removed -- I'm not sure it brings any real advantage. This could/should be proved by benchmarks. |
Here is the last iteration with a few minor tweaks and a couple more tests. |
LGTM. |
New changeset 7b9235852b3b by Ezio Melotti in branch 'default': |
Fixed, thanks for the reviews! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: