Skip to content

Commit a17c57e

Browse files
gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837)
* the "plaintext" element * the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes" * optionally RAWTEXT (if scripting=True) element "noscript"
1 parent 07912f8 commit a17c57e

File tree

4 files changed

+163
-114
lines changed

4 files changed

+163
-114
lines changed

Doc/library/html.parser.rst

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,18 @@
1515
This module defines a class :class:`HTMLParser` which serves as the basis for
1616
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1717

18-
.. class:: HTMLParser(*, convert_charrefs=True)
18+
.. class:: HTMLParser(*, convert_charrefs=True, scripting=False)
1919

2020
Create a parser instance able to parse invalid markup.
2121

22-
If *convert_charrefs* is ``True`` (the default), all character
23-
references (except the ones in ``script``/``style`` elements) are
22+
If *convert_charrefs* is true (the default), all character
23+
references (except the ones in elements like ``script`` and ``style``) are
2424
automatically converted to the corresponding Unicode characters.
2525

26+
If *scripting* is false (the default), the content of the ``noscript``
27+
element is parsed normally; if it's true, it's returned as is without
28+
being parsed.
29+
2630
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
2731
when start tags, end tags, text, comments, and other markup elements are
2832
encountered. The user should subclass :class:`.HTMLParser` and override its
@@ -37,6 +41,9 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
3741
.. versionchanged:: 3.5
3842
The default value for argument *convert_charrefs* is now ``True``.
3943

44+
.. versionchanged:: 3.14.1
45+
Added the *scripting* parameter.
46+
4047

4148
Example HTML Parser Application
4249
-------------------------------
@@ -161,24 +168,24 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
161168
.. method:: HTMLParser.handle_data(data)
162169

163170
This method is called to process arbitrary data (e.g. text nodes and the
164-
content of ``<script>...</script>`` and ``<style>...</style>``).
171+
content of elements like ``script`` and ``style``).
165172

166173

167174
.. method:: HTMLParser.handle_entityref(name)
168175

169176
This method is called to process a named character reference of the form
170177
``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
171-
(e.g. ``'gt'``). This method is never called if *convert_charrefs* is
172-
``True``.
178+
(e.g. ``'gt'``).
179+
This method is only called if *convert_charrefs* is false.
173180

174181

175182
.. method:: HTMLParser.handle_charref(name)
176183

177184
This method is called to process decimal and hexadecimal numeric character
178185
references of the form :samp:`&#{NNN};` and :samp:`&#x{NNN};`. For example, the decimal
179186
equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
180-
in this case the method will receive ``'62'`` or ``'x3E'``. This method
181-
is never called if *convert_charrefs* is ``True``.
187+
in this case the method will receive ``'62'`` or ``'x3E'``.
188+
This method is only called if *convert_charrefs* is false.
182189

183190

184191
.. method:: HTMLParser.handle_comment(data)
@@ -292,8 +299,8 @@ Parsing an element with a few attributes and a title:
292299
Data : Python
293300
End tag : h1
294301

295-
The content of ``script`` and ``style`` elements is returned as is, without
296-
further parsing:
302+
The content of elements like ``script`` and ``style`` is returned as is,
303+
without further parsing:
297304

298305
.. doctest::
299306

@@ -304,10 +311,10 @@ further parsing:
304311
End tag : style
305312

306313
>>> parser.feed('<script type="text/javascript">'
307-
... 'alert("<strong>hello!</strong>");</script>')
314+
... 'alert("<strong>hello! &#9786;</strong>");</script>')
308315
Start tag: script
309316
attr: ('type', 'text/javascript')
310-
Data : alert("<strong>hello!</strong>");
317+
Data : alert("<strong>hello! &#9786;</strong>");
311318
End tag : script
312319

313320
Parsing comments:
@@ -336,7 +343,7 @@ correct char (note: these 3 references are all equivalent to ``'>'``):
336343

337344
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
338345
:meth:`~HTMLParser.handle_data` might be called more than once
339-
(unless *convert_charrefs* is set to ``True``):
346+
if *convert_charrefs* is false:
340347

341348
.. doctest::
342349

Lib/html/parser.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,17 +127,25 @@ class HTMLParser(_markupbase.ParserBase):
127127
argument.
128128
"""
129129

130-
CDATA_CONTENT_ELEMENTS = ("script", "style")
130+
# See the HTML5 specs section "13.4 Parsing HTML fragments".
131+
# https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
132+
# CDATA_CONTENT_ELEMENTS are parsed in RAWTEXT mode
133+
CDATA_CONTENT_ELEMENTS = ("script", "style", "xmp", "iframe", "noembed", "noframes")
131134
RCDATA_CONTENT_ELEMENTS = ("textarea", "title")
132135

133-
def __init__(self, *, convert_charrefs=True):
136+
def __init__(self, *, convert_charrefs=True, scripting=False):
134137
"""Initialize and reset this instance.
135138
136-
If convert_charrefs is True (the default), all character references
139+
If convert_charrefs is true (the default), all character references
137140
are automatically converted to the corresponding Unicode characters.
141+
142+
If *scripting* is false (the default), the content of the
143+
``noscript`` element is parsed normally; if it's true,
144+
it's returned as is without being parsed.
138145
"""
139146
super().__init__()
140147
self.convert_charrefs = convert_charrefs
148+
self.scripting = scripting
141149
self.reset()
142150

143151
def reset(self):
@@ -172,7 +180,9 @@ def get_starttag_text(self):
172180
def set_cdata_mode(self, elem, *, escapable=False):
173181
self.cdata_elem = elem.lower()
174182
self._escapable = escapable
175-
if escapable and not self.convert_charrefs:
183+
if self.cdata_elem == 'plaintext':
184+
self.interesting = re.compile(r'\z')
185+
elif escapable and not self.convert_charrefs:
176186
self.interesting = re.compile(r'&|</%s(?=[\t\n\r\f />])' % self.cdata_elem,
177187
re.IGNORECASE|re.ASCII)
178188
else:
@@ -444,8 +454,10 @@ def parse_starttag(self, i):
444454
self.handle_startendtag(tag, attrs)
445455
else:
446456
self.handle_starttag(tag, attrs)
447-
if tag in self.CDATA_CONTENT_ELEMENTS:
448-
self.set_cdata_mode(tag)
457+
if (tag in self.CDATA_CONTENT_ELEMENTS or
458+
(self.scripting and tag == "noscript") or
459+
tag == "plaintext"):
460+
self.set_cdata_mode(tag, escapable=False)
449461
elif tag in self.RCDATA_CONTENT_ELEMENTS:
450462
self.set_cdata_mode(tag, escapable=True)
451463
return endpos

0 commit comments

Comments
 (0)