Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser : A auto-tolerant parsing mode #43346

Closed
kxroberto mannequin opened this issue May 11, 2006 · 24 comments
Closed

HTMLParser : A auto-tolerant parsing mode #43346

kxroberto mannequin opened this issue May 11, 2006 · 24 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@kxroberto
Copy link
Mannequin

kxroberto mannequin commented May 11, 2006

BPO 1486713
Nosy @freddrake, @terryjreedy, @orsenthil, @ezio-melotti, @merwok, @bitdancer
Files
  • HTMLParser_tolerant.patch
  • HTMLParser_tolerant_py24.patch
  • HTMLParser_tolerant_py26.patch
  • test_htmlparser_tolerant.patch: test case
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-12-03.04:10:56.507>
    created_at = <Date 2006-05-11.17:19:36.000>
    labels = ['type-feature', 'library']
    title = 'HTMLParser : A auto-tolerant parsing mode'
    updated_at = <Date 2011-11-16.13:17:02.792>
    user = 'https://bugs.python.org/kxroberto'

    bugs.python.org fields:

    activity = <Date 2011-11-16.13:17:02.792>
    actor = 'ezio.melotti'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-12-03.04:10:56.507>
    closer = 'r.david.murray'
    components = ['Library (Lib)']
    creation = <Date 2006-05-11.17:19:36.000>
    creator = 'kxroberto'
    dependencies = []
    files = ['7243', '7244', '18623', '18624']
    hgrepos = []
    issue_num = 1486713
    keywords = ['patch']
    message_count = 24.0
    messages = ['50232', '50233', '50234', '50235', '113366', '114659', '114682', '114773', '114786', '114796', '115031', '115115', '115624', '121674', '123174', '123247', '147692', '147698', '147755', '147756', '147763', '147765', '147766', '147767']
    nosy_count = 9.0
    nosy_names = ['fdrake', 'terry.reedy', 'jjlee', 'orsenthil', 'kxroberto', 'ezio.melotti', 'Neil Muller', 'eric.araujo', 'r.david.murray']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1486713'
    versions = ['Python 3.2']

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented May 11, 2006

    Changes:

    • Now allows missing spaces between attributes as its
      often seen on the web like this :

    <script type="text/javascript"language="JavaScript1.1">

    That like broke the whole parsing before.

    • A fully auto-tolerant mode (HTMLParser.tolerant=1)
      was added. It should hopefully NEVER break HTML parsing
      on the level of HTMLParser, but recover and continue
      the parsing smartly. The mode was tested extensively
      with complex pages. The tolerant mode is guaranted to
      finish all HTML stuff only during HTMLParser.close() /
      goahead(end=True) - yet that was the same (stucking)
      policy before.
      Maybe steep: I have switched ON the tolerant mode by
      default, as this is, what in 99.9% of cases one wants
      to have.
      (I've maybe 20 applications for HTMLParser - None like
      the unrecoverable breaks with Exceptions)
      During tolerant mode the virtual .warning(message,i,k)
      is called instead of error - by default this just
      counts .warning_count up. This framework should even
      enable to write po HTML checkers

    • The patch was generated against py2.3 (still the
      "good/base" Python for me) and also fixes a regexp-bug
      (which already was fixed in py2.4.2). Yet the patch
      works also against py2.4/2.5 - 2 locations where py24
      trivially changed to %r/repr may grumble.

    -robert

    @kxroberto kxroberto mannequin added stdlib Python modules in the Lib dir labels May 11, 2006
    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented May 23, 2006

    Logged In: YES
    user_id=972995

    Python 2.4 version of the patch added.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented May 23, 2006

    Logged In: YES
    user_id=972995

    (and works also for Python2.5)

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Jan 30, 2007

    This badly needs unit tests.

    @devdanzin devdanzin mannequin added type-feature A feature request or enhancement labels Mar 21, 2009
    @terryjreedy
    Copy link
    Member

    terryjreedy commented Aug 9, 2010

    This needs to be checked for applicability to 3.x.
    Do beautifulsoup and other programs cover this ground (tolerant parsing of junk html)?

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 22, 2010

    I think this should be closed as have other similar requests in the last few days.

    @bitdancer
    Copy link
    Member

    bitdancer commented Aug 22, 2010

    I disagree (and might disagree with those other closings but I haven't noticed them I guess). BeautifulSoup does *not* cover this ground, it is broken in 3.x because of the lack of a tolerant HTML parser in the stdlib (it used to use sgmlib, which is now gone). BeautifulSoup would probably very much like to have this tolerant mode.

    It probably shouldn't be the default, though, for backward compatibility reasons :(

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Aug 24, 2010

    for me a parser which cannot be feed with HTML from outside (which I cannot edit myself) has not much use at all.
    attached my current patch (vs. py26) - many changes meanwhile.
    and a test case.

    I've put the default to strict mode, but ...

    @bitdancer
    Copy link
    Member

    bitdancer commented Aug 24, 2010

    2.6 is now in security-fix-only mode. Since this is a new feature, it can only go into 3.2.

    Can you provide a patch against py3k trunk?

    I've only glanced at the patch briefly, but one thing that concerns me is 'warning file'. I suppose that either the logging module or perhaps the warnings module should be used instead. We should look at how other stdlib modules handle this kind of thing. Or perhaps warnings shouldn't be generated at all, since the default will be strict and therefore the programmer has consciously selected tolerant mode.

    One stdlib model we could follow is the model of the email module: have a 'defects' attribute that collects the errors. email6, by the way, is going to have both 'tolerant' and 'strict' modes, and in that case the default is tolerant (and always has been) in respect for Postel's law, which is enshrined in the email RFCs. If the HTTP standards have a similar recommendation to accept "dirty" input when possible, we could make an argument for changing HTMLParser's default to tolerant.

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Aug 24, 2010

    I agree that a tolerant mode would be good (and often requested). String encoding and decoding also have strict and forgiving modes, so this seems close to a policy.

    Unit tests with example snippets that properly fail strict mode and pass the new tolerant mode are needed both to completely review and apply a patch.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Aug 26, 2010

    I'm not working with Py3. don't how much that module is different in 3.
    unless its going into a py2 version, I'll leave the FR so far to the py3 community

    @bitdancer
    Copy link
    Member

    bitdancer commented Aug 27, 2010

    For anyone who does want to work on this (and I do, but it will be quite a while before I can) see also bpo-6191.

    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 5, 2010

    See also bpo-1058305, which may be a duplicate.

    @NeilMuller
    Copy link
    Mannequin

    NeilMuller mannequin commented Nov 20, 2010

    bpo-975556 and bpo-1046092 look like they should also be superseded by this.

    @bitdancer
    Copy link
    Member

    bitdancer commented Dec 3, 2010

    I have committed a version of this patch, without the warnings, using the keyword 'strict=True' as the default, and with a couple added heuristics from other similar issues, in r86952.

    kxroberto, if you want to supply your full name, I'll add you to Misc/ACKS. Thanks for the original patch.

    @bitdancer
    Copy link
    Member

    bitdancer commented Dec 3, 2010

    A note for the curious: I changed the keyword name from 'tolerant' to 'strict' because the stdlib has other examples of 'strict' as a keyword, but the word 'tolerant' appears nowhere in the documentation and certainly not as a keyword. So it seemed better to remain consistent with existing practice. This would be even better if the default value of 'strict' was False, but unfortunately we can't do that.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Nov 15, 2011

    I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding the extended tolerance and local backporting to Python2.7):

    What I miss are the calls of a kind of self.warning(msg,i,k) function in non-strict/tolerant mode (where self.error is called in strict mode). Such function could be empty or could be a silent simple counter (like in the old patch) - and could be easily sub-classed for advanced use.
    I often want at least the possibilty of a HTML error log - so the HTML author (sometimes its me myself) can be noticed to get it more strict on the long run ;-) ...

    @ezio-melotti
    Copy link
    Member

    ezio-melotti commented Nov 15, 2011

    The HTMLParser is not suitable for validation, even the strict mode allows some non valid markup (and it might be removed soon).
    Also I don't think it's easy to call a self.warnings() without trying the strict mode first. The tolerant parsing just allow more things, without making any distinction between valid and not.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Nov 16, 2011

    Well in many browsers for example there is a internal warning and error log (window). Which yet does not (need to) claim to be a official W3C checker. It has positive effect on web stabilization.
    For example just looking now I see the many HTML and CSS warnings and errors about the sourceforge site and this bug tracker in the Browsers log - not believing that the log covers the bugs 100% ;-)

    The events of warnings are easily available here, and calling self.warning, as it was, costs quite nothing. I don't see a problem for non-users of this feature. And most code using HTMLParser also emits warnings on the next higher syntax level, so to not have a black box...

    As I used a tolerant version of HTMLParser for about a decade, I can say the warnings are of the same value in many apps and use case, as to be able to have look into a Browsers syntax log.
    The style of stretching a argument to black<->white is not reasonable here in the world of human edited HTML ;-)

    @ezio-melotti
    Copy link
    Member

    ezio-melotti commented Nov 16, 2011

    The strict/tolerant mode mainly works by using either a strict or a tolerant regex. If the markup is invalid, the strict regex doesn't match and it gives an error. The tolerant regex will match both valid and invalid markup at the same time, without distinctions, and that's why there's no way to emit a warning for these cases. I think there are a couple of places where a warning could be emitted, but that would just cover a small percentage of the errors. Even if we find a way to emit a warning for everything allowed by the tolerant mode that fails on strict, it won't still cover all the possible errors, that's why I think tools like validators and conformance checkers (or even the warning/error logs) should be used instead.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Nov 16, 2011

    The old patch warned already the majority of real cases - except the missing white space between attributes.

    "The tolerant regex will match both":
    locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event.
    Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag).
    Or missing whitespace could be warned (multiple times) at attrfind time.

    attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant could become the only attrfind.

    --

    locatestarttagend_tolerant = re.compile(r"""
      <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
      (?:(?:\s+|(\s*))                   # optional whitespace before attribute name
        (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
          (?:\s*=\s*                     # value indicator
            (?:'[^']*'                   # LITA-enclosed value
              |\"[^\"]*\"                # LIT-enclosed value
              |[^'\">\s]+                # bare value
             )
             (?:\s*(,))*                   # possibly followed by a comma
           )?
         )
       )*
      \s*                                # trailing whitespace
    """, re.VERBOSE)
    attrfind_tolerant = re.compile(
        r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
        r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')

    #s='<abc a="b,+"c="d"e=f>text'
    #s='<abc a="b,+" c="d"e=f>text'
    s='<abc a="b,+",c="d" e=f>text'

    m = locatestarttagend_tolerant.search(s)
    print m.group()
    print m.groups()
    #if m.group(1) is not None: self.warning('space missing ...
    #if m.group(2) is not None: self.warning('comma between attr...
    m = attrfind_tolerant.search(s, 5)
    print m.group()
    print m.groups()

    @ezio-melotti
    Copy link
    Member

    ezio-melotti commented Nov 16, 2011

    Note that the regex and the way the parser considers the commas changed in 16ed15ff0d7c (it now considers them as the name of a value-less attribute), so adding a group for the comma is no longer doable.

    In theory, the approach you suggest might work, but if we want some warning mechanism it should be generic enough to work with all kind of invalid markup. In addition this adds complexity to already complex regular expressions, so there should be a valid use case for this.
    Also keep in mind that HTMLParser won't do any check about the validity of the elements' names or attributes' names/values, or even if they are nested/closed correctly, so even with a comprehensive set of warnings, you won't still be able to use HTMLParser to validate your pages.

    @kxroberto
    Copy link
    Mannequin Author

    kxroberto mannequin commented Nov 16, 2011

    16ed15ff0d7c was not in current stable py3.2 so I missed it..

    When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods.
    (still I guess locatestarttagend_tolerant matches a free standing comma extra after an attribute)

    "should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time.

    "HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably.

    "there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there.

    Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-)
    I considered extra libs like the HTML tidy binding, but this is all too much fuzz for most cases. And HTMLParser has already quite everything, with the few calls inserted ..

    @ezio-melotti
    Copy link
    Member

    ezio-melotti commented Nov 16, 2011

    16ed15ff0d7c was not in current stable py3.2 so I missed it..

    It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it).

    When the comma is now raised as attribute name, then the problem is
    anyway moved to the higher level anyway - and is/can be handled easily
    there by usual methods.

    The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this).

    100% is not the point unless it shall drive the official W3C checker.

    I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer). If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch.

    The call of self.warning, as in old patch, doesn't cost otherwise and
    I see no real increase of complexity/cpu-time.

    The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings. These changes are indeed not too invasive, but they still make the code more complicated.

    Almost any app which parses HTML (self authored or remote) can have
    (should have?) a no-fuzz/collateral warn log option. (->no need to
    make a expensive W3C checker session).

    I think the original goal of HTMLParser was parsing mostly-valid HTML. People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML. AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page. So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module.

    I mostly have this in use as said, as it was anyway there.

    If 'this' refers to some kind of warning system, what do you do with these warnings? Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough? Or do you fix them, and then you also check with the w3c validator?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants