Tokenizer corner cases #2

Closed
SimonSapin opened this Issue Apr 18, 2012 · 2 comments

Comments

Projects
None yet
1 participant
@SimonSapin
Contributor

SimonSapin commented Apr 18, 2012

Now that the descendant selector bug are fixed (unless I missed
something) the remaining issues that I see are:

  1. The current tokenizer for Symbol uses something like the '\w' regex,
    while a CSS IDENT token can contain any non-ASCII character (including
    U+00A0 no-break space, for example), can have backslash-escapes but can
    not start with a digit.
  2. Unicode white space (like U+00A0) counts as white space (either
    ignored or a descendant combinator) but should not (related to 1)
  3. 2n+1 or similar strings (arguments to :nth-child()) are tokenized as
    Symbol objects, and are then accepted by the parser as element types,
    class names, IDs, etc.

I think that any valid (for CSS) selector that only uses ASCII without
backslash-escapes should be fine now, so maybe this is not really a
problem ...

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Apr 20, 2012

Contributor

https://bugs.launchpad.net/lxml/+bug/754636 is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.

Contributor

SimonSapin commented Apr 20, 2012

https://bugs.launchpad.net/lxml/+bug/754636 is a special case of this bug. parse(u'.test\u201d') fails but should not. The Unicode character should be part of the class.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin May 21, 2012

Contributor

Actually, the valid escaping would be .test\201d. The u is a Python thing.

Contributor

SimonSapin commented May 21, 2012

Actually, the valid escaping would be .test\201d. The u is a Python thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment