New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token-based HTML parser #1079
Token-based HTML parser #1079
Conversation
} | ||
while (f != EOF && ! (d == '-' && e == '-' && f == '>')); | ||
|
||
if (skipComments || f == EOF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would skip the comment and return EOF when skipComments == TRUE
and f == EOF
, is that really wanted? I'd rather see returning TOKEN_COMMENT
, and let the next read return TOKEN_EOF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's an invalid TOKEN_COMMENT
in this case as the comment wasn't terminated by -->
but rather by EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, so it might depend on how it's used; but here it drops the comment content if the comment is unterminated, while it doesn't if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't matter. I introduced TOKEN_COMMENT because I was fighting with cases like
<foo>abc<!--comment-->def<!--comment-->ghi</bar>
Inside readTokenText() I don't know whether the < is tag start or comment start but at this point I cannot read more characters because I'd need multi-level-ungetc() which we don't have. On the other hand if the following readToken() just silently skipped the comment, I wouldn't know I can expect another text behind it.
If the comment token is at the end of input, it doesn't matter much if TOKEN_COMMENT or EOF is returned because no other text is behind it.
Looks mostly good at first read. |
vStringDelete (text); | ||
} | ||
|
||
static void findHtmlTags () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing void
in the argument list :)
@techee, I have a request about the new html parser. Currently the html parser tags javascript functions embedded in a html file. This is applicable to inline stylesheets; I would like you to run CSS parser from your html parser. |
About <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>CDATA tests</title>
</head>
<body>
<div>
<h1><![CDATA[Pears & apples]]></h1>
<![CDATA[<h1>Bug</h1>]]>
<![CDATA[<h1>]]>
<h1>Cherries</h1>
<h1>Strawberries</h1>
</div>
</body>
</html> However, it might not be so important to support |
In this case I'll probably skip the support of CDATA for the first version of the parser - it can be done later. |
@masatake Sure, will have a look at it. |
@masatake Just had a brief look at the Promise API and have two questions:
where both languages are mixed on the same line?
|
Yes.
You can specify it to
No. I would like to learn more about this topic. Could you show an example? |
Pre-processing could work, but it would require a fairly more subtle API, writing to a (memory) stream and then using that as the input for the next parser. That could work just fine, but it might require some non-trivial changes I imagine. Though probably not much more than what would be required to have a full in-memory library API :) |
I was thinking about something simple like the possibility for the hosting parser to register a callback like
which would decide whether the input character should be passed to the embedded language's parser or not. But after thinking about it we couldn't use it in this form for HTML because we learn whether something is a comment too late (after reading 4 characters) and we couldn't ungetc() them so the embedded parser can read them. Anyway, not parsing HTML comments within |
As an experiment I'm adding two hooks to rearrange input(char or string) passed to a parser. |
I see. I will not implmenet char based preprocessor for awhile. (I will implement line based preprocessor for different purpose.) I you have a question about makePromise, fell free to ask me. None has eaten the dog food ever. |
@masatake Done (and it works, cool!). I added a new function to read.c to get the offset from the beginning of the line because computing it in the parser would be annoying (and any other parser using the promise API would have to do the same). Please let me know if it's OK this way or if it should be modified some way. @b4n Hopefully all your comments are addressed. If the patches are OK, I could squash some of the fixes to the initial commit. |
{ | ||
unsigned char *base = (unsigned char *) vStringValue (File.line); | ||
int ret = File.currentLine - base - File.ungetchIdx; | ||
return ret >= 0 ? ret : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately if ungetchIdx
is greater than File.currentLine - base
(i.e. is on the previous line), both getInputLineNumber()
and getInputLineOffset()
will be wrong…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I noticed that too but it shouldn't be a practical problem as one typically ungec()s either real code characters (new token) or whitespace characters (EOLs to stay at the previous line) but not both of them at the same time. So there shouldn't be any forgotten non-whitespace characters from the previous line and we'll only miss some whitespace characters which don't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's unlikely to really matter (even more so as it's likely to be 1 byte at most), but we never know until it bites :)
Anyway, not a blocker, it's a mere comment on a current limitation.
Test results fail, maybe you forgot to include an args.ctags? |
Better :) LGBI |
@techee, I sent an invitation for joining to Universal-ctags organization. Can I ask you to maintain html parser? |
Sure. I can also maintain the Go parser as I wrote some patches for it in the past and I'm quite familiar with it. |
Should I squash the commits before it gets merged? |
Following tests are passed.
|
I'm surprised that the API works well. None has tested it. I read the test case. CSS, JavaScript and HTML are included. Excellent! Thank you, @techee. From my view, a code reader, capturing a Could you merge The rest of changes about new html parser, could you use I'm quite happy if you update the Promise API (http://docs.ctags.io/en/latest/internal.html#promise-api). |
@masatake Yep, can make the changes you propose. What do you mean by
? |
What is a "reference tag"? |
I'm sorry. What I should write is "could oyu use HTML as PREFIX for the commit headers?".
See http://docs.ctags.io/en/latest/news.html#reference-tags
This can be tagged as
This is just an idea. The kind and role design should be done by people who knows well this area:-P I forgot writing one thing. You can add a document about the HTML parser to http://docs.ctags.io/en/latest/parsers.html . |
@techee, the reference tag is just an idea for future development. Of couse you can merge the current change. |
Yes, I understood it that way and won't do it right now. I'll just write some documentation, push the squashed commit here so you can have a look if everything is alright and it can get merged then. |
aff7904
to
d522fab
Compare
@masatake Done (without the implementation). Does it look OK like this? |
case '=': | ||
token->type = TOKEN_EQUAL; | ||
break; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case '>'
, you put {
and }
around the statements after the case label.
However, the case '='
, you don't. Could you do the same (putting or not putting)?
About the commit log:
This is incorrect now:-) |
I pointed two, a style issue and an item in the commit log. |
@techee, thank you for updating internal.rst. |
- independent of newline location - adds tags for h1, h2, h3 headings - detects comments - offers good error recovery when run on invalid input - faster (3.2s vs 5.5s when run on HTML boost documentation) - reasonably easy extensibility for additional tag and attribute indexing - JavaScript and CSS parsing using the "promise" API and dedicated parsers
@masatake Fixed. OK to merge? |
@techee, yes. |
Thank you. |
This is a new token-based HTML parser. Apart from the usual advantages like comment detection and EOL independence the main goals of creating it were:
I didn't really want to spend much time on parsing embedded javascript code so the parser just simulates the original regex for the javascript functions with all its limitations.
It should be quite easy to further extend the parser and detect additional tags and attributes.
cc @b4n