Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token-based HTML parser #1079

Merged
merged 2 commits into from Aug 27, 2016
Merged

Token-based HTML parser #1079

merged 2 commits into from Aug 27, 2016

Conversation

techee
Copy link
Contributor

@techee techee commented Aug 16, 2016

This is a new token-based HTML parser. Apart from the usual advantages like comment detection and EOL independence the main goals of creating it were:

  • support of additional tags we have in Geany (h1, h2, h3)
  • speed - for universal-ctags it brings the parsing speed of the boost library html documentation from 5.5s to 3.2s (in Geany the speed difference is even bigger because for h1, h2, h3 tags we have additional 3 regexes which make the parsing speed about 2x slow in the original regex-based parser)
  • good error recovery when run on invalid input - this is important for Geany where using real XML parser would be probably too strict during typing and would end with an error so the symbol list would be probably empty most of the time

I didn't really want to spend much time on parsing embedded javascript code so the parser just simulates the original regex for the javascript functions with all its limitations.

It should be quite easy to further extend the parser and detect additional tags and attributes.

cc @b4n

}
while (f != EOF && ! (d == '-' && e == '-' && f == '>'));

if (skipComments || f == EOF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would skip the comment and return EOF when skipComments == TRUE and f == EOF, is that really wanted? I'd rather see returning TOKEN_COMMENT, and let the next read return TOKEN_EOF.

Copy link
Contributor Author

@techee techee Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's an invalid TOKEN_COMMENT in this case as the comment wasn't terminated by --> but rather by EOF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, so it might depend on how it's used; but here it drops the comment content if the comment is unterminated, while it doesn't if it is.

Copy link
Contributor Author

@techee techee Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter. I introduced TOKEN_COMMENT because I was fighting with cases like

<foo>abc<!--comment-->def<!--comment-->ghi</bar>

Inside readTokenText() I don't know whether the < is tag start or comment start but at this point I cannot read more characters because I'd need multi-level-ungetc() which we don't have. On the other hand if the following readToken() just silently skipped the comment, I wouldn't know I can expect another text behind it.

If the comment token is at the end of input, it doesn't matter much if TOKEN_COMMENT or EOF is returned because no other text is behind it.

@b4n
Copy link
Member

b4n commented Aug 16, 2016

Looks mostly good at first read.

@b4n b4n added the Parsers label Aug 16, 2016
vStringDelete (text);
}

static void findHtmlTags ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing void in the argument list :)

@masatake
Copy link
Member

@techee, I have a request about the new html parser.

Currently the html parser tags javascript functions embedded in a html file.
Could you remove the code for tagging them from your parser?
Instead, I would like you to use real Javascript parser implemented in javascript.c.
Promise API (http://docs.ctags.io/en/latest/internal.html#promise-api) I designed and implmented can be uesd for run a subparser(Javascript) from a parser(html). yacc.c is a good example to know the Promies API.

This is applicable to inline stylesheets; I would like you to run CSS parser from your html parser.

@b4n
Copy link
Member

b4n commented Aug 16, 2016

About CDATA, I propose this test case:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>CDATA tests</title>
</head>
<body>
  <div>

  <h1><![CDATA[Pears & apples]]></h1>
  <![CDATA[<h1>Bug</h1>]]>
  <![CDATA[<h1>]]>
  <h1>Cherries</h1>
  <h1>Strawberries</h1>

  </div>
</body>
</html>

However, it might not be so important to support <![CDATA[ as it's apparently not valid in HTML, but only in XML and SGML (which includes XHTML served as XML, but not XHTML served as text/html). And actually if you just open this file in Firefox or Chromium they won't display the expected content.
So it might not be worth the work implementing <![CDATA[ support for true XHTML.

@techee
Copy link
Contributor Author

techee commented Aug 16, 2016

However, it might not be so important to support <![CDATA[ as it's apparently not valid in HTML, but only in XML and SGML (which includes XHTML served as XML, but not XHTML served as text/html). And actually if you just open this file in Firefox or Chromium they won't display the expected content.

In this case I'll probably skip the support of CDATA for the first version of the parser - it can be done later.

@techee
Copy link
Contributor Author

techee commented Aug 16, 2016

@masatake Sure, will have a look at it.

@techee
Copy link
Contributor Author

techee commented Aug 16, 2016

@masatake Just had a brief look at the Promise API and have two questions:

  • Is it possible to specify position on the line where the input for the embedded language starts? E.g. can it handle something like:
<script>function foo(){}</script>

where both languages are mixed on the same line?

  • Can the host language pre-process the input for the embedded language? Within the <script> block there could be a HTML comment removing parts of the javascript code.

@masatake
Copy link
Member

masatake commented Aug 17, 2016

Is it possible to specify position on the line where the input for the embedded language starts? E.g. can it handle something like:...

Yes.

int  makePromise   (const char *parser,
                    unsigned long startLine, int startCharOffset,
                    unsigned long endLine, int endCharOffset,
                    unsigned long sourceLineOffset);

You can specify it to startCharOffset.
The byte offset counted from the start of line should be specified to startCharOffset.
(Not from the start of file.)
For specifying the end position of the area, you can use endCharOffset.

Can the host language pre-process the input for the embedded language? Within the <script> block there could be a HTML comment removing parts of the javascript code.

No. I would like to learn more about this topic. Could you show an example?

@b4n
Copy link
Member

b4n commented Aug 17, 2016

Pre-processing could work, but it would require a fairly more subtle API, writing to a (memory) stream and then using that as the input for the next parser. That could work just fine, but it might require some non-trivial changes I imagine. Though probably not much more than what would be required to have a full in-memory library API :)

@techee
Copy link
Contributor Author

techee commented Aug 17, 2016

I was thinking about something simple like the possibility for the hosting parser to register a callback like

boolean filter_character(int c, void *user_data);

which would decide whether the input character should be passed to the embedded language's parser or not. But after thinking about it we couldn't use it in this form for HTML because we learn whether something is a comment too late (after reading 4 characters) and we couldn't ungetc() them so the embedded parser can read them.

Anyway, not parsing HTML comments within <script> isn't a big problem, everyone's using javascript comments there.

@masatake
Copy link
Member

As an experiment I'm adding two hooks to rearrange input(char or string) passed to a parser.
They change the result of calling getcFromInputFile and readLineFromInputFile.

@masatake
Copy link
Member

input-text-stream

There is a weakness; the regex match is run after "char preprocessing" is done. Ideally regex match should be run before the char level preprocessing.

@masatake
Copy link
Member

masatake commented Aug 18, 2016

Anyway, not parsing HTML comments within <script> isn't a big problem, everyone's using javascript comments there.

I see. I will not implmenet char based preprocessor for awhile. (I will implement line based preprocessor for different purpose.)

I you have a question about makePromise, fell free to ask me. None has eaten the dog food ever.
If it is seriously trouble-full, change the role with me. I will try to add makePromise with your hints about new html parser.

@techee
Copy link
Contributor Author

techee commented Aug 20, 2016

@masatake Done (and it works, cool!). I added a new function to read.c to get the offset from the beginning of the line because computing it in the parser would be annoying (and any other parser using the promise API would have to do the same). Please let me know if it's OK this way or if it should be modified some way.

@b4n Hopefully all your comments are addressed. If the patches are OK, I could squash some of the fixes to the initial commit.

{
unsigned char *base = (unsigned char *) vStringValue (File.line);
int ret = File.currentLine - base - File.ungetchIdx;
return ret >= 0 ? ret : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately if ungetchIdx is greater than File.currentLine - base (i.e. is on the previous line), both getInputLineNumber() and getInputLineOffset() will be wrong…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I noticed that too but it shouldn't be a practical problem as one typically ungec()s either real code characters (new token) or whitespace characters (EOLs to stay at the previous line) but not both of them at the same time. So there shouldn't be any forgotten non-whitespace characters from the previous line and we'll only miss some whitespace characters which don't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's unlikely to really matter (even more so as it's likely to be 1 byte at most), but we never know until it bites :)

Anyway, not a blocker, it's a mere comment on a current limitation.

@b4n
Copy link
Member

b4n commented Aug 20, 2016

Test results fail, maybe you forgot to include an args.ctags?

@b4n
Copy link
Member

b4n commented Aug 20, 2016

Better :) LGBI

@masatake
Copy link
Member

@techee, I sent an invitation for joining to Universal-ctags organization. Can I ask you to maintain html parser?

@techee
Copy link
Contributor Author

techee commented Aug 21, 2016

@techee, I sent an invitation for joining to Universal-ctags organization. Can I ask you to maintain html parser?

Sure. I can also maintain the Go parser as I wrote some patches for it in the past and I'm quite familiar with it.

@techee
Copy link
Contributor Author

techee commented Aug 21, 2016

Should I squash the commits before it gets merged?

@masatake
Copy link
Member

Following tests are passed.

make units VG=1 LANGUAGES=HTML
make chop VG=1 LANGUAGES=HTML
make fuzz LANGUAGES=HTML VG=1
make noise LANGUAGES=HTML 

@masatake
Copy link
Member

I'm surprised that the API works well. None has tested it.

I read the test case. CSS, JavaScript and HTML are included. Excellent! Thank you, @techee.
@b4n, thank you for reviewing the changes.

From my view, a code reader, capturing a <link> as a reference tag will be next step.

Could you merge c90cd07 first? When merging c90cd07, could you change the prefix of commit title from read to main?

The rest of changes about new html parser, could you use HTML?
They should be squashed into one. In addition could you write about docs/news.html?
The new parser should be categorized to Fully rewritten parsers.

I'm quite happy if you update the Promise API (http://docs.ctags.io/en/latest/internal.html#promise-api).

@techee
Copy link
Contributor Author

techee commented Aug 21, 2016

@masatake Yep, can make the changes you propose. What do you mean by

The rest of changes about new html parser, could you use HTML?

?

@techee
Copy link
Contributor Author

techee commented Aug 21, 2016

From my view, a code reader, capturing a as a reference tag will be next step.

What is a "reference tag"?

@masatake
Copy link
Member

@masatake Yep, can make the changes you propose. What do you mean by

The rest of changes about new html parser, could you use HTML?
?

I'm sorry. What I should write is "could oyu use HTML as PREFIX for the commit headers?".
e.g. HTML: Return TOKEN_COMMENT also when EOF is reached.

What is a "reference tag"?

See http://docs.ctags.io/en/latest/news.html#reference-tags

<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head> 

This can be tagged as

theme.css ... /^<link rel="stylesheet" type="text/css" href="theme.css/;" kind:link role:stylesheet

This is just an idea. The kind and role design should be done by people who knows well this area:-P

I forgot writing one thing. You can add a document about the HTML parser to http://docs.ctags.io/en/latest/parsers.html .

@masatake
Copy link
Member

@techee, the reference tag is just an idea for future development. Of couse you can merge the current change.

@techee
Copy link
Contributor Author

techee commented Aug 23, 2016

@techee, the reference tag is just an idea for future development. Of couse you can merge the current change.

Yes, I understood it that way and won't do it right now. I'll just write some documentation, push the squashed commit here so you can have a look if everything is alright and it can get merged then.

@techee techee force-pushed the html_parser branch 2 times, most recently from aff7904 to d522fab Compare August 25, 2016 14:41
@techee
Copy link
Contributor Author

techee commented Aug 25, 2016

@masatake Done (without the implementation). Does it look OK like this?

case '=':
token->type = TOKEN_EQUAL;
break;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case '>', you put { and } around the statements after the case label.
However, the case '=', you don't. Could you do the same (putting or not putting)?

@masatake
Copy link
Member

About the commit log:

- javascript function parsing is still based on regular expression
      matching with all its limitations

This is incorrect now:-)

@masatake
Copy link
Member

I pointed two, a style issue and an item in the commit log.
Other than the two, LGTM.

@masatake
Copy link
Member

@techee, thank you for updating internal.rst.

- independent of newline location
- adds tags for h1, h2, h3 headings
- detects comments
- offers good error recovery when run on invalid input
- faster (3.2s vs 5.5s when run on HTML boost documentation)
- reasonably easy extensibility for additional tag and attribute
  indexing
- JavaScript and CSS parsing using the "promise" API and
  dedicated parsers
@techee
Copy link
Contributor Author

techee commented Aug 27, 2016

@masatake Fixed. OK to merge?

@masatake
Copy link
Member

@techee, yes.

@techee techee merged commit 55d0686 into universal-ctags:master Aug 27, 2016
@masatake
Copy link
Member

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants