Token-based HTML parser #1079

techee · 2016-08-16T19:45:26Z

This is a new token-based HTML parser. Apart from the usual advantages like comment detection and EOL independence the main goals of creating it were:

support of additional tags we have in Geany (h1, h2, h3)
speed - for universal-ctags it brings the parsing speed of the boost library html documentation from 5.5s to 3.2s (in Geany the speed difference is even bigger because for h1, h2, h3 tags we have additional 3 regexes which make the parsing speed about 2x slow in the original regex-based parser)
good error recovery when run on invalid input - this is important for Geany where using real XML parser would be probably too strict during typing and would end with an error so the symbol list would be probably empty most of the time

I didn't really want to spend much time on parsing embedded javascript code so the parser just simulates the original regex for the javascript functions with all its limitations.

It should be quite easy to further extend the parser and detect additional tags and attributes.

cc @b4n

b4n · 2016-08-16T20:03:51Z

parsers/html.c

+						}
+						while (f != EOF && ! (d == '-' && e == '-' && f == '>'));
+
+						if (skipComments || f == EOF)


this would skip the comment and return EOF when skipComments == TRUE and f == EOF, is that really wanted? I'd rather see returning TOKEN_COMMENT, and let the next read return TOKEN_EOF.

But it's an invalid TOKEN_COMMENT in this case as the comment wasn't terminated by --> but rather by EOF.

sure, so it might depend on how it's used; but here it drops the comment content if the comment is unterminated, while it doesn't if it is.

It shouldn't matter. I introduced TOKEN_COMMENT because I was fighting with cases like

<foo>abcdefghi</bar>

Inside readTokenText() I don't know whether the < is tag start or comment start but at this point I cannot read more characters because I'd need multi-level-ungetc() which we don't have. On the other hand if the following readToken() just silently skipped the comment, I wouldn't know I can expect another text behind it.

If the comment token is at the end of input, it doesn't matter much if TOKEN_COMMENT or EOF is returned because no other text is behind it.

b4n · 2016-08-16T20:22:41Z

Looks mostly good at first read.

b4n · 2016-08-16T20:32:32Z

parsers/html.c

+		vStringDelete (text);
+}
+
+static void findHtmlTags ()


missing void in the argument list :)

masatake · 2016-08-16T20:59:42Z

@techee, I have a request about the new html parser.

Currently the html parser tags javascript functions embedded in a html file.
Could you remove the code for tagging them from your parser?
Instead, I would like you to use real Javascript parser implemented in javascript.c.
Promise API (http://docs.ctags.io/en/latest/internal.html#promise-api) I designed and implmented can be uesd for run a subparser(Javascript) from a parser(html). yacc.c is a good example to know the Promies API.

This is applicable to inline stylesheets; I would like you to run CSS parser from your html parser.

b4n · 2016-08-16T21:08:02Z

About CDATA, I propose this test case:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>CDATA tests</title>
</head>
<body>
  <div>

  <h1><![CDATA[Pears & apples]]></h1>
  <![CDATA[<h1>Bug</h1>]]>
  <![CDATA[<h1>]]>
  <h1>Cherries</h1>
  <h1>Strawberries</h1>

  </div>
</body>
</html>

However, it might not be so important to support <![CDATA[ as it's apparently not valid in HTML, but only in XML and SGML (which includes XHTML served as XML, but not XHTML served as text/html). And actually if you just open this file in Firefox or Chromium they won't display the expected content.
So it might not be worth the work implementing <![CDATA[ support for true XHTML.

techee · 2016-08-16T21:20:15Z

However, it might not be so important to support <![CDATA[ as it's apparently not valid in HTML, but only in XML and SGML (which includes XHTML served as XML, but not XHTML served as text/html). And actually if you just open this file in Firefox or Chromium they won't display the expected content.

In this case I'll probably skip the support of CDATA for the first version of the parser - it can be done later.

techee · 2016-08-16T21:20:59Z

@masatake Sure, will have a look at it.

techee · 2016-08-16T21:52:18Z

@masatake Just had a brief look at the Promise API and have two questions:

Is it possible to specify position on the line where the input for the embedded language starts? E.g. can it handle something like:

<script>function foo(){}</script>

where both languages are mixed on the same line?

Can the host language pre-process the input for the embedded language? Within the <script> block there could be a HTML comment removing parts of the javascript code.

masatake · 2016-08-17T01:44:05Z

Is it possible to specify position on the line where the input for the embedded language starts? E.g. can it handle something like:...

Yes.

int  makePromise   (const char *parser,
                    unsigned long startLine, int startCharOffset,
                    unsigned long endLine, int endCharOffset,
                    unsigned long sourceLineOffset);

You can specify it to startCharOffset.
The byte offset counted from the start of line should be specified to startCharOffset.
(Not from the start of file.)
For specifying the end position of the area, you can use endCharOffset.

Can the host language pre-process the input for the embedded language? Within the <script> block there could be a HTML comment removing parts of the javascript code.

No. I would like to learn more about this topic. Could you show an example?

b4n · 2016-08-17T12:11:15Z

Pre-processing could work, but it would require a fairly more subtle API, writing to a (memory) stream and then using that as the input for the next parser. That could work just fine, but it might require some non-trivial changes I imagine. Though probably not much more than what would be required to have a full in-memory library API :)

techee · 2016-08-17T14:33:16Z

I was thinking about something simple like the possibility for the hosting parser to register a callback like

boolean filter_character(int c, void *user_data);

which would decide whether the input character should be passed to the embedded language's parser or not. But after thinking about it we couldn't use it in this form for HTML because we learn whether something is a comment too late (after reading 4 characters) and we couldn't ungetc() them so the embedded parser can read them.

Anyway, not parsing HTML comments within <script> isn't a big problem, everyone's using javascript comments there.

masatake · 2016-08-18T05:13:46Z

As an experiment I'm adding two hooks to rearrange input(char or string) passed to a parser.
They change the result of calling getcFromInputFile and readLineFromInputFile.

masatake · 2016-08-18T05:53:07Z

There is a weakness; the regex match is run after "char preprocessing" is done. Ideally regex match should be run before the char level preprocessing.

masatake · 2016-08-18T06:01:36Z

Anyway, not parsing HTML comments within <script> isn't a big problem, everyone's using javascript comments there.

I see. I will not implmenet char based preprocessor for awhile. (I will implement line based preprocessor for different purpose.)

I you have a question about makePromise, fell free to ask me. None has eaten the dog food ever.
If it is seriously trouble-full, change the role with me. I will try to add makePromise with your hints about new html parser.

techee · 2016-08-20T20:59:32Z

@masatake Done (and it works, cool!). I added a new function to read.c to get the offset from the beginning of the line because computing it in the parser would be annoying (and any other parser using the promise API would have to do the same). Please let me know if it's OK this way or if it should be modified some way.

@b4n Hopefully all your comments are addressed. If the patches are OK, I could squash some of the fixes to the initial commit.

b4n · 2016-08-20T21:06:44Z

main/read.c

+{
+	unsigned char *base = (unsigned char *) vStringValue (File.line);
+	int ret = File.currentLine - base - File.ungetchIdx;
+	return ret >= 0 ? ret : 0;


unfortunately if ungetchIdx is greater than File.currentLine - base (i.e. is on the previous line), both getInputLineNumber() and getInputLineOffset() will be wrong…

Right, I noticed that too but it shouldn't be a practical problem as one typically ungec()s either real code characters (new token) or whitespace characters (EOLs to stay at the previous line) but not both of them at the same time. So there shouldn't be any forgotten non-whitespace characters from the previous line and we'll only miss some whitespace characters which don't matter.

Agreed it's unlikely to really matter (even more so as it's likely to be 1 byte at most), but we never know until it bites :)

Anyway, not a blocker, it's a mere comment on a current limitation.

b4n · 2016-08-20T21:22:57Z

Test results fail, maybe you forgot to include an args.ctags?

b4n · 2016-08-20T21:32:01Z

Better :) LGBI

masatake · 2016-08-21T13:00:36Z

@techee, I sent an invitation for joining to Universal-ctags organization. Can I ask you to maintain html parser?

techee · 2016-08-21T13:12:52Z

@techee, I sent an invitation for joining to Universal-ctags organization. Can I ask you to maintain html parser?

Sure. I can also maintain the Go parser as I wrote some patches for it in the past and I'm quite familiar with it.

techee · 2016-08-21T13:15:23Z

Should I squash the commits before it gets merged?

masatake · 2016-08-21T13:32:07Z

Following tests are passed.

make units VG=1 LANGUAGES=HTML
make chop VG=1 LANGUAGES=HTML
make fuzz LANGUAGES=HTML VG=1
make noise LANGUAGES=HTML

masatake · 2016-08-21T13:51:08Z

I'm surprised that the API works well. None has tested it.

I read the test case. CSS, JavaScript and HTML are included. Excellent! Thank you, @techee.
@b4n, thank you for reviewing the changes.

From my view, a code reader, capturing a <link> as a reference tag will be next step.

Could you merge c90cd07 first? When merging c90cd07, could you change the prefix of commit title from read to main?

The rest of changes about new html parser, could you use HTML?
They should be squashed into one. In addition could you write about docs/news.html?
The new parser should be categorized to Fully rewritten parsers.

I'm quite happy if you update the Promise API (http://docs.ctags.io/en/latest/internal.html#promise-api).

techee · 2016-08-21T17:48:28Z

@masatake Yep, can make the changes you propose. What do you mean by

The rest of changes about new html parser, could you use HTML?

?

techee · 2016-08-21T17:53:29Z

From my view, a code reader, capturing a as a reference tag will be next step.

What is a "reference tag"?

masatake · 2016-08-22T00:54:30Z

@masatake Yep, can make the changes you propose. What do you mean by

The rest of changes about new html parser, could you use HTML?
?

I'm sorry. What I should write is "could oyu use HTML as PREFIX for the commit headers?".
e.g. HTML: Return TOKEN_COMMENT also when EOF is reached.

What is a "reference tag"?

See http://docs.ctags.io/en/latest/news.html#reference-tags

<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

This can be tagged as

theme.css ... /^<link rel="stylesheet" type="text/css" href="theme.css/;" kind:link role:stylesheet

This is just an idea. The kind and role design should be done by people who knows well this area:-P

I forgot writing one thing. You can add a document about the HTML parser to http://docs.ctags.io/en/latest/parsers.html .

masatake · 2016-08-22T22:54:22Z

@techee, the reference tag is just an idea for future development. Of couse you can merge the current change.

techee · 2016-08-23T07:39:30Z

@techee, the reference tag is just an idea for future development. Of couse you can merge the current change.

Yes, I understood it that way and won't do it right now. I'll just write some documentation, push the squashed commit here so you can have a look if everything is alright and it can get merged then.

techee · 2016-08-25T15:11:40Z

@masatake Done (without the implementation). Does it look OK like this?

masatake · 2016-08-27T00:07:14Z

parsers/html.c

+		case '=':
+			token->type = TOKEN_EQUAL;
+			break;
+


For the case '>', you put { and } around the statements after the case label.
However, the case '=', you don't. Could you do the same (putting or not putting)?

masatake · 2016-08-27T00:09:40Z

About the commit log:

- javascript function parsing is still based on regular expression
      matching with all its limitations

This is incorrect now:-)

masatake · 2016-08-27T00:10:34Z

I pointed two, a style issue and an item in the commit log.
Other than the two, LGTM.

masatake · 2016-08-27T00:11:04Z

@techee, thank you for updating internal.rst.

- independent of newline location - adds tags for h1, h2, h3 headings - detects comments - offers good error recovery when run on invalid input - faster (3.2s vs 5.5s when run on HTML boost documentation) - reasonably easy extensibility for additional tag and attribute indexing - JavaScript and CSS parsing using the "promise" API and dedicated parsers

techee · 2016-08-27T11:47:52Z

@masatake Fixed. OK to merge?

masatake · 2016-08-27T14:36:06Z

@techee, yes.

masatake · 2016-08-29T15:58:52Z

Thank you.

b4n reviewed Aug 16, 2016
View reviewed changes

b4n added the Parsers label Aug 16, 2016

b4n reviewed Aug 16, 2016
View reviewed changes

parsers/html.c

vStringDelete (text);

}

static void findHtmlTags ()

Copy link

Member

b4n Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing void in the argument list :)

b4n reviewed Aug 20, 2016
View reviewed changes

techee force-pushed the html_parser branch 2 times, most recently from aff7904 to d522fab Compare August 25, 2016 14:41

masatake reviewed Aug 27, 2016
View reviewed changes

techee added 2 commits August 27, 2016 13:30

docs: Describe getInputLineOffset()

9580539

techee force-pushed the html_parser branch from d522fab to 9580539 Compare August 27, 2016 11:33

techee merged commit 55d0686 into universal-ctags:master Aug 27, 2016

Token-based HTML parser #1079

Token-based HTML parser #1079

Conversation

techee commented Aug 16, 2016

b4n Aug 16, 2016

Choose a reason for hiding this comment

techee Aug 16, 2016 • edited by b4n

Choose a reason for hiding this comment

b4n Aug 16, 2016

Choose a reason for hiding this comment

techee Aug 16, 2016 • edited

Choose a reason for hiding this comment

b4n commented Aug 16, 2016

b4n Aug 16, 2016

Choose a reason for hiding this comment

masatake commented Aug 16, 2016

b4n commented Aug 16, 2016 • edited

techee commented Aug 16, 2016

techee commented Aug 16, 2016

techee commented Aug 16, 2016 • edited

masatake commented Aug 17, 2016 • edited

b4n commented Aug 17, 2016

techee commented Aug 17, 2016

masatake commented Aug 18, 2016

masatake commented Aug 18, 2016

masatake commented Aug 18, 2016 • edited

techee commented Aug 20, 2016

b4n Aug 20, 2016

Choose a reason for hiding this comment

techee Aug 20, 2016

Choose a reason for hiding this comment

b4n Aug 20, 2016

Choose a reason for hiding this comment

b4n commented Aug 20, 2016

b4n commented Aug 20, 2016

masatake commented Aug 21, 2016

techee commented Aug 21, 2016

techee commented Aug 21, 2016

masatake commented Aug 21, 2016

masatake commented Aug 21, 2016

techee commented Aug 21, 2016

techee commented Aug 21, 2016

masatake commented Aug 22, 2016

masatake commented Aug 22, 2016

techee commented Aug 23, 2016

techee commented Aug 25, 2016

masatake Aug 27, 2016

Choose a reason for hiding this comment

masatake commented Aug 27, 2016

masatake commented Aug 27, 2016

masatake commented Aug 27, 2016

techee commented Aug 27, 2016

masatake commented Aug 27, 2016

masatake commented Aug 29, 2016

techee Aug 16, 2016 •

edited by b4n

techee Aug 16, 2016 •

edited

b4n commented Aug 16, 2016 •

edited

techee commented Aug 16, 2016 •

edited

masatake commented Aug 17, 2016 •

edited

masatake commented Aug 18, 2016 •

edited