Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mio_new_mio() to have size 0 #1951

Merged
merged 2 commits into from Dec 25, 2018
Merged

Conversation

techee
Copy link
Contributor

@techee techee commented Dec 15, 2018

Consider for instance the following HTML code where a subparser is called
on the contents of the <script> tags.

<script language="Javascript" src="./test.js"></script>
<script language="JavaScript">
...
</script>

Since the first <script> is empty, mio_new_mio() is called with 0 size
but with the current semantics of the API it creates MIO until the end of
the file so the whole source file to EOF is parsed. While it seems
no extra tags are generated in this case, the parsing happens
unnecessarily.

Instead of using 0, make size signed and use -1 when we want to create a
MIO till the end of the source MIO.


I actually noticed this problem in Geany rather than in ctags where tags started getting duplicated because of the multiple parsing. I haven't investigated enough to learn why there aren't duplicate tags in ctags in this case but I added some traces to the parsers and the extra parsing really happens when 0 is passed as the size in io_new_mio().

Also, I think it's cleaner to allow 0-sized MIO from the API point of view,

Consider for instance the following HTML code where a subparser is called
on the contents of the <script> tags.

<script language="Javascript" src="./test.js"></script>
<script language="JavaScript">
...
</script>

Since the first <script> is empty, mio_new_mio() is called with 0 size
but with the current semantics of the API it creates MIO until the end of
the file so the whole source file to EOF is parsed. While it seems
no extra tags are generated in this case, the parsing happens
unnecessarily.

Instead of using 0, make size signed and use -1 when we want to create a
MIO till the end of the source MIO.
@coveralls
Copy link

coveralls commented Dec 15, 2018

Coverage Status

Coverage decreased (-0.007%) to 84.882% when pulling ee887ea on techee:mio_new_mio into 64f7d61 on universal-ctags:master.

@techee
Copy link
Contributor Author

techee commented Dec 15, 2018

Funny - notice how coverage of the javascript parser went down because of this patch. It's probably because a lot of HTML was parsed as javascript before and these code paths aren't covered now.

techee added a commit to techee/geany that referenced this pull request Dec 15, 2018
Copy link
Member

@b4n b4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and I totally agree that there's no reason not to allow a 0-sized MIO object here; and empty stream is a perfectly valid stream.

Other changes look good, but for the comment about the loop.

last = tmp;
if (last)
endCharOffset = strlen ((const char *)last);
endCharOffset = strlen ((const char *)tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trickyness of that loop might warrant a comment for nobody to come by it and "improve" it like it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

Though if anyone tries to make such an "improvement" the "bom" test should fail because of the change in MIO - but of course the tests can change in the future and the comment definitely doesn't hurt as this was totally not obvious to me when I started looking at the code.

The change in mio_new_mio() exposed a problem in yacc parser where
epilogue parsing didn't work correctly and the reason it worked was
only thanks to the strange mio_new_mio() semantics.

First, in runYaccParser() the line shouldn't be read because this is
the first (and possibly last) line of epilogue and this way it gets
ignored and offsets for the promise API aren't calculated for it.
Because of this change, we have to add 1 to c_start and c_source_start
inside make_promise_for_epilogue() so we get real start of epilogue.

Second, the loop

while ((tmp = readLineFromInputFile ()))
	last = tmp;

didn't do what the author thought it was doing. The idea behind it
probably was to "remember" the last line and subsequently to compute
its length. This however doesn't work this way - because
readLineFromInputFile() returns a const pointer into an internal
buffer inside ctags main, the last time readLineFromInputFile()
is called and when it returns NULL this value is cleared so no matter
what the "last" variable contained, it would always be an empty string
after the loop finishes.

This patch just moves the calculation of endCharOffset inside the loop
where the line value is still valid. It means some unnecessary strlen()
calculation of all lines in the epilogue but I don't think it has a
big overall impact on the performance.

The unfortunate side-effect of this patch is that if a tag appears
on the last line of the file, we can't say whether the line ends with
\n or not because readLineFromInputFile() silently discards them. This
means that the test in which there's a tag at the last line doesn't
contain $ in the pattern. This problem didn't appear before the change
in MIO because thanks to the various bugs in the yacc parser, the start
line was equal to the end line and both offsets were 0 so the size
for mio_new_mio() was 0 and it was created till the end of the file.
@masatake
Copy link
Member

I will review this to understand how I did wrong.

@masatake
Copy link
Member

LGTM. Sorry for taking your time for fixing my broken code.

@masatake
Copy link
Member

BTW, I would like ctags to make a tag for ./test.js. It can be captured as a reference tag.

Since I introduced reference tag, I have tried to capture language objects specified with import, include, require, source, etc. as reference tags without no serious use case.

However, now I find a good and critical use case.
New option -r, semantic-recursion. This is similar to -R. -R makes ctags parse given directory tree recursively. -r makes ctags parse import-include-require-source relation recursively.

Let show me example (foo.html):

<html>
  <body>
    <script language="Javascript" src="test.js"/>
    <script language="Javascript" src="http://example.com/bar.js"/>
    </body>
</html>

For the about input, run ctags with "-r -rpath . foo.html".
ctags parsrs foo.html. Then it parsers test.js in the current directory.
Finally ctags does download http://example.com/bar.js and parser the downloaded file.

techee added a commit to techee/geany that referenced this pull request Dec 17, 2018
@masatake masatake merged commit 5728abe into universal-ctags:master Dec 25, 2018
elextr pushed a commit to geany/geany that referenced this pull request Apr 6, 2019
…gs (#2018)

* Use latest version of htable

* Use latest version of mio

* Use latest version of objpool

* Use latest version of ptrarray

* Use latest version of vstring

This also requires adding trashbox.c/h which is now used by vstring and
inline macros from inline.h.

* Rename fieldSpec to fieldDefinition

See b56bd065123d69087acd6f202499d71a86a7ea7a upstream.

* Rename kindOption to kindDefinition

See e112e8ab6e0933b5bd7922e0dfb969b1f28c60fa upstream

* Rename kinds field in parserDefinition to kindTable

See 09ae690face8b5cde940e2d7cf40f8860381067b upstream.

* Rename structure fields about field in parserDefinition

See a739fa5fb790bc349a66b2bee0bf42cf289994e8 upstream.

* Use kindIndex instead of kindDefinition

This patch replaces kindDefinition related entries from sTagEntryInfo
with kindIndex so kinds are referenced indirectly using the index. For
more info please refer to commits:

16a2541c0698bd8ee03c1be8172ef3191f6e695a
f92e6bf2aeb21fd6b04756487f98d0eefa16d9ce

Some other changes had to be made to make the sources compile (without
bringing all the diffs from upstream). At some places, which aren't used
by Geany, only stub implementations have been created.

In particular, the regex parser has been disabled (for now?) because its
current implementation doesn't allow accessing kindDefinitions using
index and allowing this would require big changes in its implementation.
The affected parsers are Cobol, ActionScript and HTML. For HTML we can
use the token-based parser from upstream, and we should consider
whether Cobol and ActionScript are worth the effort to maintain a separate
regex implementation using GRegex (IMO these languages are dead enough
not to justify the extra effort).

The patch also disables tests for languages using regex parsers.

* Rename roleDesc to roleDefinition

See 1345725842c196cc0523ff60231192bcd588961b upstream. Since we don't care
about roles in Geany, we don't have to do the additional stuff the upstream
patch does.

* Add XTAG_ANONYMOUS used by jscript

See 0e4c5d4a0461bc8d9616fe3b97d75b91d014246e upstream.

* Include stdint.h in entry.h

* Don't use hash value as an Anonymous field identifier

Instead of something like "Anonymous0ab283cd9402" use sequential integer
values like "Anonymous1".

* Call anonReset in main part

See 3c91b1ea509df238feb86c9cbd552b621e462653 upstream.

* Use upstream javascript parser

* Use upstream css parser

* Create correctly sized MIO for 0 size

See universal-ctags/ctags#1951

* Always enable promise API and subparsers for Geany

* Support subparsers in Geany and add HTML parser demonstrating this feature

This feature requires several changes:

1. Propagating language of the tag from ctags to Geany so we know whether
the tag comes from a master parser or a subparser.

2. We need to address the problem that tag types from a subparsers can
clash with tag types from master parsers or other subparsers used by the
master parser. For instance, HTML and both its css and javascript
subparsers use tm_tag_class_t but HTML uses it for <h2> headings, and
css and javascript for classes. Representing all of them using
tm_tag_class_t would lead to complete mess where all of these types would
for instance be listed in the same branch of the tree in the sidebar.

To avoid this problem, this patch adds another mapping for subparsers where
each tag type can be mapped to another tag type (which isn't used neither
by master parser or other subparsers). To avoid unwanted clashes with other
parsers, only tags explicitly mentioned in such mappings are added to tag
manager; other subparser tags are discarded.

For HTML this patch introduces mapping only for tm_tag_function_t (which
in this case maps to the same type) to mimick the previous HTML parser
behavior but other javascript and css tag types can be added this way
in the future too.

3. Since in most of the code Geany and tag manager assume that tags from
one file use the same language, subparser's tags are modified to have the
same language like the master parser.

4. HTML parser itself was copied from upstream without any modifications.
Tests were fixed as the parser now correctly ignores comments.

* Rename truncateLine field of tagEntryInfo

See 0e70b22791877322598f03ecbe3eb26a6b661001 upstream. Needed for Fortran
parser.

* Add dummy mbcs.h and trace.h

Included by javascript parser.

* Introduce an accessor to `locate' field of `Option'

See fb5ef68859f71ff2949f1d9a7cab7515f523532f upstream. Needed for Fortran.

* Add numarray.c/h

Needed by various parsers.

* Add getLanguageForFilename() and getLanguageForCommand()

See

416c5e6b8807feaec318d7f8addbb4107370c187
334e072f9d6d9954ebd3eb89bbceb252c20ae9dd

upstream. Needed for Sh parser.

* txt2tags: Fix scope separator definition and re-enable tests

* Rename rest.c to rst.c to match upstream filename

* Use upstream asciidoc and rst parsers

* Add asciidoc and rst unit tests

* Rename conf.c to iniconf.c to match upstream filename

* Add tests of conf, diff, md parsers from universal ctags

* Add more ctags unit tests

This patch adds unit tests for: nsis, docbook, haskell, haxe, abaqus, vala,
abc.

The only missing unit tests are for GLSL and Ferite parsers which
however share the implementation with the C parser and should be
reasonably well covered by other C-like language tests.

The tests were put together from various tutorials and help of the
languages in order to cover the tags these parsers generate. No guarantee
they'd compile with real parsers.

* Rename latex.c to tex.c to match upstream filename

* Rename entry points of parsers to match upstream names

* Initialize trashbox

* Add newline to the end of file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants