Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<ref> elements (and probably other html-like tags) inside list items can seeminly contain newlines #86

Open
kristian-clausal opened this issue Aug 11, 2023 · 8 comments

Comments

@kristian-clausal
Copy link
Collaborator

This is annoying, because of the structure of our parser.

If we have the source (from comprise/English):

# {{...}} To [[compose]]; to [[constitute]].<ref group="usage">Traditionally, the whole comprised its parts, ... an increasingly frequent and accepted usage.</ref><ref group="usage">In the passive voice, ... in this sense always requires {{m|en|of}}).

</ref> {{defdate|from the late 18th c.}}
#: {{ux|en|The whole is '''comprised''' of the parts.}}

the "from the late 18th c." template is still part of the same line as the preceding list item. If we do this:

# {{...}} To [[compose]]; to [[constitute]].<ref group="usage">Traditionally, the whole comprised its parts, ... an increasingly frequent and accepted usage.</ref><ref group="usage">In the passive voice, ... in this sense always requires {{m|en|of}}).</ref>
{{defdate|from the late 18th c.}}
#: {{ux|en|The whole is '''comprised''' of the parts.}}

with a newline before the defdate template, it behaves as expected and the "from the late 18th c." text is on a new line and breaks the table into two new tables.

The problem is, as usual, the way we have to parse wikitext needing to be different than what happens when wikitext is processed. We need to keep some data that is discarded when processing, like the fact that templates exist at all, while processing just (probably) involves a ton of template expansion passes and the final product is then parsed. We do parsing in between, which breaks a ton of HTML-tag related stuff, because those can appear from templates as they wish to generate new structure.

This needs to be hacked together to get lists working again; I have a vague idea, and hopefully it's as simple as it is vague.

@kristian-clausal kristian-clausal transferred this issue from tatuylonen/wiktextract Aug 11, 2023
@kristian-clausal
Copy link
Collaborator Author

3952eda

This is a partial fix.

If you have something like:

# Line 1, <ref> ref 1...
</ref> line 1 continues
# Line 2...

The </endtag> doesn't break the list into two, as it did before, even though it is at the start of a line.

HOWEVER, this doesn't fix this:

# Line 1, <ref> ref 1...

</ref> line 1 continues
# Line 2...

or this:

# Line 1, <ref>
ref 1...
</ref> line 1 continues
# Line 2...

We're parsing things one token at a time, and look-ahead (especially this complicated look-ahead) is annoying. I'll take a look if I can figure anything out, but at the moment it plausibly seems possibly impossible.

@kristian-clausal
Copy link
Collaborator Author

At the moment, this is a bit too difficult to fix completely.

The issue is HTML tags are too 'free'. HTML tags (or HTML-like tag entitites like nowiki or ref etc.) can be generated by templates, for example, so you can never trust the tags you see are the tags that will be there at the very end.

Because we do piecemeal parsing to keep template data (ie. "there was a template here that generated this), tags can't be trusted to be correct. String parsing to find matching tags is not possible in this case, so lookahead is not possible. It could be done partially, and we do lookahead with '' italics and ''' bold tags, but people do not put those in separate templates like they do with HTML tags.

If we try to do something during the time that we parse tokens (ie. here's a </ref> token, what do we do?), we can't do lookahead because there is not buffer of tokens (yet), and the parser stack is incomplete because we are generating the parser stack in this very step. We can look backwards, which is what the patch I linked above does, but because we break lists at certain points and pop the stack to do so, we can't undo that if we later on come across a valid end-tag.

Recently, I had to fix a problem with whole pages being consumed by tags that weren't being closed. This is done in the token parsing section of the code, using the "close this list" mechanism, so we can't really do anything that goes against that, because instead of affecting a small bit of the page (like this issue does) we would basically turn whole articles, including entries in other languages, into nothing. Otherwise, if we could trust Wiktionary editors to not forget closing tags, we could just keep the tag element open until we hit an end-tag.

But because Wikimedia handles this issue gracefully (and we can't), there's no cost to editors leaving out end-tags. Annoying, and possibly unfixable if we don't do an overhaul of how we parse things.

@xxyzz
Copy link
Collaborator

xxyzz commented Aug 11, 2023

Recently I've been finding how other wikitext parsers work(https://www.mediawiki.org/wiki/Alternative_parsers), most use manually written parser like ours code, and others like Sweble use parser generator. The later seems more flexible and I'm read the TatSu library's document and trying to assess how it could be used to parse wikitext.

@xxyzz
Copy link
Collaborator

xxyzz commented Aug 14, 2023

MediaWiki's new runtime Parsoid already uses a PEG parser called wikipeg to parse the context-free part, here is the grammar: https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/src/Wt2Html/Grammar.pegphp
We could use this grammar but it'd be simpler if we could parse the HTML output of Parsoid API, this HTML is encoded with RDFa data. Templates are expanded and template name and parameters are included in the data-mw tag attribute. They also have a specification for the HTML format: https://www.mediawiki.org/wiki/Specs/HTML

The Parsoid API can be tested from https://en.wiktionary.org/api/rest_v1/, use the GET /page/html/{title} API

And the HTML dump file with RDFa data can be downloaded from https://dumps.wikimedia.org/other/enterprise_html/runs/

@xxyzz
Copy link
Collaborator

xxyzz commented Aug 16, 2023

The latest HTML dump files are missing many pages, and it hasn't been fixed yet: https://phabricator.wikimedia.org/T305407

Files created in 20230620 and 20230701 are usable.

@kristian-clausal
Copy link
Collaborator Author

We're not using the HTML dump, so it should be fine.

@xxyzz
Copy link
Collaborator

xxyzz commented Aug 16, 2023

I'm trying to extract the HTML dump file, using XPATH is really convenient and it also contains data created from the LanguageConverter feature. Currently I can't see much drawbacks compare to reimplement Parsoid.

@kristian-clausal
Copy link
Collaborator Author

Talk with Tatu about this before you start working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants