-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<ref> elements (and probably other html-like tags) inside list items can seeminly contain newlines #86
Comments
This is a partial fix. If you have something like:
The HOWEVER, this doesn't fix this:
or this:
We're parsing things one token at a time, and look-ahead (especially this complicated look-ahead) is annoying. I'll take a look if I can figure anything out, but at the moment it plausibly seems possibly impossible. |
At the moment, this is a bit too difficult to fix completely. The issue is HTML tags are too 'free'. HTML tags (or HTML-like tag entitites like nowiki or ref etc.) can be generated by templates, for example, so you can never trust the tags you see are the tags that will be there at the very end. Because we do piecemeal parsing to keep template data (ie. "there was a template here that generated this), tags can't be trusted to be correct. String parsing to find matching tags is not possible in this case, so lookahead is not possible. It could be done partially, and we do lookahead with If we try to do something during the time that we parse tokens (ie. here's a Recently, I had to fix a problem with whole pages being consumed by tags that weren't being closed. This is done in the token parsing section of the code, using the "close this list" mechanism, so we can't really do anything that goes against that, because instead of affecting a small bit of the page (like this issue does) we would basically turn whole articles, including entries in other languages, into nothing. Otherwise, if we could trust Wiktionary editors to not forget closing tags, we could just keep the tag element open until we hit an end-tag. But because Wikimedia handles this issue gracefully (and we can't), there's no cost to editors leaving out end-tags. Annoying, and possibly unfixable if we don't do an overhaul of how we parse things. |
Recently I've been finding how other wikitext parsers work(https://www.mediawiki.org/wiki/Alternative_parsers), most use manually written parser like ours code, and others like Sweble use parser generator. The later seems more flexible and I'm read the TatSu library's document and trying to assess how it could be used to parse wikitext. |
MediaWiki's new runtime Parsoid already uses a PEG parser called wikipeg to parse the context-free part, here is the grammar: https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/src/Wt2Html/Grammar.pegphp The Parsoid API can be tested from https://en.wiktionary.org/api/rest_v1/, use the And the HTML dump file with RDFa data can be downloaded from https://dumps.wikimedia.org/other/enterprise_html/runs/ |
The latest HTML dump files are missing many pages, and it hasn't been fixed yet: https://phabricator.wikimedia.org/T305407 Files created in 20230620 and 20230701 are usable. |
We're not using the HTML dump, so it should be fine. |
I'm trying to extract the HTML dump file, using XPATH is really convenient and it also contains data created from the LanguageConverter feature. Currently I can't see much drawbacks compare to reimplement Parsoid. |
Talk with Tatu about this before you start working on it. |
This is annoying, because of the structure of our parser.
If we have the source (from comprise/English):
the "from the late 18th c." template is still part of the same line as the preceding list item. If we do this:
with a newline before the defdate template, it behaves as expected and the "from the late 18th c." text is on a new line and breaks the table into two new tables.
The problem is, as usual, the way we have to parse wikitext needing to be different than what happens when wikitext is processed. We need to keep some data that is discarded when processing, like the fact that templates exist at all, while processing just (probably) involves a ton of template expansion passes and the final product is then parsed. We do parsing in between, which breaks a ton of HTML-tag related stuff, because those can appear from templates as they wish to generate new structure.
This needs to be hacked together to get lists working again; I have a vague idea, and hopefully it's as simple as it is vague.
The text was updated successfully, but these errors were encountered: