-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML code in English word senses #33
Comments
The word senses for benzylidene are also not being parsed correctly. "translations": [
{
"code": "ca",
"lang": "Catalan",
"sense": "=\">\nC6H5-CH=",
"tags": [
"masculine"
],
"word": "benzilidè"
},
{
"code": "fi",
"lang": "Finnish",
"sense": "=\">\nC6H5-CH=",
"word": "bentsylideeni"
}
], |
There's probably a couple of different things going on here, I'll take first a look at the ones where the |
I've now kludged together a fix for the issue with italics and bolded stuff, it was breaking the parsing of table attributes because we assumed table attributes could only be raw strings and now contain other nodes. Some of the other things break because of this too, but oh haaaaaa. |
The whole mess with the interleaved trans-top and multi-trans templates seems to be what breaks things. I'm going to try to kludge something together using regex to detect lines that look like attributes, and see if we can just brute-force it that way. |
I gave up on trying to fix the underlying problem (some things being parsed too early and being passed on as parsed nodes instead of escape strings (can't even figure out where that HTML-escaping is taking place!!!!), and made a bandaid. When we're checking to see whether the children of a node should be parsed for attributes, we now de-parse those nodes and return them into wikitext. This wikitext (if from a node and not the top-level strings surrounding it) is escaped with html.escape and concatenated. This string is then compared to see if a regex matches (for |
The kaikki regeneration started after my earlier commit, so tomorrow the italics and bolded translation sense titles should be corrected, but the newest commit was after so you'll see more corrections on Saturday, fingers crossed. |
In the meanwhile I added some of the old behavior back to the code; it was much lighter on resources, because it basically just involved checking whether there is only 1 (one) child node and whether that child node was a string (and a bunch of string concatenation in the background before that, but still). It worked 99% of the time and we can return early with the result, so why not? It definitely shouldn't break things any more than they were broken before; technically, it could be worse than this newest version committed yesterday, because its formatting is not being checked by the regex, but still. |
I've separated out the benzylidene post as its own issue, that's a different kettle of fish. The opening issue with the parsing of italics and other nodes within table attributes seems to be ok for now, so I'm closing this thread as completed. |
Some HTML code got into the English dump.
Here's an example from wall (sense: The butterfly Lasiommata megera.)
More examples:
The text was updated successfully, but these errors were encountered: