Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML code in English word senses #33

Closed
lggruspe opened this issue Mar 22, 2023 · 8 comments
Closed

HTML code in English word senses #33

lggruspe opened this issue Mar 22, 2023 · 8 comments

Comments

@lggruspe
Copy link

Some HTML code got into the English dump.

Here's an example from wall (sense: The butterfly Lasiommata megera.)

{
   "code": "sw",
   "lang": "Swahili",
   "sense": "butterfly Lasiommata megera\n class=\"translations\" role=\"presentation\" style=\"width:100%;\" data-gloss=\"butterfly Lasiommata megera\"",
   "word": "kuta"
}

More examples:

  • love (A climbing plant, Clematis vitalba)
  • rose (A plant or species in the rose family. (Rosaceae))
  • they
  • read (Used after a euphemism to introduce the intended, more blunt meaning of a term)
@lggruspe
Copy link
Author

The word senses for benzylidene are also not being parsed correctly.

"translations": [
    {
      "code": "ca",
      "lang": "Catalan",
      "sense": "=\">\nC6H5-CH=",
      "tags": [
        "masculine"
      ],
      "word": "benzilidè"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "=\">\nC6H5-CH=",
      "word": "bentsylideeni"
    }
  ],

@kristian-clausal
Copy link
Collaborator

There's probably a couple of different things going on here, I'll take first a look at the ones where the '' break things.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Mar 23, 2023

I've now kludged together a fix for the issue with italics and bolded stuff, it was breaking the parsing of table attributes because we assumed table attributes could only be raw strings and now contain other nodes. Some of the other things break because of this too, but

oh
my
god

look at this template

haaaaaa.

@kristian-clausal
Copy link
Collaborator

The whole mess with the interleaved trans-top and multi-trans templates seems to be what breaks things. I'm going to try to kludge something together using regex to detect lines that look like attributes, and see if we can just brute-force it that way.

@kristian-clausal
Copy link
Collaborator

I gave up on trying to fix the underlying problem (some things being parsed too early and being passed on as parsed nodes instead of escape strings (can't even figure out where that HTML-escaping is taking place!!!!), and made a bandaid.

When we're checking to see whether the children of a node should be parsed for attributes, we now de-parse those nodes and return them into wikitext. This wikitext (if from a node and not the top-level strings surrounding it) is escaped with html.escape and concatenated.

This string is then compared to see if a regex matches (for a=1 b="b" c='c' formats) what we want from attribute assignments, and then that is passed on as usual to the attribute parsing function (which takes a string and where I stole most of this regex from after my own started breaking down).

@kristian-clausal
Copy link
Collaborator

The kaikki regeneration started after my earlier commit, so tomorrow the italics and bolded translation sense titles should be corrected, but the newest commit was after so you'll see more corrections on Saturday, fingers crossed.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Mar 24, 2023

In the meanwhile I added some of the old behavior back to the code; it was much lighter on resources, because it basically just involved checking whether there is only 1 (one) child node and whether that child node was a string (and a bunch of string concatenation in the background before that, but still). It worked 99% of the time and we can return early with the result, so why not? It definitely shouldn't break things any more than they were broken before; technically, it could be worse than this newest version committed yesterday, because its formatting is not being checked by the regex, but still.

@kristian-clausal
Copy link
Collaborator

I've separated out the benzylidene post as its own issue, that's a different kettle of fish. The opening issue with the parsing of italics and other nodes within table attributes seems to be ok for now, so I'm closing this thread as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants