don't classify desc as romanization if it is in data["categories"] #230

jmviz · 2023-04-12T20:14:05Z

Noticed some spurious romanizations. Consider for example Reconstruction:Latin/sufferio. parse_word_head() was going through the cleaned version of {{la-verb|4|*sufferiō|*sufferīv|*suffert}} {{lb|la|Proto-Romance}}, which is *sufferiō (present infinitive *sufferīre, perfect active *sufferīvī, supine *suffertum); fourth conjugation (Proto-Romance), and was determining Proto-Romance to be a romanization. This can be ruled out as a romanization since {{lb}} generates the category Proto-Romance, which is in data["categories"]. So this PR adds a check to make sure cases like this are avoided.

Now, the spurious romanization is gone, but there now is a DEBUG: unrecognized head form: Proto-Romance that gets logged as the desc ends up not being classified. I left this alone as I'm not sure what the general system is for these classifications. (It's possible this check should be moved up higher in the for desc_i, desc in enumerate(new_desc): loop, depending what that system is...)

(There happens to be an unrelated Lua execution error on this page when invoking {{VL-conj-4th}}, but that is neither here nor there for this PR.)

kristian-clausal · 2023-04-13T04:41:33Z

This seems like a sensible change; a situation where a romanization and a category are identical and present on the same exact page (especially when a vast majority of categories are at least a couple of words) should be extremely rare or even non-existent.

kristian-clausal · 2023-04-13T05:14:41Z

After testing out changing things to skip the debug message (by inserting the category check inside the romanization branch which would have continue'd the loop and bypassed the debug message), I actually realized the correct way to fix this (particular example) is to add Proto-Romance in uppercase_tags in tags.py, so that Proto-Romance gets picked up as a tag. That way, it doesn't end up in romanizations and stops the debug message from triggering.

But that doesn't mean the code here is wrong in general; it is still true that if we see a romanization that looks like a category from that page it shouldn't be used as a romanization. In that case, the debug message is useful information.

kristian-clausal · 2023-04-13T06:53:31Z

Also was able to figure out the Lua thing:
tatuylonen/wikitextprocessor@0543061
tatuylonen/wikitextprocessor#35

It was an issue with super-weird code in Module:VL-verb; the __tostring metamethod was used to pass data in the form of tables (instead of strings), which is a thing that was not prohibited by 5.1, but has been prohibited in later versions of tostring() (the function calling __tostring).

don't classify desc as romanization if it is in data["categories"]

1bdc3db

kristian-clausal merged commit f33e75d into tatuylonen:master Apr 13, 2023

kristian-clausal mentioned this pull request Apr 13, 2023

Lua: differences in __tostring() between 5.1 and later? tatuylonen/wikitextprocessor#35

Closed

jmviz mentioned this pull request Apr 14, 2023

some more spurious romanizations #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't classify desc as romanization if it is in data["categories"] #230

don't classify desc as romanization if it is in data["categories"] #230

jmviz commented Apr 12, 2023

kristian-clausal commented Apr 13, 2023

kristian-clausal commented Apr 13, 2023 •

edited

kristian-clausal commented Apr 13, 2023

don't classify desc as romanization if it is in data["categories"] #230

don't classify desc as romanization if it is in data["categories"] #230

Conversation

jmviz commented Apr 12, 2023

kristian-clausal commented Apr 13, 2023

kristian-clausal commented Apr 13, 2023 • edited

kristian-clausal commented Apr 13, 2023

kristian-clausal commented Apr 13, 2023 •

edited