Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't classify desc as romanization if it is in data["categories"] #230

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

jmviz
Copy link
Contributor

@jmviz jmviz commented Apr 12, 2023

Noticed some spurious romanizations. Consider for example Reconstruction:Latin/sufferio. parse_word_head() was going through the cleaned version of {{la-verb|4|*sufferiō|*sufferīv|*suffert}} {{lb|la|Proto-Romance}}, which is *sufferiō (present infinitive *sufferīre, perfect active *sufferīvī, supine *suffertum); fourth conjugation (Proto-Romance), and was determining Proto-Romance to be a romanization. This can be ruled out as a romanization since {{lb}} generates the category Proto-Romance, which is in data["categories"]. So this PR adds a check to make sure cases like this are avoided.

Now, the spurious romanization is gone, but there now is a DEBUG: unrecognized head form: Proto-Romance that gets logged as the desc ends up not being classified. I left this alone as I'm not sure what the general system is for these classifications. (It's possible this check should be moved up higher in the for desc_i, desc in enumerate(new_desc): loop, depending what that system is...)

(There happens to be an unrelated Lua execution error on this page when invoking {{VL-conj-4th}}, but that is neither here nor there for this PR.)

@kristian-clausal
Copy link
Collaborator

This seems like a sensible change; a situation where a romanization and a category are identical and present on the same exact page (especially when a vast majority of categories are at least a couple of words) should be extremely rare or even non-existent.

@kristian-clausal kristian-clausal merged commit f33e75d into tatuylonen:master Apr 13, 2023
@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Apr 13, 2023

After testing out changing things to skip the debug message (by inserting the category check inside the romanization branch which would have continue'd the loop and bypassed the debug message), I actually realized the correct way to fix this (particular example) is to add Proto-Romance in uppercase_tags in tags.py, so that Proto-Romance gets picked up as a tag. That way, it doesn't end up in romanizations and stops the debug message from triggering.

But that doesn't mean the code here is wrong in general; it is still true that if we see a romanization that looks like a category from that page it shouldn't be used as a romanization. In that case, the debug message is useful information.

@kristian-clausal
Copy link
Collaborator

Also was able to figure out the Lua thing:
tatuylonen/wikitextprocessor@0543061
tatuylonen/wikitextprocessor#35

It was an issue with super-weird code in Module:VL-verb; the __tostring metamethod was used to pass data in the form of tables (instead of strings), which is a thing that was not prohibited by 5.1, but has been prohibited in later versions of tostring() (the function calling __tostring).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants