Do not linktrail if following text is not [a-z]?#414
Merged
kristian-clausal merged 3 commits intomainfrom Mar 9, 2026
Merged
Conversation
See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This should not be merged as is, because it will create problems in other extractors that might rely on different behavior. In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.
ea35175 to
ecb885e
Compare
See wiktectract issue #1604 tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link This adds a new attribute to Wtp that contains a `re.Pattern` object used for pattern-matching these kinds of suffixed links. Modify `Wtp.linktrailing_re` to change the behavior based on how the parsed Wikimedia project handles linktrailing. English uses `[a-z]+`. Our default implementation uses `\w+`, which should be fine most of the time. Languages without spaces seem to use the English `[a-z]+`, which seems to make sense. `[[englishword]]KANJI` wouldn't have the kanji characters be consumed, but `\w+` breaks this.
We have a `NAMESPACEE` field in `parserfns` (`{{{NAMESPACEE}}}`,
it's unimplement) which pisses off the linter for some
reason.
2b9a20e to
980bb47
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See wiktectract issue #1604
tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link
This should not be merged as is, because it will create problems in other extractors that might rely on different behavior.
In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing
If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.