Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template regex in Wtp._encode() can't match templates that has -{}- argument #59

Closed
xxyzz opened this issue Jun 15, 2023 · 1 comment · Fixed by #60
Closed

Template regex in Wtp._encode() can't match templates that has -{}- argument #59

xxyzz opened this issue Jun 15, 2023 · 1 comment · Fixed by #60

Comments

@xxyzz
Copy link
Collaborator

xxyzz commented Jun 15, 2023

Chinese Wiktionary template Ja-romanization of uses -{}- as a placeholder for the first unnamed argument of module form of/templates, -{}- will be replaced to empty string by MediaWiki. But the template regex at here

# Replace template invocation
text = re.sub(
r"(?si)\{" + MAGIC_NOWIKI_CHAR + r"?\{(("
r"\{\|[^{}]*?\|\}|"
r"\}[^{}]|"
r"[^{}](\{[^{}|])?"
r")+?)\}" + MAGIC_NOWIKI_CHAR + r"?\}",
repl_templ,
text,
)

can't match the #invoke function and the whole invoke function got expanded as plain text.

ja-romanization of template wikitext: {{#invoke:form of/templates|form_of_t|-{}-|withcap=1|lang=ja|noprimaryentrycat=}} 的[[罗马字]]转写

I tried to replace -{}- with a white space before calling _encode but I guess some Lua code removes the empty string argument and throws a "parameter 1 is required" error. Maybe a more complex regex could solve this bug.

Example of affected page: https://zh.wiktionary.org/wiki/manga#日語

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Jun 15, 2023

We could steal a MAGIC character slot, replace -{}- with it, and then change it back as needed.

In common we have:
https://github.com/tatuylonen/wikitextprocessor/blob/03d2312b6c59ece309c99e5b013d2dfe96473637/wikitextprocessor/common.py#LL8C1-L17C2

The magic characters and the range are just unused Unicode characters that should be safe to sprinkle into strings.

Here we could add a MAGIC_ZH_PLACEHOLDER by shifting MAGIC_FIRST to 0x0x00102040 and giving 0x0010203f to MAGIC_ZH_PLACEHOLDER, cast it as a character in MAGIC_ZH_PLACEHOLDER_CH and do string replacements back and forth between the magic character and -{}- as needed.

(Note to self: replace magic values with MAGIC_BASE_VALUE + N.)

This is what I did when I needed to somehow fix an issue with quote single quote ' apostrophes inside double-quoted HTML attributes when parsing.

xxyzz added a commit to xxyzz/wikitextprocessor that referenced this issue Jun 15, 2023
Chinese Wiktionary template `ja-romanization of` uses `-{}-` as a
module argument, but the template regex in `Wtp._encode()` couldn't
match the #invoke string contains `-{}-`. Fixes tatuylonen#59.
xxyzz added a commit to xxyzz/wikitextprocessor that referenced this issue Jul 5, 2023
…rgument

a proper fix for tatuylonen#59, remove magic char added in previous temporary
fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants