-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template [[Fichier:Lycium shawii.jpg|vignette|...]]
is misinterpreted?
#198
Comments
Oh, that is weird. I'll take a look. |
And with |
The underlying issue is that clean_value should be removing We need to add new data for each language code again to take into account localizations of The pipe is caused by a broken regex: print(f"Before repl_link_bars: {title=}")
title = re.sub(
r"(?s)\[\[\s*([^][|<>]+?)\s*\|"
r"\s*(([^][|]|\[[^]]*\])+?)"
r"(\s*\|\s*(([^]|]|\[[^]]*\])+?))*\s*\]\]",
repl_link_bars,
title,
)
print(f"After repl_link_bars: {title=}") that I'm going to take a look at now. It was never caught because File and Image links were being discarded. clean_value discards image links because they're not part of the normal text body, in a floating div of their own. This should be handled in clean_node; either using |
The underlying issue here is, in retrospect, that clean_node and clean_value are Wiktextract-specific implementations. They're not from wikitextprocessor! So technically they would need to be forked for Wikipedia... |
The |
This should now be fixed, but I haven't implemented any changes to the issue with Do not trust that the output of Fichier links will be there in the future! |
Forgot to close this. |
I use French Wikipedia dump file
frwiki-latest-pages-articles.xml.bz2
& generated SQLite db filefr-wiki-latest.db
from it. It seems that template[[Fichier:Lycium shawii.jpg|vignette|...]]
is misinterpreted.Code:
Output:
The
|
at the beginning of the sentence should not be present.The result should be:
Lycium shawii appelé Gharqad qui a donné son nom au cimetière d’al Baqi à Médine.
The text was updated successfully, but these errors were encountered: