Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template [[Fichier:Lycium shawii.jpg|vignette|...]] is misinterpreted? #198

Closed
LeMoussel opened this issue Feb 12, 2024 · 7 comments
Closed

Comments

@LeMoussel
Copy link
Contributor

LeMoussel commented Feb 12, 2024

I use French Wikipedia dump file frwiki-latest-pages-articles.xml.bz2 & generated SQLite db filefr-wiki-latest.db from it. It seems that template [[Fichier:Lycium shawii.jpg|vignette|...]] is misinterpreted.

Code:

wtp = Wtp(
        db_path="fr-wiki-latest.db",
        lang_code="fr",
        project="wikipedia",
)

wxr = WiktextractContext(wtp, WiktionaryConfig())

wiki_page_body = """[[Fichier:Lycium shawii.jpg|vignette|''[[Lycium]] shawii'' appelé Gharqad qui a donné son nom au cimetière d’[[al Baqi]] à [[Médine]].]]"""
wiki_page_title = "Test"

wxr.wtp.start_page(wiki_page_title)
wiki_data = wxr.wtp.parse(
    text=wiki_page_body,
    expand_all=True,
)

print_tree(wiki_data, 2)

text = clean_node(
    wxr=wxr,
    sense_data=None,
    wikinode=wiki_data
)

print(text)

Output:

  ROOT [['Arabie saoudite']]
    LINK [['Fichier:Lycium shawii.jpg'], ['vignette'], [<ITALIC(){} <LINK(['Lycium']){} >, ' shawii'>, ' appelé Gharqad qui a donné son nom au cimetière d’', <LINK(['al Baqi']){} >, ' à ', <LINK(['Médine']){} >, '.']]
|Lycium shawii appelé Gharqad qui a donné son nom au cimetière d’al Baqi à Médine.

The | at the beginning of the sentence should not be present.
The result should be: Lycium shawii appelé Gharqad qui a donné son nom au cimetière d’al Baqi à Médine.

@kristian-clausal
Copy link
Collaborator

Oh, that is weird. I'll take a look.

@LeMoussel
Copy link
Contributor Author

And with [[Fichier:Ikhwan.jpg|vignette|gauche|Troupes des [[Ikhwan (Arabie saoudite)|Ikhwâns]].]]
we have |Ikhwâns.]]

@kristian-clausal
Copy link
Collaborator

The underlying issue is that clean_value should be removing File: and Image: links. If I add Fichier: to the list where they are skipped, the image is returned as an empty string, which is technically a fix.

We need to add new data for each language code again to take into account localizations of File and Image.

The pipe is caused by a broken regex:

        print(f"Before repl_link_bars: {title=}")
        title = re.sub(
            r"(?s)\[\[\s*([^][|<>]+?)\s*\|"
            r"\s*(([^][|]|\[[^]]*\])+?)"
            r"(\s*\|\s*(([^]|]|\[[^]]*\])+?))*\s*\]\]",
            repl_link_bars,
            title,
        )
        print(f"After repl_link_bars: {title=}")

that I'm going to take a look at now. It was never caught because File and Image links were being discarded.

clean_value discards image links because they're not part of the normal text body, in a floating div of their own.

This should be handled in clean_node; either using collect_links to collect image and file links, or a new parameter that collects image alt text, which is then put into sense_data (which is a misnomer here because we're handling wikipedia).

@kristian-clausal
Copy link
Collaborator

The underlying issue here is, in retrospect, that clean_node and clean_value are Wiktextract-specific implementations. They're not from wikitextprocessor! So technically they would need to be forked for Wikipedia...

@kristian-clausal
Copy link
Collaborator

The Ikhwâns is still broken, looking into it. It's the regex, it usually is...

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Feb 12, 2024

This should now be fixed, but I haven't implemented any changes to the issue with Fichier:. The commit just fixes the issue with badly cleaned values, so I'm leaving this open.

Do not trust that the output of Fichier links will be there in the future!

@kristian-clausal
Copy link
Collaborator

Forgot to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants