Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multilingual support for the Wikipedia reader. #12616

Merged
merged 2 commits into from
Apr 6, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Simple reader that reads wikipedia."""

from typing import Any, List

from llama_index.core.readers.base import BasePydanticReader
Expand Down Expand Up @@ -27,15 +28,22 @@ def __init__(self) -> None:
def class_name(cls) -> str:
return "WikipediaReader"

def load_data(self, pages: List[str], **load_kwargs: Any) -> List[Document]:
def load_data(
self, pages: List[str], lang_code: str = "en", **load_kwargs: Any
) -> List[Document]:
"""Load data from the input directory.

Args:
pages (List[str]): List of pages to read.

lang_code (str): Language code for Wikipedia. Defaults to English. Valid Wikipedia language codes
can be found at https://en.wikipedia.org/wiki/List_of_Wikipedias.
"""
import wikipedia

if lang_code.lower() != "en":
# Sets, without checking the validity of, the language code for Wikipedia.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some validation here? It looks like we can (and maybe should) check and raise error if the supplied lang_code isn't actually supported via wikipedia.languages()

source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nerdai thanks for your comment. That sounds like a good idea. How about the following?

if lang_code.lower() in wikipedia.languages().keys():
    wikipedia.set_lang(lang_code.lower())
else:
    raise SomeError(with_a_message)

What error would you like to raise? wikipedia.exceptions.WikipediaException? The error message should be something like "The provided language prefix for Wikipedia is not supported. Check supported languages at https://en.wikipedia.org/wiki/List_of_Wikipedias."

Alternatively, would you like it to silently fall back to "en" if the prefix is not supported instead of raising an exception?

Since I use the word prefix instead of code in the error message to be consistent with the wikipedia package, I will also change the function argument from lang_code to lang_prefix.

Looks good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we can raise a ValueError here. The message you suggested seems reasonable to me.

lang_prefix > lang_code 🙏

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think the following is better than checking wikipedia.languages().keys():

if lang_prefix.lower() in wikipedia.languages():
      wikipedia.set_lang(lang_prefix.lower())
else:
       raise ValueError(
                    f"Language prefix '{lang_prefix}' for Wikipedia is not supported. Check supported languages at https://en.wikipedia.org/wiki/List_of_Wikipedias."
                )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nerdai thanks, code pushed.

wikipedia.set_lang(lang_code)

results = []
for page in pages:
wiki_page = wikipedia.page(page, **load_kwargs)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ license = "MIT"
maintainers = ["jerryjliu"]
name = "llama-index-readers-wikipedia"
readme = "README.md"
version = "0.1.3"
version = "0.1.4"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
Expand Down