Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikipedia #6

Open
richardjharris opened this issue Mar 27, 2022 · 1 comment
Open

Wikipedia #6

richardjharris opened this issue Mar 27, 2022 · 1 comment

Comments

@richardjharris
Copy link

richardjharris commented Mar 27, 2022

An obvious source of decent quality, freely usable sentences is Wikipedia/Wiktionary. It would not be that difficult to download a database dump and extract them. And it would be useful to have non-fictional sentences to complement the existing fictional ones.

@tchin25
Copy link

tchin25 commented Sep 27, 2022

@rsimmons if you ever do this, I happen to work at Wikipedia and can give some pointers. Just off the top of my head:

  • The dumps are in XML. You will probably want to filter the namespace <ns>0</ns> since that indicates an article page and not something like a user page or a category page
  • The actual text is in a format called Wikitext. It's a bit of a pain if you want to extract sentences from it, so it's better to use a Wikitext parser [2] to turn it into HTML and then grab the text from the p tags
  • You can get articles that are well-written by filtering the Wikitext for {{Good article}} or {{Featured article}}. This means virtually no vandalism at the cost of eliminating 99.86% of all articles (the requirements for a good article are strict. It's still a good 1900 articles though). There are more lenient tags in en-wiki, but it seems like they're not used in ja-wiki

If you have any questions feel free to ask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants