You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An obvious source of decent quality, freely usable sentences is Wikipedia/Wiktionary. It would not be that difficult to download a database dump and extract them. And it would be useful to have non-fictional sentences to complement the existing fictional ones.
The text was updated successfully, but these errors were encountered:
@rsimmons if you ever do this, I happen to work at Wikipedia and can give some pointers. Just off the top of my head:
The dumps are in XML. You will probably want to filter the namespace <ns>0</ns> since that indicates an article page and not something like a user page or a category page
The actual text is in a format called Wikitext. It's a bit of a pain if you want to extract sentences from it, so it's better to use a Wikitext parser[2] to turn it into HTML and then grab the text from the p tags
You can get articles that are well-written by filtering the Wikitext for {{Good article}} or {{Featured article}}. This means virtually no vandalism at the cost of eliminating 99.86% of all articles (the requirements for a good article are strict. It's still a good 1900 articles though). There are more lenient tags in en-wiki, but it seems like they're not used in ja-wiki
An obvious source of decent quality, freely usable sentences is Wikipedia/Wiktionary. It would not be that difficult to download a database dump and extract them. And it would be useful to have non-fictional sentences to complement the existing fictional ones.
The text was updated successfully, but these errors were encountered: