Scripts that extract a word corpus from OpenStreetMap, Wikipedia, and Wikidata targeting South-East Asian and Indic languages.
- OpenStreetMap-derived data is licensed under the Open Data Commons Open Database License (ODbL). See https://www.openstreetmap.org/copyright
- Wikipedia-derived data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). See https://en.wikipedia.org/wiki/Wikipedia:Copyrights
- Wikidata-derived data is licensed under the Creative Commons CC0 License. See https://www.wikidata.org/wiki/Wikidata:Licensing
Download corpus with duplicates. These files contain only non-Latin/Greek/Cyrillic/CJK text as defined in the file latin_greek_cyrillic_cjk.py.
- osm-corpus-with-duplicates.txt.zip (19M)
- wikipedia-corpus-with-duplicates.txt.zip (816M)
- wikidata-corpus-with-duplicates.txt.zip (118M)
Download corpus for a single script without duplicates:
- Arabic
- Bengali
- Devanagari
- Gujarati
- Gurmukhi
- Kannada
- osm-kannada-corpus.txt.zip (119K)
- wikipedia-kannada-corpus.txt.zip (1.7M)
- wikidata-kannada-corpus.txt.zip (293K)
- Khmer
- osm-khmer-corpus.txt.zip (73K)
- wikipedia-khmer-corpus.txt.zip (12M)
- wikidata-khmer-corpus.txt.zip (137K)
- Malayalam
- Myanmar
- Oriya
- osm-oriya-corpus.txt.zip (7.5K)
- wikipedia-oriya-corpus.txt.zip (2.2M)
- wikidata-oriya-corpus.txt.zip (167K)
- Tamil
- osm-tamil-corpus.txt.zip (53K)
- wikipedia-tamil-corpus.txt.zip (27M)
- wikidata-tamil-corpus.txt.zip (688K)
- Telugu
If you need some other language or script, please open an Issue on GitHub...
Download data sources:
cd osm/
python3 download.py
cd ../wikidata/
python3 download.py
cd ../wikipedia/
python3 download.pyExtract non-Latin/Greek/Cyrillic/CJK text from sources with:
python3 extract.pyGenerate word corpus with duplicates with:
python3 generate_corpus.pyFilter the corpus for a single script:
python3 filter_by_script.py