wikiextract

This package is used to create an artificial dataset of Hindi Errors. It depends on the wikiextractor package from here: https://github.com/attardi/wikiextractor. It does the following:

Downloads a "current" hindi wikipedia dump. Note that it can use any corpus of hindi sentences, such as HindiMonocorp, but we have not tried to do that
Extracts meaningful sentences from the wikipedia dump
Taking these as 'correct' sentences, it creates erroneous sentences using some heuristics and rule based algorithms (see insert_error.py)
For error correction task the erroneous sentences are considered as source file and the correct sentences are considered as the target files

To run, simply:

run bash setup.sh
run bash run_hiwiki.sh
optionally

Note that you may need to update a link inside setup.sh, since wikimedia deletes old dumps.

We do not consider spelling errors to be grammatical errors and as such avoid generating spelling errors. Still, due to inadequacies in the POS tagger that we use, we still generate some spelling errors regardless. In fact much of the code in insert_error.py is based around circumventing these inadequacies.

The ouput of run_hiwiki.sh is a file hiwiki.augmented.edits which contains the parallel corpus of hindi errors in the .edits format. You can convert it to whatever format you like using the tools in scripts/.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
.gitignore		.gitignore
HindmonoExtractor.py		HindmonoExtractor.py
README.md		README.md
WikiExtractor.py		WikiExtractor.py
categories.filter		categories.filter
hindi-pos-tagger.patch		hindi-pos-tagger.patch
indic_sentence_tokenizer.py		indic_sentence_tokenizer.py
insert_errors.py		insert_errors.py
run_hindmonocorp.sh		run_hindmonocorp.sh
run_hiwiki.sh		run_hiwiki.sh
setup.sh		setup.sh
shuffle.sh		shuffle.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiextract

About

Releases

Packages

Languages

s-ankur/wikiextract

Folders and files

Latest commit

History

Repository files navigation

wikiextract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages