Skip to content

A tool for extracting plain text and internal Wikipedia links from Wikipedia dumps

License

Notifications You must be signed in to change notification settings

samuelbroscheit/wikiextractor-wikimentions

 
 

Repository files navigation

WikiExtractor for WikiMentions

This is a modified version of the great Wikiextractor with the additional option to extract the internal Wikipedia links from an article.

If you run the following command with the enwiki-XXXXXXXX-pages-articles1.xml-XXXXXXXX.bz2 replaced by an actual dump file

python WikiExtractor.py --json --filter_disambig_pages --processes 2 --collect_links enwiki-XXXXXXXX-pages-articles1.xml-XXXXXXXX.bz2 -o test

then each articles dictionary contains an additional field 'internal_links'. Please see this notebook for a HOWTO and code snippet for reading the data.

For the full README please consult https://github.com/attardi/wikiextractor. However, I have not tested my modifications with other options.

About

A tool for extracting plain text and internal Wikipedia links from Wikipedia dumps

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Python 93.7%
  • Jupyter Notebook 6.3%