Skip to content

sentiteam/wikipedia-crawler

Repository files navigation

wikipedia-crawler

Extracts plain-text from series of Wikipedia articles and saves to a local text file.

The goal is to have text samples of a specific language on a specific topic, that can be used on computer analysis applied to linguistics (word frequency, distribution, etc), or to generate wordlists of any language on Wikipedia (294 languages).

Usage:

python3 wikipedia-crawler.py https://en.wikipedia.org/wiki/Biology

Generates output.txt, extracting only this single article. Parameters to go crawling:

--output=biology.txt --articles=10 --interval=5

Generates biology.txt, crawling 10 articles related to Biology. Requests interval set to 5 seconds (default). Session log containing all visited URLs is saved as session_biology.txt. Running with the same output will use the same session file.

In this example the initial article is Biology, the crawler will continue extracting related pages: Natural Science, Evolution, ...

Dependencies:

pip install -r requirements.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages