wikipedia-crawler

Extracts plain-text from series of Wikipedia articles and saves to a local text file.

The goal is to have text samples of a specific language on a specific topic, that can be used on computer analysis applied to linguistics (word frequency, distribution, etc), or to generate wordlists of any language on Wikipedia (294 languages).

Usage:

python3 wikipedia-crawler.py https://en.wikipedia.org/wiki/Biology

Generates output.txt, extracting only this single article. Parameters to go crawling:

--output=biology.txt --articles=10 --interval=5

Generates biology.txt, crawling 10 articles related to Biology. Requests interval set to 5 seconds (default). Session log containing all visited URLs is saved as session_biology.txt. Running with the same output will use the same session file.

In this example the initial article is Biology, the crawler will continue extracting related pages: Natural Science, Evolution, ...

Dependencies:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all-languages-crawler.py		all-languages-crawler.py
language-codes.txt		language-codes.txt
requirements.txt		requirements.txt
wikipedia-crawler.py		wikipedia-crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia-crawler

Usage:

Dependencies:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikipedia-crawler

Usage:

Dependencies:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages