Skip to content

simple-english/WiktionaryCrawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WiktionaryCrawler

This crawler was written to parse Wiktionary pages (which tend to be a mess, sadly) into the speling format, which can be used by programs which require these wordlists.

Dependencies

$ sudo pip install urlnorm

Depending on your language, you may need to install more dependencies.

Here are the list of language specific dependencies:

  • zh (Chinese, simplified and traditional): sudo pip install mafan BeautifulSoup4
  • th (Thai): sudo pip install BeautifulSoup4
  • lo (Lao): sudo pip install BeautifulSoup4

Running the crawler

$ python main.py

That's all you have to do. All configuration is done in config.py.

General config

Refer to General config for more details.

How it works

Refer to How it works for more details.

About

Crawling Wiktionary for words

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 91.2%
  • Python 8.8%