Python Urban Dictionary Scraper (aka ubscrape)
This python script tries to scrape and store every single word and definition from Urban Dictionary.
$ . venv/bin/activate $ pip install -r requirements.txt $ python ubscrape-runner.py # [args]
pip install ubscrape
ubscrape --help # shows options ubscrape --scrape # begins scraping all of urban dictionary, starting by adding words to database ubscrape --define hello # defines hello and prints it, verifies your network connection ubscrape --define-all # begind defining all words that are stored locally ubscrape --dump # dump all existing definitions to .json files ubscrape --dump --out OUT # specify an output directory for --dump ubscrape --report # shows the progress in defining all the locally stored words ubscrape --clear --force # deletes the locally stored words and definitions
How ubscrape Works
ubscrape goes through the page indices looking for every word (https://www.urbandictionary.com/browse.php?character=A, https://www.urbandictionary.com/browse.php?character=A&page=2, etc). ubscrape adds these words to a SQLite database in a
ubscrape goes through every row in the database and looks it up (https://www.urbandictionary.com/define.php?term=Magic%20Carpet%20Ride) and adds the definitions to a
When ubscrape has added every definition for a word, it flags the word as
completeand moves onto the next word.
When every word in ubscrape is complete, it dumps the SQLite database to JSON. Each letter gets its own folder, and then definitions are added to files in 50 MB groups. Each file will be ~50 MB, and the title will be the first and last word in the file (firstword-lastword.json).
If ubscrape crashes or fails, it will restart and try to redo as little work as possible.
- Do we want examples as well as definitions?
- Add support for dumping at the same time as scraping, making it less linear.
Cannot take escaped unicode characters as input to
ubscrape --define \u2764\ufe0fdoes not work.
ubscrape --define ❤️❤️DOES work.
Cannot dump to json while it's defining words.
- Using multiprocessing pool
Time of 100 words: real 0m13.341s real 0m12.922s real 0m12.606s
Time of 0 words (testing for initialization): real 0m3.033s real 0m3.171s real 0m2.893s
~13 and ~3 seem good enough for an estimate. 100 words takes 10 seconds, so 1.9 million words takes 0.19 million seconds.
0.19 * 10 ^ 6 seconds / 60 sec/min / 60 min/hr / 24 hr/day = ~2.2 days
I could run it on my laptop for 6 hours a day, or I could run it on the school computers and get it done in two days (checking twice a day on progress).
- Testing before building:
python -m ubscrape --version python -m ubscrape --scrape # for a bit python -m ubscrape --define hello python -m ubscrape --define-all # for a bit python -m ubscrape --dump
Bump version number in
Activate your virtual environment, make sure everything is installed.
python ubscrape/setup.py sdist bdist_wheel
Note to self
- Activate global environment with
. ~/global_venv/bin/activatebefore the next step.
twine upload --repository-url https://test.pypi.org/legacy/ dist/*(test)
twine upload dist/*(real)
Download and test:
pip install -i https://test.pypi.org/simple/ ubscrape==0.5(test)
pip install ubscrape(real)