OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Huge thanks to jcpeterson for letting me use his download code. His version of OpenWebText is super well written, so please check it out!

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

pipenv run python get_urls.py

pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
download.py		download.py
download_old.py		download_old.py
filter.py		filter.py
get_urls.py		get_urls.py
scrapers.py		scrapers.py
utils.py		utils.py