Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Here’s a download link for all of bookcorpus as of Sept 2020 #27
You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21
it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)
The new script:
The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.
Sorry for the download problems. It should be fixed now. My server was running out of space due to 128GB of google cloud logs.
Ideally the zip file could be mirrored elsewhere. I'd set up a torrent, but I've never done that before. If someone has a good walkthrough, feel free to link it, else I'll research it someday.
@shawwn This seems excellent and I can't wait to snag a copy of the files!
Unfortunately I'm running into failed downloads now as well (likely due to log proliferation again I'd presume -- incidentally, while I know nothing about setting up torrents, I'd be happy to help out with a stop-gap scripted daemon that cleans logs to keep them in check if that appeals).
@SeanVody and everyone else:
I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to come:
(It's bit-for-bit identical to the file in my original tweet.)
However, anyone who is looking for bookcorpus will undoubtedly be interested in everything else. I urge you to take a peek: https://the-eye.eu/public/AI/pile_preliminary_components
In addition to bookcorpus (books1.tar.gz), it also has:
This is possible thanks to two organizations. First and foremost, thank you to the-eye.eu. They have a wonderful community (see discord), and they are extremely interested in archiving data for the benefit of humanity.
Secondly, thank you to "The Pile", which is the project that has been meticulously gathering and preparing this training data. Join their discord if you're interested in ML: https://www.eleuther.ai/get-involved
You now have OpenAI-grade training data at your fingertips; do with it as you please.
books3.tar.gz seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. Nonetheless, books3 is "all of bibliotik", which is possibly useful to anyone doing NLP work.
I have tried to carefully and rigorously prepare the data in books3; e.g. all of the files are already preprocessed with
If you have high quality datasets that you wish to make available to ML researchers, please DM me (@theshawwn) or reach out to The Pile.
@jorditg It's mostly English, but if anyone discovers a trove of foreign .epub files, please DM me. I am quite interested in doing various foreign language versions.
By the way, you can use the epub to txt converter on your own .epub files. I would be curious if it works well enough on foreign epubs, since sadly I speak only southern Texas, ya'll.
Small correction @shawwn - it is