Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Here’s a download link for all of bookcorpus as of Sept 2020 #27

Open
shawwn opened this issue Sep 5, 2020 · 11 comments
Open

Here’s a download link for all of bookcorpus as of Sept 2020 #27

shawwn opened this issue Sep 5, 2020 · 11 comments

Comments

@shawwn
Copy link

@shawwn shawwn commented Sep 5, 2020

You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21

it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

The new script:

  1. Correctly preserves structure, matching the table of contents very closely;

  2. Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),

  3. Correctly preserves code structure, so that source code and similar things are visually coherent,

  4. Converts numbered lists from “1\.” to “1.”

  5. Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;

  6. Expands Unicode ellipses to “...” (three separate ascii characters).

The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.

@soskek
Copy link
Owner

@soskek soskek commented Sep 5, 2020

@shawwn Excellent work! It seems great, and I added the reference to it in the README in this repo!

@ZonglinY
Copy link

@ZonglinY ZonglinY commented Sep 24, 2020

@shawwn Thanks for your efforts! However, I run into 'network error' when using the link. Anyone succeed in using the link?

@richarddwang
Copy link

@richarddwang richarddwang commented Sep 30, 2020

@shawwn This is exciting !
But I also encountered failed download.

@shawwn
Copy link
Author

@shawwn shawwn commented Oct 2, 2020

@ZonglinY @richarddwang

Sorry for the download problems. It should be fixed now. My server was running out of space due to 128GB of google cloud logs.

Ideally the zip file could be mirrored elsewhere. I'd set up a torrent, but I've never done that before. If someone has a good walkthrough, feel free to link it, else I'll research it someday.

@SeanVody
Copy link

@SeanVody SeanVody commented Oct 19, 2020

@shawwn This seems excellent and I can't wait to snag a copy of the files!

Unfortunately I'm running into failed downloads now as well (likely due to log proliferation again I'd presume -- incidentally, while I know nothing about setting up torrents, I'd be happy to help out with a stop-gap scripted daemon that cleans logs to keep them in check if that appeals).

@shawwn
Copy link
Author

@shawwn shawwn commented Oct 25, 2020

@SeanVody and everyone else:

I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to come:

https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz

(It's bit-for-bit identical to the file in my original tweet.)

However, anyone who is looking for bookcorpus will undoubtedly be interested in everything else. I urge you to take a peek: https://the-eye.eu/public/AI/pile_preliminary_components

In addition to bookcorpus (books1.tar.gz), it also has:

  • books3.tar.gz (37GB), aka "all of bibliotik in plain .txt form", aka 197,000 books processed in exactly the same way as I did for bookcorpus here. So basically 11x bigger.

  • github.tar (100GB), a huge amount of code for training purposes

  • Many other delightful datasets, all of which are extremely high quality:

image

This is possible thanks to two organizations. First and foremost, thank you to the-eye.eu. They have a wonderful community (see discord), and they are extremely interested in archiving data for the benefit of humanity.

Secondly, thank you to "The Pile", which is the project that has been meticulously gathering and preparing this training data. Join their discord if you're interested in ML: https://www.eleuther.ai/get-involved

image

You now have OpenAI-grade training data at your fingertips; do with it as you please.

books3.tar.gz seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. Nonetheless, books3 is "all of bibliotik", which is possibly useful to anyone doing NLP work.

I have tried to carefully and rigorously prepare the data in books3; e.g. all of the files are already preprocessed with ftfy.fix_text(), as OpenAI does.

If you have high quality datasets that you wish to make available to ML researchers, please DM me (@theshawwn) or reach out to The Pile.

@jorditg
Copy link

@jorditg jorditg commented Oct 26, 2020

Great!

Do we have any information about the language percentages of the database or should be considered a "main English" database?

@shawwn
Copy link
Author

@shawwn shawwn commented Oct 26, 2020

@jorditg It's mostly English, but if anyone discovers a trove of foreign .epub files, please DM me. I am quite interested in doing various foreign language versions.

By the way, you can use the epub to txt converter on your own .epub files. I would be curious if it works well enough on foreign epubs, since sadly I speak only southern Texas, ya'll.

Aurametrix added a commit to Aurametrix/Alg that referenced this issue Oct 28, 2020
books
@turnkit
Copy link

@turnkit turnkit commented Nov 7, 2020

+1 torrent.

@shawwn
Copy link
Author

@shawwn shawwn commented Nov 17, 2020

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as bookcorpusnew, thanks to @vblagoje: huggingface/datasets#856

@vblagoje
Copy link

@vblagoje vblagoje commented Nov 21, 2020

Small correction @shawwn - it is bookcorpusopen. Whoever wants to use Shawn's bookcorpus in HuggingFace Datasets simply has to:

from datasets import load_dataset
d = load_dataset('bookcorpusopen', split="train")

And then continue to use dataset d as any other HF dataset. See the manual for more details or the dataset card for this version of bookcorpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.