Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download the newest version of dataset without duplicate files? #15

Closed
qiaogh97 opened this issue Nov 10, 2021 · 2 comments
Closed

Comments

@qiaogh97
Copy link

Hi, @rom1504
I know there are three versions of the parquet files as below.

Version Parquet file size Hash value Total size
1.0 1.6G 5b54c5d5 400 million
2.0 3.6G 03f11a48 800 million
3.0 4.9G f27692e1 1.1 billion

So I wonder know if the parquet files in different versions are one-to-one correspondence.
I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?

@rom1504
Copy link
Owner

rom1504 commented Nov 10, 2021

Hi,
All three versions you mention are free of duplicate and are subset of each other, ie version 2 contains 1, 3 contains 2.

Only the 400M version (the first one) is properly released by us (that's the one we call laion400m) and you can get it from https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ or https://www.kaggle.com/romainbeaumont/laion400m

The other 2 versions you mention are work in progress, and are not yet fully ready for use (for example these versions 2 and 3 are not fully randomly shuffled unlike version 1, which is an important property for use of the dataset)

We will release a larger version of the dataset with a few billions samples in a few months.

Do you have any deadlines / uses of the larger dataset (larger than 400m) on your side?

@qiaogh97
Copy link
Author

It doesn't matter, I don't have any deadlines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants