Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to download all of LAION-400M #277

Closed
zw615 opened this issue Feb 8, 2023 · 10 comments
Closed

Failed to download all of LAION-400M #277

zw615 opened this issue Feb 8, 2023 · 10 comments

Comments

@zw615
Copy link

zw615 commented Feb 8, 2023

Hi, I have been trying to download LAION-400M, using the same instructions you provided, however, the download is not complete.
On a rough estimation, the success rate is about 0.83-0.85. So for a 400M size dataset, I actually get 350+M samples. Here is the content of a typical stats.json file

"HTTP Error 404: Not Found": 594,
        "success": 8542,
        "HTTP Error 503: Service Temporarily Unavailable": 11,
        "HTTP Error 403: Forbidden": 141,
        "HTTP Error 503: Service Unavailable": 17,
        "<urlopen error [Errno 113] No route to host>": 8,
        "Use of image disallowed by X-Robots-Tag directive": 30,
        "HTTP Error 401: Unauthorized": 9,
        "<urlopen error [Errno -2] Name or service not known>": 139,
        "HTTP Error 400: Bad Request": 31,
        "HTTP Error 500: Internal Server Error": 12,
        "Image decoding error": 83,
        "HTTP Error 404: File Not Found": 14,
        "<urlopen error [Errno 111] Connection refused>": 10,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)>": 31,
        "HTTP Error 521: ": 4,
        "HTTP Error 530: ": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)>": 18,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.twistarticle.com'. (_ssl.c:1131)>": 1,
        "Remote end closed connection without response": 5,
        "<urlopen error [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)>": 5,
        "URL can't contain control characters. '/tz/ItemImages/Games/Game Boy Advance/Mega%20Man%20Battle%20Network.jpg' (found at least ' ')": 1,
        "<urlopen error [Errno -5] No address associated with hostname>": 17,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.webfronts.com'. (_ssl.c:1131)>": 2,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn1.freevap.ch'. (_ssl.c:1131)>": 1,
        "URL can't contain control characters. '/thumbnail.asp?file=assets/images/pfly images/sweater happy skull_thumbnail.jpg&maxx=150&maxy=0' (found at least ' ')": 1,
        "<urlopen error [Errno -3] Temporary failure in name resolution>": 62,
        "HTTP Error 523: ": 4,
        "'ascii' codec can't encode character '\\xf1' in position 44: ordinal not in range(128)": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'today.law.harvard.edu'. (_ssl.c:1131)>": 1,
        "HTTP Error 502: Bad Gateway": 3,
        "timed out": 15,
        "The read operation timed out": 42,
        "<urlopen error timed out>": 64,
        "HTTP Error 422: Unprocessable Entity": 3,
        "HTTP Error 403: Access Denied": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'preview.mp3mixx.com'. (_ssl.c:1131)>": 1,
        "HTTP Error 503: first byte timeout": 7,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.loccie.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)>": 5,
        "HTTP Error 415: Unsupported Media Type": 1,
        "<urlopen error EOF occurred in violation of protocol (_ssl.c:1131)>": 4,
        "OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:798: error: (-215:Assertion failed) !buf.empty() in function 'imdecode_'\n": 5,
        "HTTP Error 410: Gone": 9,
        "HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nMoved Permanently": 3,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.flowerschennai.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'r8zlusvr.rocketcdn.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1131)>": 3,
        "HTTP Error 503: Service Unavailable: Back-end server is at capacity": 3,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'dcspestcontrol.com'. (_ssl.c:1131)>": 1,
        "[Errno 104] Connection reset by peer": 1,
        "HTTP Error 308: Permanent redirect": 1,
        "URL can't contain control characters. '/les%20goupes/T/Triumph (CAN)/The Sport of Kings/The Sport of Kings.jpg' (found at least ' ')": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'partysurprise.co.za'. (_ssl.c:1131)>": 1,
        "HTTP Error 404: Not found": 1,
        "HTTP Error 404: The specified resource does not exist.": 1,
        "HTTP Error 404: ": 2,
        "<urlopen error [Errno 101] Network is unreachable>": 1,
        "HTTP Error 500: Domain Not Found": 1,
        "URL can't contain control characters. '/s/files/1/0035/0306/3153/products/JR_NaturalJacksFlatSandal_Midnight_B_512x512@2x.jpg?v=1574368035 2x' (found at least ' ')": 1,
        "HTTP Error 503: Backend is unhealthy": 1,
        "HTTP Error 404: The specified blob does not exist.": 2,
        "HTTP Error 520: status code 520": 1,
        "HTTP Error 308: Permanent Redirect": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'spotted.tv'. (_ssl.c:1131)>": 1,
        "HTTP Error 429: Too Many Requests": 3,
        "URL can't contain control characters. '/th?q=Diy Wood Wine Holder' (found at least ' ')": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.lightingandfanpros.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.masedomani.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'gamefaqs1.cbsistatic.com'. (_ssl.c:1131)>": 1,
        "HTTP Error 403: The specified account is disabled.": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.celebrateexpress.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.pencalenickhouse.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.footesmusic.com'. (_ssl.c:1131)>": 1,
        "<urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1131)>": 1,
        "unknown url type: '2428'": 1,
        "<urlopen error _ssl.c:1114: The handshake operation timed out>": 1

Also, according to the OpenCLIP example running code, they get a total of 41455 shards for LAION-400M. But I only get 41408 shards, which is 47 shards less. I used the default number_sample_per_shard=10000, so I am not sure why there is this difference.

I wonder is that normal? How can I download all the 400M data? Thanks a lot!

BTW, I have searched and found a similar issue here, where it is suggested to set up knot resolver for DNS resolving. However, I did set up the knot resolver exactly as the doc, and checked it by dig @localhost google.com. So I think the problem is not the DNS resolver.

@rom1504
Copy link
Owner

rom1504 commented Feb 8, 2023 via email

@faithfulnguyen
Copy link

Hi I tried to download but the size of tar only gets ~20MB and the number of files in file tar gets ~2200 files, this is the command I used

img2dataset --url_list laion400m-meta --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256\
             --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True

@rom1504
Copy link
Owner

rom1504 commented Sep 12, 2023 via email

@faithfulnguyen
Copy link

No, I don't set up knot resolver, I will try to set up the package and try again, thank you.

@faithfulnguyen
Copy link

faithfulnguyen commented Sep 14, 2023

after installing knot and ban9, the size of the tar file was larger than 20MB but still did not reach 270MB, the size is ~100MB, This is the message I got during the download process:

total   - success: 0.375 - failed to download: 0.621 - failed to resize: 0.004 - images per sec: 10 - count: 10000
worker  - success: 0.370 - failed to download: 0.627 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total   - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 20 - count: 20000
4it [17:22, 142.27s/it]worker  - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 29 - count: 30000
worker  - success: 0.374 - failed to download: 0.620 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total   - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 39 - count: 40000
6it [17:25, 60.78s/it]worker  - success: 0.367 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 49 - count: 50000
worker  - success: 0.374 - failed to download: 0.623 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.371 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 58 - count: 60000
11it [17:32, 10.15s/it]worker  - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 68 - count: 70000
worker  - success: 0.359 - failed to download: 0.636 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total   - success: 0.369 - failed to download: 0.626 - failed to resize: 0.004 - images per sec: 77 - count: 80000
worker  - success: 0.365 - failed to download: 0.630 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total   - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 87 - count: 90000
worker  - success: 0.358 - failed to download: 0.638 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 96 - count: 100000
worker  - success: 0.363 - failed to download: 0.633 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.367 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 106 - count: 110000
15it [17:37,  3.16s/it]worker  - success: 0.354 - failed to download: 0.641 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.366 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 115 - count: 120000
worker  - success: 0.356 - failed to download: 0.639 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.366 - failed to download: 0.630 - failed to resize: 0.004 - images per sec: 125 - count: 130000
worker  - success: 0.362 - failed to download: 0.635 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total   - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 134 - count: 140000
worker  - success: 0.359 - failed to download: 0.638 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total   - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 144 - count: 150000

how can I fix this error?

update: This is log from wandb

ảnh

@rom1504
Copy link
Owner

rom1504 commented Sep 14, 2023 via email

@faithfulnguyen
Copy link

yes, I tried to only use the knot but the rate failed ~0.6 for each shard, I added more DNS into the /etc/resolv.conf and re-download again, the results seem better than before.

@cailk
Copy link

cailk commented Dec 14, 2023

yes, I tried to only use the knot but the rate failed ~0.6 for each shard, I added more DNS into the /etc/resolv.conf and re-download again, the results seem better than before.

hello~ I recently faced a similar problem with the download rate and the knot resolver couldn't help. Could you please share more details or steps on how you modified the DNS to improve the download rate? Many thanks!~

@faithfulnguyen
Copy link

yes, I modified the file /etc/resolv.conf because the internet in my company,
here is some public DNS I used to download the dataset

nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 76.76.2.0
nameserver 76.76.10.0
nameserver 9.9.9.9
nameserver 1.1.1.1
nameserver 1.0.0.1

one more thing, you could try to reduce the number processing count and thread to increase successful rates, because my bandwidth is limited, I need to change the config to adapt with my env,
I used --processes_count 2 --thread_count 32. The tradeoff between speed and successful rates is just my experiment, not sure that config works with you but you could give it a try. hope that thing helps you.

@vishaal27
Copy link

Yes, I also observed that by reducing the number of processes, my success rate went up quite a bit and I stopped getting DNS errors like: "<urlopen error [Errno -2] Name or service not known>".

@rom1504 rom1504 closed this as completed Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants