-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to download all of LAION-400M #277
Comments
Hi,
I think that's as expected, there is 1% link rot per month and it's already
been more than 8 months since initial release, so success rate went from 95
to 87
You may be able to increase this a little bit by changing the user agent.
You could decide to get more samples by downloading laion2B-en instead
…On Wed, Feb 8, 2023, 01:22 Zeyu Wang ***@***.***> wrote:
Hi, I have been trying to download LAION-400M, using the same instructions
you provided, however, the download is not complete.
On a rough estimation, the success rate is about 0.83-0.85. So for a 400M
size dataset, I actually get 350+M samples. Here is the content of a
typical stats.json file
"HTTP Error 404: Not Found": 594,
"success": 8542,
"HTTP Error 503: Service Temporarily Unavailable": 11,
"HTTP Error 403: Forbidden": 141,
"HTTP Error 503: Service Unavailable": 17,
"<urlopen error [Errno 113] No route to host>": 8,
"Use of image disallowed by X-Robots-Tag directive": 30,
"HTTP Error 401: Unauthorized": 9,
"<urlopen error [Errno -2] Name or service not known>": 139,
"HTTP Error 400: Bad Request": 31,
"HTTP Error 500: Internal Server Error": 12,
"Image decoding error": 83,
"HTTP Error 404: File Not Found": 14,
"<urlopen error [Errno 111] Connection refused>": 10,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)>": 31,
"HTTP Error 521: ": 4,
"HTTP Error 530: ": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)>": 18,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.twistarticle.com'. (_ssl.c:1131)>": 1,
"Remote end closed connection without response": 5,
"<urlopen error [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)>": 5,
"URL can't contain control characters. '/tz/ItemImages/Games/Game Boy Advance/Mega%20Man%20Battle%20Network.jpg' (found at least ' ')": 1,
"<urlopen error [Errno -5] No address associated with hostname>": 17,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn.webfronts.com'. (_ssl.c:1131)>": 2,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'cdn1.freevap.ch'. (_ssl.c:1131)>": 1,
"URL can't contain control characters. '/thumbnail.asp?file=assets/images/pfly images/sweater happy skull_thumbnail.jpg&maxx=150&maxy=0' (found at least ' ')": 1,
"<urlopen error [Errno -3] Temporary failure in name resolution>": 62,
"HTTP Error 523: ": 4,
"'ascii' codec can't encode character '\\xf1' in position 44: ordinal not in range(128)": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'today.law.harvard.edu'. (_ssl.c:1131)>": 1,
"HTTP Error 502: Bad Gateway": 3,
"timed out": 15,
"The read operation timed out": 42,
"<urlopen error timed out>": 64,
"HTTP Error 422: Unprocessable Entity": 3,
"HTTP Error 403: Access Denied": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'preview.mp3mixx.com'. (_ssl.c:1131)>": 1,
"HTTP Error 503: first byte timeout": 7,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.loccie.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)>": 5,
"HTTP Error 415: Unsupported Media Type": 1,
"<urlopen error EOF occurred in violation of protocol (_ssl.c:1131)>": 4,
"OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:798: error: (-215:Assertion failed) !buf.empty() in function 'imdecode_'\n": 5,
"HTTP Error 410: Gone": 9,
"HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nMoved Permanently": 3,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.flowerschennai.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'r8zlusvr.rocketcdn.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1131)>": 3,
"HTTP Error 503: Service Unavailable: Back-end server is at capacity": 3,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'dcspestcontrol.com'. (_ssl.c:1131)>": 1,
"[Errno 104] Connection reset by peer": 1,
"HTTP Error 308: Permanent redirect": 1,
"URL can't contain control characters. '/les%20goupes/T/Triumph (CAN)/The Sport of Kings/The Sport of Kings.jpg' (found at least ' ')": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'partysurprise.co.za'. (_ssl.c:1131)>": 1,
"HTTP Error 404: Not found": 1,
"HTTP Error 404: The specified resource does not exist.": 1,
"HTTP Error 404: ": 2,
"<urlopen error [Errno 101] Network is unreachable>": 1,
"HTTP Error 500: Domain Not Found": 1,
"URL can't contain control characters. ***@***.***?v=1574368035 2x' (found at least ' ')": 1,
"HTTP Error 503: Backend is unhealthy": 1,
"HTTP Error 404: The specified blob does not exist.": 2,
"HTTP Error 520: status code 520": 1,
"HTTP Error 308: Permanent Redirect": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'spotted.tv'. (_ssl.c:1131)>": 1,
"HTTP Error 429: Too Many Requests": 3,
"URL can't contain control characters. '/th?q=Diy Wood Wine Holder' (found at least ' ')": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.lightingandfanpros.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.masedomani.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'gamefaqs1.cbsistatic.com'. (_ssl.c:1131)>": 1,
"HTTP Error 403: The specified account is disabled.": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'images.celebrateexpress.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.pencalenickhouse.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.footesmusic.com'. (_ssl.c:1131)>": 1,
"<urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1131)>": 1,
"unknown url type: '2428'": 1,
"<urlopen error _ssl.c:1114: The handshake operation timed out>": 1
Also, according to the OpenCLIP example running code, they get a total of
41455 shards for LAION-400M. But I only get 41408 shards, which is 47
shards less. I used the default number_sample_per_shard=10000, so I am
not sure why there is this difference.
I wonder is that normal? How can I download all the 400M data? Thanks a
lot!
BTW, I have searched and found a similar issue here
<#242>, where it is
suggested to set up knot resolver for DNS resolving. However, I did set up
the knot resolver exactly as the doc, and checked it by dig @localhost
google.com. So I think the problem is not the DNS resolver.
—
Reply to this email directly, view it on GitHub
<#277>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QCIKSF7HDYG4W3KDLWWLRNFANCNFSM6AAAAAAUUTRSBQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi I tried to download but the size of tar only gets ~20MB and the number of files in file tar gets ~2200 files, this is the command I used
|
Did you set up knot resolver?
…On Tue, Sep 12, 2023, 13:27 faithfulnguyen ***@***.***> wrote:
Hi I tried to download but the size of tar only gets ~20MB and the number
of files in file tar gets ~2200 files, this is the command I used
img2dataset --url_list laion400m-meta --input_format "parquet"\
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
--output_folder laion400m-data --processes_count 16 --thread_count 128 --image_size 256\
--save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437SNP76XLLXT3OMX6E3X2A2HJANCNFSM6AAAAAAUUTRSBQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
No, I don't set up knot resolver, I will try to set up the package and try again, thank you. |
after installing knot and ban9, the size of the tar file was larger than 20MB but still did not reach 270MB, the size is ~100MB, This is the message I got during the download process:
how can I fix this error? update: This is log from wandb |
You should set up only knot resolver, not bind9 and also disable your
previous resolver ( eg systemd-resolved)
You can make sure this is working by looking at the CPU usage of Knot in
top/htop
Then you can check the error reasons in wandb or in json files in the
output folder
…On Thu, Sep 14, 2023, 02:54 faithfulnguyen ***@***.***> wrote:
after installing knot and ban9, the size of the tar file was larger than
20MB but still did not reach 270MB, the size is ~100MB, This is the message
I got during the download process:
total - success: 0.375 - failed to download: 0.621 - failed to resize: 0.004 - images per sec: 10 - count: 10000
worker - success: 0.370 - failed to download: 0.627 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 20 - count: 20000
4it [17:22, 142.27s/it]worker - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 29 - count: 30000
worker - success: 0.374 - failed to download: 0.620 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.372 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 39 - count: 40000
6it [17:25, 60.78s/it]worker - success: 0.367 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 49 - count: 50000
worker - success: 0.374 - failed to download: 0.623 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.624 - failed to resize: 0.004 - images per sec: 58 - count: 60000
11it [17:32, 10.15s/it]worker - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.371 - failed to download: 0.625 - failed to resize: 0.004 - images per sec: 68 - count: 70000
worker - success: 0.359 - failed to download: 0.636 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.369 - failed to download: 0.626 - failed to resize: 0.004 - images per sec: 77 - count: 80000
worker - success: 0.365 - failed to download: 0.630 - failed to resize: 0.005 - images per sec: 10 - count: 10000
total - success: 0.369 - failed to download: 0.627 - failed to resize: 0.004 - images per sec: 87 - count: 90000
worker - success: 0.358 - failed to download: 0.638 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.368 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 96 - count: 100000
worker - success: 0.363 - failed to download: 0.633 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.367 - failed to download: 0.628 - failed to resize: 0.004 - images per sec: 106 - count: 110000
15it [17:37, 3.16s/it]worker - success: 0.354 - failed to download: 0.641 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.366 - failed to download: 0.629 - failed to resize: 0.004 - images per sec: 115 - count: 120000
worker - success: 0.356 - failed to download: 0.639 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.366 - failed to download: 0.630 - failed to resize: 0.004 - images per sec: 125 - count: 130000
worker - success: 0.362 - failed to download: 0.635 - failed to resize: 0.004 - images per sec: 10 - count: 10000
total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 134 - count: 140000
worker - success: 0.359 - failed to download: 0.638 - failed to resize: 0.003 - images per sec: 10 - count: 10000
total - success: 0.365 - failed to download: 0.631 - failed to resize: 0.004 - images per sec: 144 - count: 150000
is this normal
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RCX4TBIUDOXELNKNTX2JITRANCNFSM6AAAAAAUUTRSBQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
yes, I tried to only use the knot but the rate failed ~0.6 for each shard, I added more DNS into the |
hello~ I recently faced a similar problem with the download rate and the knot resolver couldn't help. Could you please share more details or steps on how you modified the DNS to improve the download rate? Many thanks!~ |
yes, I modified the file
one more thing, you could try to reduce the number processing count and thread to increase successful rates, because my bandwidth is limited, I need to change the config to adapt with my env, |
Yes, I also observed that by reducing the number of processes, my success rate went up quite a bit and I stopped getting DNS errors like: |
Hi, I have been trying to download LAION-400M, using the same instructions you provided, however, the download is not complete.
On a rough estimation, the success rate is about 0.83-0.85. So for a 400M size dataset, I actually get 350+M samples. Here is the content of a typical
stats.json
fileAlso, according to the OpenCLIP example running code, they get a total of
41455
shards for LAION-400M. But I only get41408
shards, which is47
shards less. I used the defaultnumber_sample_per_shard=10000
, so I am not sure why there is this difference.I wonder is that normal? How can I download all the 400M data? Thanks a lot!
BTW, I have searched and found a similar issue here, where it is suggested to set up knot resolver for DNS resolving. However, I did set up the knot resolver exactly as the doc, and checked it by
dig @localhost google.com
. So I think the problem is not the DNS resolver.The text was updated successfully, but these errors were encountered: