Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Regarding Performance Optimisation for Large-Scale Downloads #374

Open
yihong1120 opened this issue Jan 4, 2024 · 1 comment
Open

Comments

@yihong1120
Copy link

Dear img2dataset Maintainers,

I hope this message finds you well. I am reaching out to discuss a potential enhancement in the performance of img2dataset when dealing with large-scale image downloads. While the tool performs admirably, I believe there is room for further optimisation, particularly when operating on datasets exceeding 100 million images.

I have observed that the download speed tends to fluctuate, and at times, the CPU and bandwidth utilisation do not reach their full potential. This observation leads me to ponder whether additional parallelisation strategies or more efficient resource allocation could be implemented.

Moreover, I have a few suggestions that might contribute to the tool's efficiency:

  1. Introducing a dynamic thread management system that can adapt to the current network and CPU load.
  2. Implementing a more sophisticated DNS caching mechanism to reduce the overhead of DNS lookups.
  3. Exploring the possibility of integrating with a CDN or other network optimisation services to enhance download speeds globally.

I am keen to hear your thoughts on these suggestions and would be delighted to contribute to the development of these enhancements.

Thank you for your time and the excellent work you have done with img2dataset. It is a vital tool for the machine learning community, and I am excited about its potential evolution.

Best regards,
yihong1120

@rom1504
Copy link
Owner

rom1504 commented Jan 4, 2024

Hi Yihong,

Your suggestions make a lot of sense.
I am interested by any improvement that would improve the speed further.

In particular

  1. A dynamic thread management would be interesting. I tried in the past to implement timeout with no great success, maybe starting more threads when some are stuck would help. One limiting factor however is the capacity of the operating system to open enough TCP connections.
  2. Yes that's in fact something some users are currently looking in as DNS lookup is difficult in some environment. I recommend knot resolver in the readme but I would appreciate any in built solution for this. I tried static resolving in the past but hit issues (DNS load balancing needs to be handled, some domain to IP mapping changes often)
  3. I'm curious on your ideas with CDN. Do you mean an externally hosted software or some CDN implementation?

Regardless, I encourage you to try out any ideas.

I found to get reproducible results of speed, using a shard from a large dataset to work well. I would usually use a shard from laion400m or laion2B-en but since those are currently down, you may use coyo700m as a replacement.
Usually running the tool for a few minutes and looking at the metrics on wandb is pretty efficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants