-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make process spawning configurable: either multiprocess, either spark or dask #20
Comments
with the current architecture of things, it should be pretty natural to make it possible to choose a multiprocessing pool and a spark or dask distributed environment |
https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L337 at least do it at the file level, so this can be a pure mapper, follow the same idea as rom1504/clip-retrieval#79 (comment) |
https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/distributed_backends/distributed_backend.py can also be an interesting inspiration |
https://github.com/horovod/horovod/blob/386be429b1417a1f6cb5e715bbe36efd2e74f402/horovod/spark/runner.py#L244 is a good trick to let the user build his own spark context |
to move forward on this, moving the reader at the executor level could be good |
Spark streaming can handle a streaming collection of files in a folder
Third solution is the best. |
https://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark_Streaming.php
|
https://github.com/criteo/cluster-pack/tree/master/examples/spark-with-S3 maybe be helpful to create a pyspark session but should probably not be included by default and instead be under an option or even as an example script / let the user create the session as he prefers |
ok we now have pyspark support. Next step here is to actually try running it on some pyspark clusters.
|
standalone https://spark.apache.org/downloads.html https://spark.apache.org/docs/latest/spark-standalone.html wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz on master: on nodes: |
makes sure all writer overwrite for this to work well with spark feature of retrying (just delete the file if it already exists) |
-> not obvious how to run standalone: how to send the env to other nodes ; where to write ? (how to setup a distributed fs locally) |
maybe using sshfs could work |
|
since this is just a mapper, it could also be possible to build a docker and spawn it one time per input file like https://blog.iron.io/docker-iron-io-super-easy-batch-processing/ |
possibly reconsider a streaming based approach to eliminate the concept of file from most of the pipeline |
consider yielding examples in the downloader and moving the aggregation by the writer at the distributor level (not the driver, but an abstraction on top of the downloader happening in the workers) |
https://github.com/intel-analytics/analytics-zoo looks really good |
This could potentially be made easier by having a service handling the http/dns part, returning the original image and letting img2dataset job do the resizing and packaging Pipeline is
The download part may be complicated to scale beyond 1000 request/s due to dns, so maybe it's better to let this part be done by a service |
consider making 2 ways shared file systems not required (can be done by distributing the shards via pyspark/python serialization instead of arrow + save to file system) |
|
this is almost done now |
https://github.com/rom1504/img2dataset/blob/main/examples/distributed_img2dataset_tutorial.md here is the guide it works, but it's a bit complex I would like to propose also these alternatives:
|
aws emr on eks is actually rather painful to setup I'm considering instead going the raw ec2 route options are to document the spark setup in this case, or to do a no spark option (would require implementing robustness) |
writing to s3 (and hdfs) from any machine is working just fine now I believe the only additional thing I will try here is a pure ssh based strategy, to make it easier for people to run in distributed mode |
this is working. A little troublesome to setup but overall working! |
May at least be useful to control better the memory usage
The text was updated successfully, but these errors were encountered: