Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ray as a distributor #272

Merged
merged 15 commits into from
Aug 20, 2023
Merged

Adding ray as a distributor #272

merged 15 commits into from
Aug 20, 2023

Conversation

Vaishaal
Copy link
Contributor

For running large jobs on AWS I found a couple problems with the spark backend.

  1. Spark runs things stage-wise by default so either you need to have a huge subjob_size and wait for the reader to finish before the job starts or you run stage by stage but then you are vulnerable to stragglers in each stage. In my experience I saw the the stragglers slowed down the pipeline by 2-10x depending how unlucky you got with a shard
  2. when you are running on a non dedicated cluster spark doesn't handle autoscaling well so you basically need to pay for num_cpus x total runtime, whereas natively ray auto scales number of cpu nodes based on queue size and download speed. This is also nice when you get an error (as you often do for these large DL jobs), as you didn't waste money keeping cluster alive.
  3. As of right now the ray aws utilities are much nicer to work with than the native spark ones.

This PR just adds a ray distributor (with same interface as spark and multiprocessing distributors), and adds an example launch script, and a cluster_config.yaml file if people want to spin up their own AWS cluster.

Using this I was able to get over 200k/images second on a cluster of 100 m5.24xlarges consistently for 24 hours.

@Vaishaal Vaishaal force-pushed the ray_pr branch 2 times, most recently from 1d93049 to b315c84 Compare January 17, 2023 03:49
@rom1504 rom1504 added this to Needs triage in PR Triage Mar 4, 2023
@Vaishaal
Copy link
Contributor Author

Vaishaal commented May 3, 2023

@rom1504 can you take a look and merge this :)

@rom1504
Copy link
Owner

rom1504 commented May 3, 2023

Looks pretty good. Any way to add a test for this distributor ?
It would guarantee this would keep working with future changes

@rom1504 rom1504 moved this from Needs triage to Important to finish in PR Triage May 28, 2023
@jelech
Copy link

jelech commented Jun 6, 2023

Hi, is there any plan to merge this PR?

@rom1504
Copy link
Owner

rom1504 commented Jul 15, 2023

@jelech I want to but I would prefer we add some testing as otherwise it'll eventually get broken over time

@Vaishaal
Copy link
Contributor Author

Vaishaal commented Jul 16, 2023 via email

@rom1504
Copy link
Owner

rom1504 commented Aug 6, 2023

I think we need this https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html

eg in gh action run

pip install ray[default]
ray start --head --port=6379
ray start --address=127.0.0.1:6379

@rom1504
Copy link
Owner

rom1504 commented Aug 6, 2023

then adapt https://github.com/rom1504/img2dataset/blob/main/tests/test_main.py#L363

plus either automatically start a ray cluster if there is none in the main code

or do it only in tests

@rom1504
Copy link
Owner

rom1504 commented Aug 6, 2023

ok so I added tests

I'll rework the doc here a little bit and then we're good for merging

@Vaishaal
Copy link
Contributor Author

Vaishaal commented Aug 7, 2023

Nice thanks so much!

@rom1504 rom1504 merged commit 171e3cd into rom1504:main Aug 20, 2023
4 checks passed
PR Triage automation moved this from Important to finish to Closed Aug 20, 2023
@rom1504
Copy link
Owner

rom1504 commented Aug 20, 2023

thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants