Adding ray as a distributor #272

Vaishaal · 2023-01-17T03:34:15Z

For running large jobs on AWS I found a couple problems with the spark backend.

Spark runs things stage-wise by default so either you need to have a huge subjob_size and wait for the reader to finish before the job starts or you run stage by stage but then you are vulnerable to stragglers in each stage. In my experience I saw the the stragglers slowed down the pipeline by 2-10x depending how unlucky you got with a shard
when you are running on a non dedicated cluster spark doesn't handle autoscaling well so you basically need to pay for num_cpus x total runtime, whereas natively ray auto scales number of cpu nodes based on queue size and download speed. This is also nice when you get an error (as you often do for these large DL jobs), as you didn't waste money keeping cluster alive.
As of right now the ray aws utilities are much nicer to work with than the native spark ones.

This PR just adds a ray distributor (with same interface as spark and multiprocessing distributors), and adds an example launch script, and a cluster_config.yaml file if people want to spin up their own AWS cluster.

Using this I was able to get over 200k/images second on a cluster of 100 m5.24xlarges consistently for 24 hours.

Vaishaal · 2023-05-03T05:40:34Z

@rom1504 can you take a look and merge this :)

rom1504 · 2023-05-03T18:56:47Z

Looks pretty good. Any way to add a test for this distributor ?
It would guarantee this would keep working with future changes

jelech · 2023-06-06T11:33:08Z

Hi, is there any plan to merge this PR?

rom1504 · 2023-07-15T18:41:03Z

@jelech I want to but I would prefer we add some testing as otherwise it'll eventually get broken over time

Vaishaal · 2023-07-16T16:19:39Z

Let me do it!

…

________________________________ From: Romain Beaumont ***@***.***> Sent: Saturday, July 15, 2023 2:41:14 PM To: rom1504/img2dataset ***@***.***> Cc: Vaishaal ***@***.***>; Author ***@***.***> Subject: Re: [rom1504/img2dataset] Adding ray as a distributor (PR #272) @jelech<https://github.com/jelech> I want to but I would prefer we add some testing as otherwise it'll eventually get broken over time — Reply to this email directly, view it on GitHub<#272 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKMJEEKH4TQTX72XPRCWP3XQLP4VANCNFSM6AAAAAAT5LYA3Y>. You are receiving this because you authored the thread.Message ID: ***@***.***>

rom1504 · 2023-08-06T23:22:09Z

I think we need this https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html

eg in gh action run

pip install ray[default]
ray start --head --port=6379
ray start --address=127.0.0.1:6379

rom1504 · 2023-08-06T23:26:56Z

then adapt https://github.com/rom1504/img2dataset/blob/main/tests/test_main.py#L363

plus either automatically start a ray cluster if there is none in the main code

or do it only in tests

img2dataset/distributor.py

rom1504 · 2023-08-06T23:49:05Z

ok so I added tests

I'll rework the doc here a little bit and then we're good for merging

Vaishaal · 2023-08-07T18:16:26Z

Nice thanks so much!

rom1504 · 2023-08-20T21:24:31Z

thanks for the PR!

Vaishaal force-pushed the ray_pr branch 2 times, most recently from 1d93049 to b315c84 Compare January 17, 2023 03:49

added ray as a distributor

6183e5a

Vaishaal force-pushed the ray_pr branch from b315c84 to 6183e5a Compare January 22, 2023 03:23

add block size for fs open

abd4b02

rom1504 added this to Needs triage in PR Triage Mar 4, 2023

Vaishaal added 5 commits May 1, 2023 21:22

fixed indent error

e94a540

pylint fix

937f51f

black fix

67a4180

bug fix in shard download

ac30f02

bug fix in shard download

87e2389

rom1504 moved this from Needs triage to Important to finish in PR Triage May 28, 2023

rom1504 added 4 commits August 7, 2023 01:26

Merge branch 'main' into ray_pr

8a255c5

add ray to test

ebfbf8c

add ray to requirements test

16fe248

add ray start to ci yml

4d8c102

rom1504 reviewed Aug 6, 2023

View reviewed changes

img2dataset/distributor.py Show resolved Hide resolved

Update ci.yml

2ef23ef

rom1504 added 3 commits August 20, 2023 23:21

clarify doc

39098c9

improve ray doc

7632938

Merge branch 'main' into ray_pr

dcfc3a3

rom1504 merged commit 171e3cd into rom1504:main Aug 20, 2023
4 checks passed

PR Triage automation moved this from Important to finish to Closed Aug 20, 2023

rom1504 mentioned this pull request Aug 20, 2023

Refactor as a (self hosted) service #339

Open

This was referenced Sep 15, 2023

Expose img2dataset distributor mlfoundations/datacomp#58

Open

Usage with AWS S3 and Ray mlfoundations/datacomp#59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ray as a distributor #272

Adding ray as a distributor #272

Vaishaal commented Jan 17, 2023

Vaishaal commented May 3, 2023 •

edited

Loading

rom1504 commented May 3, 2023

jelech commented Jun 6, 2023

rom1504 commented Jul 15, 2023

Vaishaal commented Jul 16, 2023 via email

rom1504 commented Aug 6, 2023 •

edited

Loading

rom1504 commented Aug 6, 2023

rom1504 commented Aug 6, 2023

Vaishaal commented Aug 7, 2023

rom1504 commented Aug 20, 2023

Adding ray as a distributor #272

Adding ray as a distributor #272

Conversation

Vaishaal commented Jan 17, 2023

Vaishaal commented May 3, 2023 • edited Loading

rom1504 commented May 3, 2023

jelech commented Jun 6, 2023

rom1504 commented Jul 15, 2023

Vaishaal commented Jul 16, 2023 via email

rom1504 commented Aug 6, 2023 • edited Loading

rom1504 commented Aug 6, 2023

rom1504 commented Aug 6, 2023

Vaishaal commented Aug 7, 2023

rom1504 commented Aug 20, 2023

Vaishaal commented May 3, 2023 •

edited

Loading

rom1504 commented Aug 6, 2023 •

edited

Loading