Introduce tensor sharding #14

alanwaketan · 2023-07-25T01:18:13Z

Summary:
This pull request introduce a new way to do sharding which allow weights to be sharded in two dimensional mesh, i.e., (fsdp, tensor), and then the input to be sharded according to the fsdp dimension.

To enable it, pass --spmd_tensor_sharding 2, 2 is the tensor dimension, the fsdp dimension will be auto calculated according to num_devices // 2.

Test Plan:
Test it on a V4-8 with 2B LLaMA.

jonb377

Looks great, thanks Jiewen!

jonb377 · 2023-07-25T17:00:43Z

src/transformers/trainer.py

                mesh = xs.Mesh(device_ids, (num_devices, 1))
                sharding_spec = xs.ShardingSpec(mesh, (0, 1))
+            elif self.args.spmd_tensor_sharding > 0:


Specifying batch or fsdp sharding will silently override the tensor parallelism, can we assert that these flags are exclusive since tensor_sharding implies FSDP/batch sharding?

Actually, it's intended as you can do tensor_sharding on weights and batch sharding on the input.

Oh I see - so we can run 2D FSDP by specifying --spmd_batch_sharding and e.g. --spmd_tensor_sharding 4. I think specifying --spmd_fsdp_sharding with --spmd_tensor_sharding 4 will always ignore the tensor_sharding though, right?

What I mean is that you can specify:
--spmd_batch_sharding --spmd_tensor_sharding 4
but not
--spmd_fsdp_sharding --spmd_tensor_sharding 4

Do you think that's clear? If not, I can do a follow up.

Got it, thanks Jiewen! I think what we have now is fine. We can follow up later to make the sharding more standard like MaxText has with ici_*_parallelism and dcn_*_parallelism parameters.

Yea, for sure. Does HybridMesh does anything for you in a single slice?

Yeah, it will rearrange the tiling assignment to optimize for the ICI connections. I would say we should always use HybridMesh, even for single slice.

Trying to see if you have the MFU numbers to compare. I will make a change later.

I'll do a quick test on v4-8 to get the MFU difference

No worries. It's not a priority.

alanwaketan · 2023-07-25T18:40:21Z

Thanks Jon for approving.

Summary: This pull request introduce a new way to do sharding which allow weights to be sharded in two dimensional mesh, i.e., (fsdp, tensor), and then the input to be sharded according to the fsdp dimension. To enable it, pass --spmd_tensor_sharding 2, 2 is the tensor dimension, the fsdp dimension will be auto calculated according to num_devices // 2. Test Plan: Test it on a V4-8 with 2B LLaMA.

* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <ahmetustun89@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

Summary: This pull request introduce a new way to do sharding which allow weights to be sharded in two dimensional mesh, i.e., (fsdp, tensor), and then the input to be sharded according to the fsdp dimension. To enable it, pass --spmd_tensor_sharding 2, 2 is the tensor dimension, the fsdp dimension will be auto calculated according to num_devices // 2. Test Plan: Test it on a V4-8 with 2B LLaMA.

Introduce tensor sharding

e51c3f9

alanwaketan requested review from miladm, jonb377 and JackCaoG July 25, 2023 01:18

alanwaketan self-assigned this Jul 25, 2023

jonb377 approved these changes Jul 25, 2023

View reviewed changes

alanwaketan merged commit ba5c61d into llama2-google-next-training Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce tensor sharding #14

Introduce tensor sharding #14

alanwaketan commented Jul 25, 2023

jonb377 left a comment

jonb377 Jul 25, 2023

alanwaketan Jul 25, 2023

jonb377 Jul 25, 2023

alanwaketan Jul 25, 2023

jonb377 Jul 25, 2023

alanwaketan Jul 25, 2023

jonb377 Jul 25, 2023

alanwaketan Jul 25, 2023

jonb377 Jul 25, 2023

alanwaketan Jul 25, 2023

alanwaketan commented Jul 25, 2023

Introduce tensor sharding #14

Introduce tensor sharding #14

Conversation

alanwaketan commented Jul 25, 2023

jonb377 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan commented Jul 25, 2023