Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Jul 2, 2024

Somehow in TorchChat, we only set device to be "cuda" which makes everyone use cuda:0 and leads to CUDA OOM when it comes to checkpoint loading. And now I can run all the way until the prompt is showing up. But somehow we now need to enter so many times for each rank so this is something we need to solve next.

Also for TP part, we need to use TP not sequence parallel like what we did for training.

To test torchrun DI, one can just run ./distributed/run_dist_inference.sh to run the DI program

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/877

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 655ea0f with merge base c716548 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 2, 2024
@fduwjj fduwjj requested review from kartikayk and lessw2020 July 2, 2024 18:22
@fduwjj fduwjj changed the title [Dist][Inference] U-haul TP and distribute utils code to TorchChat [Distributed Inference] Make torch run work for torchchat and fix TP bugs Jul 2, 2024
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding, esp for the OOM (device 0) fix.
tiny nit to update the one tp comment and remove ref to seq parallel since it's not being used now.

@fduwjj fduwjj merged commit 7973c2a into main Jul 2, 2024
vmpuri pushed a commit that referenced this pull request Jul 8, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
malfet pushed a commit that referenced this pull request Jul 17, 2024
…bugs (#877)

* [Distributed Inference] Make torch run work for torchchat
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants