Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] Fix mixtral example for Azure #3017

Merged
merged 5 commits into from
Jan 25, 2024
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 23, 2024

Fixes #2905
Fix mixtral example for Azure by removing the problemetic nccl config file on Azure.

With this PR, after #2434 is merged, we should be able to change the examples to disk_tier: best, so that Azure is allowed to be part of the candidates.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --disk-tier none -c test-mixtral --cloud azure llm/mixtral/serve.yaml
    • sky launch -c test-gcp-mixtral --cloud gcp --use-spot llm/mixtral/serve.yaml
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@@ -34,6 +34,9 @@ setup: |
pip list | grep megablocks || pip install megablocks

run: |
# Remove the default nccl.conf which causes failure on Azure
sudo mv /etc/nccl.conf /etc/nccl.conf.bak || true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file for? Will this have any side-effects?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it a common issue for other workloads? Should we add it in azure-ray.yml.j2 instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point! This file is wrongly configured on Azure with multiple GPUs, causing the issue with the distributed workload using nccl. We should include this in the azure-ray.yml.j2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. PTAL.

Copy link
Collaborator

@MaoZiming MaoZiming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Michaelvll Michaelvll merged commit f6b6b6e into master Jan 25, 2024
38 checks passed
@Michaelvll Michaelvll deleted the fix-mixtral-example-for-azure branch January 25, 2024 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Example] Mixtral example fail to work on Azure VM due to NCCL error
3 participants