Skip to content

Commit

Permalink
Update on "[FSDP][3/N] Unify fully_shard auto wrap"
Browse files Browse the repository at this point in the history
This moves `fully_shard` to use `_auto_wrap()` just like `FullyShardedDataParallel`. This means that `fully_shard` goes through the `_init_param_handle_from_module()` path (i.e. 1 `fully_shard` per "wrap"), removing the need for `_init_param_handles_from_module()` (which was 1 `fully_shard` for all "wraps" of a given policy). `_auto_wrap()` simply calls `fully_shard` on target submodules.

This includes several important fixes:
- We should register the pre/post-forward hooks on the module regardless of it has managed parameters.
- We can permit `_module_handles` to return `[]` in the composable path (for when the module has no managed parameters).
- We should unify the paths for `_get_buffers_and_dtypes_for_computation()` (previously, composable path was buggy in some cases).

[ghstack-poisoned]
  • Loading branch information
awgu committed Jul 7, 2023
2 parents 954b142 + 2a2951c commit c501068
Show file tree
Hide file tree
Showing 206 changed files with 4,798 additions and 2,522 deletions.
4 changes: 3 additions & 1 deletion .ci/docker/common/install_base.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ install_ubuntu() {
libtool \
vim \
unzip \
gdb
gdb \
libxml2-dev \
libxslt-dev

# Should resolve issues related to various apt package repository cert issues
# see: https://github.com/pytorch/pytorch/issues/65931
Expand Down
4 changes: 3 additions & 1 deletion .ci/docker/common/install_inductor_benchmark_deps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ function install_huggingface() {
version=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "transformers==${version}"
}

Expand All @@ -17,8 +18,9 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}

install_huggingface
# install_timm
# install_timm
5 changes: 5 additions & 0 deletions .ci/docker/requirements-ci.txt
Original file line number Diff line number Diff line change
Expand Up @@ -269,3 +269,8 @@ pytest-cpp==2.3.0
#Description: This is used by pytest to invoke C++ tests
#Pinned versions: 2.3.0
#test that import:

z3-solver
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
2 changes: 2 additions & 0 deletions .ci/pytorch/common_utils.sh
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ function install_huggingface() {
version=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "transformers==${version}"
}

Expand All @@ -202,6 +203,7 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}

Expand Down
3 changes: 3 additions & 0 deletions .ci/pytorch/win-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0

# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver

run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do
Expand Down
2 changes: 1 addition & 1 deletion .github/ci_commit_pins/vision.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2d4484fba0f45637f68adadf5a056a6147642aa4
23b0938f897f4003bb26c086115de94b9976cb9f
2 changes: 1 addition & 1 deletion .github/ci_commit_pins/xla.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
73392fc2a6c9ec40cba968ea66754514346ac79f
b799e206c6ffdee3aae3415ea9fe8edcabccb7a5
1 change: 1 addition & 0 deletions .github/requirements/pip-requirements-macOS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ filelock==3.6.0
sympy==1.11.1
pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
23 changes: 15 additions & 8 deletions .github/scripts/filter_test_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def is_cuda_or_rocm_job(job_name: Optional[str]) -> bool:
BUILD_AND_TEST_JOB_NAME = "build-and-test"
JOB_NAME_CFG_REGEX = re.compile(r"(?P<job>[\w-]+)\s+\((?P<cfg>[\w-]+)\)")
EXCLUDED_BRANCHES = ["nightly"]
MEM_LEAK_LABEL = "enable-mem-leak-check"


class IssueType(Enum):
Expand Down Expand Up @@ -495,6 +496,12 @@ def perform_misc_tasks(

set_output("reenabled-issues", ",".join(get_reenabled_issues(pr_body=pr_body)))

if MEM_LEAK_LABEL in labels:
# Enable mem leak check if label is added
for config in test_matrix.get("include", []):
if is_cuda_or_rocm_job(job_name):
config["mem_leak_check"] = "mem_leak_check"


def main() -> None:
args = parse_args()
Expand Down Expand Up @@ -558,14 +565,6 @@ def main() -> None:
args.workflow, args.job_name, filtered_test_matrix
)

# Set the filtered test matrix as the output
set_output("test-matrix", json.dumps(filtered_test_matrix))

filtered_test_matrix_len = len(filtered_test_matrix.get("include", []))
# and also put a flag if the test matrix is empty, so subsequent jobs can
# quickly check it without the need to parse the JSON string
set_output("is-test-matrix-empty", filtered_test_matrix_len == 0)

pr_body = get_pr_info(int(pr_number)).get("body", "") if pr_number else ""

perform_misc_tasks(
Expand All @@ -575,6 +574,14 @@ def main() -> None:
pr_body=pr_body,
)

# Set the filtered test matrix as the output
set_output("test-matrix", json.dumps(filtered_test_matrix))

filtered_test_matrix_len = len(filtered_test_matrix.get("include", []))
# and also put a flag if the test matrix is empty, so subsequent jobs can
# quickly check it without the need to parse the JSON string
set_output("is-test-matrix-empty", filtered_test_matrix_len == 0)


if __name__ == "__main__":
main()
2 changes: 1 addition & 1 deletion .github/scripts/generate_binary_build_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def generate_wheels_matrix(
"pytorch_extra_install_requirements": "nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.8.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Expand Down
17 changes: 11 additions & 6 deletions .github/workflows/_win-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,17 @@ jobs:
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
# Duplicated in win-test because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main

Expand All @@ -65,12 +76,6 @@ jobs:
call C:\Jenkins\Miniconda3\Scripts\activate.bat C:\Jenkins\Miniconda3
call "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" x64
# Duplicated in win-test because this MUST go before a checkout
- name: Enable git symlinks on Windows
shell: bash
run: |
git config --global core.symlinks true
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
Expand Down
8 changes: 7 additions & 1 deletion .github/workflows/_win-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,17 @@ jobs:
runs-on: ${{ matrix.runner }}
timeout-minutes: 300
steps:
- name: Enable git symlinks on Windows
# Duplicated in win-build because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main

Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit c501068

Please sign in to comment.