Update on "[JIT] memory planning base with naive strategy"

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
pytorch · Nov 3, 2021 · c545281 · c545281
2 parents 796d334 + 31912fb
commit c545281
Show file tree

Hide file tree

Showing 2,442 changed files with 156,895 additions and 58,050 deletions.
diff --git a/.azure_pipelines/job_templates/prepare-build-template.yml b/.azure_pipelines/job_templates/prepare-build-template.yml
@@ -46,7 +46,7 @@ steps:
       curl -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output .\tmp_bin\sccache.exe
       curl -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output .\tmp_bin\sccache-cl.exe
       copy .\tmp_bin\sccache.exe .\tmp_bin\nvcc.exe
-      curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.3/randomtemp.exe --output .\tmp_bin\randomtemp.exe
+      curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output .\tmp_bin\randomtemp.exe
     displayName: Install sccache and randomtemp
     condition: not(eq(variables.CUDA_VERSION, ''))
 

diff --git a/.azure_pipelines/job_templates/set-environment-variables.yml b/.azure_pipelines/job_templates/set-environment-variables.yml
@@ -120,9 +120,7 @@ steps:
         Write-Host "##vso[task.setvariable variable=CMAKE_LIBRARY_PATH;]$(Build.SourcesDirectory)\mkl\lib;$env:CMAKE_LIBRARY_PATH"
         Write-Host "##vso[task.setvariable variable=ADDITIONAL_PATH;]$(Build.SourcesDirectory)\tmp_bin"
         Write-Host "##vso[task.setvariable variable=SCCACHE_IDLE_TIMEOUT;]1500"
-        Write-Host "##vso[task.setvariable variable=RANDOMTEMP_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\nvcc.exe"
-        Write-Host "##vso[task.setvariable variable=CUDA_NVCC_EXECUTABLE;]$(Build.SourcesDirectory)\tmp_bin\randomtemp.exe"
-        Write-Host "##vso[task.setvariable variable=RANDOMTEMP_BASEDIR;]$(Build.SourcesDirectory)\tmp_bin"
+        Write-Host "##vso[task.setvariable variable=CMAKE_CUDA_COMPILER_LAUNCHER;]$(Build.SourcesDirectory)/tmp_bin/randomtemp.exe;$(Build.SourcesDirectory)/tmp_bin/sccache.exe"
       displayName: Set MKL, sccache and randomtemp environment variables
 
     # View current environment variables

diff --git a/.circleci/cimodel/data/binary_build_data.py b/.circleci/cimodel/data/binary_build_data.py
@@ -63,7 +63,8 @@ def get_processor_arch_name(gpu_version):
         ],
     )),
     windows=(
-        [v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS],
+        # Stop building Win+CU102, see https://github.com/pytorch/pytorch/issues/65648
+        [v for v in dimensions.GPU_VERSIONS if v not in dimensions.ROCM_VERSION_LABELS and v != "cuda102"],
         OrderedDict(
             wheel=dimensions.STANDARD_PYTHON_VERSIONS,
             conda=dimensions.STANDARD_PYTHON_VERSIONS,

diff --git a/.circleci/cimodel/data/dimensions.py b/.circleci/cimodel/data/dimensions.py
@@ -7,9 +7,9 @@
 ]
 
 ROCM_VERSIONS = [
-    "4.0.1",
     "4.1",
     "4.2",
+    "4.3.1",
 ]
 
 ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]

diff --git a/.circleci/cimodel/data/pytorch_build_data.py b/.circleci/cimodel/data/pytorch_build_data.py
@@ -10,19 +10,6 @@
                 ]),
             ]),
             # TODO: bring back libtorch test
-            ("7", [X("3.6")]),
-        ]),
-        ("clang", [
-            ("7", [
-                ("3.6", [
-                    ("asan", [
-                        (True, [
-                            ("shard_test", [XImportant(True)]),
-                        ]),
-                    ]),
-                    ("onnx", [XImportant(True)]),
-                ]),
-            ]),
         ]),
         ("cuda", [
             ("10.2", [
@@ -52,7 +39,6 @@
             ("9", [
                 ("3.6", [
                     ("xla", [XImportant(True)]),
-                    ("vulkan", [XImportant(True)]),
                 ]),
             ]),
         ]),
@@ -145,7 +131,6 @@ def child_constructor(self):
             "build_only": BuildOnlyConfigNode,
             "shard_test": ShardTestConfigNode,
             "cuda_gcc_override": CudaGccOverrideConfigNode,
-            "coverage": CoverageConfigNode,
             "pure_torch": PureTorchConfigNode,
             "slow_gradcheck": SlowGradcheckConfigNode,
         }
@@ -289,14 +274,6 @@ def child_constructor(self):
         return ImportantConfigNode
 
 
-class CoverageConfigNode(TreeConfigNode):
-    def init2(self, node_name):
-        self.props["is_coverage"] = node_name
-
-    def child_constructor(self):
-        return ExperimentalFeatureConfigNode
-
-
 class ImportantConfigNode(TreeConfigNode):
     def modify_label(self, label):
         return "IMPORTANT=" + str(label)

diff --git a/.circleci/cimodel/data/pytorch_build_definitions.py b/.circleci/cimodel/data/pytorch_build_definitions.py
@@ -239,7 +239,6 @@ def instantiate_configs(only_slow_gradcheck):
         compiler_version = fc.find_prop("compiler_version")
         is_xla = fc.find_prop("is_xla") or False
         is_asan = fc.find_prop("is_asan") or False
-        is_coverage = fc.find_prop("is_coverage") or False
         is_noarch = fc.find_prop("is_noarch") or False
         is_onnx = fc.find_prop("is_onnx") or False
         is_pure_torch = fc.find_prop("is_pure_torch") or False
@@ -284,10 +283,6 @@ def instantiate_configs(only_slow_gradcheck):
             python_version = fc.find_prop("pyver")
             parms_list[0] = fc.find_prop("abbreviated_pyver")
 
-        if is_coverage:
-            parms_list_ignored_for_docker_image.append("coverage")
-            python_version = fc.find_prop("pyver")
-
         if is_noarch:
             parms_list_ignored_for_docker_image.append("noarch")
 
@@ -357,28 +352,6 @@ def instantiate_configs(only_slow_gradcheck):
                                         tags_list=RC_PATTERN)
             c.dependent_tests = gen_docs_configs(c)
 
-        if (
-            compiler_name != "clang"
-            and not rocm_version
-            and not is_libtorch
-            and not is_vulkan
-            and not is_pure_torch
-            and not is_noarch
-            and not is_slow_gradcheck
-            and not only_slow_gradcheck
-            and not build_only
-        ):
-            distributed_test = Conf(
-                c.gen_build_name("") + "distributed",
-                [],
-                is_xla=False,
-                restrict_phases=["test"],
-                is_libtorch=False,
-                is_important=True,
-                parent_build=c,
-            )
-            c.dependent_tests.append(distributed_test)
-
         config_list.append(c)
 
     return config_list

diff --git a/.circleci/cimodel/data/simple/android_definitions.py b/.circleci/cimodel/data/simple/android_definitions.py
@@ -90,12 +90,6 @@ def gen_tree(self):
         ["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
         is_master_only=False,
         is_pr_only=True),
-    AndroidGradleJob(
-        "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
-        "pytorch_android_gradle_custom_build_single",
-        [DOCKER_REQUIREMENT_NDK],
-        is_master_only=False,
-        is_pr_only=True),
     AndroidGradleJob(
         "pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit",
         "pytorch_android_gradle_custom_build_single",

diff --git a/.circleci/cimodel/data/simple/binary_smoketest.py b/.circleci/cimodel/data/simple/binary_smoketest.py
@@ -120,9 +120,9 @@ def gen_tree(self):
     ),
     SmoketestJob(
         "binary_windows_build",
-        ["wheel", "3.7", "cu102"],
+        ["wheel", "3.7", "cu113"],
         None,
-        "binary_windows_wheel_3_7_cu102_build",
+        "binary_windows_wheel_3_7_cu113_build",
         is_master_only=True,
     ),
 
@@ -144,11 +144,11 @@ def gen_tree(self):
     ),
     SmoketestJob(
         "binary_windows_test",
-        ["wheel", "3.7", "cu102"],
+        ["wheel", "3.7", "cu113"],
         None,
-        "binary_windows_wheel_3_7_cu102_test",
+        "binary_windows_wheel_3_7_cu113_test",
         is_master_only=True,
-        requires=["binary_windows_wheel_3_7_cu102_build"],
+        requires=["binary_windows_wheel_3_7_cu113_build"],
         extra_props={
             "executor": "windows-with-nvidia-gpu",
         },

diff --git a/.circleci/cimodel/data/simple/docker_definitions.py b/.circleci/cimodel/data/simple/docker_definitions.py
@@ -4,38 +4,21 @@
 from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
 
 
-# TODO: make this generated from a matrix rather than just a static list
+# NOTE: All hardcoded docker image builds have been migrated to GHA
 IMAGE_NAMES = [
-    "pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7",
-    "pytorch-linux-bionic-py3.6-clang9",
-    "pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
-    "pytorch-linux-bionic-py3.8-gcc9",
-    "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
-    "pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
-    "pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7",
-    "pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
-    "pytorch-linux-xenial-py3-clang5-asan",
-    "pytorch-linux-xenial-py3-clang7-asan",
-    "pytorch-linux-xenial-py3-clang7-onnx",
-    "pytorch-linux-xenial-py3.8",
-    "pytorch-linux-xenial-py3.6-clang7",
-    "pytorch-linux-xenial-py3.6-gcc5.4",  # this one is used in doc builds
-    "pytorch-linux-xenial-py3.6-gcc7.2",
-    "pytorch-linux-xenial-py3.6-gcc7",
-    "pytorch-linux-bionic-rocm4.1-py3.6",
-    "pytorch-linux-bionic-rocm4.2-py3.6",
-    "pytorch-linux-bionic-rocm4.3.1-py3.6",
 ]
 
 # This entry should be an element from the list above
 # This should contain the image matching the "slow_gradcheck" entry in
 # pytorch_build_data.py
 SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
 
-def get_workflow_jobs(only_slow_gradcheck=False):
+def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):
     """Generates a list of docker image build definitions"""
     ret = []
-    for image_name in IMAGE_NAMES:
+    for image_name in images:
+        if image_name.startswith('docker-'):
+            image_name = image_name.lstrip('docker-')
         if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:
             continue
 

diff --git a/.circleci/cimodel/data/simple/ios_definitions.py b/.circleci/cimodel/data/simple/ios_definitions.py
@@ -75,6 +75,12 @@ def gen_tree(self):
     IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={
         "op_list": "mobilenetv2.yaml",
         "lite_interpreter": miniutils.quote(str(int(True)))}),
+    IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={
+        "use_coreml": miniutils.quote(str(int(True))),
+        "lite_interpreter": miniutils.quote(str(int(True)))}),
+    IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
+        "use_coreml": miniutils.quote(str(int(True))),
+        "lite_interpreter": miniutils.quote(str(int(True)))}),
 ]
 
 

diff --git a/.circleci/cimodel/data/simple/mobile_definitions.py b/.circleci/cimodel/data/simple/mobile_definitions.py
@@ -4,12 +4,6 @@
 
 import cimodel.lib.miniutils as miniutils
 import cimodel.data.simple.util.branch_filters
-from cimodel.data.simple.util.docker_constants import (
-    DOCKER_IMAGE_ASAN,
-    DOCKER_REQUIREMENT_ASAN,
-    DOCKER_IMAGE_NDK,
-    DOCKER_REQUIREMENT_NDK
-)
 
 
 class MobileJob:
@@ -52,33 +46,6 @@ def gen_tree(self):
 
 
 WORKFLOW_DATA = [
-    MobileJob(
-        DOCKER_IMAGE_ASAN,
-        [DOCKER_REQUIREMENT_ASAN],
-        ["build"]
-    ),
-
-    # Use LLVM-DEV toolchain in android-ndk-r19c docker image
-    MobileJob(
-        DOCKER_IMAGE_NDK,
-        [DOCKER_REQUIREMENT_NDK],
-        ["custom", "build", "dynamic"]
-    ),
-
-    MobileJob(
-        DOCKER_IMAGE_NDK,
-        [DOCKER_REQUIREMENT_NDK],
-        ["custom", "build", "static"]
-    ),
-
-    # Use LLVM-DEV toolchain in android-ndk-r19c docker image
-    # Most of this CI is already covered by "mobile-custom-build-dynamic" job
-    MobileJob(
-        DOCKER_IMAGE_NDK,
-        [DOCKER_REQUIREMENT_NDK],
-        ["code", "analysis"],
-        True
-    ),
 ]