Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarks #93847

Closed
atalman opened this issue Feb 1, 2023 · 6 comments
Closed

Failures in cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarks #93847

atalman opened this issue Feb 1, 2023 · 6 comments
Labels
oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@atalman
Copy link
Contributor

atalman commented Feb 1, 2023

When Migrating our CI from CUDA 11.6 to CUDA 11.7. Here: #93406
I see multiple failures in cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarks workflow.

Github Workflow failure: https://github.com/pytorch/pytorch/actions/runs/4060149836/jobs/6989215115

aot_eager_all internal link:

Error: al  maml                                [2023-02-01 03:15:46,456] torch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001

Error: ain tinynet_a                           [2023-02-01 04:32:59,763] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00769, (ref-fp64): 0.00072 and shape=torch.Size([32])
Error: 2-01 04:32:59,763] torch._dynamo.utils: [ERROR] Accuracy failed for key name bn1.weight.grad
FAIL

Error: ain gernet_l                            [2023-02-01 04:20:23,425] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.02019, (ref-fp64): 0.00534 and shape=torch.Size([640])
Error: 2-01 04:20:23,425] torch._dynamo.utils: [ERROR] Accuracy failed for key name stages.3.0.shortcut.bn.running_var
FAIL

Error: ain gluon_xception65                    [2023-02-01 04:21:35,059] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00408, (ref-fp64): 0.00054 and shape=torch.Size([728])
Error: 2-01 04:21:35,060] torch._dynamo.utils: [ERROR] Accuracy failed for key name mid.block17.rep.bn1.weight.grad
FAIL

dynamic_aot_eager_torchbench internal link:

Error: al  maml                                [2023-02-01 03:15:46,456] torch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001

dynamic_aot_eager_timm 1 internal link

Error: ain gernet_l                            [2023-02-01 03:27:46,292] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.02019, (ref-fp64): 0.00534 and shape=torch.Size([640])
Error: 2-01 03:27:46,292] torch._dynamo.utils: [ERROR] Accuracy failed for key name stages.3.0.shortcut.bn.running_var
FAIL
Error: ain gluon_xception65                    [2023-02-01 03:30:28,459] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00408, (ref-fp64): 0.00054 and shape=torch.Size([728])
Error: 2-01 03:30:28,459] torch._dynamo.utils: [ERROR] Accuracy failed for key name mid.block17.rep.bn1.weight.grad
FAIL

dynamic_aot_eager_timm 2 internal link:

Error: ain tinynet_a                           [2023-02-01 03:47:04,591] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00769, (ref-fp64): 0.00072 and shape=torch.Size([32])
Error: 2-01 03:47:04,591] torch._dynamo.utils: [ERROR] Accuracy failed for key name bn1.weight.grad
FAIL

cc @ezyang @soumith @msaroufim @wconstab @ngimel @bdhirsh @malfet @ptrblck

Versions

CI 31.01.2023

@desertfire
Copy link
Contributor

I wonder how the inductor test results look like. One difference I noticed is we have turned on dynamic shape for aot_eager tests but not for inductor tests, so I wonder if this has something to do with dynamic shape. It could also just come from numerical differences with different CUDA versions.

@ezyang
Copy link
Contributor

ezyang commented Feb 1, 2023

The diff is dynamic is sharded but non-dynamic is not. You can see that the errors are exactly the same when you concat the dynamic shards together. So it is not dynamic related.

@malfet
Copy link
Contributor

malfet commented Feb 1, 2023

@atalman why internal links as logs are available publicly on S3?

@atalman
Copy link
Contributor Author

atalman commented Feb 1, 2023

here is the direct link to failed workflow in github: https://github.com/pytorch/pytorch/actions/runs/4060149836/jobs/6989215115

@ezyang
Copy link
Contributor

ezyang commented Feb 4, 2023

When I ran the failed dynamic jobs, they passed. Deleted the skips in #94114

So someone just needs to look at the non-dynamic case.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2023
desertfire added a commit that referenced this issue Mar 5, 2023
[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 5, 2023
ghstack-source-id: ec813c7450fbf4544a45500f8496192c05842041
Pull Request resolved: #96049
desertfire added a commit that referenced this issue Mar 5, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 5, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 6, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 6, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 6, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue Mar 6, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Mar 8, 2023
ghstack-source-id: 618d9fbe4d7e9c157e547ad79966a4b8dcbc0fa5
Pull Request resolved: #96049
pytorchmergebot pushed a commit that referenced this issue Mar 8, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Mar 8, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue May 8, 2023
ghstack-source-id: d1286fb949a7d3575328229a41fee30f66f417a5
Pull Request resolved: #96049
desertfire added a commit that referenced this issue May 8, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
desertfire added a commit that referenced this issue May 8, 2023
cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue May 10, 2023
clrpackages pushed a commit to clearlinux-pkgs/pytorch that referenced this issue Oct 26, 2023
….1.0

ALi (1):
      implement __dir__ for dynamo (#102480)

Aaron Bockover (3):
      Introduce torch.onnx.dynamo_export API (#97920)
      Add Thiago Crepaldi (ONNX) to CODEOWNERS (#103894)
      [ONNX] Support `torch.compile(backend="onnxrt", options=OrtBackendOptions(...))` (#107973)

Aaron Enye Shi (14):
      Manual submodule update: kineto and libfmt bazel issue (#94756) (#95535)
      [Profiler] Add export_memory_timeline to save memory timeline plot to file (#96137)
      [Profiler] Memory timeline to show actual timestamps (#96535)
      [Kineto] Improve Config Options for Input Shapes, Memory, Stack, Flops, and Modules - Part 1 (#97380)
      [Kineto] Improve Config Options Part 2 - update to new Kineto Submodule (#97556)
      [Profiler][Easy] Fix typo in Profiler report input shapes (#99430)
      [Profiler] Support HTML plot output for profiler export_memory_timeline API (#99751)
      [Profiler] Fix HTML plot output for profiler export_memory_timeline (#101316)
      [Profiler] Workaround CUPTI Lazy Reinit and CUDA Graphs crash in CUDA 11 (#101879)
      [Profiler] Update Kineto Submodule (#103031)
      [Profiler] Include more uncategorized events in memory profile (#101200)
      [Profiler][Easy] Add log msg to assertEqual for flaky test_memory_timeline_no_id (#103326)
      [Profiler] Fix flaky test_memory_timeline_no_id (#103441)
      [Profiler][Memory] Export raw timestamped events in export_memory_timeline_raw (#105094)

Aaron Gokaslan (34):
      [BE] Add flake8-logging-format linter (#94840)
      [BE]: Merge startswith calls - rule PIE810 (#96754)
      Remove unnecessary items() call in zero_grad (#97040)
      [BE] Remove unnecessary dict comprehensions (#97116)
      [BE] Update flake8-comprehensions to 3.11.1 (#97671)
      [BE]: Update flake8 and plugins and fix bugs (#97795)
      Update ufmt to v2.1.0 (#97900)
      [BE] Enable flake8-comprehension rule C417 (#97880)
      [BE] Enable flake8-simplify checks (#97984)
      [BE] Update flake8-comprehensions and adapt to rule C418 (#99178)
      [BE] Update python versions for black formatter config (#99827)
      [BE] Enable C419 rule for any all shortcircuiting (#99890)
      Update Cutlass to v3.1 (#94188)
      [BE] Update cutlass with NVIDIA upstream changes to 3.1 (#100333)
      [BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715)
      [BE]: enable PLE error codes in ruff and fix bugs (#101079)
      [BE]: Bugfix functorch and some generic typing improvements (#101337)
      [BE]: Cleanup deprecated stdlib imports (UP006,UP035) (#101361)
      [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
      [BE] switch fprintf to fmt::print (#104640)
      Update cutlass submodule to stable 3.1 from RC (#104638)
      Update cuDNN frontend submodule to v9.1 (#104847)
      [BE]: Apply ruff PERF fixes to torch (#104917)
      Fix merged lintrunner error (#105005)
      Update pybind11 submodule to 2.11.0 (#105245)
      [BE]: Update Ruff to 0.0.280 (#105724)
      [BE]: Enable ruff rules PIE807 and PIE810 (#106218)
      Update submodule NCCL to v2.18.3 (#104993)
      [BE]: Update ruff to 0.285 (#107519)
      [BE]: Apply PYI autofixes to various types (#107521)
      [BE]: Update cudnn_frontend submodule to v0.9.2. (#107525)
      [BE]: Add PYI files to ruff lintrunner (#107524)
      [BE]: Update ruff to 0.285 (#107519)
      Update nccl submodule to 2.18.5 (#107883)

Abhishek Jindal (1):
      Correct typo for NCCL_MAJOR (#99482)

Adnan Akhundov (9):
      Enable addmm + GELU epilogue fusion via cuBLASLt (#103811)
      Fix test_addmm_gelu assertion on Windows CUDA (#104031)
      Unify GELU tanh approximation in _addmm_activation GPU back-end (#104061)
      [inductor] addmm + ReLU / GELU fusion pass (#104132)
      [inductor] Enable mypy checking in lowering.py (#105317)
      Skip Triton templates in MM max autotune with zero-size inputs (#106865)
      [inductor] Adjust dynamic SMEM limit when above default in AOT (#107601)
      [reland][inductor] Adjust dynamic SMEM limit when above default in AOT (#107814)
      [inductor] Add cat + split_with_sizes elimination pass (#107956)

Aiden Nibali (1):
      Default permissions for torch.hub downloads (#82869)

Aidyn-A (9):
      [TorchScript] Fix torch.cuda._exchange_device (#95306)
      [inductor] fix typos in test_torchinductor.py (#96233)
      [CUDA12] Autograd engine use current device only (#92354)
      [CUDA12] set_device change (#94864)
      [CUDA12] set_device change (#94864)
      [NCCL] Use OptionalCUDAGuard in ProcessGroupNCCL::WorkNCCL::synchronizeInternal (#98895)
      Fix test_multiple_devices_randint_cuda (#99775)
      [CI] Enable UCC in CI (#100395)
      [MPI] Allow previously initialized (#105023)

Akila Premachandra (1):
      [dynamo] Fix TimmRunner typo in benchmarks (#104052)

Akinori Mitani (1):
      Update torch.arange doc. (#99963)

Alan Ji (2):
      fix some typos (#106018)
      remove the duplicate method `is_private_use1` in class Device (#107198)

Albert Chen (2):
      [PT][FSDP] Combine _utils.py into _common_utils.py [1/3] (#105857)
      [PT][FSDP] Combine _utils.py into _common_utils.py [2/2] (#106181)

Aleksandar Samardžić (13):
      Add CUTLASS-based MM for structured sparse linear operator (#100485)
      Implement adding bias vector into structured sparse linear operator (#100881)
      Add activation functions (ReLU  and SiLU for now) for structured sparse linear operator (#101339)
      Add missing decompositons/lowerings for logical/bitwise operators (#102566)
      Implement adding bias vector into structured sparse linear operator (#100881)
      Add activation functions (ReLU  and SiLU for now) for structured sparse linear operator (#101339)
      Fix autograd issue with identity conversions (#92022)
      Support bfloat16 dtype for CUTLASS-based semi-structured sparsity (#103978)
      Add semi-structured sparse conversions (#103830)
      Update sparse semi-structured linear operator (#104608)
      Make conversions from/to sparse semi-structured always @torch.compile-d (#105272)
      Remove CUTLASS extensions merged upstream (#107612)
      Remove CUTLASS extensions merged upstream (#107612)

Aleksei Nikiforov (27):
      test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)
      Fix TestBufferProtocolCPU::test_byte_to_int_cpu test on Big Endian (#96424)
      Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)" (#97759)
      TensorExpr eval: fix copying variables from pointers on big endian systems (#96951)
      Fix saving and loading pickle files on Big Endian systems (#95881)
      Reintroduce s390x SIMD support (#99057)
      Fix loading data on different encoding (#94503)
      Remove little endian asserts (#99713)
      Remove inclusion of non-existent header on s390x (#99870)
      Fix byteswapping (#99869)
      S390x tests (#99871)
      Don't apply _Py_OPCODE twice (#97986)
      ASAN: fix use-after-free (#101064)
      ASAN: fix use-after-free (#101400)
      s390x zvector: implement expm1 for complex vectorized types (#99872)
      s390x SIMD: Propagate NaN in minimum and maximum operations (#99716)
      s390x simd: disable functions with out-of-bounds reads (#102266)
      ASAN: fix heap-buffer-overflow (#101970)
      S390x clang fixes for SIMD (#100874)
      s390x simd: ensure that vectorized complex constructor behaves same to x86 (#103426)
      s390x simd: switch clamp min and max order (#103849)
      s390x simd: update abs() functions for vectors of complex numbers (#103850)
      s390x SIMD: propagate NaN value in clamp functions (#102978)
      Add functions to get and set default endianness in load() functions (#101973)
      s390x: fix special_hermite_polynomial_h for '+/-inf' and '+/-nan' (#104705)
      Make sure that little endian is default case when __BYTE_ORDER__ is not defined (#104249)
      When byteorder record is missing load as little endian by default (#108523)

Alex Settle (1):
      Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129)

Alexander Jipa (1):
      fixing named tensor unflatten example (#106921)

Alexander Pivovarov (5):
      Fixe some typos (#105869)
      Fix some typos, mostly "that that" (#106901)
      Fix rst formatting in dynamo/guards-overview doc (#107275)
      Use compiled model in torch.compiler_get_started (#107267)
      Fix rst formatting in torch.compiler_troubleshooting.rst (#107360)

Alexander Yermolovich (1):
      [llvm-17][ORC] Fix for move most ORC APIs to ExecutorAddr, introduce ExecutorSymbolDef. (#98811)

Alexis Thual (1):
      Fix Wishart distribution documentation (#95816)

Ali Kamali (1):
      Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827)

Ali Moezzi (2):
      Merge original module attributes with attributes assigned by __setattr__ (#102910)
      Fix lr_scheduler serialization contains bound methods issue (#102627)

AllenTiTaiWang (54):
      [ONNX] Add bloom ops (#94878)
      [ONNX] Support aten::bit_wise_not in fx-onnx exporter (#94919)
      [ONNX] Refactor validation op-level (#94920)
      [ONNX] Set shape/type into torchscript (#96349)
      [ONNX] Support converting fx graph with symbolic shape to ONNX (#96350)
      [ONNX] Refactor op level debugging (#97494)
      [ONNX] Fix scalar elements in op.Concat (#98509)
      [ONNX] Skip flaky dynamic tests before ORT==1.15 in fx exporter (#98856)
      [ONNX] Support aten::unflatten in torchscript exporter (#99056)
      [ONNX] Refactor ShapeInferenceWithFakeTensor to fill  metavalue into the original gm (#98760)
      [ONNX] Support aten::scaled_dot_product_attention in torchscript exporter (#99658)
      [ONNX] Add additional_test_kwargs into test_fx_to_onnx_with_onnxruntime.py (#99434)
      [ONNX] Bump onnx-script version with imported module renaming (#99926)
      [ONNX] Support aten::tile in torchscript exporter (#99927)
      [ONNX] Support aten::atan2 in torchscript exporter (#100040)
      [ONNX] Add test_fx_op_consistency.py (#99465)
      [ONNX] Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172)
      [ONNX] Add xfail into subtests of op consistency and retire fixme (#100173)
      [ONNX] Set tracing_mode through options.dynamic_shapes and enable dynamic tests in test_fx_to_onnx_runtime.py (#100212)
      [ONNX] Add RemoveConstantInputStep to adapt torch inputs to ONNX inputs (#100252)
      [ONNX] Add supported ops into test_fx_op_consistency - 1st batch (#100265)
      [ONNX] Diagnostic 'log' and 'log_and_raise_if_error' (#100407)
      [ONNX] Diagnostic to show all unsupported call_functions (#100451)
      [ONNX] Support aten::broadcast_to (#101833)
      [ONNX] Bump onnx submodule to release 1.14.0 (#101809)
      [ONNX] Introduce FX-ONNX dispatcher (#100660)
      [ONNX] Bump ORT version to 1.15.0 (#102248)
      [ONNX] Support aten::scatter_reduce (#102048)
      [ONNX] Support aten::logit (#102377)
      [ONNX] Add FX exporter MaxPool tests (#102773)
      [ONNX] Support aten::atleast_1d and aten::atleast_2d and aten::atleast_3d (#103061)
      [ONNX] Support aten::hstack and aten::vstack (#102872)
      [ONNX] FX Dispatcher Test (#103971)
      [ONNX] Separate fx _type_utils from torchscript exporter (#103942)
      [ONNX] Add op_level_debugging rule on validate_op_between_ort_torch (#104268)
      [ONNX] Create stand alone diagnostic rule on nearest match finding in dispatcher (#104267)
      [ONNX] Fix third party custom operator support in torchscript exporter (#104785)
      [ONNX] Refactor FX Registry and Support Custom Operator in FX Exporter (#103943)
      [ONNX] Apply param_manipulation.py from onnx-script to op validation and dispatcher (#104679)
      [ONNX] Enable attribute type checking in onnx dispatcher (#105104)
      [ONNX] Refactor AvgPool to support dynamic shapes (#105683)
      [ONNX] Support torch.device in FX exporter (#105757)
      [ONNX] Register list/tuple/dict to format_argumment and refactor fx.Node format_argument in diagnostics (#105263)
      [ONNX] Support ONNXFakeContext with op_level_debug (#105874)
      [ONNX] Fix the warnings of `aten overload fallback to default` in onnx dispatcher (#105972)
      [ONNX] Add comment on test_view_dynamic_zero_dim (#105950)
      [ONNX] Clean up outdated skip ort < 1.15 decorator in tests (#105951)
      [ONNX] Support complex in FX exporter (#100554)
      [ONNX] Expose OnnxRegistry publicly (#106140)
      [ONNX] Refactor perfect/nearest match criteria to allow optional inputs and disallow mismatch attributes (#106478)
      [ONNX] Update xfail reasons in fx runtime tests (#107257)
      [ONNX] Exclude FXSymbolicTracer from _assert_fake_tensor_mode (#107712)
      [ONNX] Add huggingface models into CI tests (#107247)
      [ONNX] Support constant tensors in FakeMode exporting (#107836)

Amadeusz Skrzypczak (3):
      Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
      Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
      Add missing hpu check to is_any_autocast_enabled (#106539)

Amr Elshennawy (2):
      Initial commit of collective_utils (#101037)
      Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269) (#104302)

Andres Lugo (1):
      [ROCm] Enable hipsolver unit tests for batched linalg drivers (#106620)

Andres Lugo-Reyes (5):
      Enable hipSOLVER in ROCm builds (#97370)
      Use hipsolver for default svd case on ROCm (#103540)
      [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)
      Use hipsolver for default svd case on ROCm (#103540)
      [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)

Andrew Gallagher (4):
      [caffe2/tools/autograd] Fix non-determinism in code gen (#101287)
      [caffe2/torchgen] Fix codegen non-determinism (#101286)
      [caffe2/tools/autograd] Fix non-determinism in code gen (#101425)
      [PyTorch][Dispatcher] Fix destruction order fiasco crash (#104393)

Andrew Gu (89):
      [BE] Simplify `Source.is_nn_module`; add some types (#95292)
      [FSDP][Docs] Re-add why reg. post-bwd hook on 1st forward (#95326)
      [FSDP] Save `_fsdp_states` on root (#95343)
      [FSDP] Save `_all_handles`; `_all_fsdp_states` to root (#95465)
      [MTA] Skip size-0 tensors in `multi_tensor_apply` (#94655)
      [BE][DDPOptimizer] De-dup `p` and `param` (#95654)
      [FSDP][Docs] Per-device NCCL stream is per PG (#95705)
      [Easy] Fix typo "steams" -> "streams" (#95706)
      [FSDP] Speed up first iter order check (#96146)
      [FSDP] Speed up first iter order check (part 2) (#96220)
      [FSDP] Add unsafe setattr gated by env var (#96326)
      [Autograd] `expand_as` instead of `clone` to get `AccumulateGrad` (#96356)
      [FSDP] Relax `sharded_grad` assert to allow IDLE (#96584)
      [FSDP] Reduce CPU overhead (#96958)
      [FSDP][1/N] Rename "flattened parameter" to "flat parameter" (#97661)
      [FSDP][2/N] Rename "flattened parameter" -> "flat parameter" (pt. 2) (#97662)
      [FSDP][3/N] Minor fixes (rename, assert message) (#97663)
      [FSDP][4/N] Document `use_orig_params: bool` (#97664)
      [FSDP][5/N] Lift `FSDPParamInfo` to use `FlatParamHandle` (#97665)
      [FSDP][6/N] Rename param/module name helpers for clarity (#97666)
      [FSDP][7/N] Add alignment padding for `use_orig_params=True` (#97667)
      [FSDP][8/N] Simplify addr padding internals (#97796)
      [FSDP][Docs] Tidy up FSDP ctor docs (#97979)
      [FSDP][Easy] Minor cleanups to `_runtime_utils.py` (#97980)
      [FSDP] Do not `_unshard` if already prefetched (#97981)
      [FSDP] Allow non-uniform `requires_grad` for `use_orig_params=True` (#98221)
      [FSDP] Use correct handle training state when prefetching (#98249)
      [FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` (#98250)
      [FSDP][Easy] Remove unused `requires_grad_mask` (#98299)
      [FSDP] Add skip writeback check gated by env var (#98300)
      [FSDP] Only move current FSDP's states to GPU during init (#98319)
      [FSDP][Docs] Add warning about forward saving param refs (#98320)
      [SyncBatchNorm] Support running with low precision parameters (#98332)
      [Dynamo] De-dup graph inputs (#98775)
      [Easy] Reuse `source` variable in `wrap_tensor` (#98845)
      [AOTAutograd] Fix is-duplicate check in de-dup guard logic (#98932)
      [FSDP] Auto-pad for no `pad()` in post-bwd hook (`use_orig_params=True`) (#99054)
      [FSDP] Set `NCCL_DESYNC_DEBUG=0` for FSDP unit tests (#99916)
      Add frame summary to for/while loop backedge log message (#100045)
      [FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py (#100178)
      [FSDP] Remove unneeded disable of tf32 (#100179)
      [Dynamo] Fix staticmethods for FSDP (#100117)
      [FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` (#100180)
      [Easy][FSDP] Clarify `_use_unsharded_grad_views` comment (#100359)
      [FSDP] Do not `sys.exit(0)` explicitly at end of unit test (#100645)
      [FSDP] Reshard frozen params in backward (#101982)
      [FSDP][Easy] Remove redundant var def in test (#103270)
      [FSDP] Fix `device_id` when buffer-only module (#103504)
      Fix composable `checkpoint(use_reentrant=True)` with multi args (#103590)
      Silence `has_cuda` deprecation in optim (#103610)
      De-register forward hooks upon exiting flop counter context (#103744)
      [Easy][FSDP] Fix "column" -> "row" in PG example (#103975)
      [FSDP] Support unfreezing params for reshard-only hook (#104186)
      [FSDP] Fix `ignored_states` doc (#104253)
      [FSDP] Validate `ignored_modules`, `ignored_states` (#104273)
      [Easy][FSDP] Remove misleading asserts (#104274)
      [FSDP] Rework meta device init (#104189)
      [FSDP] Annotate modules for `fully_shard` (#104363)
      [FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path (#104346)
      [FSDP][2/N][Easy] Prepare `_auto_wrap` for `fully_shard` (#104407)
      [FSDP][3/N] Unify `fully_shard` auto wrap (#104408)
      [FSDP][4/N] Remove `_get_fully_sharded_module_to_states` (#104409)
      [FSDP][5/N] Unblock `ignored_states` + auto wrap (for now) (#104418)
      [FSDP] Default `limit_all_gathers=True` (#104900)
      [FSDP][Easy] Rename streams; add back stream sharing test (#104966)
      [FSDP] Fix skip-sharded-views + mixed precision (#105346)
      [FSDP][Easy] nit follow-ups to handle refactor (#105738)
      [FSDP][Docs] Tidy up FSDP ctor/api docs (#105847)
      [FSDP][Docs] Make model/optim state dict configs visible in docs (#105848)
      Revert "Simplify handle indexing (#105006)" (#105984)
      [FSDP] Add `record_function` for explicit prefetching (#105985)
      [FSDP][Easy] Move post-bwd hook logging to own func (#106032)
      [FSDP][Easy] Rename to `_comm_hook`, `_comm_hook_state` (#106033)
      Add @penguinwu to distributed codeowners (#105945)
      [FSDP] Improve `test_fsdp_hybrid_shard_basic_setup` (#106072)
      [FSDP] Add HSDP parity unit test (#106131)
      [FSDP] Optimize away intermediate `div_` for HSDP (#106034)
      Remove @penguinwu from distributed codeowners (#106322)
      [FSDP][Easy] Move `_FSDPState` attrs to avoid comment confusion (#106392)
      [FSDP][6/N] Check valid param freezing for `ModuleWrapPolicy` (#104427)
      [FSDP][7/N] Add warning about frozen params (#104967)
      [FSDP][Easy] Allow `ModuleWrapPolicy` to take `Iterable` (#104999)
      [FSDP][8/N] Replace `_FSDPPolicy.policy` with `_Policy._run_policy` (#104969)
      [FSDP][9/N] Introduce `CustomPolicy` (#104986)
      [FSDP][Easy] `zeros` -> `empty` for immediately freed tensors (#106857)
      [FSDP] Fix train -> EMA -> eval with mixed precision (#106858)
      [FSDP] Break up `_post_backward_hook` into smaller funcs (#106068)
      [FSDP] Enable async all-reduce for HSDP (#106080)
      [FSDP][Docs] Add note on `NCCL_CROSS_NIC=1` for HSDP (#107784)

Andrew M. James (2):
      Expand sparse.softmax zero nnz tests to cover cases of previously reported FPE. (#95646)
      Sparse Compressed mm avoid creating temp sparse (#104062)

Andrew Or (13):
      [quant][test] Fix broken PT2 import, add warnings (#102644)
      [reland][quant][test] Fix broken PT2 import, add warnings (#102819)
      [quant][pt2] Fix convert in Conv + BN QAT fusion (#102224)
      [quant][pt2] Add test for inplace add (#102867)
      [quant][pt2] Add prepare QAT test for resnet18 (#103020)
      [quant][pt2] Fix convert in Conv + BN + ReLU QAT fusion (#102993)
      [quant][pt2] Fix no conv bias in convert QAT (#103298)
      [quant][pt2] Handle literal conv args in convert QAT (#103731)
      [quant][pt2] Fix QAT convert for resnet18 (#103759)
      [quant][pt2] Update special qspecs after QAT rewrite (#103970)
      [quant][pt2] Add prepare QAT test for mobilenetv2 (#104068)
      [quant][pt2] Fix QAT convert for mobilenetv2 (#104110)
      make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)

Andrey Talman (14):
      Update release related information (#101819)
      [Release] Add FAQ explaining release terms (#102618)
      Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
      [CI] Release only changes for 2.1 release (#108053)
      [CI] Release only chnages use anaconda token for test env (#108064)
      [MPS] Fix `.item()` for multi-dim scalar (#107913) (#108410)
      Add check for out of range pointer. (#107510) (#108649)
      Release only change, test against test channel (#108688)
      Prerequisite of ATen/native/utils header for C++ extension (#109013) (#109106)
      Fix CUDA-12 wheel loading on AmazonLinux  (#109291)
      [release-2.1] Make numpy dependency optional for torch.compile (#109608)
      Remove torchtext from Build Official Docker images (#109799) (#109803)
      [release only] Docker build - Setup release specific variables (#109809)
      [CI] Add `torch.compile` works without numpy test (#109624) (#109818)

Andrii Grynenko (2):
      [data_loader] Enable overriding signal handler in DataLoader.cpp (#101816)
      [data_loader] Extra signal handlers in DataLoader.cpp should be added on top rather than replacing defaults (#103164)

Andy Rock (5):
      fix _slice_meta's shape calculation (#98326)
      consider `CALL_FINALLY` non-jumping in `stacksize_analysis` (#103621)
      Support bit shifting `SymInt`s (#104318)
      fix `hash_storage`'s padding calculation (#105036)
      fix `upsample_nearest` decompositions for `uint8` tensors (#106675)

Angela Yi (54):
      [dynamo] Fix list contains check (#95092)
      Deepcopy output node metadata (#95426)
      Remove fake inputs from control flow (#95988)
      Remove hacky python dispatcher fallthrough (#96635)
      [fx] Replace literals with placeholder helper (#97683)
      [fx] Subgraph rewriter matching on attributes (#98604)
      [fx] Minor bug fix for SubgraphMatcher when ignoring literals (#98458)
      [export] Constraints API (#98433)
      [export] Constraints API (#98433)
      [dynamo] FakeTensor comparison with "is" instead of "==" (#99134)
      [fx] Variatic arg matching (#99431)
      [exir][delegate] torch.ops.call_delegate (#92562)
      [fx] Remove replace_literals_with_placeholders (#99728)
      [export] Move verifier over to export from torch/fx (#100019)
      [export] ExportPassBase + view_copy pass (#100000)
      [export] Port over const prop pass (#100102)
      Decompose arange.default to arange.start_step (#99739)
      [export] Migrate internal verifier to subclass export/verifier
      Partition modules (#98628)
      [docs] Docs for writing ATen IR passes + FX Pattern matching (#100577)
      [export] Pickle result of export (#100423)
      [export] Pickle of ExportGraphModule (#100620)
      [export] Fix cond for pass_base (#100836)
      [export] Pickle of ExportGraphModule (#100924)
      [fx] Better replacements finder in subgraph rewriter (#100556)
      [fx] Better replacements finder in subgraph rewriter (#100556)
      [dynamo] Change dimension constraint summary to log.info (#101584)
      Add aten.searchsorted.Tensor meta kernel (#101637)
      [export] Error when constraining on static values (#101655)
      [export] ExportedProgram (#102259)
      [export] Rename graph_module.py to exported_program.py (#102260)
      [export] Cleanup constraints (#102666)
      Serialize pytree to string v2 (#102708)
      [export] Change equality constraints to list of tuples (#102998)
      [export] Initial serialization v2 (#102707)
      [export] Initial deserialization v2 (#102716)
      [dynamo] Fix Autograd Function Classmethod bug (#103175)
      [export] Serialize symbolic values (#103273)
      [export] Serialize metadata (#103274)
      [export] Make pass base composable (#103701)
      [exir] Initial serialization (#103763)
      [export] Serialize optional tensors (#104723)
      [export] Fix deserialization of symint (#104722)
      [export] Fix serialize nn_module_stack (#104721)
      [export] Make serializer more composable (#104816)
      [export] Allow optional call-spec (#105041)
      [export] Allow optional call-spec (#105179)
      [export] Remove eliminate_dead_code (#105875)
      [exir] Update exir.pass_base to use export.pass_base (#106647)
      [export] Remove setter for graph_module (#106651)
      [export][reland] ExportedProgram.transform updates graph_signature automatically (#107792)
      [export] torch.export landing page (#108783) (#108962)
      [export] Fix export arg type declaration (#109060) (#109064)
      [fx][split] Copy node metadata for placeholders (#107981) (#109297)

Animesh Jain (85):
      [minifier] cuda.synchronize to better detect IMA (#97962)
      [dynamo][graph break fix] inplace add for empty tuple (#97923)
      [dynamo] Fix bug with torch._dynamo.skip (#98862)
      [dynamo] Raise exception on incorrect usage of disallow_in_graph (#98892)
      [dynamo] Remove _dynamo.skip and fold it in _dynamo.disable (#98899)
      Functionalization of torch.rand/rand_like ops (#97377)
      Python binding to set/get CUDA rng state offset (#98965)
      [cuda rng] Making offset calculation independent of device properties (#98988)
      Reland of "Python binding to set/get CUDA rng state offset" (#99565)
      [dynamo] disallow_in_graph bugfix (#99600)
      [easy] iterate dict with sorted keys for accuracy checking (#99793)
      [inductor][non determinism] Disable autotuning when determinisitic mode is ON (#99851)
      [philox_rand] Dynamic shape support (#99290)
      [inductor] Lowering of rngprims philox_rand (#99289)
      [dynamo][hf_bigbird] Actually graph break on tensor.unsqueeze_/resize_ (#99986)
      [minifier][after dynamo] clone inputs while retaining gradness (#100066)
      [philox_rand] Add decomps (#100206)
      [dynamo] Compile torchvision augmentations (#100292)
      [decomp] Bad accuracy for elu_backward (#100284)
      [dynamo] Graph break on a list referencing self (#100296)
      [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
      [dashboard] higher tolerance for AlbertForQuestionAnswering (#100277)
      [dynamo] Hide guard_fail_hook behind a flag to improve cache lookup time (+10% DebertaV2) (#100590)
      summarize graph breaks (#100696)
      [dynamo] Activation checkpointing as higher order op (#101028)
      [dynamo][moco] Save global torch state to restore on graph break (#101201)
      adding moco to CI (#101098)
      [dynamo] Activation checkpoint higher order ops - Reland 101028 (#101790)
      [dynamo] Bugfix for unspecialized nn module variable (#101859)
      [dynamo] Minor refactor to use is_allowed to decide inlining of NNModule methods (#101910)
      [dynamo][higher order op] Support nn.Module calls (#102022)
      [partitioner] fix for rng ops (#102123)
      [aot_autograd][functional_rng] Change calling convention (#102344)
      [dynamo] Some torchrec_dlrm related fixes (#101953)
      [minifier] add missing import (#102521)
      [fx] Fix repr when arg is an OpOverload (#102547)
      [benchmark] Flag to switch on activation checkpointing for HF models (#102557)
      [dynamo][higher order op] Bugfixes to pass graph.lint (#102448)
      [inductor][pattern matcher] Retain meta tags (#102462)
      [benchmarks] Use train mode for accuracy checks for HF models (#102578)
      [benchmarks] Torchbench llama is not suitable for training (#103094)
      Update torchbench pin - torchrec_dlrm moved to canary (#103383)
      [activation checkpointing] Higher order functional rng op wrappers (#102934)
      [activation checkpointing] Tagging based min cut partitioner (#103357)
      [activation checkpoint][dynamo] Wrap AC into Tag based higher order op (#102935)
      [benchmark][compile] Limit number of bounding boxes to 5 (#103413)
      [benchmark] hf_T5_base - torchbench original batchsize too large (#103442)
      [inductor] Fix tags for inductor random ops (#103648)
      [min-cut partitioner] Disable a heuristic if graph has recomputable ops (#103635)
      [debugging] aot_eager backend to use the min-cut partitioner (#103555)
      [inductor] Limit window for horizontal fusion (#104024)
      [dynamo] FSDP + AC + torch.compile (#103953)
      [dynamo][higher order op] Relaxing too restrictive check for output to be a list/tuple of tensors (#104221)
      [dynamo][ac] Minor refactor for better code organization and a bugfix (#104276)
      [dynamo] Lazy disable_dynamo API out-of-dynamo  (#104317)
      [export] Dont run export guard hook when there is no graph (#104383)
      [dynamo][ac] Remove disable monkeypatching of utils.checkpoint (#104397)
      [dynamo] Organize higherorderops variable trackers (#104565)
      [dynamo] Reland #104317 - Lazy disable_dynamo API out-of-dynamo (#104664)
      [dynamo][ac] Reland #104397 - Remove disable monkeypatching of utils.checkpoint (#104665)
      [dynamo][ddp][ac] Fallback to single bucket when higher order op (#104639)
      [dynamo] Dataclass variables with default field (#104840)
      [dynamo] Maintainable code - Move decorators in a separate file (#105070)
      [dynamo] Maintainable code - Move export impl to a different file (#105071)
      [dynamo] Reland Move decorators into decorators.py (#105273)
      [dynamo] Bugfix for enums (#105306)
      [dynamo] Support defaults for namedtuples (#105341)
      [dynamo][rewrite_asserts] Insert assertion msg in bytecode only when needed (#105549)
      [dynamo][higher order ops] Bugfix for kwargs support (#105699)
      [partitioners][ac][dynamic] Fix output signature of fwd with symints (#105771)
      [dynamo][constant] Kwargs already supported for str methods (#105785)
      [logs] Share same formatter between trace_source and other Dynamo loggers (#106493)
      [dynamo] use cache size to detect recompilation (#106878)
      [dynamo] Readability - Rename name to get_frame_name (#106880)
      [dynamo][fallback] Fallback to eager when backend fails with fake tensor exceptions (#107179)
      [dynamo][eval_frame] Unify cache entry and frame_state on the same co_extra index (#106917)
      [dynamo][eval_frame] Set destroy_extra_state deleter as part of co_extra (#107117)
      [dynamo][eval frame] Make CacheEntry a PyObject (#107405)
      [dynamo] Continue on fbgemm import fail (#107622)
      [dynamo] Store originating source in the Guard object (#107634)
      [dynamo][guards] Use dict for storing weakrefs (#107645)
      [dynamo] bugfix - make module setattr more restrictive (#107828)
      [Dynamo] cache_size policy (#107496)
      [inductor][ac] preserve recompute tags through pattern matching (#107742)
      [activation checkpointing] Add default autocast keys to functional rng wrappers (#107934)

Annwesh Barik (1):
      [efficiency_camp] Vector Realloc Optimize caffe2::BinaryElementwiseWithArgsOp::DoRunWithType (#100631)

Anthony Alayo (1):
      Prefixing DeviceType with c10 namespace to avoid name collisions (#104364)

Anton Bushuiev (1):
      Fix device handling in  `nn.utils.rnn.unpad_sequence` (#98042)

Antoni Viros i Martin (3):
      Add a unit test for negative torch.arange() incorrect numerical behavior with dynamic shapes (#97926)
      Refactory bits for the codegen cache (#103452)
      Add wait_tensor so print always has a correct result for AsyncCollectiveTensor (#107808)

Anupam Bhatnagar (1):
      Adding allocated and reserved memory values to memory timline view. (#107056)

Arthur (1):
      Correct LBFGS tolerance_grad doc string (#99792)

Ashok Kumar Kannan (2):
      Fix missing mandatory device_type argument in autocast docstring (#97223)
      Enable mypy check for torch/_inductor/codegen/common.py (#106199)

Ashwin Hari (1):
      Allow ORT backend for DTensor (#101914)

Atharva Kavitkar (1):
      Corrected grammar in contribution guide (#93014)

Austin (1):
      Expose intended public constraints. Fixes #106386 (#106458)

Avi Verma (1):
      Do not materialize entire randperm in RandomSampler (#103339)

Avik Chaudhuri (23):
      debug shape guards (#95848)
      record caller frame instead of function frame (#96882)
      dynamic range constraint API (#98779)
      [cond] error on closed over variables (#99367)
      suggest constraints to specify for export based on generated shape guards (#98463)
      relax restriction on cond branches calling closed functions (#100013)
      dynamic equality constraint (#99993)
      misc. fixes to constraints warnings and errors (#100745)
      fix precision error in constraint solver (#101307)
      work around precision error in constraint solver (#101607)
      remove default lower bound in dynamic_dim suggestions (#101636)
      [easy] refactor signature flattening transform (#101886)
      group constraints by arg (#101815)
      group constraints by arg (#102096)
      do not raise when constraint locals are not in signature (#102198)
      equality assertions (#102256)
      fix soundness bug with unsupported constraints (#102897)
      error on bad input to equality constraint (#107311)
      do not raise constraint violation on trivial guards (#107470)
      fix symint meta val (#107491)
      constraint violation error messages (#107790)
      remove redundant dynamic_dim (#107815)
      improved error message for IO mismatch (#107907)

BJ Hargrave (6):
      Fix CPU bitwise shifts for out-of-limit shift values (#96659)
      Use unordered NEQ comparison for vec512 operator!= implementations (#97466)
      Add itemsize and nbytes properties to Tensor (#98322)
      Fix test_mps for macos 13.3 (#98739)
      Fix CPU vectorized eq and ne operations for complex types (#97374)
      Enable bitwise shift operations tests (#97150)

Bartosz Szmelczynski (1):
      Extend assert statement to include ListVariable (#100841)

Barys Skarabahaty (1):
      [caffe2] Create deterministic zip archives (#102903)

Bas Aarts (1):
      [ONNX] Export Relu6 without using Relu (#99022)

Bearnardd (1):
      Add dtype check baddbmm (#102659)

Ben Lawrence (1):
      Deallocate workspace on thread exit (#102276)

Benjamin Ghaemmaghami (1):
      Fix split module interaction with dead code (#104554)

Benson Ma (1):
      [T153220354] Fix header inclusions in c10 (#1541) (#101846)

Bert Maher (21):
      [pt2][inductor] Ignore trace.upload_tar when pickling config (#96519)
      [inductor] Allow `tensors` kwarg in sink_cat_after_pointwise (#97019)
      [inductor] Move fx-fusion tests to a separate file (#97028)
      [inductor] Add scaled_dot_product_attention to fallback kernels (#93339)
      [inductor] Fix shape padding (#99917)
      [dynamo] Make bytecode logging off-by-default (#100093)
      [dynamo] Add ddp_graphs artifact (#100021)
      Rename percentiles to quantiles in triton.testing.do_bench (#100477)
      [inductor] TARGETS for all inductor tests (#100744)
      [inductor] Do not try to shape-pad symbolic-sized tensors (#100738)
      Reduce fake_tensor create_mode logging (#101074)
      [inductor] Test for shape padding (#100493)
      [dynamo] Skip tests that are broken in fbcode (#101217)
      [inductor] Update qualname and module for wrapped testcases (#101975)
      [functorch] Get test_functionalize to run on FB infra (#102695)
      [functorch] Remove test_functionalize (#103748)
      [pt2][test] Loosen stack trace check in test (#104902)
      Skip test_indirect_device_assert in fbcode (#105065)
      [inductor] Enable vectorization in fbcode (#105756)
      [inductor] Make OpenMP work in fbcode (#105777)
      [inductor] Make AOT CPU Inductor work in fbcode (#106225)

Bin Bao (114):
      [Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473)
      [CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478)
      [CI] Do not compare two eager run results against fp64 result (#95616)
      [Inductor] Support sparse_grad for torch.gather (#95490)
      [CI] Force clear triton cache between running each test (#95729)
      [CI] Change the way tests are triggered with dynamo and inductor (#94539)
      [CI] Reduce the frequency of running inductor-perf-test-nightly (#95778)
      [CI] Increate the timeout limit for benchmark test (#95787)
      [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
      [CI] Further tighten the checking of two eager runs (#95902)
      [CI] Skip xcit_large_24_p8_224 in TIMM (#96048)
      [CI] Make inductor-perf-test-nightly produce data for dashboard (#95685)
      [CI] Use CUDA 11.8 to run inductor benchmark tests (#96059)
      [CI] Avoid calling torch.use_deterministic_algorithms for some models (#96245)
      [CI] Add a workflow for quick perf comparison (#96166)
      [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
      [CI] Use different subdirectories for amp and float32 nightly perf run (#96470)
      [CI] Change compile_threads to 1 when running benchmark accuracy test on CI (#96195)
      [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
      [CI] switch torchbench to a pinned version (#96553)
      [inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654)
      [reland][CI] switch torchbench to a pinned version (#96782)
      [CI] Revert https://github.com/pytorch/pytorch/pull/96195 (#96897)
      [inductor] Refactor memory management code in wrapper codegen (#96768)
      [CI] Change tests used by the new dashboard (#96986)
      [CI] Fix perf_nightly output file naming error (#97263)
      [inductor] Make the original ATen info dumped in alphabetical order (#97261)
      [CI] Turn on debug logging for dla102 and gernet_l (#97307)
      [CI] Add a missing dtype flag in nightly perf run (#97357)
      [inductor] Fix a multi-gpu context error (#97398)
      [CI] Experiment with a newer CUDA driver (#96904)
      [CI] Add missing --cold-start-latency for the dashboard run (#97547)
      [CI] Reduce perf nightly run frequency and bump up its timeout limit (#97682)
      [CI] Run benchmark test with dynamo_eager in periodic (#97543)
      [inductor] Refactor cpp_wrapper to be an attribute of GraphLowering (#97709)
      [CI] Bump up torchbench version to fix dynamo graph breaks in transformers (#98003)
      Add a --inference flag to dynamo benchmark script (#98173)
      [CI] Add inference run for the performance dashboard (#98174)
      [inductor] Add an AOT mode for the Triton backend (#98214)
      Skip gat, gcn and sage for TorchBench CUDA test (#98244)
      [CI] Mark mobilenet_v3_large as nondeterministic (#98314)
      [inductor] Fix a perf regression caused by https://github.com/pytorch/pytorch/pull/98214 (#98343)
      [CI] Mark sebotnet33ts_256 as nondeterministic (#98356)
      [CI] Update update_expected.py to make it generate a combined csv file (#98407)
      [inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088)
      [inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)
      [reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534)
      [CI] Mark vision_maskrcnn as NONDETERMINISTIC (#98570)
      [inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741)
      [inductor] Consolidate constant_args and cpp_constant_args (#98742)
      [inductor] Support IndexPutFallback in cpp_wrapper (#98972)
      [CI] Use expected accuracy csv files to check benchmark test status (#98839)
      [CI] Remove inductor skip list for timm_models (#98840)
      [CI] Collect inductor max-autotune performance every Sunday (#99387)
      [CI] Remove inductor skip list for Huggingface (#99375)
      [CI] Change max-autotune's output file name (#99754)
      [CI] Pause perf data collection for max-autotune (#99829)
      [inductor] Fix AOTInductor (#99203)
      [inductor] Add cpp_wrapper support for FallbackKernel (#99887)
      [inductor] Support mixed device in cpp wrapper (#99950)
      [CI] Replace timm_efficientdet with timm_vision_transformer in smoketest (#100106)
      [CI] Start to collect inference perf with cpp_wrapper ON (#100187)
      [inductor] Use decomposition for smooth_l1_loss_backward (#100242)
      Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267)
      [inductor] Remove redundant model copy when running with cpp_wrapper (#100275)
      [CI] Change the dashboard run to once a day (#100499)
      [inductor] Change the default value of layout (#100254)
      [reland][CI] Start to collect inference perf with cpp_wrapper ON (#100187) (#100502)
      Add a rst doc for the performance dashboard (#100592)
      [inductor] Support FallbackKernel in cpp wrapper codegen (#100553)
      [inductor] Move cpp wrapper trigger logic to inner_compile (#100611)
      [CI] Delete skips from https://github.com/pytorch/pytorch/issues/93847 (#96049)
      [CI] Run test_multi_gpu in test_inductor_distributed (#100135)
      [CI] Add workflow_dispatch.inputs to control dashboard runs (#101279)
      [CI] Change dashboard workflow inputs type to boolean (#101308)
      [CI] Introduce dashboard-tag to pass dashboard run configs (#101320)
      [CI] Fix a dashboard command line string formatting bug (#101325)
      [inductor] Move cpp wrapper dynamic shapes test to test_cpp_wrapper (#102017)
      [inductor] Refactor generate_kernel_call (#102018)
      [inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019)
      [inductor] Move two cpu tests to test_cpu_repro.py (#101887)
      [dynamo] Add astunparse dependency (#102120)
      Fix an issue where checking sameness throw an exception (#102279)
      [inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083)
      [inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093)
      [inductor] Revert a CI remedy for Triton compilation error (#102541)
      [inductor] Fix a cpp wrapper codegen issue for _scaled_dot_product_efficient_attention (#102624)
      [inductor] Fix a cpp_wrapper issue when fx_passes modified fx graph (#102851)
      [inductor] Support select_algorithm with cpp_wrapper (#103003)
      [inductor] Turn off autotune_cublasLt for cpp_wrapper (#103004)
      [inductor] fix a numel expr codegen issue (#103005)
      [dashboard] Bring back inference perf measurement as nightly (#103151)
      [dynamo] Support OrderedDict constructor with kwargs (#103192)
      [dashboard] Allocate 4 shards for torchbench (#103280)
      [CI] Update inference accuracy test (#103361)
      [inductor] Make clone_graph copy node name as well (#103409)
      [inductor] Fix an expression printer issue during generate_return (#103557)
      [inductor] Store real inputs to be used for cpp wrapper codegen (#103289)
      [reland][inductor] Make clone_graph copy node name as well (#103688)
      [CI] Switch inference accuracy and performance tests to bfloat16 (#103535)
      [CI] Fix a bug that bfloat16 is also used for dashboard training run (#103816)
      Change how AOTInductor's fx input is produced (#104123)
      [CI] Add DALLE2_pytorch to FORCE_AMP_FOR_FP16_BF16_MODELS (#104283)
      [inductor] Relax custom op schema checking for cpp_wrapper (#104349)
      [inductor] Register an op for mm_plus_mm (#104835)
      [inductor] fix a custom_op test problem (#104972)
      [reland][inductor] Register an op for mm_plus_mm (#105153)
      [reland][inductor] fix a custom_op test problem (#105234)
      Add aot_inductor as a test backend for benchmarking (#105221)
      [inductor] Allow specify a subdir to store .so and .cubin files (#105466)
      [inductor] Fix an AOTInductor missing output issue (#105496)
      [inductor] Fix AOTInductor output issues (#105773)
      [dashboard] Replace cpp_wrapper with aot_inductor on the perf dashboard (#106077)
      [CI] Delete .github/ci_commit_pins/huggingface.txt (#107729)

Boris Fomitchev (1):
      Fix int() casting in torch.nn.RNN to have correctly traced JIT and ONNX graph. (#92970)

Bosheng Zhang (Daniel) (1):
      Update Documentation for TripletMarginLoss (#105115)

Bowen Bao (2):
      [ONNX] Cache AutoTokenizer in CI for test (#104233)
      [ONNX] Make unsupported node analysis result deterministic (#105231)

BowenBao (76):
      [ONNX] Enable skipped gpt2 test (#94930)
      [ONNX] Support 'dtype' argument for 'aten::norm' (#95637)
      Type annotate `dynamo.export` (#95742)
      Refactor unittest around dynamo.export wrt function signature (#95850)
      Dynamo.export to preserve names of args & kwargs (#95851)
      [ONNX] Move symbolic export to separate file (#95650)
      [ONNX] Skip doctest `torch.onnx._internal.fx` if ImportError (#95686)
      [ONNX] Merge 'initializers' into 'TorchScriptGraph' (#95676)
      [ONNX] Bump onnx submodule to release 1.13.1 from rc2 (#96325)
      [ONNX] Export logical_not (#96315)
      [ONNX][Diagnostics] Speed up 'python_call_stack' by 'traceback' (#96348)
      [ONNX] Move graph transform functions to 'passes' (#95664)
      [ONNX] Preserve stacktrace info for decomp (#95929)
      Bump black version to 23.1.0 (#96578)
      [ONNX] 'Transform' as base class for passes (#95935)
      [ONNX] Introduce 'Functionalization' for fx exporter (#98245)
      [ONNX] Enable xdoctests in CI (#98546)
      [ONNX] Safely set node name for 'replace_placeholder_name_and_target' (#98633)
      [ONNX] Remove duplicated code from previous rebase (#99072)
      [ONNX] Introduce Input/Ouptut adapter; Switch to 'DynamoExporter' (#98421)
      [ONNX] Support aten.stack for dynamo_export (#99191)
      [ONNX] Retire 'DynamoOptimizeExporter' (#99202)
      [ONNX] Run ONNX tests as part of standard run_test script (#99215)
      [ONNX] Fix missing import numpy for docs example (#99663)
      [ONNX] Cover 'undiscoverable' ops 'torch.ops.aten' (#99682)
      [ONNX] Improve diagnostics performance (#99936)
      [ONNX] Fix type annotation for 'fx_to_onnxscript' (#100050)
      [ONNX] Skip flaky dynamic test in CI (#100297)
      Update docstring for dynamo.export tracing_mode (#100205)
      [ONNX] Introduce 'diagnostics' to 'dynamo_export' api (#99668)
      [ONNX] Remove 'diagnose_step' (#99944)
      [ONNX] Non-global diagnostic context (#100219)
      [ONNX] Update 'Functionalize' pass to support pre-decomp graph; Drop 'aten_graph' arg for 'DynamoExporter' (#99667)
      [ONNX] Refactor diagnose_call message_formatter signature (#100299)
      Different csv headers by bench mode on infra error (#103134)
      Tidy __all__ under torch._refs (#103712)
      [ONNX] Bench torch.onnx.dynamo_export and torch.onnx.export under dynamo bench (#103135)
      [ONNX][TypePromo] Materialize type promotion rules (#104063)
      [ONNX][TypePromo] Explicit type promotion pass (#104064)
      [ONNX][TypePromo] aten.div (#104229)
      [ONNX][TypePromo] Simplify  API `_run_node_and_set_meta` (#104720)
      [ONNX] Remove unnecessary deepcopy on args in 'DynamoExport' (#104736)
      [ONNX] Restore readable names for parameters and buffers (#104493)
      [ONNX] Fix exported onnx initializer name (#104741)
      Training skip list should not be applied on inference bench (#104738)
      [ONNX][TypePromo] Introduce ReductionTypePromotionRule (#104491)
      [ONNX] Allow None as operator argument (#105040)
      [ONNX] Support 'aten::randint' in torchscript onnx exporter (#105089)
      [ONNX] Fix UnsupportedFxNodesAnalysis after onnx dispatcher changes (#105156)
      [ONNX] Fix aten::cat export when arg include parameters (#105373)
      [ONNX] Suppress ORT warnings in unit tests (#105624)
      Enable intellisense for _dynamo, _inductor and onnx by importing under type_checking guard (#105361)
      [ONNX] Passes to reuse existing fake mode if possible (#105764)
      [ONNX] Export module as function (#105618)
      [ONNX] Diagnostic option 'warnings_as_errors' (#105886)
      [ONNX] Add primitives formatting for diagnostics (#105889)
      [ONNX] Detailed diagnostics for 'perfect_match_inputs' (#105892)
      [ONNX] Limit number of elements to display for list/tuple/dict in diagnostics (#106048)
      [ONNX] Log instead of crash when 'tabulate' is not installed (#106228)
      [ONNX] Do not run 'deduplicate_initializers' when 'keep_initializers_as_inputs' is True (#96320)
      [ONNX] Support type promoting sym number representing scalar output (#106178)
      Register ONNX exporter under PT2 logging (#105989)
      [ONNX] Remove legacy diagnostic printing (#106498)
      [ONNX] Turn on batch norm related unittest (#105769)
      [ONNX] Migrate to PT2 logging (#106592)
      [ONNX] Public diagnostic options for 'dynamo_export' (#106741)
      [ONNX] Fix diagnostic log and add unittest (#107158)
      [ONNX] Set 'Generic[Diagnostic]' as base class for 'DiagnosticContext' (#107165)
      [ONNX] Relax not exist assertion for 'register_pytree_node' (#107245)
      [ONNX] Add unittest for exporting embedding_bag (#105862)
      [ONNX] Re-purpose 'name' field of GraphProto (#107408)
      [ONNX] Enclose package info for modules exported as local functions (#107409)
      [ONNX] Clean up diagnostic rules (#107653)
      [ONNX] More debug logging from fx to onnx (#107654)
      [ONNX] Enable 'ExportOutput.save' for models larger than 2GB (#107904)
      [ONNX] Remove API reference for TorchScript export diagnostics (#107979)

Brian (2):
      Fix RenamePlanner documentation (#107535)
      Fix FP16Planner documentation (#107620)

Brian Coutinho (4):
      [pytorch][2/3] Pytorch profiler permits CPU events with CUPTI Range profiler mode (#97048)
      [profiler] add option for kineto synchronization events in the trace (#105187)
      [pytorch profiler] fix profiler test for windows (#106156)
      [pytorch] Disable CUDA sync events by default (#106723)

Brian Hirsh (52):
      avoid extra copies in batchnorm inference by introducing a new op, _native_batch_norm_legit_no_training (#94946)
      hotfix for memory leak in aot autograd induced by saving tensors for backward (#95101)
      fix spurious aot autograd warning (#95521)
      fix primtorch handling for sub.scalar with alpha and float64 arg (#95421)
      fix embedding_backward_dense decomp with broadcasting (#95499)
      better error message when functionalization cant handle op (#95392)
      allow privateuse1 key to be used with legacy constructor (#95748)
      aot autograd: dont allow symint outputs to get tangents in the bw graph (#96219)
      aot autograd: handle detach() and no_grad() mutations on input (#95980)
      [aot autograd] merge all outputs of funtionalization analysis into single metadata (#95991)
      [aot_autograd] only performance functionalization analysis pass once (#95992)
      aot autograd refactor: make all synthetic base logic layered in a single location (#96235)
      aot_autograd: dont requires_grad on tangents (#96339)
      aot autograd: consolidate metadata (#96340)
      [aot autograd] refactor to make functionalization self-contained (#96341)
      [aot autograd] avoid cloning some inputs unnecessarily when they dont require grad (#96342)
      [draft for discussion] add per-dispatch key modes (#97052)
      [aot autograd] refactor to make functionalization self-contained (#96341)
      make_fx, make pre_autograd a kwarg (#97559)
      dont bake in defaults when tracing *_like factories (#97564)
      aot_autograd: avoid using intermediate_base logic unnecessarily (#97786)
      AOTAutograd: fix 'Trying to backward through the graph a second time' error (#98960)
      aot_autograd: more logging on metadata asserts (#99177)
      fix per-dispatchkey-mode caching bug (#98030)
      change torch._dynamo.export(aten_graph=...) to allow pre_autograd tracing (#98031)
      functionalization: error during mutations on mem overlap (#99919)
      move SchemaCheckMode to torch/_subclasses (#99743)
      Get SchemaCheckMode to error on ops that return inputs directly. Expose as a dynamo backend, eager_debug (#99744)
      [aot_autograd] proper handling for when outputs are aliased but have identical size/stride/offset metadata (#100430)
      [aot autograd] fix de-dupping metadata computation bug (#100431)
      [inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115)
      aot_autograd: factor out runtime epilogue from aot_dispatch_base (#100586)
      [AOTAutograd] add export entrypoints (#100587)
      separate out dynamo .requires_grad and .is_grad_enabled guards (#100570)
      fix inference_mode with torch.compile (#101219)
      separate out dynamo .requires_grad and .is_grad_enabled guards (#100570)
      fix inference_mode with torch.compile (#101219)
      aotautograd: fix mutation bug when input is noncontiguous (#102767)
      change pre_autograd to pre_dispatch tracing (#101818)
      [AOTAutograd] make _unsafe_view() logic happen during the runtime epilogue (#103919)
      fix inference mode / PyDispatcher / Functionalize interaction (#103275)
      Reland of https://github.com/pytorch/pytorch/pull/101818 (#103888)
      pre_dispatch tracing: support autocast and no_grad/enable_grad ctx managers, add a pre_dispatch_eager dynamo backend (#103024)
      kill inductor.config.disable_cpp_codegen in internal (#104351)
      [hotfix inductor test] disable cpp vectorization codegen in fbcode for inductor (#104560)
      [inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115)
      faketensor: prevent deepcopy from cloning FakeTensorMode (#104476)
      AOTAutograd: correctness fix when tracing custom autograd functions that alias inputs (#102992)
      AOTAutograd: allow input mutations on inputs that are non-contiguous (#106460)
      Add some support for detecting false aliasing in AOTAutograd (#106461)
      Allow storage() to work on python tensor subclasses, but error on future data accesses (#107417)
      allow result of at::for_blob to advertise as resizeable (for tensor subclasses) (#107416)

Bruce Jiang (1):
      Support third-party devices to use the init_process_group method with… (#107113)

Bug Hunter Yan (8):
      Fix typos in torch/fx/_compatibility.py (#97618)
      Add custom backend case for storage and automatically generate storage attributes. (#98478)
      Fix a minor bug about method generation. (#99704)
      extend serialization for tensor metadata (#99808)
      Extend storage create for custom storageImpl (#100237)
      Fix device normalization of automatically generate methods for custom backends.  (#101796)
      extend serialization for tensor metadata (#99808)
      add torch_api (#108617)

CURTLab (1):
      Fixed cmake mkl lib path in caffee2 public (#105525)

CYuxian (3):
      [onnx] Convert aten::flatten with 0d input to onnx Reshape and 1d to Identity (#104089)
      [onnx] Fix output shape mismatch issue of max_pool (#106270)
      [ONNX] Return input itself for non-fp inputs and support decimals for aten::round op (#107920)

Cao Doan (2):
      Fix nullable-to-nonnull-conversion warnings (#106232)
      [nullability] Suppress -Wnullable-to-nonnull-conversion errors in caffe2 (#107418)

Cao E (1):
      Add channels_last3d support for mkldnn conv and mkldnn deconv (#95271) (#108216)

CaoE (6):
      Add scalar conversion using avx instructions for half (#102140)
      add channel last 3d support for batch_norm on CPU (#97774)
      Add backward check for test_memory_format (#106104)
      Add scalar conversion using avx instructions for half (#102140)
      Add backward check for test_memory_format (#106104)
      add channel last 3d support for maxpool3d on CPU (#97775)

Carl Lemaire (1):
      Harmonize BCELoss example to F.binary_cross_entropy (#95178)

Catherine Lee (81):
      [mergebot] Fix for pagination error (#95333)
      Remove mentions of distributed/_shard/test_replicated_tensor (#95632)
      Add super().setUp() in test_symbolic_shape_analysis (#95336)
      Run tests in USE_PYTEST_LIST through run_tests (#95659)
      Run more tests through pytest (#95844)
      Run _nvfuser/test_torchscript serially (#95951)
      Reduce pytest blocklist (#96016)
      Use GH cache for sccache on GH mac runners (#96142)
      Set ref for linux_job checkout in lint (#96317)
      --subprocess for pytest (#96210)
      Remove on_green and mandatory_only (#96400)
      Remove land checks in trymerge (#96401)
      Reduce pytest blocklist part 2 (#96397)
      Remove duplicate windows job (#96552)
      Remove pytest block list (#96698)
      [mergebot] An ignore current flag (#96756)
      [ci] Onnx test 3->2 shards (#97383)
      Update xla pin merge rule for python3.8 (#97371)
      Remove non existent files in multigpu tests (#97393)
      [easy] Update xla hash pin merge rule (#97700)
      Allow -ic when no pending jobs (#97707)
      Update vision pinned hash (#97706)
      Add arm tests to mps workflow (#97279)
      Add unstable workflow to upload test stats (#97918)
      Retry at test file level (#97506)
      Skip test_batch_norm in test_jit_fuser_te for asan (#98016)
      [ci][easy] Only print remaining logs if test step ran (#97713)
      Clean up duplicate function run_test.py (#97914)
      Print test times for pytest in verbose mode (#98028)
      Retry at test file level (#97506)
      Add slow workflow to upload test stats workflow (#98447)
      [experiment] More procs in CI (#98098)
      Set up automated hash pinning for triton (#97568)
      Remove filter step (#98969)
      [easy] Fix upload test stats after master -> main switch (#99924)
      stepcurrent (#98035)
      Rename master -> main in docs workflow (#100022)
      Add asan slow test shard (#99925)
      Fix triton auto update pin workflow (#100211)
      File level rerun changes (#100200)
      Add comment link to revert message (#100276)
      Update to reruns + timeouts in run_test.py (#100412)
      Revert "PyTorch -> C++17 (#98209)" (#100497)
      Add unstable-periodic to upload test stats (#100751)
      Fix get_reordered_tests in run_test.py (#100752)
      Add inductor as a test disable group (#101448)
      Don't run libtorch tests on slow test shard (#101429)
      Run dynamo tests in parallel (#101432)
      Check for pytest extensions in run_test (#100916)
      No cpp + step current (#102001)
      Quick fix for keep-going + reruns (#102569)
      Quick fix for keep-going + reruns (#102569)
      Add print statements to debug sharding error (#102713)
      Fix rocm sharding (#102871)
      Dont run test files that are already run in test_optim (#103017)
      Reenable disabled tests by pr body (#103790)
      Exclude _nvfuser from test collection (#104003)
      Reenable disabled tests by pr body (#103790)
      Pin pytest linux dependencies in docker (#104281)
      Add libxml2 and libxslt in docker image (#104663)
      Label for mem leack check (#104643)
      Pin pillow (#104760)
      Revert "[inductor] fix a custom_op test problem (#104972)" (#105149)
      Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
      Upload all test stats only if the workflow is from pytorch/pytorch main (#105087)
      Close non existent disable issues (#105096)
      Fix docs not showing error, remove circleci docs scripts (#105678)
      Add pull request target to bc lint (#106065)
      Bot message changes for -f and rebase (#106150)
      Reordering tests experiment (#106347)
      Close non existent disable issues complete rollout (#106923)
      Mark test_lstm_packed as slow (#107048)
      Mark test_gradient_extreme_cases as slow for inductor (#107189)
      Reordering tests experiment (#106347)
      Add deployment environment for docs and upload test stats (#107318)
      Use default build env and test config for test times (#107325)
      Run check api rate limit on ephemeral runner (#107621)
      Slightly more flexible naming system for disable + slow tests (#104002)
      Always import test selection tools (#107644)
      Remove aws ossci metrics upload keys from rocm (#107613)
      Move conda uploads into environment (#107807)

CedricPicron (1):
      Use L1 loss for Smooth L1 loss with beta=0 (#97022)

Chan Ger Hean (1):
      Get OutOfMemoryError to inherit from RuntimeError (#99786)

Chao Yang (2):
      enforece dtype (#102802)
      enforce `dtype` (reland) (#102996)

Charles David Hernandez (1):
      fixing internal test failure on non sm_80 machines (#107340)

Charlie West-Taylor (1):
      Add autocast support for IPU (#103890)

Charlie Yan (3):
      [1/n] Consolidate `replicate` and `DDP`: setup ufmt for `distributed.py` (#96597)
      [2/n] Consolidate `replicate` and `DDP`: split `forward` function (#96658)
      [3/n] Consolidate `replicate` and `DDP`: update `replicate` to reuse functions in `DDP` (#96660)

Chase (1):
      [DataLoader] Add context to NotImplementedErrors in dataset.py (#100667)

Chen Lai (1):
      add get buffer from exported program (#107809)

Chien-Chin Huang (50):
      [SPMD] Pull the minimal working distribute API and SPMD module to PyTorch (#94802)
      [SPMD] Introduce the cross-iteration graph optimization framework (#94803)
      [FSDP][optim_state_dict] Fix a memory leakage in optim_state_dict (#96263)
      [FSDP][optim_state_dict] Copy step tensor so that each parameter has its own step (#96313)
      [SPMD] Add defunctionalize_optimizer feature (#96323)
      [FSDP][optim_state_dict] Make FSDP optim_state_dict aware of DDP prefix (#96415)
      [SPMD] Make the IterGraphModule less verbose and more profiling friendly (#96969)
      [FSDP][optim_state_dict] Print out more useful error message for optim_state_dict (#96860)
      [FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534)
      [DTensor] Fix the default PG condition for DeviceMesh (#97384)
      [SPMD] Make compile cache the compilation result and add option to perform transformation (#97836)
      [SPMD] Allow IterGraph support a more general subgraph movement (#98360)
      [SPMD] Add the default graph module transformation that is applied after tracing and expansion (#98182)
      [SPMD] Add a dump_graphs_to_files utils to facilitate graph transformation debug (#98284)
      [SPMD] Introduce graph_optimization_pass and comm_fusion_with_cat (#98285)
      [SPMD] Introduce schedule_comm_wait (#98578)
      [SPMD] Add optimizer states and steps to the return (#98579)
      [SPMD] Introduce remove_copy_for_optimizer optimization (#98580)
      [SPMD] Expedite the allreduce call before doing comm_fusion (#98922)
      [FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912)
      [SPMD] Upstream partial_lower (#99069)
      [SPMD] Remove the unused code (#99075)
      [SPMD] Move some functions to IterGraphModule.setup() (#99076)
      [SPMD] Implement split_fused_optimizer to split one fused_optimizer node to two (#98784)
      [FSDP] Ensure that customized non tensor optimizer state can be saved (#99214)
      [FSDP][optim_state_dict][Easy] Temporarily disable rank0_only=True for use_orig_paramscaseEas (#99354)
      [SPMD] Upstream iter_move_grads_and_optimizers (#98785)
      [SPMD] Allow users to dynamically pass the last_iter to IterGraphModule (#99575)
      [FSDP][optim_state_dict] Support rank0_only when use_orig_params is on (#99624)
      [FSDP][optim_state_dict] Consolidate rank0_only load logic (#99647)
      [FSDP][BE] Remove unused code (#99731)
      [FSDP][Reland] Include duplicate parameters and modules when calling named_parameters and named_modules  (#99448)
      [SPMD] Add arange and zeros to default factory ops (#100037)
      [SPMD] Add embedding dense backward prop rule for postional embedding (#100038)
      [SPMD][Easy] Add time counter in graph_optimization_pass (#99969)
      [SPMD] Introduce prerequisites to graph_optimization_pass (#99970)
      [FSDP][state_dict] Restore the state_dict_config for NO_SHARD (#100855)
      [FSDP][state_dict] Make sharded_state_dict work with composable fully_shard (#100856)
      [SPMD][BE] Remove the legacy tracing code (#100858)
      [Dynamo] Support methods of NamedTuple (#103217)
      [FSDP][optim_state_dict] Avoid calling optim.state_dict() to get the initial
      [FSDP][optim_state_dict] Cleanup the unused optimizer state_dict APIs (#103781)
      [FSDP][state_dic…
@williamwen42
Copy link
Member

Is this resolved?

@ezyang ezyang closed this as completed Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants