Releases: microsoft/mscclpp
Releases · microsoft/mscclpp
MSCCL++ v0.7.0
What's Changed
- Move pipeline to official org by @Binyang2014 in #406
- Disable CuMemMap check for ROCm by @Binyang2014 in #411
- NVLS support for NCCL API by @Binyang2014 in #410
- Supporting multi-node executors in NCCL API by @caiomcbr in #412
- Fix synchronization in allreduce8 kernel by @dsidler in #407
- Add ncclBcast / ncclBroadcast support by @SreevatsaAnantharamu in #419
- Update README by @Binyang2014 in #414
- Fix nccl-test failure issue by @Binyang2014 in #421
- Tackle build warnings by @chhwang in #422
- trigger ci for release branches by @Binyang2014 in #426
- Fix CI trigger issue by @Binyang2014 in #428
- Fix typos in the pipeline by @chhwang in #420
- Update version number by @Binyang2014 in #433
- Enhance the nccl error message handling by @seagater in #434
- [NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl by @PedramAlizadeh in #399
- Fix azure pipeline by @Binyang2014 in #437
- Add
GpuBuffer
class by @chhwang in #423 - Fix CMake build messages by @chhwang in #443
- Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit by @caiomcbr in #415
- Fix Python binding of exceptions by @chhwang in #444
- Auto-update version numbers in CMakeLists.txt by @chhwang in #450
- Resolve cuMemMap error by @Binyang2014 in #451
- Manage runtime environments by @chhwang in #452
- Lazily create streams for CudaIpcConnection by @chhwang in #449
- Fix PR #449 by @chhwang in #453
- Merge mscclpp-lang to mscclpp project by @Binyang2014 in #442
- Renaming channels by @chhwang in #436
- Add multi-nodes example & update doc by @Binyang2014 in #455
- Adjusting BFS to seek circular dependencies in the msccl-tools DAG by @caiomcbr in #459
- remove unnecessary sync by @Binyang2014 in #461
- Support ReduceScatter in the NCCL interface by @caiomcbr in #460
- Updating MSCCLLang Examples by @caiomcbr in #462
- Disable channel cache by @seagater in #463
- Adjusting AllGather Collective in MSCCLLang by @caiomcbr in #466
- Adding Read Put Packet operation at Executor by @caiomcbr in #441
- NPKit Support to Read Put Packet Operation by @caiomcbr in #471
- Adjust NPKit IB Event by @caiomcbr in #472
- Fix minor typos and errors in documentation by @RyoYang in #474
- Improving Get Operation at MSCCLLang by @caiomcbr in #475
- Fix memory OOM issue by @Binyang2014 in #479
- Mark mscclpp-test as deprecated in the doc by @chhwang in #478
- Update allgather fallback algo by @Binyang2014 in #476
- Add min operation for allreduce by @Binyang2014 in #481
- NCCL API CI Test for ReduceScatter by @caiomcbr in #465
- Fix correctness issue when mscclppDisableChannelCache set to true by @Binyang2014 in #483
- nccl/rccl integration by @seagater in #469
- Fix reduceMin failaure issue by @Binyang2014 in #486
- Reduce Operation Support to the Executor by @caiomcbr in #484
- Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations by @seagater in #485
- Remove the requirement for
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED
for NVLS support by @Binyang2014 in #489 - Add CUDA 12.8 images by @chhwang in #488
- Add a devcontainer configuration by @chhwang in #490
- Fix CMake installation in Dockerfile for arm64 by @chhwang in #491
- Export mscclpp GpuBuffer to dlpack format by @Binyang2014 in #492
- Fix the virtual address mapping issue of cuMemMap in fallback code by @seagater in #501
- Improve signal/wait performance and fix barrier issue by @Binyang2014 in #499
- Fix performance issue introduced in PR: 499 by @Binyang2014 in #505
- Add flag to disable nvls by @Binyang2014 in #500
- Optimized allreduce fallback for ~10KB sizes by @chhwang in #506
- Automatic creation of Scratch Buffer at MSCCLLang by @caiomcbr in #510
- Use implicit ctors for default device ctors by @chhwang in #512
- apps/nccl: fix a bug in allreduce kernels for graph mode by @nusislam in #502
- Revised MemoryChannel interfaces by @chhwang in #508
- Fix #508 by @chhwang in #515
- Add NVLS based fallback algo by @Binyang2014 in #507
- Enhance Collective Check at MSCCLang by @caiomcbr in #511
- Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc by @seagater in #513
- Fix the issue of echo message for nccl fallback in CI test by @seagater in #520
- Asynchronous setup by @chhwang in #514
- Adding maxSpinCount to port channel flush by @Binyang2014 in #518
- Fix device assert by @chhwang in #522
- Fix #514 by @chhwang in #521
- Add a CMake option
MSCCLPP_GPU_ARCHS
by @chhwang in #525 - Update citations by @chhwang in #524
- Set Up a CI Pipeline for H100 by @Binyang2014 in #526
- Properly setting up the device in Ethernet Connection by @caiomcbr in #527
- Add device semaphore API by @Binyang2014 in #523
- Address NVCC warning #20012-D by @chhwang in #528
- Rename
ChannelTrigger
fields and check field values in debug builds by @chhwang in #529 - DLPack fixes by @chhwang in #537
- Improved documentation & minor interface revision by @chhwang in #541
- Use a stream pool for
gpuCalloc*()
by @chhwang in #509 - Multi-stream CUDA IPC by @chhwang in #326
- Fix #509 by @chhwang in #546
- Fix build processes by @chhwang in #545
- Do not use tail replica by default by @chhwang in #544
- DeviceSemaphore fix by @Binyang2014 in #553
- Fix some typos in docs by @Edenzzzz in #555
- New FIFO test by @chhwang in #558
- FIFO improvements by @chhwang in #557
- Fix #557 by @chhwang in #560
- Support connection between local endpoints by @chhwang in #561
- Fix multi-nodes CI pipeline by @Binyang2014 in #564
- Support any GPUs per node for NCCL_API by @Binyang2014 in #566
- Fix pytest failure by @Binyang2014 in #567
- Fix a FIFO correctness bug by @chhwang in #549
- New semaphore constructors by @chhwang in #559
- Revise NVLS interface by @chhwang in #458
- update readme & bump version by @Binyang2014 in #550
New Contribu...
MSCCL++ v0.6.0
Highlight
- Improved NCCL API integration in MSCCL++ for better performance and usability
- Enhanced execution plan-based executor in MSCCL++
- Fixed several bugs to improve stability and reliability
What's Changed
- Add support for different vector sizes in multimem instructions by @roshandathathri in #332
- NCCL API Executor Integration by @caiomcbr in #331
- Fix missing import in executor test by @yzygitzh in #334
- bfloat16 support by @chhwang in #336
- Dynamically load libibverbs by @caiomcbr in #337
- Auto-tune vector sizes for NVLS allreduce6 by @roshandathathri in #338
- Make ibverbs optional at compile time by @chhwang in #340
- ProxyChannel Support in Executor by @caiomcbr in #342
- Support executors to send packets over ProxyChannel by @caiomcbr in #344
- Fix for ROCm 6.0 by @chhwang in #347
- Fix bug for construct sempaphore by @Binyang2014 in #341
- Add proxy channel related operations by @Binyang2014 in #351
- Add CI for rocm by @Binyang2014 in #346
- Tune threads per block for mscclpp executor by @Binyang2014 in #345
- Fix NPKit exit event offset by @yzygitzh in #356
- Use IB transport flags only when an IB device exists by @chhwang in #355
- Update ROCm CI by @chhwang in #357
- Fixing RegisterMemory Allocation for ProxyChannels by @caiomcbr in #353
- Fix NCCL API bugs by @chhwang in #363
- Perf optimization & support clipping by @chhwang in #364
- Fix copyright messages by @chhwang in #367
- [Doc] mscclpp docs by @Binyang2014 in #348
- Executor AllGather In-Place Support by @caiomcbr in #365
- Fix algo repo name by @Binyang2014 in #369
- Update docker image for cuda12.4 by @Binyang2014 in #370
- Fix in-place all-gather input buffer in executor_test by @yzygitzh in #372
- [docs] fix quickstart link by @jeffra in #374
- Add kernel-based verification for executor_test by @yzygitzh in #378
- Lazily create the context stream by @chhwang in #381
- Fixing Bug Const Offset in Execution Plan by @caiomcbr in #380
- Fix light load bug by @Binyang2014 in #379
- Small Adjust in Test Data AllGather at Executor Test by @caiomcbr in #384
- Fix missing packet parameter for executor by @yzygitzh in #385
- NVLS support for msccl++ executor by @Binyang2014 in #375
- Fix typo by @Binyang2014 in #389
- Improve CMake options by @chhwang in #376
- Fixing Message Boundary AllReduce Fallback Code by @caiomcbr in #391
- Fix mscclpp_benchmark by @Binyang2014 in #392
- Add cross threadblock barrier by @Binyang2014 in #383
- AllGather Executor Support in NCCL Interface by @caiomcbr in #393
- Providing reduce-scatter test support by @caiomcbr in #390
- Select algo according to json config by @Binyang2014 in #396
- Add connection events for NPKit by @yzygitzh in #386
- Revised ProxyChannel interfaces by @chhwang in #400
- Setup pipeline for mscclpp over nccl by @Binyang2014 in #401
- Exception Max Number Operation per Tb by @caiomcbr in #405
- Reduce memory usage for scratch buffer by @Binyang2014 in #403
- [Cherry-pick] Move pipeline to official org (#406) by @Binyang2014 in #416
- [Cherry-pick] trigger ci for release branches (#426) by @Binyang2014 in #427
- [Cherry-pick] Disable CuMemMap check for ROCm (#411) by @Binyang2014 in #424
- [Cherry-pick] NVLS support for NCCL API (#410) by @Binyang2014 in #425
- [Cherry-pick] Fix nccl-test failure issue (#421) by @Binyang2014 in #429
New Contributors
Full Changelog: v0.5.2...v0.6.0
MSCCL++ v0.5.2
What's Changed
- Add C++ executor test by @chhwang in #304
- Cumulative Updates by @Binyang2014 in #309
- Add NPKit GPU event support by @yzygitzh in #310
- Fix NPKit support for AMD by @yzygitzh in #312
- Add "packet type" option for executor test by @Binyang2014 in #313
- Add support for multicast reduce insruction by @roshandathathri in #316
- Update quickstart.md by @angelica-moreira in #314
- Simplify/improve barrier in AllReduce6 by @roshandathathri in #317
- Support NCCL APIs by @caiomcbr in #319
- Update allreduce_bench.py by @angelica-moreira in #318
- Separate NPKit CPU timestamp access from different blocks for AMD platform by @yzygitzh in #321
- AllReduce Kernel for Small Messages by @caiomcbr in #322
- Resolve clang++ warnings by @chhwang in #325
- Support to write packets via uint2 by @Binyang2014 in #327
- Double buffering for NCCL APIs by @caiomcbr in #324
- v0.5.2 by @chhwang in #328
New Contributors
- @angelica-moreira made their first contribution in #314
- @caiomcbr made their first contribution in #319
Full Changelog: v0.5.1...v0.5.2
MSCCL++ v0.5.1
MSCCL++ v0.5.0
What's Changed
- Fix a typo name by @chhwang in #286
- Add executor to execute schedule-plan file by @Binyang2014 in #283
- Allow binding allocated memory to NVLS multicast pointer by @roshandathathri in #290
- Seperate headers for GPU data types by @chhwang in #291
- Refactoring NVLS interfaces by @chhwang in #293
- Include GPU data types only for kernel code by @chhwang in #292
- Ethernet support by @chhwang in #284
- Resolve multi-nodes test failure issue by @Binyang2014 in #295
- Move pipeline to Azure org by @Binyang2014 in #296
- Optimized the execution kernel by @Binyang2014 in #294
- Allow obtaining cuda stream handle from PyTorch stream when launching kernel by @aashaka in #297
- v0.5.0 by @chhwang in #298
New Contributors
- @roshandathathri made their first contribution in #290
Full Changelog: v0.4.3...v0.5.0
MSCCL++ v0.4.3
What's Changed
- Add optional prefix to installation paths by @chhwang in #235
- Fix #235 by @chhwang in #239
- Check
nvidia_peermem
during runtime by @chhwang in #234 - Do not check value of
__HIP_PLATFORM_AMD__
by @chhwang in #240 - Fix crash in static variable deconstructor by @Binyang2014 in #238
- Update interface to let user change fifo size by @Binyang2014 in #243
- Mask each fields of the trigger by @chhwang in #244
- Minor improvement on device syncer by @chhwang in #231
- remove make pylib-copy command by @Binyang2014 in #249
- Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 by @aashaka in #251
- Add
putWithSignal()
latency tests by @chhwang in #246 - NVLS support. by @saeedmaleki in #250
- Fix wrong offset calculation by @chhwang in #257
- Fix NVLS support by @chhwang in #258
- Allow MSCCL++ CommGroup to take PyTorch tensors in args by @aashaka in #255
- Fix multi-nodes test failure by @Binyang2014 in #262
- Allow semaphores and memory to be registered separately in ProxyService by @aashaka in #264
- Remove cuda-python from project by @Binyang2014 in #245
- Fix the comm.py for nvls by @saeedmaleki in #267
- New packet format & optimizations by @chhwang in #256
- Fix multi-node ci pipeline by @Binyang2014 in #272
- add launch_bounds for mscclpp_test by @Binyang2014 in #273
- Fix bootstrapping mechanism by @chhwang in #278
- v0.4.3 by @chhwang in #279
New Contributors
Full Changelog: v0.4.2...v0.4.3
MSCCL++ v0.4.2
MSCCL++ v0.4.1
What's Changed
- Fix performance downgrade issue & update doc by @Binyang2014 in #229
- Add a documentation issue template by @chhwang in #230
Full Changelog: v0.4.0...v0.4.1
MSCCL++ v0.4.0
- Add Python benchmark
- Update documentation
- Add ROCm support
- Bug fixes
See details from #160.
MSCCL++ v0.3.0
- Updated interfaces
- Add Python bindings and interfaces
- Add Python unit tests
- Add more configurable parameters
- Add a new single-node AllReduce kernel
- Fix bugs
See details from #89.
Full Changelog: v0.2.0...v0.3.0