Releases · microsoft/mscclpp

12 Jul 08:10

chhwang

v0.7.0

5e991cf

MSCCL++ v0.7.0 Latest

Latest

What's Changed

Move pipeline to official org by @Binyang2014 in #406
Disable CuMemMap check for ROCm by @Binyang2014 in #411
NVLS support for NCCL API by @Binyang2014 in #410
Supporting multi-node executors in NCCL API by @caiomcbr in #412
Fix synchronization in allreduce8 kernel by @dsidler in #407
Add ncclBcast / ncclBroadcast support by @SreevatsaAnantharamu in #419
Update README by @Binyang2014 in #414
Fix nccl-test failure issue by @Binyang2014 in #421
Tackle build warnings by @chhwang in #422
trigger ci for release branches by @Binyang2014 in #426
Fix CI trigger issue by @Binyang2014 in #428
Fix typos in the pipeline by @chhwang in #420
Update version number by @Binyang2014 in #433
Enhance the nccl error message handling by @seagater in #434
[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl by @PedramAlizadeh in #399
Fix azure pipeline by @Binyang2014 in #437
Add GpuBuffer class by @chhwang in #423
Fix CMake build messages by @chhwang in #443
Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit by @caiomcbr in #415
Fix Python binding of exceptions by @chhwang in #444
Auto-update version numbers in CMakeLists.txt by @chhwang in #450
Resolve cuMemMap error by @Binyang2014 in #451
Manage runtime environments by @chhwang in #452
Lazily create streams for CudaIpcConnection by @chhwang in #449
Fix PR #449 by @chhwang in #453
Merge mscclpp-lang to mscclpp project by @Binyang2014 in #442
Renaming channels by @chhwang in #436
Add multi-nodes example & update doc by @Binyang2014 in #455
Adjusting BFS to seek circular dependencies in the msccl-tools DAG by @caiomcbr in #459
remove unnecessary sync by @Binyang2014 in #461
Support ReduceScatter in the NCCL interface by @caiomcbr in #460
Updating MSCCLLang Examples by @caiomcbr in #462
Disable channel cache by @seagater in #463
Adjusting AllGather Collective in MSCCLLang by @caiomcbr in #466
Adding Read Put Packet operation at Executor by @caiomcbr in #441
NPKit Support to Read Put Packet Operation by @caiomcbr in #471
Adjust NPKit IB Event by @caiomcbr in #472
Fix minor typos and errors in documentation by @RyoYang in #474
Improving Get Operation at MSCCLLang by @caiomcbr in #475
Fix memory OOM issue by @Binyang2014 in #479
Mark mscclpp-test as deprecated in the doc by @chhwang in #478
Update allgather fallback algo by @Binyang2014 in #476
Add min operation for allreduce by @Binyang2014 in #481
NCCL API CI Test for ReduceScatter by @caiomcbr in #465
Fix correctness issue when mscclppDisableChannelCache set to true by @Binyang2014 in #483
nccl/rccl integration by @seagater in #469
Fix reduceMin failaure issue by @Binyang2014 in #486
Reduce Operation Support to the Executor by @caiomcbr in #484
Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations by @seagater in #485
Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support by @Binyang2014 in #489
Add CUDA 12.8 images by @chhwang in #488
Add a devcontainer configuration by @chhwang in #490
Fix CMake installation in Dockerfile for arm64 by @chhwang in #491
Export mscclpp GpuBuffer to dlpack format by @Binyang2014 in #492
Fix the virtual address mapping issue of cuMemMap in fallback code by @seagater in #501
Improve signal/wait performance and fix barrier issue by @Binyang2014 in #499
Fix performance issue introduced in PR: 499 by @Binyang2014 in #505
Add flag to disable nvls by @Binyang2014 in #500
Optimized allreduce fallback for ~10KB sizes by @chhwang in #506
Automatic creation of Scratch Buffer at MSCCLLang by @caiomcbr in #510
Use implicit ctors for default device ctors by @chhwang in #512
apps/nccl: fix a bug in allreduce kernels for graph mode by @nusislam in #502
Revised MemoryChannel interfaces by @chhwang in #508
Fix #508 by @chhwang in #515
Add NVLS based fallback algo by @Binyang2014 in #507
Enhance Collective Check at MSCCLang by @caiomcbr in #511
Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc by @seagater in #513
Fix the issue of echo message for nccl fallback in CI test by @seagater in #520
Asynchronous setup by @chhwang in #514
Adding maxSpinCount to port channel flush by @Binyang2014 in #518
Fix device assert by @chhwang in #522
Fix #514 by @chhwang in #521
Add a CMake option MSCCLPP_GPU_ARCHS by @chhwang in #525
Update citations by @chhwang in #524
Set Up a CI Pipeline for H100 by @Binyang2014 in #526
Properly setting up the device in Ethernet Connection by @caiomcbr in #527
Add device semaphore API by @Binyang2014 in #523
Address NVCC warning #20012-D by @chhwang in #528
Rename ChannelTrigger fields and check field values in debug builds by @chhwang in #529
DLPack fixes by @chhwang in #537
Improved documentation & minor interface revision by @chhwang in #541
Use a stream pool for gpuCalloc*() by @chhwang in #509
Multi-stream CUDA IPC by @chhwang in #326
Fix #509 by @chhwang in #546
Fix build processes by @chhwang in #545
Do not use tail replica by default by @chhwang in #544
DeviceSemaphore fix by @Binyang2014 in #553
Fix some typos in docs by @Edenzzzz in #555
New FIFO test by @chhwang in #558
FIFO improvements by @chhwang in #557
Fix #557 by @chhwang in #560
Support connection between local endpoints by @chhwang in #561
Fix multi-nodes CI pipeline by @Binyang2014 in #564
Support any GPUs per node for NCCL_API by @Binyang2014 in #566
Fix pytest failure by @Binyang2014 in #567
Fix a FIFO correctness bug by @chhwang in #549
New semaphore constructors by @chhwang in #559
Revise NVLS interface by @chhwang in #458
update readme & bump version by @Binyang2014 in #550

New Contribu...

Contributors

SreevatsaAnantharamu, dsidler, and 8 other contributors

Assets 2

23 Dec 19:01

Binyang2014

v0.6.0

11e6202

MSCCL++ v0.6.0

Highlight

Improved NCCL API integration in MSCCL++ for better performance and usability
Enhanced execution plan-based executor in MSCCL++
Fixed several bugs to improve stability and reliability

What's Changed

Add support for different vector sizes in multimem instructions by @roshandathathri in #332
NCCL API Executor Integration by @caiomcbr in #331
Fix missing import in executor test by @yzygitzh in #334
bfloat16 support by @chhwang in #336
Dynamically load libibverbs by @caiomcbr in #337
Auto-tune vector sizes for NVLS allreduce6 by @roshandathathri in #338
Make ibverbs optional at compile time by @chhwang in #340
ProxyChannel Support in Executor by @caiomcbr in #342
Support executors to send packets over ProxyChannel by @caiomcbr in #344
Fix for ROCm 6.0 by @chhwang in #347
Fix bug for construct sempaphore by @Binyang2014 in #341
Add proxy channel related operations by @Binyang2014 in #351
Add CI for rocm by @Binyang2014 in #346
Tune threads per block for mscclpp executor by @Binyang2014 in #345
Fix NPKit exit event offset by @yzygitzh in #356
Use IB transport flags only when an IB device exists by @chhwang in #355
Update ROCm CI by @chhwang in #357
Fixing RegisterMemory Allocation for ProxyChannels by @caiomcbr in #353
Fix NCCL API bugs by @chhwang in #363
Perf optimization & support clipping by @chhwang in #364
Fix copyright messages by @chhwang in #367
[Doc] mscclpp docs by @Binyang2014 in #348
Executor AllGather In-Place Support by @caiomcbr in #365
Fix algo repo name by @Binyang2014 in #369
Update docker image for cuda12.4 by @Binyang2014 in #370
Fix in-place all-gather input buffer in executor_test by @yzygitzh in #372
[docs] fix quickstart link by @jeffra in #374
Add kernel-based verification for executor_test by @yzygitzh in #378
Lazily create the context stream by @chhwang in #381
Fixing Bug Const Offset in Execution Plan by @caiomcbr in #380
Fix light load bug by @Binyang2014 in #379
Small Adjust in Test Data AllGather at Executor Test by @caiomcbr in #384
Fix missing packet parameter for executor by @yzygitzh in #385
NVLS support for msccl++ executor by @Binyang2014 in #375
Fix typo by @Binyang2014 in #389
Improve CMake options by @chhwang in #376
Fixing Message Boundary AllReduce Fallback Code by @caiomcbr in #391
Fix mscclpp_benchmark by @Binyang2014 in #392
Add cross threadblock barrier by @Binyang2014 in #383
AllGather Executor Support in NCCL Interface by @caiomcbr in #393
Providing reduce-scatter test support by @caiomcbr in #390
Select algo according to json config by @Binyang2014 in #396
Add connection events for NPKit by @yzygitzh in #386
Revised ProxyChannel interfaces by @chhwang in #400
Setup pipeline for mscclpp over nccl by @Binyang2014 in #401
Exception Max Number Operation per Tb by @caiomcbr in #405
Reduce memory usage for scratch buffer by @Binyang2014 in #403
[Cherry-pick] Move pipeline to official org (#406) by @Binyang2014 in #416
[Cherry-pick] trigger ci for release branches (#426) by @Binyang2014 in #427
[Cherry-pick] Disable CuMemMap check for ROCm (#411) by @Binyang2014 in #424
[Cherry-pick] NVLS support for NCCL API (#410) by @Binyang2014 in #425
[Cherry-pick] Fix nccl-test failure issue (#421) by @Binyang2014 in #429

New Contributors

@jeffra made their first contribution in #374

Full Changelog: v0.5.2...v0.6.0

Contributors

jeffra, chhwang, and 4 other contributors

Assets 2

16 Jul 00:37

chhwang

v0.5.2

40cb196

MSCCL++ v0.5.2

What's Changed

Add C++ executor test by @chhwang in #304
Cumulative Updates by @Binyang2014 in #309
Add NPKit GPU event support by @yzygitzh in #310
Fix NPKit support for AMD by @yzygitzh in #312
Add "packet type" option for executor test by @Binyang2014 in #313
Add support for multicast reduce insruction by @roshandathathri in #316
Update quickstart.md by @angelica-moreira in #314
Simplify/improve barrier in AllReduce6 by @roshandathathri in #317
Support NCCL APIs by @caiomcbr in #319
Update allreduce_bench.py by @angelica-moreira in #318
Separate NPKit CPU timestamp access from different blocks for AMD platform by @yzygitzh in #321
AllReduce Kernel for Small Messages by @caiomcbr in #322
Resolve clang++ warnings by @chhwang in #325
Support to write packets via uint2 by @Binyang2014 in #327
Double buffering for NCCL APIs by @caiomcbr in #324
v0.5.2 by @chhwang in #328

New Contributors

@angelica-moreira made their first contribution in #314
@caiomcbr made their first contribution in #319

Full Changelog: v0.5.1...v0.5.2

Contributors

chhwang, Binyang2014, and 4 other contributors

Assets 2

26 May 21:32

chhwang

v0.5.1

cddffbc

MSCCL++ v0.5.1

What's Changed

Upgrade gtest by @chhwang in #300
Rename executor.cpp to executor_py.cpp by @chhwang in #301
Fix assert declaration & add a compile test by @chhwang in #303
Fix security issue by @Binyang2014 in #305
v0.5.1 by @chhwang in #308

Full Changelog: v0.5.0...v0.5.1

Contributors

chhwang and Binyang2014

Assets 2

04 May 23:53

chhwang

v0.5.0

9c2a960

MSCCL++ v0.5.0

What's Changed

Fix a typo name by @chhwang in #286
Add executor to execute schedule-plan file by @Binyang2014 in #283
Allow binding allocated memory to NVLS multicast pointer by @roshandathathri in #290
Seperate headers for GPU data types by @chhwang in #291
Refactoring NVLS interfaces by @chhwang in #293
Include GPU data types only for kernel code by @chhwang in #292
Ethernet support by @chhwang in #284
Resolve multi-nodes test failure issue by @Binyang2014 in #295
Move pipeline to Azure org by @Binyang2014 in #296
Optimized the execution kernel by @Binyang2014 in #294
Allow obtaining cuda stream handle from PyTorch stream when launching kernel by @aashaka in #297
v0.5.0 by @chhwang in #298

New Contributors

@roshandathathri made their first contribution in #290

Full Changelog: v0.4.3...v0.5.0

Contributors

chhwang, Binyang2014, and 2 other contributors

Assets 2

27 Mar 18:55

chhwang

v0.4.3

1a7cb98

MSCCL++ v0.4.3

What's Changed

Add optional prefix to installation paths by @chhwang in #235
Fix #235 by @chhwang in #239
Check nvidia_peermem during runtime by @chhwang in #234
Do not check value of __HIP_PLATFORM_AMD__ by @chhwang in #240
Fix crash in static variable deconstructor by @Binyang2014 in #238
Update interface to let user change fifo size by @Binyang2014 in #243
Mask each fields of the trigger by @chhwang in #244
Minor improvement on device syncer by @chhwang in #231
remove make pylib-copy command by @Binyang2014 in #249
Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 by @aashaka in #251
Add putWithSignal() latency tests by @chhwang in #246
NVLS support. by @saeedmaleki in #250
Fix wrong offset calculation by @chhwang in #257
Fix NVLS support by @chhwang in #258
Allow MSCCL++ CommGroup to take PyTorch tensors in args by @aashaka in #255
Fix multi-nodes test failure by @Binyang2014 in #262
Allow semaphores and memory to be registered separately in ProxyService by @aashaka in #264
Remove cuda-python from project by @Binyang2014 in #245
Fix the comm.py for nvls by @saeedmaleki in #267
New packet format & optimizations by @chhwang in #256
Fix multi-node ci pipeline by @Binyang2014 in #272
add launch_bounds for mscclpp_test by @Binyang2014 in #273
Fix bootstrapping mechanism by @chhwang in #278
v0.4.3 by @chhwang in #279

New Contributors

@aashaka made their first contribution in #251

Full Changelog: v0.4.2...v0.4.3

Contributors

chhwang, Binyang2014, and 2 other contributors

Assets 2

20 Dec 12:25

chhwang

v0.4.2

f1605b7

MSCCL++ v0.4.2

What's Changed

Include cstdint in packet_device.hpp by @chhwang in #233
Fix & improve perf for ROCm by @chhwang in #232
v0.4.2 by @chhwang in #236

Full Changelog: v0.4.1...v0.4.2

Contributors

chhwang

Assets 2

06 Dec 02:14

chhwang

v0.4.1

c15a166

MSCCL++ v0.4.1

What's Changed

Fix performance downgrade issue & update doc by @Binyang2014 in #229
Add a documentation issue template by @chhwang in #230

Full Changelog: v0.4.0...v0.4.1

Contributors

chhwang and Binyang2014

Assets 2

24 Nov 09:09

chhwang

v0.4.0

351b95b

MSCCL++ v0.4.0

Add Python benchmark
Update documentation
Add ROCm support
Bug fixes

See details from #160.

Assets 2

11 Oct 14:37

chhwang

v0.3.0

8c0f9e8

MSCCL++ v0.3.0

Updated interfaces
Add Python bindings and interfaces
Add Python unit tests
Add more configurable parameters
Add a new single-node AllReduce kernel
Fix bugs

See details from #89.

Full Changelog: v0.2.0...v0.3.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

New Contribu...

Contributors

Uh oh!

Highlight

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

Uh oh!

Uh oh!

Releases: microsoft/mscclpp

MSCCL++ v0.7.0

What's Changed

New Contribu...

Contributors

Uh oh!

MSCCL++ v0.6.0

Highlight

What's Changed

New Contributors

Contributors

Uh oh!

MSCCL++ v0.5.2

What's Changed

New Contributors

Contributors

Uh oh!

MSCCL++ v0.5.1

What's Changed

Contributors

Uh oh!

MSCCL++ v0.5.0

What's Changed

New Contributors

Contributors

Uh oh!

MSCCL++ v0.4.3

What's Changed

New Contributors

Contributors

Uh oh!

MSCCL++ v0.4.2

What's Changed

Contributors

Uh oh!

MSCCL++ v0.4.1

What's Changed

Contributors

Uh oh!

MSCCL++ v0.4.0

Uh oh!

MSCCL++ v0.3.0

Uh oh!