## 测试8个GPU之间的HCCL通信带宽的脚本

- **平均测试时间**：是指某一算法先做m次预热，再做n次测试，并从n次测试开始计时，计算算法完成时平均每次迭代的耗时

- **算法带宽**：是指申请内存大小/平均时间的数据，包含数据传输、计算和内存复制的带宽

In [1]:
import os, time, utils
from glob import glob

# 获取程序运行的根目录
root_path = utils.get_rootpath()
log_path = f"{root_path}/log"
# 其实内存容量（单位Byte）
test_mem_start = 2048 * 1024 * 1024
# 结束的内存容量（单位Byte）
test_mem_stop = 2048 * 1024 * 1024
# 增长倍数
test_mem_factor = 2
# 定义算法的名字和对应的绘图的颜色
test_func = [
    "all_gather",
    "all_reduce",
    "alltoall",
    "alltoallv",
    "broadcast",
    "reduce",
    "reduce_scatter",
]

### A. 清理工作环境

In [2]:
print(f"root_path: {root_path}")
os.system(f"rm -rf {log_path}")
os.chdir(root_path)
os.makedirs(log_path)

root_path: /root/workdir/hccl_test


### B. 重新编译所有的hccl方法

In [None]:
os.chdir(f"{root_path}/hccl")
os.system("make clean")
os.system("make")
os.chdir(root_path)

## 获取计算列表

### A. 根据参数自动生成列表

In [3]:
mem_list = []
mem_size = test_mem_start
while mem_size <= test_mem_stop:
    for func in test_func:
        mem_list.append([func, mem_size])
    mem_size = mem_size * 2
mem_count = len(mem_list)
print(f"mem_list len: {mem_count}")

mem_list len: 7


### B. 手动指定处理列表

In [4]:
mem_list = [
    # ["all_gather", 2048 * 1024 * 1024],
    ["all_reduce", 2048 * 1024 * 1024],
    # ["alltoall", 2048 * 1024 * 1024],
    # ["alltoallv", 2048 * 1024 * 1024],
    # ["broadcast", 2048 * 1024 * 1024],
    # ["reduce", 2048 * 1024 * 1024],
    # ["reduce_scatter", 2048 * 1024 * 1024],
]
mem_count = len(mem_list)
print(f"mem_list len: {mem_count}")

mem_list len: 1


## 运行测试程序

### A. 性能测试

该测试脚本主要利用msprof工具抓取trace性能工具，运行对应参数的方法，并将多个节点的结果合并到一起，最终的产物为位于log文件夹下的timeline的json文件，通过trace分析工具可以直接导入并分析该文件

In [5]:
utils.load_env()
npus = os.environ["HCCL_TEST_USE_DEVS"].split(",")
for i in range(len(mem_list)):
    # 准备基础变量, 打印前缀信息
    global_start = time.time()
    local_start = time.time()
    [func, mem_size] = mem_list[i]
    prt_precent = f"{((i + 1.0) / mem_count * 100):02.2f}%"
    prt_prefix = f"{i + 1:03d} / {len(mem_list):03d} ({prt_precent})"
    print(f"{prt_prefix} >> run   {func}_test in {utils.get_size(mem_size)}...", end="")
    # 准备后续操作需要的变量
    mpirun_template = f"{root_path}/script/mpirun.template"
    mpirun_script = f"{log_path}/mpirun_script.sh"
    msprof_template = f"{root_path}/script/msprof.template"
    mpirun_args = {
        "npus": len(npus),
        "mem_size": mem_size,
        "exec": f"{root_path}/bin/{func}_test",
        "log_path": f"{log_path}/{func}_{utils.get_size(mem_size)}_{len(npus)}npus.log",
    }
    msprof_args = {
        "script_path": mpirun_script,
        "prof_path": f"{log_path}/tmp",
        "log_path": f"{log_path}/mpirun_script.log",
    }
    # 删除旧的临时文件, 并开始性能测试
    os.system(f"rm -rf {log_path}/tmp")
    utils.get_script(mpirun_template, mpirun_args, mpirun_script)
    os.system(utils.get_script(msprof_template, msprof_args))
    print(f"done in {utils.get_time(local_start)}")
    # 开始合并device侧的timeline文件
    local_start = time.time()
    print(f"{prt_prefix} >> merge {func} timeline json...", end="")
    msprof_timeline_src = f"{log_path}/tmp/PROF_*"
    msprof_timeline_dst = f"{log_path}/{func}_{utils.get_size(mem_size)}"
    utils.merge_timeline(glob(msprof_timeline_src), msprof_timeline_dst)
    print(f"done in {utils.get_time(local_start)}")
    print(f"{prt_prefix} >> proc  {func} total use: {utils.get_time(global_start)}")

001 / 001 (100.00%) >> run   all_reduce_test in 2.0GB...done in 00:02:59
001 / 001 (100.00%) >> merge all_reduce timeline json...done in 00:00:24
001 / 001 (100.00%) >> proc  all_reduce total use: 00:03:23


### B. 普通测试

In [None]:
utils.load_env()
npus = os.environ["HCCL_TEST_USE_DEVS"].split(",")
for i in range(len(mem_list)):
    # 初始化变量
    local_start = time.time()
    [func, mem_size] = mem_list[i]
    prt_precent = f"{((i + 1.0) / mem_count * 100):02.2f}%"
    prt_prefix = f"{i + 1:03d} / {len(mem_list):03d} ({prt_precent})"
    print(f"{prt_prefix} >> {func} in {utils.get_size(mem_size)}...", end="")
    # 设置路径, 运行脚本
    script_path = f"{root_path}/script/mpirun.template"
    args_dict = {
        "npus": len(npus),
        "mem_size": mem_size,
        "exec": f"{root_path}/bin/{func}_test",
        "log_path": f"{log_path}/{func}_{len(npus)}npus.log",
    }
    os.system(utils.get_script(script_path, args_dict))
    print(f"done in {utils.get_time(local_start)}")