## Imports

In [1]:
import os

## System Configuration Files

The system configuration file variables are organized in the `system_cfg_files_{# GPUs}{,_2,_3}` format, where
- `{# GPUs}` is the # of GPUs in the target system cluster
- `{,_2,_3}` is used to further organize clusters based on the number of devices in each node (noted by the `# Node Size ...` comment). This will be important for customizing the parallelization strategies used in `task_cfgs/cloud/dlrm_train_cloud.json`.

In [2]:
system_cfg_files_64 = [
    'system_cfgs/cloud_A/cloud_A_a100.80_64.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_64.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_64.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_64.json',
]

# Node Size 4
system_cfg_files_64_2 = [
    'system_cfgs/cloud_B/cloud_B_a100.80_64.json',
]

system_cfg_files_128 = [
    'system_cfgs/cloud_A/cloud_A_a100.40_128.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_128.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_128.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_128.json',
]

# Node Size 4
system_cfg_files_128_2 = [
    'system_cfgs/cloud_B/cloud_B_a100.80_128.json',
]

# Node Size 16
system_cfg_files_128_3 = [
    'system_cfgs/cloud_C/cloud_C_a100.40m_128.json',
]

system_cfg_files_256 = [
    'system_cfgs/cloud_A/cloud_A_v100.32_256.json',
    'system_cfgs/cloud_A/cloud_A_a100.40_256.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_256.json',
    'system_cfgs/cloud_B/cloud_B_v100.32_256.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_256.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_256.json',
]

# Node Size 4
system_cfg_files_256_2 = [
    'system_cfgs/cloud_B/cloud_B_a100.80_256.json',
]

# Node Size 16
system_cfg_files_256_3 = [
    'system_cfgs/cloud_C/cloud_C_a100.40m_256.json',
]

system_cfg_files_512 = [
    'system_cfgs/cloud_A/cloud_A_v100.16_512.json',
    'system_cfgs/cloud_A/cloud_A_v100.32_512.json',
    'system_cfgs/cloud_A/cloud_A_a100.40_512.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_512.json',
    'system_cfgs/cloud_B/cloud_B_v100.32_512.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_512.json',
    'system_cfgs/cloud_C/cloud_C_v100.16_512.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_512.json',
]

# Node Size 4
system_cfg_files_512_2 = [
    'system_cfgs/cloud_B/cloud_B_v100.16_512.json',
    'system_cfgs/cloud_B/cloud_B_a100.80_512.json',
]

# Node Size 16
system_cfg_files_512_3 = [
    'system_cfgs/cloud_C/cloud_C_a100.40m_512.json',
]

system_cfg_files_1024 = [
    'system_cfgs/cloud_A/cloud_A_v100.16_1024.json',
    'system_cfgs/cloud_A/cloud_A_v100.32_1024.json',
    'system_cfgs/cloud_A/cloud_A_a100.40_1024.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_1024.json',
    'system_cfgs/cloud_B/cloud_B_v100.32_1024.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_1024.json',
    'system_cfgs/cloud_C/cloud_C_v100.16_1024.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_1024.json',
]

# Node Size 4
system_cfg_files_1024_2 = [
    'system_cfgs/cloud_B/cloud_B_v100.16_1024.json',
    'system_cfgs/cloud_B/cloud_B_a100.80_1024.json',
]

# Node Size 16
system_cfg_files_1024_3 = [
    'system_cfgs/cloud_C/cloud_C_a100.40m_1024.json',
]

system_cfg_files_2048 = [
    'system_cfgs/cloud_A/cloud_A_v100.16_2048.json',
    'system_cfgs/cloud_A/cloud_A_v100.32_2048.json',
    'system_cfgs/cloud_A/cloud_A_a100.40_2048.json',
    'system_cfgs/cloud_A/cloud_A_h100.80_2048.json',
    'system_cfgs/cloud_B/cloud_B_v100.32_2048.json',
    'system_cfgs/cloud_B/cloud_B_a100.40_2048.json',
    'system_cfgs/cloud_C/cloud_C_v100.16_2048.json',
    'system_cfgs/cloud_C/cloud_C_a100.40_2048.json',
]

# Node Size 4
system_cfg_files_2048_2 = [
    'system_cfgs/cloud_B/cloud_B_v100.16_2048.json',
    'system_cfgs/cloud_B/cloud_B_a100.80_2048.json',
]

# Node Size 16
system_cfg_files_2048_3 = [
    'system_cfgs/cloud_C/cloud_C_a100.40m_2048.json',
]

## Main Run Script

Run this to emulate training performance on target cloud instance clusters

In [3]:
model_cfg_file = 'model_cfgs/dlrm/dlrm_a.json'
task_cfg_file = "task_cfgs/cloud/dlrm_train_cloud.json"
system_cfg_files = system_cfg_files_128 # change me to the defined system configuration block variables above

for system_cfg_file in system_cfg_files:
    os.system('python ../run_model.py --model-cfg-file \'../{}\' --system-cfg-file \'../{}\' --task-cfg-file \'../{}\''.format(
                    model_cfg_file, system_cfg_file, task_cfg_file))

**************************************************
Model Name: DLRM_A
Parameters: 801.17 B (0.04% dense, 99.96% sparse).
Size: 1602.98 GB (1.28 GB dense, 1601.70 GB sparse).
FLOPs: 638.08 MFLOPs (31.90 MFLOPs per MLP layer) per sample.
Lookup Bytes: 11.55 MB per sample.
**************************************************
**************************************************
System Name: CloudA-128-p4d.24xlarge
16 nodes with 8 devices each
Effective FLOPs:
	FP64: 6.84 TFLOPS per device / 0.88 PFLOPS system-wide
	FP/TF32: 109.98 TFLOPS per device / 14.08 PFLOPS system-wide
	FP/BF16: 219.96 TFLOPS per device / 28.15 PFLOPS system-wide
	INT8: 439.92 TOPS per device / 56.31 POPS system-wide
Memory:
	Capacity: 40.00 GB per device / 5.12 TB system-wide
	Bandwidth: 1555.00 GB/s per device / 199.04 TB/s system-wide
Effective Unidirectional BW per GPU:
	Intra-Node: 135.00 GB/s
	Inter-Node: 1.94 GB/s
Effective Communication Collectives BW:
	All to All: 1.94 GB/s
	All Reduce: 21.43 GB/s
	All Gather: 4

- Note that for the main run script, when you change the system configuration variable to one with a different node size (as indicated by comment line), you may have to change the corresponding task configuration file `"task_cfgs/dlrm_train_cloud.json"` as well
- The default settings for the code block above will generate task throughputs that match the entries highlighted in yellow in the reference '[Artifact Evaluation] Cloud Provider Results' sheet.