-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Description
🐛 Describe the bug
I am fine-tuning llama3.2-11b-instruct model with llama-cookbook repo on H100. The total flops count with pytorch2.5 and 2.6 is ~186TFlops; but the total flops count with PyTorch nightly (2.7) is ~2900TFlops. (Batch size is 2) Could you please take a look at if the flops counting here is correct? It feels both numbers are not correct as if the total TFlops is 2900TFlops, the H100 TFlops/s/GPU will reach >700 which is absurdly high.....
Below are the reproduce steps:
git clone https://github.com/meta-llama/llama-cookbook.git
docker run --name llama-cookbook --shm-size=64g --gpus all -it --rm -v /home/dougljia:/home/dougljia -e HF_HOME=/home/dougljia/model nvcr.io/nvidia/pytorch:24.12-py3
cd /home/dougljia/llama-cookbook
pip install -U pip setuptools
pip install -e .
pip install huggingface_hub transformers fire
huggingface-cli login --token <replace with your token>
torchrun --nnodes 1 --nproc_per_node 8 getting-started/finetuning/finetuning.py --enable_fsdp --lr 1e-6 --num_epochs 1 --batch_size_training 2 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" \
--custom_dataset.test_split "test" --custom_dataset.file "getting-started/finetuning/datasets/ocrvqa_dataset.py" --run_validation False --save_model False --batching_strategy padding \
--flop_counter --flop_counter_start 10 --max_train_step 15 --fsdp_activation_checkpointing
The output will be:
# Module FLOP % Total
# ----------------------------------------------------- -------- ---------
# FullyShardedDataParallel 189.021T 100.00%
# - aten.convolution 0.019T 0.01%
# - aten.bmm 0.000T 0.00%
# - aten.mm 83.731T 44.30%
# - aten._scaled_dot_product_cudnn_attention 34.430T 18.21%
# - aten.addmm 27.784T 14.70%
# - aten._scaled_dot_product_cudnn_attention_backward 43.037T 22.77%
# - aten.convolution_backward 0.019T 0.01%
# FullyShardedDataParallel._fsdp_wrapped_module 189.021T 100.00%
# - aten.convolution 0.019T 0.01%
# - aten.bmm 0.000T 0.00%
# - aten.mm 83.731T 44.30%
# - aten._scaled_dot_product_cudnn_attention 34.430T 18.21%
# - aten.addmm 27.784T 14.70%
# - aten._scaled_dot_product_cudnn_attention_backward 43.037T 22.77%
# - aten.convolution_backward 0.019T 0.01%
# Training Epoch: 1/1, step 14/112 completed (loss: 0.4789400100708008): 13%|███▎ | 15/112 [01:41<10:56, 6.77s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.3038587272167206): 13%|███▎ | 15/112 [01:40<10:52, 6.73s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.7101249694824219): 13%|███▎ | 15/112 [01:40<10:48, 6.69s/it]
# Max CUDA memory allocated was 69 GB
# Max CUDA memory reserved was 77 GB
# Peak active CUDA memory was 69 GB
# CUDA Malloc retries : 0
# CPU Total Peak Memory consumed during the train (max): 3 GB
# Epoch 1: train_perplexity=1.0954, train_epoch_loss=0.0911, epoch time 103.91605499701109s
# training params are saved in /home/dougljia/llama-cookbook/finetuned_model/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct/train_params.yaml
# Key: avg_train_prep, Value: 1.095421552658081
# Key: avg_train_loss, Value: 0.0911393016576767
# Key: avg_epoch_time, Value: 103.91605499701109
# Key: avg_checkpoint_time, Value: 4.4002081267535686e-07
# Key: model_tflops, Value: 31.987115589590758
If you install the nightly pytorch in this docker image:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126
cd /home/dougljia/llama-cookbook
pip install -U pip setuptools
pip install -e .
pip install huggingface_hub transformers fire
huggingface-cli login --token <replace with your token>
torchrun --nnodes 1 --nproc_per_node 8 getting-started/finetuning/finetuning.py --enable_fsdp --lr 1e-6 --num_epochs 1 --batch_size_training 2 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" \
--custom_dataset.test_split "test" --custom_dataset.file "getting-started/finetuning/datasets/ocrvqa_dataset.py" --run_validation False --save_model False --batching_strategy padding \
--flop_counter --flop_counter_start 10 --max_train_step 15 --fsdp_activation_checkpointing
The output will be:
# Training Epoch: 1/1, step 14/112 completed (loss: 0.39315265417099): 13%|███▌ | 15/112 [01:02<06:44, 4.17s/it]
# Module FLOP % Total
# --------------------------------------------------------- --------- ---------
# FullyShardedDataParallel 2855.461T 100.00%
# - aten.convolution 0.289T 0.01%
# - aten.bmm 0.002T 0.00%
# - aten.mm 1274.943T 44.65%
# - aten._scaled_dot_product_efficient_attention 516.971T 18.10%
# - aten.addmm 416.754T 14.59%
# - aten._scaled_dot_product_efficient_attention_backward 646.214T 22.63%
# - aten.convolution_backward 0.289T 0.01%
# FullyShardedDataParallel._fsdp_wrapped_module 2855.461T 100.00%
# - aten.convolution 0.289T 0.01%
# - aten.bmm 0.002T 0.00%
# - aten.mm 1274.943T 44.65%
# - aten._scaled_dot_product_efficient_attention 516.971T 18.10%
# - aten.addmm 416.754T 14.59%
# - aten._scaled_dot_product_efficient_attention_backward 646.214T 22.63%
# - aten.convolution_backward 0.289T 0.01%
# Training Epoch: 1/1, step 14/112 completed (loss: 1.0819196701049805): 13%|███▎ | 15/112 [01:01<06:37, 4.10s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.5718942880630493): 13%|███▎ | 15/112 [01:02<06:45, 4.18s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.7083172798156738): 13%|███▎ | 15/112 [01:02<06:42, 4.15s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.29959213733673096): 13%|███▏ | 15/112 [01:01<06:35, 4.08s/it]
# Max CUDA memory allocated was 69 GB
# Max CUDA memory reserved was 75 GB
# Peak active CUDA memory was 69 GB
# CUDA Malloc retries : 2
# CPU Total Peak Memory consumed during the train (max): 3 GB
# Epoch 1: train_perplexity=1.0953, train_epoch_loss=0.0910, epoch time 63.54467297301744s
# training params are saved in /home/dougljia/llama-cookbook/finetuned_model/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct/train_params.yaml
# Key: avg_train_prep, Value: 1.0953115224838257
# Key: avg_train_loss, Value: 0.09103880822658539
# Key: avg_epoch_time, Value: 63.54467297301744
# Key: avg_checkpoint_time, Value: 1.8998980522155762e-07
# Key: model_tflops, Value: 725.586666664745
Additionally, with the flops counter on, each iteration takes about 4.1s; but without the counter, each step only takes about 1.8s, does this indicate the actual TFlops/s/GPU is more than 1400? (which is not possible...)
Could you please take a look? Thank you!
Versions
--2025-01-29 10:11:12-- https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24353 (24K) [text/plain]
Saving to: ‘collect_env.py’
collect_env.py 100%[================================================================>] 23.78K --.-KB/s in 0.001s
2025-01-29 10:11:12 (38.1 MB/s) - ‘collect_env.py’ saved [24353/24353]
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3
Nvidia driver version: 560.35.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 3707.8120
CPU min MHz: 1500.0000
BogoMIPS: 4800.17
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization: AMD-V
L1d cache: 6 MiB (192 instances)
L1i cache: 6 MiB (192 instances)
L2 cache: 192 MiB (192 instances)
L3 cache: 768 MiB (24 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-95,192-287
NUMA node1 CPU(s): 96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collect
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise