flops counter error with PyTorch2.5 and 2.6

### 🐛 Describe the bug

I am fine-tuning llama3.2-11b-instruct model with llama-cookbook repo on H100. The total flops count with pytorch2.5 and 2.6 is ~186TFlops; but the total flops count with PyTorch nightly (2.7) is ~2900TFlops. (Batch size is 2) Could you please take a look at if the flops counting here is correct? It feels both numbers are not correct as if the total TFlops is 2900TFlops, the H100 TFlops/s/GPU will reach >700 which is absurdly high.....

Below are the reproduce steps:


```
git clone https://github.com/meta-llama/llama-cookbook.git

docker run --name llama-cookbook --shm-size=64g --gpus all -it --rm -v /home/dougljia:/home/dougljia -e HF_HOME=/home/dougljia/model nvcr.io/nvidia/pytorch:24.12-py3


cd /home/dougljia/llama-cookbook
pip install -U pip setuptools
pip install -e .

pip install huggingface_hub transformers fire

huggingface-cli login --token <replace with your token>

torchrun --nnodes 1 --nproc_per_node 8  getting-started/finetuning/finetuning.py --enable_fsdp --lr 1e-6  --num_epochs 1 --batch_size_training 2 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" \
--custom_dataset.test_split "test" --custom_dataset.file "getting-started/finetuning/datasets/ocrvqa_dataset.py"  --run_validation False --save_model False --batching_strategy padding \
--flop_counter --flop_counter_start 10 --max_train_step 15 --fsdp_activation_checkpointing
```

The output will be:

```
# Module                                                     FLOP    % Total
# -----------------------------------------------------  --------  ---------
# FullyShardedDataParallel                               189.021T    100.00%
#  - aten.convolution                                      0.019T      0.01%
#  - aten.bmm                                              0.000T      0.00%
#  - aten.mm                                              83.731T     44.30%
#  - aten._scaled_dot_product_cudnn_attention             34.430T     18.21%
#  - aten.addmm                                           27.784T     14.70%
#  - aten._scaled_dot_product_cudnn_attention_backward    43.037T     22.77%
#  - aten.convolution_backward                             0.019T      0.01%
#  FullyShardedDataParallel._fsdp_wrapped_module         189.021T    100.00%
#   - aten.convolution                                     0.019T      0.01%
#   - aten.bmm                                             0.000T      0.00%
#   - aten.mm                                             83.731T     44.30%
#   - aten._scaled_dot_product_cudnn_attention            34.430T     18.21%
#   - aten.addmm                                          27.784T     14.70%
#   - aten._scaled_dot_product_cudnn_attention_backward   43.037T     22.77%
#   - aten.convolution_backward                            0.019T      0.01%
# Training Epoch: 1/1, step 14/112 completed (loss: 0.4789400100708008):  13%|███▎                     | 15/112 [01:41<10:56,  6.77s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.3038587272167206):  13%|███▎                     | 15/112 [01:40<10:52,  6.73s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.7101249694824219):  13%|███▎                     | 15/112 [01:40<10:48,  6.69s/it]
# Max CUDA memory allocated was 69 GB
# Max CUDA memory reserved was 77 GB
# Peak active CUDA memory was 69 GB
# CUDA Malloc retries : 0
# CPU Total Peak Memory consumed during the train (max): 3 GB
# Epoch 1: train_perplexity=1.0954, train_epoch_loss=0.0911, epoch time 103.91605499701109s
# training params are saved in /home/dougljia/llama-cookbook/finetuned_model/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct/train_params.yaml
# Key: avg_train_prep, Value: 1.095421552658081
# Key: avg_train_loss, Value: 0.0911393016576767
# Key: avg_epoch_time, Value: 103.91605499701109
# Key: avg_checkpoint_time, Value: 4.4002081267535686e-07
# Key: model_tflops, Value: 31.987115589590758
```

If you install the nightly pytorch in this docker image:
```
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126

cd /home/dougljia/llama-cookbook
pip install -U pip setuptools
pip install -e .

pip install huggingface_hub transformers fire

huggingface-cli login --token <replace with your token>

torchrun --nnodes 1 --nproc_per_node 8  getting-started/finetuning/finetuning.py --enable_fsdp --lr 1e-6  --num_epochs 1 --batch_size_training 2 \
--model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" \
--custom_dataset.test_split "test" --custom_dataset.file "getting-started/finetuning/datasets/ocrvqa_dataset.py"  --run_validation False --save_model False --batching_strategy padding \
--flop_counter --flop_counter_start 10 --max_train_step 15 --fsdp_activation_checkpointing
```

The output will be:

```
# Training Epoch: 1/1, step 14/112 completed (loss: 0.39315265417099):  13%|███▌                       | 15/112 [01:02<06:44,  4.17s/it]
# Module                                                          FLOP    % Total
# ---------------------------------------------------------  ---------  ---------
# FullyShardedDataParallel                                   2855.461T    100.00%
#  - aten.convolution                                           0.289T      0.01%
#  - aten.bmm                                                   0.002T      0.00%
#  - aten.mm                                                 1274.943T     44.65%
#  - aten._scaled_dot_product_efficient_attention             516.971T     18.10%
#  - aten.addmm                                               416.754T     14.59%
#  - aten._scaled_dot_product_efficient_attention_backward    646.214T     22.63%
#  - aten.convolution_backward                                  0.289T      0.01%
#  FullyShardedDataParallel._fsdp_wrapped_module             2855.461T    100.00%
#   - aten.convolution                                          0.289T      0.01%
#   - aten.bmm                                                  0.002T      0.00%
#   - aten.mm                                                1274.943T     44.65%
#   - aten._scaled_dot_product_efficient_attention            516.971T     18.10%
#   - aten.addmm                                              416.754T     14.59%
#   - aten._scaled_dot_product_efficient_attention_backward   646.214T     22.63%
#   - aten.convolution_backward                                 0.289T      0.01%
# Training Epoch: 1/1, step 14/112 completed (loss: 1.0819196701049805):  13%|███▎                     | 15/112 [01:01<06:37,  4.10s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.5718942880630493):  13%|███▎                     | 15/112 [01:02<06:45,  4.18s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.7083172798156738):  13%|███▎                     | 15/112 [01:02<06:42,  4.15s/it]
# Training Epoch: 1/1, step 14/112 completed (loss: 0.29959213733673096):  13%|███▏                    | 15/112 [01:01<06:35,  4.08s/it]
# Max CUDA memory allocated was 69 GB
# Max CUDA memory reserved was 75 GB
# Peak active CUDA memory was 69 GB
# CUDA Malloc retries : 2
# CPU Total Peak Memory consumed during the train (max): 3 GB
# Epoch 1: train_perplexity=1.0953, train_epoch_loss=0.0910, epoch time 63.54467297301744s
# training params are saved in /home/dougljia/llama-cookbook/finetuned_model/fine-tuned-meta-llama/Llama-3.2-11B-Vision-Instruct/train_params.yaml
# Key: avg_train_prep, Value: 1.0953115224838257
# Key: avg_train_loss, Value: 0.09103880822658539
# Key: avg_epoch_time, Value: 63.54467297301744
# Key: avg_checkpoint_time, Value: 1.8998980522155762e-07
# Key: model_tflops, Value: 725.586666664745
```

Additionally, with the flops counter on, each iteration takes about 4.1s; but without the counter, each step only takes about 1.8s, does this indicate the actual TFlops/s/GPU is more than 1400? (which is not possible...)

Could you please take a look? Thank you!

### Versions

--2025-01-29 10:11:12--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24353 (24K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py                     100%[================================================================>]  23.78K  --.-KB/s    in 0.001s  

2025-01-29 10:11:12 (38.1 MB/s) - ‘collect_env.py’ saved [24353/24353]

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 560.35.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               384
On-line CPU(s) list:                  0-383
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9654 96-Core Processor
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3707.8120
CPU min MHz:                          1500.0000
BogoMIPS:                             4800.17
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            6 MiB (192 instances)
L1i cache:                            6 MiB (192 instances)
L2 cache:                             192 MiB (192 instances)
L3 cache:                             768 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95,192-287
NUMA node1 CPU(s):                    96-191,288-383
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collect

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flops counter error with PyTorch2.5 and 2.6 #145947

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

flops counter error with PyTorch2.5 and 2.6 #145947

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions