Skip to content

Conversation

ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented Sep 11, 2025

Adding benchmark to existing script, with support for all shapes. Meant to help out #24342.

B200 results below, we can see that apart from per-tensor dynamic scales, torch.compiled impl is better than both CUDA and the triton kernel, by up to 1.57x geomean (1/0.636).

Speedup over Torch (Compiled)
 col_major group_shape     CUDA   Triton
     False    (-1, -1) 5.188913      inf
     False     (1, -1) 0.928436      inf
     False     (1, 64) 0.636413 0.264822
     False    (1, 128) 0.851472 0.414274
      True     (1, 64) 0.654376 0.293654
      True    (1, 128) 0.910548 0.471288
Full results
QuantFP8 performance:
     hidden_size  batch_size  col_major group_shape  Torch (Compiled)         CUDA        Triton
0              1           1      False    (-1, -1)          2.883091    12.171749      0.000000
1             16           1      False    (-1, -1)          7.367291    13.908626      0.000000
2             64           1      False    (-1, -1)         12.385259    14.125941      0.000000
3            128           1      False    (-1, -1)         19.442213    14.427254      0.000000
4            256           1      False    (-1, -1)         32.477913    14.494847      0.000000
5            512           1      False    (-1, -1)         57.917011    16.473432      0.000000
6           1024           1      False    (-1, -1)        110.735627    18.549175      0.000000
7           2048           1      False    (-1, -1)        210.799133    28.107615      0.000000
8           4096           1      False    (-1, -1)        417.107831    47.828042      0.000000
9              1          16      False    (-1, -1)          7.372255    13.896552      0.000000
10            16          16      False    (-1, -1)         32.464888    14.503759      0.000000
11            64          16      False    (-1, -1)        110.742910    18.602913      0.000000
12           128          16      False    (-1, -1)        210.841858    28.089034      0.000000
13           256          16      False    (-1, -1)        416.955492    47.807567      0.000000
14           512          16      False    (-1, -1)       1246.435165   100.707671      0.000000
15          1024          16      False    (-1, -1)       2532.836505   219.299457      0.000000
16          2048          16      False    (-1, -1)       5073.546727   425.454264      0.000000
17          4096          16      False    (-1, -1)      10333.471775  1023.561022      0.000000
18             1          32      False    (-1, -1)          9.229187    13.885136      0.000000
19            16          32      False    (-1, -1)         57.914544    16.429869      0.000000
20            64          32      False    (-1, -1)        210.633479    28.114413      0.000000
21           128          32      False    (-1, -1)        416.955492    47.903653      0.000000
22           256          32      False    (-1, -1)       1248.450089   100.709837      0.000000
23           512          32      False    (-1, -1)       2533.972604   218.814915      0.000000
24          1024          32      False    (-1, -1)       5073.210716   425.471658      0.000000
25          2048          32      False    (-1, -1)      10333.040237  1023.609037      0.000000
26          4096          32      False    (-1, -1)      20645.311356  2024.583975      0.000000
27             1          64      False    (-1, -1)         12.379711    14.163193      0.000000
28            16          64      False    (-1, -1)        110.958742    18.575772      0.000000
29            64          64      False    (-1, -1)        416.910959    48.109178      0.000000
30           128          64      False    (-1, -1)       1251.018651   100.739310      0.000000
31           256          64      False    (-1, -1)       2534.432002   218.634367      0.000000
32           512          64      False    (-1, -1)       5073.498726   425.464008      0.000000
33          1024          64      False    (-1, -1)      10338.255882  1023.516531      0.000000
34          2048          64      False    (-1, -1)      20644.271851  2024.601380      0.000000
35          4096          64      False    (-1, -1)      41266.143799  4024.530570      0.000000
36             1         128      False    (-1, -1)         19.457667    14.396398      0.000000
37            16         128      False    (-1, -1)        210.838531    28.155110      0.000000
38            64         128      False    (-1, -1)       1242.978160   100.497325      0.000000
39           128         128      False    (-1, -1)       2534.605707   219.093453      0.000000
40           256         128      False    (-1, -1)       5073.034604   425.403471      0.000000
41           512         128      False    (-1, -1)      10335.391998  1023.513048      0.000000
42          1024         128      False    (-1, -1)      20648.096085  2024.623950      0.000000
43          2048         128      False    (-1, -1)      41268.096924  4024.717331      0.000000
44          4096         128      False    (-1, -1)       8275.111675  8024.319967      0.000000
45             1           1      False     (1, -1)          3.134108     4.060020      0.000000
46            16           1      False     (1, -1)          3.427785     4.277234      0.000000
47            64           1      False     (1, -1)          3.520462     4.386433      0.000000
48           128           1      False     (1, -1)          3.609057     4.489328      0.000000
49           256           1      False     (1, -1)          4.421333     5.036036      0.000000
50           512           1      False     (1, -1)          5.965148     6.602702      0.000000
51          1024           1      False     (1, -1)          8.923592    10.081602      0.000000
52          2048           1      False     (1, -1)         15.122080    15.959404      0.000000
53          4096           1      False     (1, -1)         28.560640    29.943732      0.000000
54             1          16      False     (1, -1)          3.437647     4.305638      0.000000
55            16          16      False     (1, -1)          4.416621     5.032333      0.000000
56            64          16      False     (1, -1)          8.946080    10.084736      0.000000
57           128          16      False     (1, -1)         15.127093    15.956857      0.000000
58           256          16      False     (1, -1)         28.503059    29.912759      0.000000
59           512          16      False     (1, -1)         62.006400    63.806423      0.000000
60          1024          16      False     (1, -1)        122.140799   124.085050      0.000000
61          2048          16      False     (1, -1)        241.689558   244.872914      0.000000
62          4096          16      False     (1, -1)        674.907613   681.216407      0.000000
63             1          32      False     (1, -1)          3.472305     4.315280      0.000000
64            16          32      False     (1, -1)          5.970980     6.595621      0.000000
65            64          32      False     (1, -1)         15.113600    15.950691      0.000000
66           128          32      False     (1, -1)         28.523089    29.903118      0.000000
67           256          32      False     (1, -1)         62.005386    63.812394      0.000000
68           512          32      False     (1, -1)        122.147629   124.088054      0.000000
69          1024          32      False     (1, -1)        241.692217   244.890334      0.000000
70          2048          32      False     (1, -1)        674.927592   681.224394      0.000000
71          4096          32      False     (1, -1)       1346.823978  1357.751989      0.000000
72             1          64      False     (1, -1)          3.526970     4.387588      0.000000
73            16          64      False     (1, -1)          8.937098    10.084751      0.000000
74            64          64      False     (1, -1)         28.604990    29.990388      0.000000
75           128          64      False     (1, -1)         61.903499    63.724211      0.000000
76           256          64      False     (1, -1)        122.232325   124.162003      0.000000
77           512          64      False     (1, -1)        241.673470   244.848408      0.000000
78          1024          64      False     (1, -1)        674.841619   681.260395      0.000000
79          2048          64      False     (1, -1)       1346.858406  1357.757568      0.000000
80          4096          64      False     (1, -1)       2691.024017  2709.640026      0.000000
81             1         128      False     (1, -1)          3.605126     4.490015      0.000000
82            16         128      False     (1, -1)         15.113505    15.956237      0.000000
83            64         128      False     (1, -1)         62.000323    63.801745      0.000000
84           128         128      False     (1, -1)        122.064168   124.038590      0.000000
85           256         128      False     (1, -1)        241.680418   244.889308      0.000000
86           512         128      False     (1, -1)        674.919200   681.264400      0.000000
87          1024         128      False     (1, -1)       1346.860027  1357.774401      0.000000
88          2048         128      False     (1, -1)       2690.929604  2709.835243      0.000000
89          4096         128      False     (1, -1)       5376.534271  5411.731148      0.000000
90             1           1       True     (1, 64)          2.763574     3.034534      2.843692
91            16           1       True     (1, 64)          4.485459     3.360267      3.157333
92            64           1       True     (1, 64)          4.638418     3.624858      4.616264
93           128           1       True     (1, 64)          4.761500     4.147604      6.776809
94           256           1       True     (1, 64)          5.471333     5.721745     11.125818
95           512           1       True     (1, 64)          6.551305     8.487796     19.830472
96          1024           1       True     (1, 64)          9.289565    14.399479     37.414587
97          2048           1       True     (1, 64)         14.391912    25.775028     72.494545
98          4096           1       True     (1, 64)         26.188800    50.029220    144.374143
99             1          16       True     (1, 64)          4.491404     3.391747      3.130383
100           16          16       True     (1, 64)          5.460396     5.707412     11.117843
101           64          16       True     (1, 64)          9.313032    14.416736     37.431817
102          128          16       True     (1, 64)         14.407560    25.772228     72.499678
103          256          16       True     (1, 64)         26.124585    49.950143    144.322203
104          512          16       True     (1, 64)         55.314238   108.841251    291.141254
105         1024          16       True     (1, 64)        107.397452   216.629575    579.961861
106         2048          16       True     (1, 64)        210.952398   430.700790   1158.461178
107         4096          16       True     (1, 64)        612.862926  1052.128004   2509.212017
108            1          32       True     (1, 64)          4.554939     3.442185      3.467605
109           16          32       True     (1, 64)          6.555915     8.487949     19.836612
110           64          32       True     (1, 64)         14.390330    25.777934     72.491218
111          128          32       True     (1, 64)         26.200851    50.026226    144.356278
112          256          32       True     (1, 64)         55.381849   108.857566    291.198090
113          512          32       True     (1, 64)        107.496791   216.879459    580.223532
114         1024          32       True     (1, 64)        210.949333   430.684090   1158.512901
115         2048          32       True     (1, 64)        612.821080  1052.111999   2509.225965
116         4096          32       True     (1, 64)       1220.242791  2100.859642   5015.915871
117            1          64       True     (1, 64)          4.640990     3.627710      4.601067
118           16          64       True     (1, 64)          9.301277    14.401637     37.422345
119           64          64       True     (1, 64)         26.184000    50.024022    144.369103
120          128          64       True     (1, 64)         55.397333   108.856229    291.203826
121          256          64       True     (1, 64)        107.441274   216.847098    580.218820
122          512          64       True     (1, 64)        210.967398   430.737771   1158.439524
123         1024          64       True     (1, 64)        612.847286  1052.199654   2509.089947
124         2048          64       True     (1, 64)       1220.352712  2100.958564   5015.500069
125         4096          64       True     (1, 64)       2435.514190  4197.494507  10028.480053
126            1         128       True     (1, 64)          4.756784     4.148005      6.747685
127           16         128       True     (1, 64)         14.403556    25.767069     72.494712
128           64         128       True     (1, 64)         55.505582   108.979418    291.173010
129          128         128       True     (1, 64)        107.228561   216.699322    580.036724
130          256         128       True     (1, 64)        210.961998   430.722152   1158.431053
131          512         128       True     (1, 64)        612.867269  1052.191320   2508.980036
132         1024         128       True     (1, 64)       1220.253198  2101.005034   5015.872002
133         2048         128       True     (1, 64)       2435.470494  4197.084808  10028.879642
134         4096         128       True     (1, 64)       4975.279999  8390.416145  20054.080009
135            1           1      False     (1, 64)          2.762939     3.026017      2.850835
136           16           1      False     (1, 64)          2.970894     3.286393      3.116082
137           64           1      False     (1, 64)          3.079456     3.553044      4.595833
138          128           1      False     (1, 64)          3.204202     4.065799      6.740989
139          256           1      False     (1, 64)          3.912238     5.604339     11.124909
140          512           1      False     (1, 64)          5.043629     7.930898     19.832728
141         1024           1      False     (1, 64)          7.921280    13.187439     37.406577
142         2048           1      False     (1, 64)         13.305237    23.488458     72.477510
143         4096           1      False     (1, 64)         25.334297    45.476465    144.326158
144            1          16      False     (1, 64)          2.971765     3.316764      3.122925
145           16          16      False     (1, 64)          3.918891     5.584376     11.110261
146           64          16      False     (1, 64)          7.939918    13.205882     37.428651
147          128          16      False     (1, 64)         13.298080    23.488266     72.490606
148          256          16      False     (1, 64)         25.242139    45.407283    144.281244
149          512          16      False     (1, 64)         55.477978    98.384418    291.165622
150         1024          16      False     (1, 64)        109.047316   195.192896    579.942114
151         2048          16      False     (1, 64)        215.345359   387.919674   1158.402836
152         4096          16      False     (1, 64)        621.794128   967.087364   2509.188056
153            1          32      False     (1, 64)          2.999146     3.360393      3.446298
154           16          32      False     (1, 64)          5.043040     7.924601     19.809549
155           64          32      False     (1, 64)         13.304490    23.483716     72.490968
156          128          32      False     (1, 64)         25.351674    45.472836    144.342253
157          256          32      False     (1, 64)         55.448247    98.423913    291.184482
158          512          32      False     (1, 64)        109.250355   195.436285    580.162357
159         1024          32      False     (1, 64)        215.357850   387.921581   1158.435765
160         2048          32      False     (1, 64)        621.742566   967.111053   2509.225965
161         4096          32      False     (1, 64)       1240.242087  1930.636009   5015.887976
162            1          64      False     (1, 64)          3.072762     3.553000      4.596042
163           16          64      False     (1, 64)          7.937796    13.190251     37.413803
164           64          64      False     (1, 64)         25.305347    45.485774    144.355202
165          128          64      False     (1, 64)         55.459518    98.418832    291.171102
166          256          64      False     (1, 64)        109.258610   195.414715    580.223981
167          512          64      False     (1, 64)        215.344790   387.894402   1158.416916
168         1024          64      False     (1, 64)        621.843931   967.107849   2509.143949
169         2048          64      False     (1, 64)       1240.365194  1930.536032   5016.236067
170         4096          64      False     (1, 64)       2476.624055  3856.629372  10028.928280
171            1         128      False     (1, 64)          3.209663     4.061244      6.744589
172           16         128      False     (1, 64)         13.308408    23.484122     72.473600
173           64         128      False     (1, 64)         55.608907    98.541654    291.173252
174          128         128      False     (1, 64)        109.015318   195.239761    580.039529
175          256         128      False     (1, 64)        215.332869   387.912636   1158.418824
176          512         128      False     (1, 64)        621.805869   967.174416   2509.175897
177         1024         128      False     (1, 64)       1240.225419  1930.506627   5016.344070
178         2048         128      False     (1, 64)       2476.455255  3856.664022  10029.567719
179         4096         128      False     (1, 64)       4948.124695  7708.298683  20053.744316
180            1           1       True    (1, 128)          2.758435     3.043615      2.862564
181           16           1       True    (1, 128)          4.687816     3.360487      3.082453
182           64           1       True    (1, 128)          4.806833     3.485621      3.549333
183          128           1       True    (1, 128)          4.876295     3.537790      4.668267
184          256           1       True    (1, 128)          5.417000     4.259854      6.941365
185          512           1       True    (1, 128)          6.561702     5.844019     11.408527
186         1024           1       True    (1, 128)          8.844835     9.290385     20.606741
187         2048           1       True    (1, 128)         13.603574    15.730864     38.884781
188         4096           1       True    (1, 128)         24.508309    29.680863     76.973999
189            1          16       True    (1, 128)          4.654515     3.389630      3.083699
190           16          16       True    (1, 128)          5.410435     4.250747      6.946697
191           64          16       True    (1, 128)          8.844215     9.302109     20.635326
192          128          16       True    (1, 128)         13.597890    15.728170     38.860225
193          256          16       True    (1, 128)         24.463149    29.647506     76.932181
194          512          16       True    (1, 128)         51.645074    67.246371    156.497100
195         1024          16       True    (1, 128)         98.864721   132.336001    310.735233
196         2048          16       True    (1, 128)        193.024938   261.698594    620.206987
197         4096          16       True    (1, 128)        575.487041   714.297939   1432.509005
198            1          32       True    (1, 128)          4.720646     3.384399      3.160629
199           16          32       True    (1, 128)          6.589617     5.844182     11.409422
200           64          32       True    (1, 128)         13.612622    15.727933     38.853933
201          128          32       True    (1, 128)         24.503304    29.681223     76.953454
202          256          32       True    (1, 128)         51.589389    67.193798    156.517457
203          512          32       True    (1, 128)         98.974946   132.435554    311.055486
204         1024          32       True    (1, 128)        193.039614   261.693619    620.181145
205         2048          32       True    (1, 128)        575.506248   714.307476   1432.536006
206         4096          32       True    (1, 128)       1146.398697  1424.682115   2862.574100
207            1          64       True    (1, 128)          4.808653     3.486566      3.536356
208           16          64       True    (1, 128)          8.852301     9.289909     20.614112
209           64          64       True    (1, 128)         24.528000    29.737940     76.993302
210          128          64       True    (1, 128)         51.584862    67.195727    156.485455
211          256          64       True    (1, 128)         98.999913   132.440394    311.061087
212          512          64       True    (1, 128)        193.000862   261.681505    620.164133
213         1024          64       True    (1, 128)        575.446396   714.263787   1432.569027
214         2048          64       True    (1, 128)       1146.431999  1424.560848   2862.714052
215         4096          64       True    (1, 128)       2287.967975  2844.389280   5722.555876
216            1         128       True    (1, 128)          4.885333     3.536037      4.676622
217           16         128       True    (1, 128)         13.604716    15.722652     38.857116
218           64         128       True    (1, 128)         51.715833    67.393788    156.562606
219          128         128       True    (1, 128)         98.904182   132.288668    310.808379
220          256         128       True    (1, 128)        193.009881   261.704651    620.221496
221          512         128       True    (1, 128)        575.487351   714.327348   1432.586968
222         1024         128       True    (1, 128)       1146.414170  1424.435365   2862.849951
223         2048         128       True    (1, 128)       2287.958145  2844.369782   5722.423792
224         4096         128       True    (1, 128)       4566.914717  5683.860064  11441.159725
225            1           1      False    (1, 128)          2.765305     3.044451      2.853063
226           16           1      False    (1, 128)          3.220044     3.284058      3.075064
227           64           1      False    (1, 128)          3.309514     3.390130      3.531404
228          128           1      False    (1, 128)          3.406923     3.450224      4.652833
229          256           1      False    (1, 128)          3.946000     4.128372      6.932044
230          512           1      False    (1, 128)          5.103674     5.725761     11.398578
231         1024           1      False    (1, 128)          7.211200     8.609532     20.590400
232         2048           1      False    (1, 128)         11.854333    14.519183     38.856909
233         4096           1      False    (1, 128)         22.307041    27.416357     76.930134
234            1          16      False    (1, 128)          3.232615     3.311902      3.104344
235           16          16      False    (1, 128)          3.948167     4.125146      6.911830
236           64          16      False    (1, 128)          7.222465     8.623327     20.616870
237          128          16      False    (1, 128)         11.852206    14.519950     38.847461
238          256          16      False    (1, 128)         22.295834    27.390898     76.899306
239          512          16      False    (1, 128)         49.187169    61.770049    156.497410
240         1024          16      False    (1, 128)         95.324130   121.285744    310.716690
241         2048          16      False    (1, 128)        187.868772   239.616783    620.237935
242         4096          16      False    (1, 128)        566.639695   670.621267   1432.497978
243            1          32      False    (1, 128)          3.245624     3.310083      3.150945
244           16          32      False    (1, 128)          5.119360     5.723191     11.404182
245           64          32      False    (1, 128)         11.849796    14.518216     38.842607
246          128          32      False    (1, 128)         22.305292    27.424782     76.940182
247          256          32      False    (1, 128)         49.139918    61.770921    156.477429
248          512          32      False    (1, 128)         95.543319   121.374891    311.054472
249         1024          32      False    (1, 128)        187.874202   239.607905    620.220369
250         2048          32      False    (1, 128)        566.629245   670.680581   1432.590961
251         4096          32      False    (1, 128)       1129.735360  1337.570381   2862.668037
252            1          64      False    (1, 128)          3.314980     3.388164      3.536653
253           16          64      False    (1, 128)          7.212735     8.611956     20.597099
254           64          64      False    (1, 128)         22.354400    27.455734     76.961758
255          128          64      False    (1, 128)         49.142905    61.770176    156.480172
256          256          64      False    (1, 128)         95.530608   121.370670    311.042801
257          512          64      False    (1, 128)        187.858049   239.621327    620.180653
258         1024          64      False    (1, 128)        566.666659   670.606823   1432.489991
259         2048          64      False    (1, 128)       1129.783374  1337.552023   2862.748027
260         4096          64      False    (1, 128)       2255.399411  2670.203209   5722.807884
261            1         128      False    (1, 128)          3.400404     3.449124      4.650247
262           16         128      False    (1, 128)         11.853120    14.515769     38.839272
263           64         128      False    (1, 128)         49.261225    61.902635    156.534682
264          128         128      False    (1, 128)         95.339283   121.236208    310.799493
265          256         128      False    (1, 128)        187.863189   239.609483    620.258510
266          512         128      False    (1, 128)        566.599210   670.548299   1432.632983
267         1024         128      False    (1, 128)       1129.825262  1337.493610   2862.669945
268         2048         128      False    (1, 128)       2255.497859  2670.243168   5723.012209
269         4096         128      False    (1, 128)       4566.255887  5332.851219  11441.559792

tahsintunan and others added 5 commits September 6, 2025 00:30
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@mergify mergify bot added the performance Performance-related issues label Sep 11, 2025
Copy link

mergify bot commented Sep 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants