Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem about bandwidht test #6

Open
blueWatermelonFri opened this issue Apr 16, 2024 · 5 comments
Open

Problem about bandwidht test #6

blueWatermelonFri opened this issue Apr 16, 2024 · 5 comments

Comments

@blueWatermelonFri
Copy link

blueWatermelonFri commented Apr 16, 2024

I tried to test bandwidth with cuda-stream benchmark, my device is 4060TI, bandwidth is 288GB/s.

I changed param max_buffer_size from 128l * 1024 *1024 +2 to 1024 * 1024as follows:

#include "../MeasurementSeries.hpp"
#include "../dtime.hpp"
#include "../gpu-error.h"
#include <iomanip>
#include <iostream>

using namespace std;

const int64_t max_buffer_size =  1024 * 1024;
double *dA, *dB, *dC, *dD;

The result is much larger than 288GB/s, result as follows:

blockSize   threads       %occ  |                init       read       scale     triad       3pt        5pt
       32        1088      4 %  |  GB/s:         108         67        130        191        119        100
       64        2176    8.3 %  |  GB/s:         209        128        250        362        225        184
       96        3264   12.5 %  |  GB/s:         294        184        357        514        320        266
      128        4352   16.7 %  |  GB/s:         380        233        453        654        400        313
      160        5440   20.8 %  |  GB/s:         443        279        541        773        466        275
      192        6528   25.0 %  |  GB/s:         506        323        620        880        533        308
      224        7616   29.2 %  |  GB/s:         577        357        697        968        589        313
      256        8704   33.3 %  |  GB/s:         646        391        765       1072        634        419
      288        9792   37.5 %  |  GB/s:         697        419        818       1141        634        329
      320       10880   41.7 %  |  GB/s:         733        454        885       1227        673        356
      352       11968   45.8 %  |  GB/s:         800        479        932       1287        745        415
      384       13056   50.0 %  |  GB/s:         800        506        984       1362        782        430
      416       14144   54.2 %  |  GB/s:         800        525       1020       1436        800        430
      448       15232   58.3 %  |  GB/s:         838        558       1050       1476        818        430
      480       16320   62.5 %  |  GB/s:         800        563       1117       1530        863        430
      512       17408   66.7 %  |  GB/s:         800        596       1154       1575        885        425
      544       18496   70.8 %  |  GB/s:         800        601       1193       1624        908        430
      576       19584   75.0 %  |  GB/s:         800        623       1203       1675        932        430
      608       20672   79.2 %  |  GB/s:         800        646       1235       1675        957        430
      640       21760   83.3 %  |  GB/s:         800        646       1245       1730        957        425
      672       22848   87.5 %  |  GB/s:         800        670       1291       1730        908        430
      704       23936   91.7 %  |  GB/s:         800        670       1291       1730        957        430
      736       25024   95.8 %  |  GB/s:         800        697       1340       1804        957        430
      768       26112  100.0 %  |  GB/s:         800        697       1340       1745        957        425

When I don't change the parameters, the test results are normal. The same phenomenon occurred on my other A800 machine.

So have you ever had that happen to you?

@te42kyfo
Copy link
Owner

At 10241024 elements the total data volume per array is 10241024 * sizeof(double) = 8MB. The 4080Ti has 32MB of L2 cache, so even the triad test, that uses 3 arrays, can fit its entire data into the L2 cache. Same for the A800, with its 2x40MB of L2 cache.

You have essentially changed the test into a L2 cache bandwidth benchmark! :-)

@blueWatermelonFri
Copy link
Author

@te42kyfo Thanks for your answer, when i flush L2 cache, return to normal.

But I still have a question, My device is A800 80GB PCIE, how do you know L2 cache is 2x40MB, where did you see that?

@te42kyfo
Copy link
Owner

As far as I could google, the A800 is made from two GA100 chips, each of which has 40MB of cache. You should also be able to query that as the device property "l2CacheSize" in the cuDeviceProp structure.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

@blueWatermelonFri
Copy link
Author

Hey @te42kyfo , I tested deviceQuery sample in CUDASamples, part of the result output is shown below:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A800 80GB PCIe"
  CUDA Driver Version / Runtime Version          12.2 / 12.1
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81051 MBytes (84987740160 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1512 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes

The results show that L2 Cache Size is 40MB, but as far as I could google, the L2 cache size of A800 80GB is 80MB.Where did you get the following information

the A800 is made from two GA100 chips, each of which has 40MB of cache

@te42kyfo
Copy link
Owner

You are right, my info was faulty. The A800 is just one model based on the GA100 chip, which has 40MB L2. I just googled really quickly because I haven't encountered that SKU yet, and drew the wrong conclusions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants