# Resisting Change: The Relationship between Volta and Ampere


<img src="https://imgs.xkcd.com/comics/data_trap.png"
     alt="It is important to make sure your analysis destroys as much data as it produces" />
It is important to make sure your analysis destroys as much data as it produces

In [None]:
import sys 
sys.path.append('../pystencils')
sys.path.append('../genpredict')

from predict_metrics import *
from meas_db import MeasDB

meas_db = MeasDB("3dstencils.db")

from measured_metrics import MeasuredMetrics, ResultComparer
from plot_utils import *


In [None]:
predValuesV100 = dict()
measValuesV100 = dict()

device = DeviceVolta()
print(device.name)

def nextBlockSize():
    for xblock in [4, 8, 16, 32, 64, 128, 256, 512]:
        for yblock in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
            for zblock in [1, 2, 4, 8, 16, 32, 64]:
                if xblock*yblock*zblock not in [512, 1024]:
                    continue
                yield (xblock, yblock, zblock)    


r = 4

for blockingFactors in [(1,1,1)]:
    for block in nextBlockSize():
        
        key = (r, *block, blockingFactors)

        lc, basic, meas = meas_db.getEntry(r, block, blockingFactors, device)
       
        if basic is None or meas is None:
            continue
            
        metrics = DerivedMetrics(lc, basic, device, meas)

        measValuesV100[key] = meas
        predValuesV100[key] = metrics

        print(str(lc), end="")
        print(str(basic), end="--\n")
        rc = ResultComparer(meas, metrics)
        print(str(rc))              

        print()
 

In [None]:
volumeScatterPlot([(k[1:4], measValuesV100[k].memLoad, predValuesV100[k].memLoadV1, k[4]) for k in measValuesV100], "V100 Memory Load Volumes V1")
volumeScatterPlot([(k[1:4], measValuesV100[k].memLoad, predValuesV100[k].memLoadV3, k[4], predValuesV100[k].memLoadV1) for k in measValuesV100], "V100 Memory Load Volumes V4")

Load Volumes are well predicted. The Purple thread block sizes require modeling the interaction with previous waves, which the second graph shows

In [None]:
volumeScatterPlot([(k[1:4], measValuesV100[k].L2Load, predValuesV100[k].L2LoadV1, k[4]) for k in measValuesV100], "V100 Stencil L2 Load Volumes V1")
volumeScatterPlot([(k[1:4], measValuesV100[k].L2Load, predValuesV100[k].L2LoadV2, k[4], predValuesV100[k].L2LoadV1) for k in measValuesV100], "V100 Stencil L2 Load Volumes V2")

The L2 Volumes are similarly well predicted. Capacity misses cause slight underprediction. The capacity prediction in the second graph is slightly overpredicted.

In [None]:
volumeScatterPlot([(k[1:4], measValuesV100[k].L1Wavefronts*32, predValuesV100[k].L1Cycles, k[4]) for k in measValuesV100], "V100 L1 Cycles")

The L1 cycle time to fulfill a warps memory requests are consistent, but underpredicted. 

In [None]:
categories = ["L1", "L2", "RAM"]

for r in range(0,5):
    
    keys = [k for k in measValuesV100 if k[0] == r]
    if len(keys) == 0: 
        continue
        
    volumeScatterPlot([(k[1:4], measValuesV100[k].lups, predValuesV100[k].perfV3, categories[predValuesV100[k].limV3], predValuesV100[k].perfV2) for k in keys], "V100 Predicted Roofline range " + str(r) + " V3")
    volumeScatterPlot([(k[1:4], measValuesV100[k].lups, predValuesV100[k].perfPheno, categories[predValuesV100[k].limPheno], predValuesV100[k].perfV4) for k in keys], "V100 Pheno Roofline range " + str(r) + " Pheno" )

The performance model shows a consistent ranking of data, though with general overprediction. The phenomenological roofline model performs very similar. Deficiencies are not in the volumes, but in the performance model. The best performing group of threads is well identified, see the following box:

In [None]:
print("measured Best")
top = [(m, measValuesV100[m].lups) for m in measValuesV100]
top.sort(key = lambda x : x[1])
for t in top[:]:
    print("{: >15s}: {:.2f}".format(str(t[0][1:4]), t[1]  ))

print()

print("predicted best")
top = [(m, predValuesV100[m].perfV4) for m in predValuesV100]
top.sort(key = lambda x : x[1])
for t in top[:]:
    print("{: >15s}: {:.2f}".format(str(t[0][1:4]), t[1]  ))


# A100: High Level Differences

|  |  V100   | A100  |  |
|--|---------|---------|---|
|SMs        | 80 SMs  | 108 SMs | +35% | 
| clocks    | 1.38 GHz | 1.41 GHz | + 2% |
| L1 cache  | 128 kB   | 192 kB  | +50% |
| L2 cache  | 6 MB     | 40 MB  | + 667% |
| DRAM (scale)   |  800 GB/s | 1400 GB/s | +75 % | 
| L2 BW    | 2500 GB/s | 4500 GB/s | + 80% |



## cache bandwidths (read only)
<img src="cache1.svg"/>

Both the L1 and the L2 cache are larger. The usable cache capacity is not the full 40MB, but rather 20MB due to the split cache.

## pointer chase latency
<img src="cache-latency.svg"/>

There is a distinct second plateau, due to hits in the far cache partition

In [None]:
predValuesA100 = dict()
measValuesA100 = dict()

device = DeviceAmpere()
print(device.name)

def nextBlockSize():
    for xblock in [4, 8, 16, 32, 64, 128, 256, 512]:
        for yblock in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
            for zblock in [1, 2, 4, 8, 16, 32, 64]:
                if xblock*yblock*zblock not in [512, 1024]:
                    continue
                yield (xblock, yblock, zblock)    


r = 4

for blockingFactors in [(1,1,1)]:
    for block in nextBlockSize():
        
        key = (r, *block, blockingFactors)

        lc, basic, meas = meas_db.getEntry(r, block, blockingFactors, device)
       
        if basic is None or meas is None:
            continue
            
        metrics = DerivedMetrics(lc, basic, device, meas)

        measValuesA100[key] = meas
        predValuesA100[key] = metrics

        print(str(lc), end="")
        print(str(basic), end="--\n")
        rc = ResultComparer(meas, metrics)
        print(str(rc))              

        print()


In [None]:
volumeScatterPlot([(k[1:4], measValuesA100[k].memLoad, predValuesA100[k].memLoadV3, k[4], predValuesA100[k].memLoadV1) for k in measValuesA100], "Memory Load Volumes V4")

There is a group of thread block sizes, mostly shallow sizes with small z extent, that are very underpredicted. This is due to the large L2 cache, which allows for layer condition like effects. THe V100 vs A100 graph shows exactly these thread block sizes to have lower balances than on V100.

In [None]:
volumeScatterPlot([(k[1:4], measValuesA100[k].memLoad, measValuesV100[k].memLoad, k[4]) for k in measValuesA100], "Memory Load Volumes V1 V100 vs A100")

In [None]:
volumeScatterPlot([(k[1:4], measValuesA100[k].L2Load, predValuesA100[k].L2LoadV1, k[4]) for k in measValuesA100], "A100 Stencil L2 Load Volumes V1")
volumeScatterPlot([(k[1:4], measValuesA100[k].L2Load_tex, predValuesA100[k].L2LoadV1, k[4]) for k in measValuesA100], "A100 Stencil L2 Load Volumes V1")

The L2 cache data volume is underpredicted by an almost consistent factor. Using a different performance counter, that does not include the traffic between L2 cache partitions, makes the prediction very accurate.

In [None]:
volumeScatterPlot([(k[1:4], measValuesA100[k].L2Load_tex, measValuesV100[k].L2Load_tex, k[4]) for k in measValuesA100], "L2 Load Volumes V100 vs A100")

V100 smaller L1 cache makes for slightly higher L2 cache volumes. 

## Gapped Stream
<img src="cache-gapped.svg"/>

In [None]:
categories = ["L1", "L2", "RAM"]

for r in range(0,5):
    
    keys = [k for k in measValuesA100 if k[0] == r]
    if len(keys) == 0: 
        continue
        
    volumeScatterPlot([(k[1:4], measValuesA100[k].lups, predValuesA100[k].perfV3, categories[predValuesA100[k].limV3]) for k in keys], "A100 Predicted Roofline range " + str(r) + " V3")
    volumeScatterPlot([(k[1:4], measValuesA100[k].lups, predValuesA100[k].perfPheno, categories[predValuesA100[k].limPheno], predValuesA100[k].perfV4) for k in keys], "A100 Pheno Roofline range " + str(r) + " Pheno" )

In [None]:
    scatterPlot([(k[1:4], measValuesA100[k].lups, measValuesV100[k].lups, categories[predValuesA100[k].limPheno]) for k in keys], "Measured Performance A100 vs V100" )

In [None]:
print("measured Best")
top = [(m, measValuesA100[m].lups) for m in measValuesA100]
top.sort(key = lambda x : x[1])
for t in top[-10:]:
    print("{: >15s}: {:.2f}".format(str(t[0][1:4]), t[1]  ))

print()

print("predicted best")
top = [(m, predValuesA100[m].perfPheno) for m in predValuesA100]
top.sort(key = lambda x : x[1])
for t in top[-10:]:
    print("{: >15s}: {:.2f}".format(str(t[0][1:4]), t[1]  ))
