<a href="https://colab.research.google.com/github/sambitmishra98/PyFR-ideal-performance/blob/main/performance_projection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projected performance computation from mesh

In this document, we aim to compare the performance of expected PyFR performance in comparison with actual performance.

The expected performance is computed by understanding inputs and outputs to all kernels used for computation.

The actual performance is computed using `perf_counter()` and looking at wall-time in the solution file. For example, actual performance of Intel MAX GPUs is shown in a [Google Docs file](https://docs.google.com/document/d/1yX7JqTTsXRikTtzRon-03TgRGceByce075N4-Ptp7cI/edit?usp=sharing) (Restricted access). The latter method was used to benchmark performance of PyFR on A100 GPUs in a paper: [Scaling Study of Flow Simulations on Composable Cyberinfrastructure](https://doi.org/10.1145/3569951.3597565).

In [104]:
# Global variables

size_of_float = 4 
precision_size = {'single': size_of_float, 
                  'double': 2*size_of_float}

etypes = ['tet', 'pyr', 'pri', 'hex',]
element_counts = {etype: 0 for etype in etypes}


## Performance details from configuration file

Following the configurations as given in [PyFR documentation](https://pyfr.readthedocs.io/en/latest/user_guide.html#configuration-file-ini).
Only those relevant to performance computation is declared below.


In [147]:
# [backend]
precision = 'double'

# [solver]
system = 'navier-stokes'
order = 2

# [solver-time-integrator]
scheme = 'rk4'
tstart = 0
tend = 1.0001
dt = 0.0001

## Processing data from mesh


Details of how mesh size may be obtained is found in plugin path `pyfr/plugins/benchmark.py` in benchmark branch in [sambitmihsra98/PyFR.git](https://github.com/sambitmishra98/PyFR.git).


In [148]:
ndims = 3
nvars = ndims + 2                   # [ρ, ρu, ρv, (ρw), ρE]

element_counts['hex'] = 86**3


## Degrees of Freedom (DoFs) calculation

An analysis of Flux Reconstruction schemes on Tetrahedral elements may be found in [this paper](https://doi.org/10.1007/s10915-016-0204-y). In general, we have ...

In [149]:
def edof(etype, order):
    if etype == 'tri':
        Nu = (order+1)*(order+2)/2
        Nf = 3*(order+1)
    elif etype == 'quad':
        Nu =   (order+1)**2
        Nf = 4*(order+1)
    elif etype == 'tet':
        Nu = (order+3)*edof('tri', order)[0]//3
        Nf =         4*edof('tri', order)[0]
    elif etype == 'pyr':
        Nu = (2*order+3)*edof('tri', order)[0]//3
        Nf =           4*edof('tri', order)[0] +   edof('quad', order)[0]
    elif etype == 'pri':
        Nu = (order+1)*edof('tri', order)[0]
        Nf =         2*edof('tri', order)[0] + 3*edof('quad', order)[0]
    elif etype == 'hex':
        Nu = (order+1)*edof('quad', order)[0]
        Nf =         6*edof('quad', order)[0]
    else:
        raise Exception("Not implemented yet")

    NNuf = Nu*Nf    # Matrix of size Nu x Nf
    NNuu = Nu*Nu    # Matrix of size Nu x Nu

    return int(Nu), int(Nf), int(NNuf), int(NNuu)


In [150]:
# Get total number of degrees of freedom on the basis of the element counts and the element type
dofs_s = sum(n*edof(etype, order)[0] for etype, n in element_counts.items())
dofs_f = sum(n*edof(etype, order)[1] for etype, n in element_counts.items())
M_sf   = sum(n*edof(etype, order)[2] for etype, n in element_counts.items())
M_ss   = sum(n*edof(etype, order)[3] for etype, n in element_counts.items())
print(f"Total number of degrees of freedom: \nsolution points: \t\t{dofs_s} \nFlux points: \t\t\t{dofs_f} \nsolution --> flux matrices: \t{M_sf} \nscalar--> scalar matrices: \t{M_ss}")

Total number of degrees of freedom: 
solution points: 		17173512 
Flux points: 			34347024 
solution --> flux matrices: 	927369648 
scalar--> scalar matrices: 	463684824


## Storage

Size of registers for explicit RK stages is as per `stepper_nregs` found in `pyfr/integrators/std/steppers.py` 

(plus one for the previous solution maybe?)

In [151]:
rk_registers = {'euler': 1, 
                'rk4'  : 4,}

In [152]:
# Storage values in bytes
solution_storages = {
       'scalar u': precision_size[precision]*nvars*dofs_s*rk_registers[scheme],
       'vector u': precision_size[precision]*nvars*dofs_s*ndims,
     'S matrix u': precision_size[precision]*dofs_s*ndims**2,
        '1/|J| u': precision_size[precision]*dofs_s,
                    }     

# Flux
flux_storages = {
     'scalar f': precision_size[precision]*nvars*dofs_f,
     'vector f': precision_size[precision]*nvars*dofs_f*ndims if system == 'navier-stokes' else 0,
}
# Interface-normal storages
interface_normal_storages = {
           'n/|n| i': precision_size[precision]*dofs_f*ndims/2,
             '|n| i': precision_size[precision]*dofs_f,
     'scalar view i': precision_size[ 'single']*dofs_f*2,
     'vector view i': precision_size[ 'single']*dofs_f*ndims if system == 'navier-stokes' else 0,
}

storages = solution_storages|flux_storages|interface_normal_storages
storages['total'] = sum(storages.values())

for key, val in storages.items():
     print(f"{key}: {val/1024**3:.3f} GB")


scalar u: 2.559 GB
vector u: 1.919 GB
S matrix u: 1.152 GB
1/|J| u: 0.128 GB
scalar f: 1.280 GB
vector f: 3.839 GB
n/|n| i: 0.384 GB
|n| i: 0.256 GB
scalar view i: 0.256 GB
vector view i: 0.384 GB
total: 12.156 GB


## Over one sweep of the integrator ...
Total number of computations performed is a function of kernels.
The inputs, outputs and computations in each kernel needs to be understood.

### ... total number of floating point operations performed 

In [153]:
# Store dictionary of computations, inputs and outputs

soln_flux_matrix_computations = sum([n*edof(k,order)[0]*edof(k,order)[1] for k, n in element_counts.items()])*2*nvars
soln_soln_matrix_computations = sum([n*edof(k,order)[0]*edof(k,order)[0] for k, n in element_counts.items()])*2*nvars*ndims

if ndims == 2:
    if system == 'euler':
        non_Ms = {'Gradcoru':         0, 'Tflux':  44*dofs_s, 'Rsolves':  92*dofs_f//2,}
    else:
        non_Ms = {'Gradcoru': 32*dofs_f, 'Tflux':  91*dofs_s, 'Rsolves': 200*dofs_f//2,}
elif ndims == 3:
    if system == 'euler':
        non_Ms = {'Gradcoru':         0, 'Tflux': 105*dofs_s, 'Rsolves': 140*dofs_f//2,}
    else:
        non_Ms = {'Gradcoru': 90*dofs_s, 'Tflux': 189*dofs_s, 'Rsolves': 269*dofs_f//2,}
#                                                         https://github.com/sambitmishra98/PyFR/blob/benchmark/pyfr/.................................
# Computations
Ms = {'M0'  : soln_flux_matrix_computations,                                            # disu,            u*f      solvers/baseadvec/elements.py#L73
      'M132': soln_soln_matrix_computations,                                            # qptsu,           dims*u*u solvers/baseadvec/elements.py#L90
      'M3'  : soln_flux_matrix_computations,                                            # tdivtpcorf,      u*f      solvers/baseadvec/elements.py#L97
      'M460': soln_soln_matrix_computations       if system == 'navier-stokes' else 0,  # tgradpcoru_upts, u*u      solvers/baseadvecdiff/elements.py#L34 
      'M6'  : soln_flux_matrix_computations*ndims if system == 'navier-stokes' else 0,  # tgradcoru_upts,  dims*u*f solvers/baseadvecdiff/elements.py#L38
      'M5'  : soln_flux_matrix_computations*ndims if system == 'navier-stokes' else 0,  # mul,             dims*u*f solvers/baseadvecdiff/elements.py#L68
}

others={'Conu'    : 0,
        'Rcpdjac' : nvars*dofs_s, 
}

kernels = Ms|non_Ms|others

# Neatly print all kernel values, with their names and values aligned

print(*[f"{k:<10}: {v/(1024**3):>10.2f} GFLOP" for k, v in kernels.items()], sep='\n')


print(f"\n\nGFLOP, stage: \t{sum(kernels.values())/(1024**3):6.2f}")
print(f"Matrices: \t{sum(Ms.values())    /(1024**3):>6.2f} \t{sum(Ms.values())    /sum(kernels.values())*100:>5.2f}%,\n"
      f"Others  : \t{sum(non_Ms.values())/(1024**3):>6.2f} \t{sum(non_Ms.values())/sum(kernels.values())*100:>5.2f}%,\n"
      )


M0        :       8.64 GFLOP
M132      :      12.96 GFLOP
M3        :       8.64 GFLOP
M460      :      12.96 GFLOP
M6        :      25.91 GFLOP
M5        :      25.91 GFLOP
Gradcoru  :       1.44 GFLOP
Tflux     :       3.02 GFLOP
Rsolves   :       4.30 GFLOP
Conu      :       0.00 GFLOP
Rcpdjac   :       0.08 GFLOP


GFLOP, stage: 	103.85
Matrices: 	 95.00 	91.48%,
Others  : 	  8.76 	 8.44%,



### ... total reads and writes performed

Matrix multiplications are the costliest parts. Hence, vector multiplications shall be ignored and only M-related computations shall be considered.


In [140]:
vector_reads  = {
    'M0'  :       dofs_s,
    'M132': ndims*dofs_s,
    'M3'  :       dofs_f,
    'M460':       dofs_s,
    'M6'  : ndims*dofs_f,
    'M5'  : ndims*dofs_f,
    }

cached_matrix_reads = {
    'M0'  :       M_sf,
    'M132': ndims*M_ss,
    'M3'  :       M_sf,
    'M460':       M_ss,
    'M6'  : ndims*M_sf,
    'M5'  : ndims*M_sf,
    }

vector_writes = {
    'M0'  :       dofs_f,
    'M132':       dofs_s,
    'M3'  :       dofs_s,
    'M460':       dofs_s,
    'M6'  :       dofs_s,
    'M5'  :       dofs_s,
    }

communication_per_timestep = max(sum(vector_reads.values()) + sum(cached_matrix_reads.values()), sum(vector_writes.values()))

print(f"Vector reads (in GB): {sum(vector_reads.values())/1e9}")
print(f"Matrix reads (in GB): {sum(cached_matrix_reads.values())/1e9}")
print(f"Total writes (in GB): {sum(vector_writes.values())/1e9}")


Vector reads (in GB): 0.206082144
Matrix reads (in GB): 44.513743104
Total writes (in GB): 0.103041072


## Performance comparison

We get the reported performance of accelerators from specifications sheets.

In [141]:
memory_size = { # in GB
    'Intel(R) Data Center GPU Max 1100': 48,
}
memory_bandwidth = { # in GB/s
    'Intel(R) Data Center GPU Max 1100': 1228.8,
}

theoretical_single_precision_performance = { # In teraflops
 'Intel(R) Data Center GPU Max 1100': 22.22,                  # https://www.techpowerup.com/gpu-specs/data-center-gpu-max-1100.c4066
}

theoretical_double_precision_performance = { # Teraflops
 'Intel(R) Data Center GPU Max 1100': 22.22,                  # https://www.techpowerup.com/gpu-specs/data-center-gpu-max-1100.c4066
}


## Practical performance



Number of computations for explicit RK stages is as per `stepper_order` found in `pyfr/integrators/std/steppers.py`.

In [142]:
rk_stage_computations = {'euler': 1, 
                         'rk4'  : 4,}

### Tested performance of simulations (TGV)

In [143]:
practical_single_precision_performance = { # GDoF/s
 'Intel(R) Data Center GPU Max 1100': 2.25,                 # https://docs.google.com/document/d/1yX7JqTTsXRikTtzRon-03TgRGceByce075N4-Ptp7cI/edit?usp=sharing
}

practical_double_precision_performance = { # Teraflops
 'Intel(R) Data Center GPU Max 1100': 1.25,                 # https://docs.google.com/document/d/1yX7JqTTsXRikTtzRon-03TgRGceByce075N4-Ptp7cI/edit?usp=sharing
}


In [144]:
for dp in practical_double_precision_performance.values():
    print(f"Double precision performance: {dp} GDoF/s")
    print(f"Total number of computations performed per second (FLOPS): {dp*rk_stage_computations[scheme]} GFLOPS")

Double precision performance: 1.25 GDoF/s
Total number of computations performed per second (FLOPS): 5.0 GFLOPS


In [146]:
52/49


1.0612244897959184