## Exercise: Use LIKWID to Count FLOPs

First, let's check that LIKWID is working. The following should work and print the supported LIKWID performance groups.

In [1]:
using LIKWID

In [2]:
PerfMon.supported_groups()

Dict{String, LIKWID.GroupInfoCompact} with 29 entries:
  "L2CACHE"        => L2CACHE => L2 cache miss rate/ratio
  "MEM"            => MEM => L3 cache bandwidth in MBytes/s
  "CYCLE_ACTIVITY" => CYCLE_ACTIVITY => Cycle Activities
  "BRANCH"         => BRANCH => Branch prediction miss rate/ratio
  "FLOPS_SP"       => FLOPS_SP => Single Precision MFLOP/s
  "RECOVERY"       => RECOVERY => Recovery duration
  "DIVIDE"         => DIVIDE => Divide unit information
  "L2"             => L2 => L2 cache bandwidth in MBytes/s
  "FALSE_SHARE"    => FALSE_SHARE => False sharing
  "L3"             => L3 => L3 cache bandwidth in MBytes/s
  "L3CACHE"        => L3CACHE => L3 cache miss rate/ratio
  "UOPS_EXEC"      => UOPS_EXEC => UOPs execution
  "CYCLE_STALLS"   => CYCLE_STALLS => Cycle Activities (Stalls)
  "ICACHE"         => ICACHE => Instruction cache miss rate/ratio
  "TLB_INSTR"      => TLB_INSTR => L1 Instruction TLB miss rate/ratio
  "MEM_DP"         => MEM_DP => Overview of arithmetic and m

Great, you're set up!

**You can find the instructions for this exercise/tutorial here:**   
https://juliaperf.github.io/LIKWID.jl/dev/tutorials/counting_flops/

In [3]:
# ...Your code goes here...

daxpy!(z, a, x, y) = z .= a .* x .+ y

const N = 10_000
const a = 3.141
const x = rand(N)
const y = rand(N)
const z = zeros(N)

daxpy!(z, a, x, y);

using LIKWID
metrics, events = @perfmon "FLOPS_DP" daxpy!(z, a, x, y);


Group: [0m[1mFLOPS_DP[22m
┌──────────────────────────────────────────┬──────────┐
│[1m                                    Event [0m│[1m Thread 1 [0m│
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │  10149.0 │
│                    CPU_CLK_UNHALTED_CORE │  32480.0 │
│                     CPU_CLK_UNHALTED_REF │  94500.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │   5000.0 │
└──────────────────────────────────────────┴──────────┘
┌──────────────────────┬────────────┐
│[1m               Metric [0m│[1m   Thread 1 [0m│
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 3.78731e-5 │
│ Runtime unhalted [s] │ 1.25313e-5 │
│          Clock [MHz] │    890.852 │
│                  CPI │    3.20032 │
│         DP [MFLOP/s] │     528.08 │
│     AVX DP [MFLOP/s] │     528.08 │
│     Packed [MUOPS/s] │     132.02 

In [4]:
flops_per_second = first(metrics["FLOPS_DP"])["DP [MFLOP/s]"] * 1e6
runtime = first(metrics["FLOPS_DP"])["Runtime (RDTSC) [s]"]
NFLOPs_actual = round(Int, flops_per_second * runtime)

20000