## Speed test of Tullio.jl on GPUs and CPUs

In [1]:
using CUDA, KernelAbstractions, Tullio, Zygote

Tullio.jl is currently not very optimized for GPUs but it works.
For 2D the speedup is not great, for 3D it is slightly better.
Below you can find several small tests.
Also look [here](https://discourse.julialang.org/t/fast-gpu-kernels-differentiable-with-zygote/56756?u=roflmaostc).

In [87]:
reg(x) = @tullio res = sqrt(abs2(x[i, j, k] - x[i+1, j, k]) + 
                            abs2(x[i, j, k] - x[i, j+1, k]) + 
                            abs2(x[i, j, k] - x[i, j, k+1]))

reg2(x) = sum(@tullio res[i, j, k] := sqrt( abs2(x[i, j, k] - x[i+1, j, k]) + 
                                            abs2(x[i, j, k] - x[i, j+1, k]) + 
                                            abs2(x[i, j, k] - x[i, j, k+1])))

reg3(x) = sum(@tullio res[j, k] := sqrt(abs2(x[i, j, k] - x[i+1, j, k ]) +
                                    abs2(x[i, j, k] - x[i, j+1, k ]) +
                                     abs2(x[i, j, k] - x[i, j, k+1])))

reg4(x) = sum(@tullio res[k] := sqrt(abs2(x[i, j, k] - x[i+1, j, k ]) +
                                    abs2(x[i, j, k] - x[i, j+1, k ]) +
                                     abs2(x[i, j, k] - x[i, j, k+1])))

reg4 (generic function with 1 method)

In [55]:
x = randn(Float16, (512, 512, 64));
x_c = CuArray(x);

In [95]:
@time reg(x)
@time reg2(x)
@time reg3(x)
@time reg4(x)
@time Zygote.gradient(reg, x);
@time Zygote.gradient(reg2, x);
@time Zygote.gradient(reg3, x);
@time Zygote.gradient(reg4, x);

  0.022839 seconds (225 allocations: 10.766 KiB)
  0.081545 seconds (198 allocations: 31.386 MiB)
  0.022012 seconds (329 allocations: 81.703 KiB)
  0.019439 seconds (215 allocations: 9.734 KiB)
  0.069363 seconds (597 allocations: 32.031 MiB)
  0.125310 seconds (594 allocations: 63.409 MiB)
  0.085667 seconds (614 allocations: 32.094 MiB)
  0.070661 seconds (504 allocations: 32.024 MiB)


In [96]:
@CUDA.time reg2(x_c)
@CUDA.time reg3(x_c)
@CUDA.time reg4(x_c)
@CUDA.time Zygote.gradient(reg2, x_c);
@CUDA.time Zygote.gradient(reg3, x_c);
@CUDA.time Zygote.gradient(reg4, x_c);

  0.022923 seconds (220 CPU allocations: 8.875 KiB) (3 GPU allocations: 31.377 MiB, 0.04% gc time)
  0.006365 seconds (248 CPU allocations: 8.031 KiB) (3 GPU allocations: 62.941 KiB, 0.11% gc time)
  0.084296 seconds (180 CPU allocations: 6.891 KiB) (2 GPU allocations: 128 bytes, 0.01% gc time)
  0.006248 seconds (473 CPU allocations: 19.516 KiB) (5 GPU allocations: 94.754 MiB, 0.11% gc time)
  0.008904 seconds (470 CPU allocations: 18.875 KiB) (5 GPU allocations: 32.123 MiB, 0.11% gc time)
  0.076914 seconds (433 CPU allocations: 17.359 KiB) (4 GPU allocations: 32.000 MiB, 0.01% gc time)
