# Lecture 5: Parallel programming in Julia

by Valentin Churavy, based on material for MIT 18.337
https://slides.com/valentinchuravy/julia-parallelism

## Levels of parallelism

1. Instruction level parallelism *(Not a topic today)*
2. Vector instructions *(Not a topic today)*
3. Threading (shared-memory)
4. Distributed
5. Accelerators e.g. GPGPU

## Benchmarking methodology

1. Measure first! Don't try to guess the performance of your code.
2. If you don't measure, you can't improve.
3. Computers are noisy! Many people use lowest runtime.
4. Read the performance tips in the Julia manual.
5. Don't benchmark in global scope.
6. Global variables are performance pitfalls.

### Steps
1. Check for type instabilities with `@code_warntype`
2. Benchmark using `@btime` and `@benchmark` from `BenchmarkTools.jl`
3. Use Julia profiler and `ProfileView.jl`
4. Use the memory allocation tracker

### About global scope
A global variable might have its value, and therefore its type, change at any given point. This makes it difficult/nigh impossible for the compiler to reason about/optimize code using global variables.

Julia uses functions as its compilation unit and any code that is performance critical or being benchmarked should be inside a function.

In [1]:
using BenchmarkTools

@benchmark sin(1)

In [2]:
 @benchmark sum(rand(1000))

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.864 μs (0.00% GC)
  median time:      2.438 μs (0.00% GC)
  mean time:        2.803 μs (6.68% GC)
  maximum time:     175.963 μs (93.93% GC)
  --------------
  samples:          10000
  evals/sample:     10

In [2]:
@benchmark sum($(rand(1000)))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.056 ns (0.00% GC)
  median time:      69.160 ns (0.00% GC)
  mean time:        70.701 ns (0.00% GC)
  maximum time:     478.892 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     976

Always interpolate inputs into your benchmark to measure the part of your code that you are interested in.

# Using the Julia profiler
```
@profile fun() # Profile a specific function
Profile.clear() # Clear the recorded profile
Profile.print() # Print the recorded profile
Profile.print(C=true) # Print the profile including calls into C

# The textual output of the profiler can be hard to understand
# ProfileView.jl gives you a graphical representation
using ProfileView
ProfileView.view()
```

In [7]:
#import Pkg
#Pkg.add("FFTW") 
using FFTW

In [1]:
?mapslices

search: [0m[1mm[22m[0m[1ma[22m[0m[1mp[22m[0m[1ms[22m[0m[1ml[22m[0m[1mi[22m[0m[1mc[22m[0m[1me[22m[0m[1ms[22m



```
mapslices(f, A; dims)
```

Transform the given dimensions of array `A` using function `f`. `f` is called on each slice of `A` of the form `A[...,:,...,:,...]`. `dims` is an integer vector specifying where the colons go in this expression. The results are concatenated along the remaining dimensions. For example, if `dims` is `[1,2]` and `A` is 4-dimensional, `f` is called on `A[:,:,i,j]` for all `i` and `j`.

# Examples

```jldoctest
julia> a = reshape(Vector(1:16),(2,2,2,2))
2×2×2×2 Array{Int64,4}:
[:, :, 1, 1] =
 1  3
 2  4

[:, :, 2, 1] =
 5  7
 6  8

[:, :, 1, 2] =
  9  11
 10  12

[:, :, 2, 2] =
 13  15
 14  16

julia> mapslices(sum, a, dims = [1,2])
1×1×2×2 Array{Int64,4}:
[:, :, 1, 1] =
 10

[:, :, 2, 1] =
 26

[:, :, 1, 2] =
 42

[:, :, 2, 2] =
 58
```


In [17]:
function profile_test(n)
    for i = 1:n
        A = randn(100,100,20)# 100x100x20 random array 
        m = maximum(A)
        Afft = fft(A)
        Am = mapslices(sum, A, dims=2)
        B = A[:,:,5]
        Bsort = mapslices(sort, B, dims=1)
        b = rand(100)
        C = B.*b
    end
end

profile_test(1);  # run once to trigger compilation

In [23]:
#Pkg.add("Profile")
using Profile
Profile.clear()  # in case we have any previous profiling data
@profile profile_test(100)

In [27]:
#Pkg.add("ProfileView")

In [26]:
using ProfileView

┌ Info: Precompiling ProfileView [c46f51b8-102a-5cf2-8d2c-8597cb0e0da7]
└ @ Base loading.jl:1260
└ @ Gtk C:\Users\Victor\.julia\packages\Gtk\X7HfN\src\Gtk.jl:74
└ @ Gtk C:\Users\Victor\.julia\packages\Gtk\X7HfN\src\Gtk.jl:74
└ @ Gtk C:\Users\Victor\.julia\packages\Gtk\X7HfN\src\Gtk.jl:74


In [28]:
ProfileView.view()

Gtk.GtkWindowLeaf(name="", parent, width-request=-1, height-request=-1, visible=TRUE, sensitive=TRUE, app-paintable=FALSE, can-focus=FALSE, has-focus=FALSE, is-focus=FALSE, focus-on-click=TRUE, can-default=FALSE, has-default=FALSE, receives-default=FALSE, composite-child=FALSE, style, events=0, no-show-all=FALSE, has-tooltip=FALSE, tooltip-markup=NULL, tooltip-text=NULL, window, opacity=1.000000, double-buffered, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-left, margin-right, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, margin=0, hexpand=FALSE, vexpand=FALSE, hexpand-set=FALSE, vexpand-set=FALSE, expand=FALSE, scale-factor=1, border-width=0, resize-mode, child, type=GTK_WINDOW_TOPLEVEL, title="Profile", role=NULL, resizable=TRUE, modal=FALSE, window-position=GTK_WIN_POS_NONE, default-width=800, default-height=600, destroy-with-parent=FALSE, hide-titlebar-when-maximized=FALSE, icon, icon-name=NULL, screen, type-hint=GDK_WINDOW_TYPE_HINT_NORMAL, skip-taskbar-hint

C:\Users\Victor\.julia\packages\IJulia\DrVMH\src\execute_request.jl, execute_request: line 67


In [29]:
ProfileView.view(C=true)

Gtk.GtkWindowLeaf(name="", parent, width-request=-1, height-request=-1, visible=TRUE, sensitive=TRUE, app-paintable=FALSE, can-focus=FALSE, has-focus=FALSE, is-focus=FALSE, focus-on-click=TRUE, can-default=FALSE, has-default=FALSE, receives-default=FALSE, composite-child=FALSE, style, events=0, no-show-all=FALSE, has-tooltip=FALSE, tooltip-markup=NULL, tooltip-text=NULL, window, opacity=1.000000, double-buffered, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-left, margin-right, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, margin=0, hexpand=FALSE, vexpand=FALSE, hexpand-set=FALSE, vexpand-set=FALSE, expand=FALSE, scale-factor=1, border-width=0, resize-mode, child, type=GTK_WINDOW_TOPLEVEL, title="Profile", role=NULL, resizable=TRUE, modal=FALSE, window-position=GTK_WIN_POS_NONE, default-width=800, default-height=600, destroy-with-parent=FALSE, hide-titlebar-when-maximized=FALSE, icon, icon-name=NULL, screen, type-hint=GDK_WINDOW_TYPE_HINT_NORMAL, skip-taskbar-hint

## Gaining additional insights

Other profiling tools
1. https://github.com/cstjean/FProfile.jl
2. https://github.com/cstjean/TraceCalls.jl
3. Learn about Linux's `perf`

Code Analyzer
1. https://github.com/vchuravy/IACA.jl

## Threading
Julia threading model is based on a fork-join approach and is still considered experimental.
(So experimental in fact that these benchmarks have been run with https://github.com/JuliaLang/julia/pull/24688)

Fork-join describes the control flow that a group of threads undergoes. Execution is forked and a anonymous function is then run across all threads.

All threads have to join together and serial execution continues.

### Hardware

In [35]:
Sys.iswindows()

true

In [38]:
Sys.iswindows() && run(`wmic cpu`)
#is_linux() && run(`lscpu`)

AddressWidth  Architecture  AssetTag                Availability  Caption                                 Characteristics  ConfigManagerErrorCode  ConfigManagerUserConfig  CpuStatus  CreationClassName  CurrentClockSpeed  CurrentVoltage  DataWidth  Description                             DeviceID  ErrorCleared  ErrorDescription  ExtClock  Family  InstallDate  L2CacheSize  L2CacheSpeed  L3CacheSize  L3CacheSpeed  LastErrorCode  Level  LoadPercentage  Manufacturer  MaxClockSpeed  Name                                      NumberOfCores  NumberOfEnabledCore  NumberOfLogicalProcessors  OtherFamilyDescription  PartNumber              PNPDeviceID  PowerManagementCapabilities  PowerManagementSupported  ProcessorId       ProcessorType  Revision  Role  SecondLevelAddressTranslationExtensions  SerialNumber            SocketDesignation  Status  StatusInfo  Stepping  SystemCreationClassName  SystemName     ThreadCount  UniqueId  UpgradeMethod  Version  VirtualizationFirmwareEnabled  VMMonitorModeExt

Process(`[4mwmic[24m [4mcpu[24m`, ProcessExited(0))

```sh
> lscpu
...
Thread(s) per core:  2
Core(s) per socket:  2
...
Model name:          Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
...
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
...
Flags: ... avx2 ...
```

```sh
> lscpu
...
Thread(s) per core:  2
Core(s) per socket:  12
...
Model name:          AMD Ryzen Threadripper 1920X 12-Core Processor
...
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
...
Flags: ... avx2 ...
```

In [7]:
using Base.Threads

In [8]:
nthreads()

1

In [9]:
#ENV["JULIA_NUM_THREADS"] = 4 

In [10]:
nthreads()

1

Special care needs to be taken if the loop body access has side-effects or accesses global state. (This includes IO and random numbers)

In [25]:
function rand_init1(A)
    @threads for i in 1:length(A)
        A[i] = rand()
    end
end

rand_init1 (generic function with 1 method)

In [32]:
using Random: GLOBAL_RNG
#using Random
import Future

In [33]:
function rand_init2(rngs, A)
    @threads for i in 1:length(A)
        A[i] = rand(rngs[threadid()])
    end
end

rand_init2 (generic function with 1 method)

In [38]:
# Based on https://github.com/bkamins/KissThreading.jl/blob/8675f55ef9469fccf808a44237bd5f0bbb02b950/src/KissThreading.jl#L5-L15
function create_rngs()
    rngjmp = Future.randjump(GLOBAL_RNG, nthreads()+1)
    rngs = Vector{MersenneTwister}(nthreads())
    Threads.@threads for tid in 1:nthreads()
        rngs[tid] = deepcopy(rngjmp[tid+1])
    end
    all([isassigned(rngs, i) for i in 1:nthreads()]) || error("failed to create rngs")
    return rngs
end

create_rngs (generic function with 1 method)

In [41]:
?Future.randjump

```
randjump(r::MersenneTwister, steps::Integer) -> MersenneTwister
```

Create an initialized `MersenneTwister` object, whose state is moved forward (without generating numbers) from `r` by `steps` steps. One such step corresponds to the generation of two `Float64` numbers. For each different value of `steps`, a large polynomial has to be generated internally. One is already pre-computed for `steps=big(10)^20`.


In [39]:
basic_rngs = [MersenneTwister(rand(UInt64)) for i in 1:nthreads()]
proper_rngs = create_rngs();

MethodError: MethodError: no method matching randjump(::Random._GLOBAL_RNG, ::Int64)
Closest candidates are:
  randjump(!Matched::MersenneTwister, ::Integer) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Future\src\Future.jl:39

In [16]:
A = zeros(10_000);

In [17]:
@benchmark rand_init1($A)

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     76.773 μs (0.00% GC)
  median time:      149.930 μs (0.00% GC)
  mean time:        150.099 μs (0.00% GC)
  maximum time:     198.623 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [18]:
@benchmark rand_init2($basic_rngs, $A)

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     23.261 μs (0.00% GC)
  median time:      24.307 μs (0.00% GC)
  mean time:        24.292 μs (0.00% GC)
  maximum time:     145.660 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [19]:
@benchmark rand_init2($proper_rngs, $A)

BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     22.975 μs (0.00% GC)
  median time:      24.237 μs (0.00% GC)
  mean time:        24.254 μs (0.00% GC)
  maximum time:     50.971 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

## Atomics and Locks

In [20]:
acc = 0
@threads for i in 1:10_000
    global acc
    acc += 1
end

In [21]:
acc

2534

In [22]:
acc = Atomic{Int64}(0)
@threads for i in 1:10_000
    atomic_add!(acc, 1)
end

In [23]:
acc

Base.Threads.Atomic{Int64}(10000)

For locks take a look at the [manual](https://docs.julialang.org/en/latest/stdlib/parallel/#Synchronization-Primitives-1)

# A useful trick
```julia
@threads for id in 1:nthreads()
    #each thread does something
end
```

In [49]:
?let

search: [0m[1ml[22m[0m[1me[22m[0m[1mt[22m de[0m[1ml[22m[0m[1me[22m[0m[1mt[22me! is[0m[1ml[22m[0m[1me[22m[0m[1mt[22mter de[0m[1ml[22m[0m[1me[22m[0m[1mt[22meat! se[0m[1ml[22m[0m[1me[22mc[0m[1mt[22mdim [0m[1ml[22m[0m[1me[22mng[0m[1mt[22mh co[0m[1ml[22ml[0m[1me[22mc[0m[1mt[22m mutab[0m[1ml[22m[0m[1me[22m s[0m[1mt[22mruct



```
let
```

`let` statements allocate new variable bindings each time they run. Whereas an assignment modifies an existing value location, `let` creates new locations. This difference is only detectable in the case of variables that outlive their scope via closures. The `let` syntax accepts a comma-separated series of assignments and variable names:

```julia
let var1 = value1, var2, var3 = value3
    code
end
```

The assignments are evaluated in order, with each right-hand side evaluated in the scope before the new variable on the left-hand side has been introduced. Therefore it makes sense to write something like `let x = x`, since the two `x` variables are distinct and have separate storage.


In [24]:
function threaded_sum(arr)
   @assert length(arr) % nthreads() == 0
   let results = zeros(eltype(arr), nthreads())
       @threads for tid in 1:nthreads()
           # split wo`rk
           acc = zero(eltype(arr))
           len = div(length(arr), nthreads())
           domain = ((tid-1)*len +1):tid*len
           @inbounds for i in domain
               acc += arr[i]    
           end
           results[tid] = acc
       end
       sum(results)
   end
end

threaded_sum (generic function with 1 method)

In [25]:
data = rand(3*2^19);

In [26]:
@btime sum($data)

  409.636 μs (0 allocations: 0 bytes)


786472.3368683406

In [27]:
@btime threaded_sum($data)

  440.189 μs (2 allocations: 160 bytes)


786472.3368683412

| NT  | Skylake | AMD TR |
| --- | --- | --- |
| sum | 514.476 μs | 430.409 μs |
| 1 | 1.578 ms | 1.206 ms |
| 2 | 831.411 μs | 575.872 μs |
| 4 | 417.656 μs | 294.724 μs |
| 6 | X | 215.986 μs |
| 12 | X | 109.536 μs |
| 24 | X | 57.197 μs |

If your `@threads` performance with one thread is not as fast as a non `@threads` version something is off..., but yeah for linear scaling.

In [42]:
?@inbounds

```
@inbounds(blk)
```

Eliminates array bounds checking within expressions.

In the example below the in-range check for referencing element `i` of array `A` is skipped to improve performance.

```julia
function sum(A::AbstractArray)
    r = zero(eltype(A))
    for i = 1:length(A)
        @inbounds r += A[i]
    end
    return r
end
```

!!! warning
    Using `@inbounds` may return incorrect results/crashes/corruption for out-of-bounds indices. The user is responsible for checking it manually. Only use `@inbounds` when it is certain from the information locally available that all accesses are in bounds.



In [44]:
zero(Float64)

0.0

In [46]:
function sum(A::AbstractArray)
    r = zero(eltype(A))
    for i = 1:length(A)
        @inbounds r += A[i]
    end
    return r
    end

sum (generic function with 1 method)

In [48]:
?@simd

```
@simd
```

Annotate a `for` loop to allow the compiler to take extra liberties to allow loop re-ordering

!!! warning
    This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the `@simd` macro may cause unexpected results.


The object iterated over in a `@simd for` loop should be a one-dimensional range. By using `@simd`, you are asserting several properties of the loop:

  * It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
  * Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In many cases, Julia is able to automatically vectorize inner for loops without the use of `@simd`. Using `@simd` gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:

  * The loop must be an innermost loop
  * The loop body must be straight-line code. Therefore, [`@inbounds`](@ref) is   currently needed for all array accesses. The compiler can sometimes turn   short `&&`, `||`, and `?:` expressions into straight-line code if it is safe   to evaluate all operands unconditionally. Consider using the [`ifelse`](@ref)   function instead of `?:` in the loop if it is safe to do so.
  * Accesses must have a stride pattern and cannot be "gathers" (random-index   reads) or "scatters" (random-index writes).
  * The stride should be unit stride.

!!! note
    The `@simd` does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use `@simd ivdep for ... end` to also assert that:


  * There exists no loop-carried memory dependencies
  * No iteration ever waits on a previous iteration to make forward progress.


In [28]:
function threaded_sum2(arr)
   @assert length(arr) % nthreads() == 0
   let results = zeros(eltype(arr), nthreads())
       @threads for tid in 1:nthreads()
           # split work
           acc = zero(eltype(arr))
           len = div(length(arr), nthreads())
           domain = ((tid-1)*len +1):tid*len
           @inbounds @simd for i in domain
               acc += arr[i]    
           end
           results[tid] = acc
       end
       sum(results)
    end
end

threaded_sum2 (generic function with 1 method)

In [29]:
@btime threaded_sum2($data)

  189.639 μs (2 allocations: 160 bytes)


786472.3368683401

| NT  | Skylake | AMD TR |
| --- | --- | --- |
| sum | 514.476 μs | 430.409 μs |
| 1 | 493.384 μs | 401.755 μs |
| 2 | 282.030 μs | 73.408 μs |
| 4 | 230.988 μs | 37.541 μs |
| 6 | X | 29.185 μs |
| 12 | X | 16.491 μs |
| 24 | X | 17.693 μs |

Hyperthreading...

and superlinear speedup from 1-2 threads on Threadripper, due to cache effect. (Data is 12MB; 2xL3 = 16MB)

## An example

In [30]:
function myfun(rng::MersenneTwister)
    s = 0.0
    N = 10000
    for i = 1:N
        s += det(randn(rng, 3,3))
    end
    s/N
end


myfun (generic function with 1 method)

In [31]:
function bench(rgi)
    a = zeros(1000)
    @threads for i = 1:length(a)
        a[i] = myfun(rgi[threadid()])
    end
end

bench (generic function with 1 method)

In [32]:
rgi = [MersenneTwister(rand(UInt)) for _ in 1:nthreads()];

In [33]:
@btime bench($rgi)

  3.837 s (46897551 allocations: 3.56 GiB)


### Steps I took to optimize this code

1. Memory allocations in hot-loop
2. Eliminate allocs caused by rand
3. Investigate how det is implemented
4. Implement det!
5. Remove overhead to library call
6. Use profiling tools
7. Start using StaticArrays

Full story here: https://hackmd.io/s/BkyZ5Mmbb

In [34]:
using StaticArrays

In [35]:
function myfun_fast(rng::MersenneTwister)
    s = 0.0
    N = 10000
    for i in 1:N
        s += det(randn(rng, SMatrix{3, 3}))
    end
    s/N
end

myfun_fast (generic function with 1 method)

In [36]:
function bench_fast(rgi)
    a = zeros(1000)
    @threads for i in 1:length(a)
        @inbounds a[i] = myfun_fast(rgi[threadid()])
    end
end

bench_fast (generic function with 1 method)

In [37]:
rgi_fast = create_rngs();

In [38]:
result = @btime bench_fast($rgi_fast)

  365.109 ms (2 allocations: 7.98 KiB)


In [2]:
?@spawnat

No documentation found.

Binding `@spawnat` does not exist.


## Distributed computing and accelerated computing
### The Julia way!

Julia supports various forms of distributed computing.

1. A native master-worker system based on remote procedure calls
2. MPI through `MPI.jl`
3. `DistributedArrays.jl`

Julia also has support for GPU accelerated computing

1. Low-level (C kernel) based operations `OpenCL.jl` and `CUDAdrv.jl`
2. Low-level (Julia kernel) based operations through `CUDAnative.jl` and 
2. High-level vendor specific abstractions `CuArray.jl` and `CLArray.jl`
2. High-level libraries like `ArrayFire.jl` and `GPUArrays.jl`

### The Julia way! Tell us where your data is and your program will follow.

#### `broadcast` example

In [4]:
import Pkg

In [5]:
using Distributed

In [8]:
?addproc

search: [0m[1ma[22m[0m[1md[22m[0m[1md[22m[0m[1mp[22m[0m[1mr[22m[0m[1mo[22m[0m[1mc[22ms

Couldn't find [36maddproc[39m
Perhaps you meant addprocs or nprocs


No documentation found.

Binding `addproc` does not exist.


In [12]:

addprocs(5) # = julia -p (_+5)

5-element Array{Int64,1}:
 22
 23
 24
 25
 26

In [16]:
nprocs()

26

In [17]:
nworkers()

25

In [18]:
myid()

1

In [14]:
#Pkg.add("DistributedArrays")

[32m[1m   Updating[22m[39m registry at `C:\Users\Victor\.julia\registries\General`
[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`








[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m DistributedArrays ─ v0.6.4
[32m[1m   Updating[22m[39m `C:\Users\Victor\.julia\environments\v1.4\Project.toml`
 [90m [aaf54ef3][39m[92m + DistributedArrays v0.6.4[39m
[32m[1m   Updating[22m[39m `C:\Users\Victor\.julia\environments\v1.4\Manifest.toml`
 [90m [aaf54ef3][39m[92m + DistributedArrays v0.6.4[39m


In [20]:
#Pkg.add("CuArrays")

[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m NNPACK_jll ──────── v2018.6.22+0
[32m[1m  Installed[22m[39m CpuId ───────────── v0.2.2
[32m[1m  Installed[22m[39m FoldingTrees ────── v1.0.0
[32m[1m  Installed[22m[39m GPUCompiler ─────── v0.2.0
[32m[1m  Installed[22m[39m VectorizationBase ─ v0.12.24
[32m[1m  Installed[22m[39m GPUArrays ───────── v3.4.1
[32m[1m  Installed[22m[39m LLVM ────────────── v1.7.0
[32m[1m  Installed[22m[39m CuArrays ────────── v2.2.2
[32m[1m  Installed[22m[39m SLEEFPirates ────── v0.5.5
[32m[1m  Installed[22m[39m CEnum ───────────── v0.3.0
[32m[1m  Installed[22m[39m Cthulhu ─────────── v1.2.0
[32m[1m  Installed[22m[39m Adapt ───────────── v1.1.0
[32m[1m  Installed[22m[39m NNlib ───────────── v0.7.3
[32m[1m  Installed[22m[39m CUDAdrv ─────────── v6.3.0
[32m[1m  Installed[22m[39m CUDAnative ──────── v3.2.0
[32m[1m  Installed[22m[39m SIMDPirates ─────── v0.8.16
[32m[1m  I

In [21]:
@everywhere using DistributedArrays
using CuArrays

┌ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
└ @ Base loading.jl:1260
│         You will be able to use only the default Julia NNlib backend
└ @ NNlib C:\Users\Victor\.julia\packages\NNlib\sSn9M\src\NNlib.jl:14


LoadError: MethodError: no method matching @everywhere(::LineNumberNode, ::Module)
Closest candidates are:
  @everywhere(::LineNumberNode, ::Module, !Matched::Any) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\macros.jl:192
  @everywhere(::LineNumberNode, ::Module, !Matched::Any, !Matched::Any) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\macros.jl:197

In [22]:
B = ones(10_000) ./ 2;
A = ones(10_000) .* π;

In [23]:
C = 2 .* A ./ B;
all(C .≈ 4*π)

true

In [27]:
typeof(C)

Array{Float64,1}

In [28]:
dB = distribute(B);
dA = distribute(A);

In [29]:
dC = 2 .* dA ./ dB;
all(dC .≈ 4*π)

true

In [30]:
typeof(dC) # Float64 1-d array from ___

DArray{Float64,1,Array{Float64,1}}

In [33]:
cuB = CuArray(B);
cuA = CuArray(A);

In [48]:
cuC = 2 .* cuA ./ cuB;
# Disclaimer on Julia v0.6 some operations don't work `sin`. Use CUDAnative.sin instead.
all(cuC .≈ 4*π)



true

In [49]:
typeof(cuC)

CuArray{Float64,1}

#### example powermethod

In [44]:
using LinearAlgebra

In [59]:
using BenchmarkTools

In [60]:
nprocs()

26

In [45]:
function power_method(M, v)
    for i in 1:100
        v = M*v        # repeatedly creates a new vector and destroys the old v
        v /= norm(v)
    end
    
    return v, norm(M*v) / norm(v)  # or  (M*v) ./ v
end

power_method (generic function with 1 method)

In [46]:
M = [2. 1; 1 1]
v = rand(2)

2-element Array{Float64,1}:
 0.11178024138382048
 0.8609828805378716

In [61]:
@btime power_method(M, v)

  9.900 μs (202 allocations: 18.88 KiB)


([0.85065080835204, 0.5257311121191336], 2.618033988749895)

In [48]:
cuM = CuArray(M);
cuv = CuArray(v);

In [62]:
@btime curesult = power_method(cuM, cuv)

  14.156 ms (8015 allocations: 253.42 KiB)


([0.85065080835204, 0.5257311121191336], 2.618033988749895)

In [51]:
typeof(curesult)

Tuple{CuArray{Float64,1,Nothing},Float64}

In [52]:
dM = distribute(M);
dv = distribute(v);

In [63]:
@btime result = power_method(dM, dv)

  189.778 ms (104527 allocations: 4.47 MiB)


([0.85065080835204, 0.5257311121191336], 2.618033988749895)

In [55]:
typeof(result)

Tuple{DArray{Float64,1,Array{Float64,1}},Float64}

In [56]:
#?similar

In [57]:
#?findfirst

# MPI tutorial

## The problem: diffusion in a two-dimensional domain
http://www.claudiobellei.com/2018/09/30/julia-mpi/

### MPI command
MPI.init() - initializes the execution environment  
MPI.COMM_WORLD -represents the communicator, i.e., all processes available through the MPI application (every communication must be linked to a communicator)  
MPI.Comm_rank(MPI.COMM_WORLD) - determines the internal rank (id) of the process  
MPI.Barrier(MPI.COMM_WORLD) - blocks execution until all processes have reached this routine  
MPI.Bcast!(buf, n_buf, rank_root, MPI.COMM_WORLD) - broadcasts the buffer buf with size n_buf from the process with rank rank_root to all other processes in the communicator MPI.COMM_WORLD  
MPI.Waitall!(reqs) - waits for all MPI requests to complete (a request is a handle, in other words a reference, to an asynchronous message transfer)  
MPI.REQUEST_NULL - specifies that a request is not associated with any ongoing communication  
MPI.Gather(buf, rank_root, MPI.COMM_WORLD) - reduces the variable buf to the receiving process rank_root  
MPI.Isend(buf, rank_dest, tag, MPI.COMM_WORLD) - the message buf is sent asynchronously from the current process to the rank_dest process, with the message tagged with the tag parameter  
MPI.Irecv!(buf, rank_src, tag, MPI.COMM_WORLD) - receives a message tagged tag from the source process of rank rank_src to the local buffer buf  
MPI.Finalize() - terminates the MPI execution environment  

## MPI.jl stable documentation 
https://juliaparallel.github.io/MPI.jl/stable/examples/03-reduce/
mpirun -n 3 julia MPI_hello.j  
examples in /home/schang21/0721juliatest/ on PSI

In [67]:
#import Pkg; Pkg.add("MPI")
import MPI

[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m OpenMPI_jll ────── v4.0.2+2
[32m[1m  Installed[22m[39m MicrosoftMPI_jll ─ v10.1.2+3
[32m[1m  Installed[22m[39m MPICH_jll ──────── v3.3.2+10
[32m[1m  Installed[22m[39m MPI ────────────── v0.15.0
[32m[1m   Updating[22m[39m `C:\Users\Victor\.julia\environments\v1.4\Project.toml`
 [90m [da04e1cc][39m[92m + MPI v0.15.0[39m
[32m[1m   Updating[22m[39m `C:\Users\Victor\.julia\environments\v1.4\Manifest.toml`
 [90m [da04e1cc][39m[92m + MPI v0.15.0[39m
 [90m [7cb0a576][39m[92m + MPICH_jll v3.3.2+10[39m
 [90m [9237b28f][39m[92m + MicrosoftMPI_jll v10.1.2+3[39m
 [90m [fe0851c0][39m[92m + OpenMPI_jll v4.0.2+2[39m
[32m[1m   Building[22m[39m MPI → `C:\Users\Victor\.julia\packages\MPI\k7f4E\deps\build.log`
┌ Info: Precompiling MPI [da04e1cc-30fd-572f-bb4f-1f8673147195]
└ @ Base loading.jl:1260


In [68]:
?MPI.Bcast!

```
Bcast!(buf[, count=length(buf)], root::Integer, comm::Comm)
```

Broadcast the first `count` elements of the buffer `buf` from `root` to all processes.

# External links

  * `MPI_Bcast` man page: [OpenMPI](https://www.open-mpi.org/doc/current/man3/MPI_Bcast.3.php), [MPICH](https://www.mpich.org/static/docs/latest/www3/MPI_Bcast.html)


In [72]:
methods(MPI.bcast)

In [73]:
?mod

search: [0m[1mm[22m[0m[1mo[22m[0m[1md[22m [0m[1mm[22m[0m[1mo[22m[0m[1md[22mf [0m[1mm[22m[0m[1mo[22m[0m[1md[22m1 [0m[1mm[22m[0m[1mo[22m[0m[1md[22mule [0m[1mm[22m[0m[1mo[22m[0m[1md[22m2pi [0m[1mM[22m[0m[1mo[22m[0m[1md[22mule ch[0m[1mm[22m[0m[1mo[22m[0m[1md[22m inv[0m[1mm[22m[0m[1mo[22m[0m[1md[22m fld[0m[1mm[22m[0m[1mo[22m[0m[1md[22m fld[0m[1mm[22m[0m[1mo[22m[0m[1md[22m1



```
mod(x::Integer, r::AbstractUnitRange)
```

Find `y` in the range `r` such that $x ≡ y (mod n)$, where `n = length(r)`, i.e. `y = mod(x - first(r), n) + first(r)`.

See also: [`mod1`](@ref).

# Examples

```jldoctest
julia> mod(0, Base.OneTo(3))
3

julia> mod(3, 0:2)
0
```

!!! compat "Julia 1.3"
    This method requires at least Julia 1.3.


---

```
mod(x, y)
rem(x, y, RoundDown)
```

The reduction of `x` modulo `y`, or equivalently, the remainder of `x` after floored division by `y`, i.e. `x - y*fld(x,y)` if computed without intermediate rounding.

The result will have the same sign as `y`, and magnitude less than `abs(y)` (with some exceptions, see note below).

!!! note
    When used with floating point values, the exact result may not be representable by the type, and so rounding error may occur. In particular, if the exact result is very close to `y`, then it may be rounded to `y`.


```jldoctest
julia> mod(8, 3)
2

julia> mod(9, 3)
0

julia> mod(8.9, 3)
2.9000000000000004

julia> mod(eps(), 3)
2.220446049250313e-16

julia> mod(-eps(), 3)
3.0
```

---

```
rem(x::Integer, T::Type{<:Integer}) -> T
mod(x::Integer, T::Type{<:Integer}) -> T
%(x::Integer, T::Type{<:Integer}) -> T
```

Find `y::T` such that `x` ≡ `y` (mod n), where n is the number of integers representable in `T`, and `y` is an integer in `[typemin(T),typemax(T)]`. If `T` can represent any integer (e.g. `T == BigInt`), then this operation corresponds to a conversion to `T`.

# Examples

```jldoctest
julia> 129 % Int8
-127
```


In [75]:
mod(-1,3)

2

In [76]:
mod(0,3)

0

In [77]:
?fill!

search: [0m[1mf[22m[0m[1mi[22m[0m[1ml[22m[0m[1ml[22m[0m[1m![22m [0m[1mf[22m[0m[1mi[22m[0m[1ml[22m[0m[1ml[22m d[0m[1mf[22m[0m[1mi[22m[0m[1ml[22m[0m[1ml[22m cu[0m[1mf[22m[0m[1mi[22m[0m[1ml[22m[0m[1ml[22m [0m[1mf[22m[0m[1mi[22mna[0m[1ml[22m[0m[1ml[22my [0m[1mf[22m[0m[1mi[22mnda[0m[1ml[22m[0m[1ml[22m



```
fill!(A, x)
```

Fill array `A` with the value `x`. If `x` is an object reference, all elements will refer to the same object. `fill!(A, Foo())` will return `A` filled with the result of evaluating `Foo()` once.

# Examples

```jldoctest
julia> A = zeros(2,3)
2×3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> fill!(A, 2.)
2×3 Array{Float64,2}:
 2.0  2.0  2.0
 2.0  2.0  2.0

julia> a = [1, 1, 1]; A = fill!(Vector{Vector{Int}}(undef, 3), a); a[1] = 2; A
3-element Array{Array{Int64,1},1}:
 [2, 1, 1]
 [2, 1, 1]
 [2, 1, 1]

julia> x = 0; f() = (global x += 1; x); fill!(Vector{Int}(undef, 3), f())
3-element Array{Int64,1}:
 1
 1
 1
```

---

```
fill!(cb::CircularBuffer, data)
```

Grows the buffer up-to capacity, and fills it entirely. It doesn't overwrite existing elements.
