# Parallel and Distributed Computing

**Multithreading** refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. **Multiprocessing** refers to the ability of a system to run multiple processors concurrently, where each processor can run one or more threads.

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*hZ3guTdmDMXevFiT5Z3VrA.png)


## Multithreading
Multi-threading is a programming technique that allows **multiple threads** of execution to run concurrently within a single process. Julia provides built-in support for multi-threading, making it easy to write concurrent code. To use multi-threading in Julia, you can use the Threads standard library.

The number of execution threads is controlled either by using the `-t`/`--threads` command line argument 

```shell
julia --threads 10 my_script.jl
```

or by using the `JULIA_NUM_THREADS` environment variable. This can also be changed in VSCode setting. 


When both `JULIA_NUM_THREADS` and `-t`/`--threads` are specified, then `-t`/`--threads` takes precedence.

The number of threads can either be specified as an integer (`--threads=4`) or as auto (`--threads=auto`), where auto sets the number of threads to the number of local CPU threads.

To check the number of threads available:

In [1]:
Threads.nthreads()

5


Multithreading in Julia is **super easy**: just put `Threads.@threads` in front of the loop you want to parrallelize.

Here is a stupid example where we assign to an array the id of the thread assigned within the loop.

In [2]:
a = zeros(10)

Threads.@threads for i = 1:10
    a[i] = Threads.threadid()
end
println(a)

[4.0, 4.0, 5.0, 5.0, 3.0, 3.0, 2.0, 2.0, 1.0, 1.0]


Here is another example to make sure that `Threads.@threads` does indeed run code in parrallel: we compare a simple loop against a loop using `Threads.@threads`.

In [4]:
myfun() = sleep(1)
function myfun_loop()
    for i = 1:10
        myfun()
    end
end

using BenchmarkTools
@btime myfun_loop()

  10.022 s (358 allocations: 10.03 KiB)


In [5]:
function myfun_loop_multithreading()
    Threads.@threads for i = 1:10
        myfun()
    end
end

@btime myfun_loop_multithreading()


  2.004 s (143 allocations: 5.84 KiB)


### Be careful with race condition!
A race condition is a situation that occurs in computer programming when multiple processes or threads access and manipulate shared resources or data concurrently. In this scenario, the outcome of the program is dependent on the order and timing of the execution of each process, which can lead to unexpected and undesirable results. 

In [9]:
a = Float64[]
Threads.@threads for i in 1:1000
    push!(a, i)
end
@assert println(length(a)) == 1000

567


AssertionError: AssertionError: println(length(a)) == 1000

#### `lock`

The `lock` function can be used to prevent race condition


In [10]:
a = []
lk = ReentrantLock()
Threads.@threads for i in 1:100
    x = i^2
    lock(lk) do
        push!(a, x)
    end
end
println(length(a)) # ==1000

100


### Overhead
There's a performance benefit to parallelization, but the overhead for starting threads may be an overkill. For multithreading to be worth, you need a reasonably large amount of "real work". Conversely, with small works, the parallel version might be slower than the serial version.

In [11]:
N = 2^30
x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N);  # a vector filled with 2.0


In [12]:
function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

using BenchmarkTools
@btime sequential_add!($y, $x)


  2.104 s (0 allocations: 0 bytes)


In [13]:
function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@btime parallel_add!($y, $x)



  1.559 s (34 allocations: 2.73 KiB)


In the above, without overhead, we'd expect to have a 5 fold time decrease since we have 5 threads. However, we get much less than that, due to overhead.

### 👍 How I use multithreading in my simulations

I typically have an expensive function that I want to call multiple times with different arguments.



In [14]:
function simul(noise, batch_size)
    # do something with noise and batch_size
    sleep(1)
    return randn()
end

simul (generic function with 1 method)


`simul` does some simulation based on the `noise` and `batch_size` parameters, then returns the simulation result.

I want to loops through all combinations of the arguments proposed. Let's do so by creating a dictionary `pars` for each combination of arguments, and adding it to an array `pars_arr`.


In [15]:
pars_arr = Dict[]

noises = [0.1, 0.2, 0.3]
batch_sizes = [1000, 2000, 3000]

for noise in noises, batch_size in batch_sizes
    pars = Dict()
    pars["noise"] = noise
    pars["batch_size"] = batch_size
    push!(pars_arr, pars)
end

We'll also create a `DataFrame` to store the results.


In [16]:
using DataFrames
df_results = DataFrame("Result" => [],
                    "noise" => [],
                    "batch_size" => [])

Row,Result,noise,batch_size
Unnamed: 0_level_1,Any,Any,Any


Here is how I would run the simulations.


In [17]:
using ProgressMeter
progr = Progress(length(pars_arr), showspeed = true, barlen = 10)

loc = Threads.ReentrantLock()

Threads.@threads for k in 1:length(pars_arr)
    p = pars_arr[k]
    noise = p["noise"]
    batch_size = p["batch_size"]
    try
        out = simul(noise, batch_size)
        lock(loc) do
            push!(df_results, (out, noise, batch_size));
        end
    catch e
        println("problem with p = $(pars_arr[k])")
        println(e)
    end
    next!(progr)
end

[32mProgress:  22%|██▎       |  ETA: 0:00:06 ( 0.80  s/it)[39m[K

[32mProgress:  67%|██████▋   |  ETA: 0:00:01 ( 0.43  s/it)[39m[K

[32mProgress:  78%|███████▊  |  ETA: 0:00:01 ( 0.40  s/it)[39m[K

[32mProgress: 100%|██████████| Time: 0:00:02 ( 0.31  s/it)[39m[K


I like using `ProgressMeter`, to get a sense of where my computation is at.


### Atomic operations
Note that you can also perform something called atomic operations, see the [dedicated section](https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations) in Julia documentation. Atomic operations are similar to what you could do with `lock`, although they may be faster but more limited in what you could do.




## Multi-processing

### `Distributed`
Julia has also a built-in library for distributed parallel computing, called `Distributed`. Although it is generally more difficult to deploy than mulitthreading, it may be useful in certain occasions.  Distributed computing is useful when you have a lot of work that cannot be split among multiple threads and needs to be distributed across multiple machines.

Monte Carlo simulations is another good use-case with distributed computing may be useful.




`julia -p 4` provides `4` worker processes on the local machine. Alternatively, within Julia you can add workers by 
```julia
using Distributed
addprocs(4)  # add 4 worker processes
```

The most straightforward way of performing distributed computing is using  `pmap`. A good tutorial on how to use `pmap` can be found [here](https://github.com/Arpeggeo/julia-distributed-computing).

Note that [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl) may be useful for distributed computing.




#### `MPI.jl`
There exists an MPI (Message Passing Interface) interface for the Julia language, provided by the `MPI.jl` package. MPI is a low-level communication protocol that enables message passing between processes running on different nodes in a distributed system. It may be a better choice due to its interoperability, customization options, performance, and scalability on large-scale systems. If you never heard of it, then forget about it!




## GPU computing

Multiple dispatch allows your code to be executed on GPUS! Here is how.



### GPU programming with CUDA

In [5]:

function myfun(a::AbstractArray, b::AbstractArray)
    return sum(a.^2 * b)
end

# generate CPU arrays
a = rand(Float32, 1000, 1000)
b = rand(Float32, 1000, 1000)

using BenchmarkTools
@btime myfun($a, $b) # 820.959 μs (3 allocations: 7.63 MiB)

  5.458 ms (4 allocations: 7.63 MiB)


1.6680128f8

In [6]:
using CUDA

@assert CUDA.functional()

for d in devices()
    println(d)
end
CUDA.device!(1)
CUDA.current_device()


CuDevice(0)
CuDevice(1)
CuDevice(2)
CuDevice(3)
CuDevice(4)
CuDevice(5)
CuDevice(6)
CuDevice(7)


CuDevice(1): NVIDIA TITAN RTX

In [7]:
a_cuda = CUDA.rand(1000, 1000)
b_cuda = CUDA.rand(1000, 1000)


1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.834464    0.457023   0.717635     …  0.484234  0.397357   0.0324202
 0.288913    0.796186   0.255861        0.265453  0.427988   0.0370142
 0.122341    0.97253    0.976035        0.77248   0.706358   0.260713
 0.351675    0.446846   0.671586        0.920287  0.500553   0.374924
 0.59903     0.312281   0.092035        0.35717   0.295004   0.95406
 0.782895    0.702707   0.226901     …  0.995633  0.707188   0.841076
 0.831006    0.409586   0.991196        0.725811  0.400745   0.936713
 0.680695    0.935281   0.69477         0.928192  0.585817   0.400571
 0.658063    0.521759   0.968131        0.439888  0.717022   0.961326
 0.352665    0.412369   0.176933        0.404161  0.574798   0.888975
 ⋮                                   ⋱                       
 0.639501    0.0262803  0.377526        0.923417  0.340922   0.543892
 0.539695    0.839126   0.24328         0.596974  0.550674   0.181971
 0.800403    0.437032   0.109061        0.7

In [8]:
@btime myfun($a_cuda, $b_cuda)

  214.489 μs (153 allocations: 7.11 KiB)


1.6638253f8


### GPU programming on MacOS
```julia
using Metal
a_mtl = MtlArray(a)
b_mtl = MtlArray(b)

@btime myfun($a_mtl, $b_mtl)
```


### Additional resources and acknowledgements
- [Discourse category :Julia at scale](https://discourse.julialang.org/c/domain/parallel/34)
- [Further explanations on Multithreading vs Multiprocessing computing](https://towardsdatascience.com/multithreading-and-multiprocessing-in-10-minutes-20d9b3c6a867)
- [Julia multi threading](https://docs.julialang.org/en/v1/manual/multi-threading/)
- [CUDA.jl documentation](https://github.com/JuliaGPU/CUDA.jl)