/
data-parallelism.jl
116 lines (80 loc) · 2.71 KB
/
data-parallelism.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#=
# Data parallelism
We'll look at two examples of using [data
parallelism](https://juliafolds.github.io/data-parallelism/) to speed up
computation using multiple CPUs
## Example 1: reducing a list
=#
using PureFun, FLoops, BenchmarkTools
using PureFun.Chunky
# We are tasked with optimizing the following function:
sequential(xs) = mapreduce(round ∘ sin, +, xs);
#=
For any applications that require reducing over an entire collection, it makes
sense to reach for a [`PureFun.Chunky.@list`](@ref):
=#
Chunky.@list(ChunkyRandomAccessList,
list = PureFun.RandomAccess.List,
chunk = PureFun.Contiguous.StaticChunk{8})
#=
We start out by benchmarking the current solution:
```julia
julia> @btime(sequential(xs),
setup = xs = ChunkyRandomAccessList(rand(Int, 5_000)),
evals = 10, samples=300);
107.854 μs (1256 allocations: 58.81 KiB)
```
The lists in PureFun.jl implement the
[`SplittablesBase`](https://juliafolds.github.io/SplittablesBase.jl/stable/)
interface, and as a result can be used with [a variety of parallel
algorithms](https://juliafolds.github.io/data-parallelism/). Here we use
[`Floops.jl`](https://juliafolds.github.io/FLoops.jl/dev/) to parallelize our
reduction:
=#
function parallel(xs)
@floop for x in xs
@reduce s += round(sin(x))
end
return s
end;
# A quick test to make sure the function works as expected:
test = ChunkyRandomAccessList(rand(Int, 100_000))
@assert sequential(test) == parallel(test)
#=
And indeed, we see a decent speedup:
```julia
julia> @btime(parallel(xs),
setup = xs = ChunkyRandomAccessList(rand(Int, 5_000)),
evals = 10, samples = 300)
53.725 μs (793 allocations: 28.46 KiB)
```
## Example 2: parallel heap sort
=#
using PureFun, Folds, BenchmarkTools
using PureFun.Pairing
# If comparisons are expensive, then constructing a heap can get pretty slow:
xs = rand(Int, 1_000);
#=
```julia
julia> @btime Pairing.Heap($xs, Base.Order.By(sin));
45.333 μs (2998 allocations: 93.69 KiB)
```
In [How to Think about Parallel Programming:
Not!](https://youtu.be/dPK6t7echuA) Guy Steele describes a useful strategy for
parallel programming:
- from each input construct a *singleton* solution
- merge solutions using an *associative* merge operation
We can use that strategy to construct a heap:
=#
const ∅ = Pairing.Heap{Int}(Base.Order.By(sin))
singleton(x) = push(∅, x)
test_seq = Pairing.Heap(xs, Base.Order.By(sin))
test_par = Folds.mapreduce(singleton, merge, xs, init=∅)
@assert all(test_seq .== test_par)
#=
And once again, we see a decent speedup:
```julia
julia> @btime Folds.mapreduce(singleton, merge, $xs, init=∅);
18.625 μs (7016 allocations: 221.66 KiB)
```
=#