/
index.html
907 lines (853 loc) · 37.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Multithreaded Algorithms</title>
<meta name="description" content="">
<!-- Mobile viewport optimized: h5bp.com/viewport -->
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/style.css">
<!-- More ideas for your <head> here: h5bp.com/d/head-Tips -->
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</head>
<body>
<div class="navbar navbar-fixed-bottom">
<div class="navbar-inner">
<div class="container">
<a class="brand" href="#home-unit">Multithreaded Algorithms</a>
<ul class="nav pull-right">
<li><a href="#introduction-unit">Introduction</a></li>
<li><a href="#matrices-unit">Matrix Multiplication</a></li>
<li><a href="#merge-sort-unit">Merge Sort</a></li>
<li><a href="#glossary-unit">Glossary</a></li>
</ul>
</div>
</div>
</div>
<div class="container hero-unit" id="home-unit">
<h1>Multithreaded Algorithms</h1>
<p>An introduction to multithreaded algorithms written by Shantanu Bala.<br />This page is an honors enrichment project for Dr. Andrea Richa's CSE310 Spring 2012 class.</p>
</div>
<div class="container" id="introduction-unit">
<div class="row separated">
<h1>Introduction</h1>
</div>
<div class="row separated">
<div class="span4">
<h3>What are they?</h3>
<p><strong>Multithreaded algorithms</strong> are algorithms that are specifically designed to solve a problem while leveraging multiple CPU cores. This increases the <strong>throughput</strong> of an application by allowing parts of the algorithm to be executed in parallel.</p>
</div>
<div class="span4">
<h3>How do they work?</h3>
<p>The <strong>throughput</strong>, or the number of times an algorithm can solve a problem within a given span of time, increases with multithreading because it splits the main problem into subproblems that do not rely on the solutions of the other subproblems.</p>
</div>
<div class="span4">
<h3>What are limitations?</h3>
<p>Parallel algorithms follow Amdahl's Law, a concept that describes the limits of multithreading as a method of increasing throughput. Because a single algorithmic task often can never be 100% executed in parallel, the throughput gain from multithreading is finite.</p>
</div>
</div>
<div class="row separated">
<div class="span12">
<h3>How does Amdahl's Law affect throughput?</h3>
</div>
<div class="span6">
<p>Although the total <em>amount</em> of computational power of a processor increases with the addition of each CPU core, the <em>actual</em> increase in throughput of an algorithm follows an asymptotic curve as shown in the graph on the right [<a href="http://en.wikipedia.org/wiki/Amdahl%27s_law">source</a>].</p>
<p>This asymptotic curve is the basis for Amdahl's Law: since most modern computational instructions used in algorithms (by design) are sequential in nature, an algorithm can never be executed 100% in parallel. As a result, each additional CPU core increases the throughput less than the previously added cores.</p>
<p>The green, purple, red, and blue lines in the graph correspond to the maximum amount of increase in throughput afforded by the addition of CPU cores when a program or algorithm can be parallelized for 95%, 90%, 75%, and 50% of its instructions respectively.</p>
<p>When writing a multithreaded algorithm, it is important to understand the amount of instructions in the algorithm that can be executed in parallel without affecting the result. Similarly, it is also important to understand the capabilities of the hardware that will be used. Multithreaded programs will be <em>slower</em> when running on a single CPU core because additional resources are used to manage individual threads.</p>
</div>
<div class="span6">
<div class="thumbnail">
<img src="img/amdahl.png" alt="Graph of Amdahl's Law">
</div>
</div>
</div>
<div class="row separated">
<div class="span12">
<h3>Supercomputers</h3>
</div>
<div class="span4">
<p>One excellent example of highly-parallelized hardware is the IBM Blue Gene architecture for supercomputers. Along with other supercomputers, implementations of the Blue Gene architecture have produced data centers with nearly 300,000 CPU cores processing data simultaneously [<a href="http://en.wikipedia.org/wiki/Blue_gene">source</a>].</p>
<p>Currently, one of the primary applications of multithreaded algorithms is the processing of extremely large data sets. When dealing with terabytes of raw data, researchers at institutions like the Argonne National Laboratory must develop multithreaded algorithms for extremely-parallel hardware in order to avoid extremely slow throughput times (processing large data sets can take years on a single core!).</p>
</div>
<div class="span8">
<div id="blue-gene-carousel" class="carousel">
<div class="carousel-inner">
<div class="active item">
<img src="img/blue-gene-center.jpg" alt="IBM Blue Gene Data Center">
<div class="carousel-caption">
<p>A photograph of an IBM Blue Gene data center at Argonne National Laboratory.</p>
</div>
</div>
<div class="item">
<img src="img/blue-gene-cabinet.jpg" alt="IBM Blue Gene Server Cabinet">
<div class="carousel-caption">
<p>An up-close image of a Blue Gene server cabinet from Argonne National Laboratory.</p>
</div>
</div>
<div class="item">
<img src="img/blue-gene-card.jpg" alt="IBM Blue Gene Server Board">
<div class="carousel-caption">
<p>Arrays of IBM Blue Gene processors used in the server cabinets of a data center.</p>
</div>
</div>
</div>
<!-- Carousel nav -->
<a class="carousel-control left" href="#blue-gene-carousel" data-slide="prev">‹</a>
<a class="carousel-control right" href="#blue-gene-carousel" data-slide="next">›</a>
</div>
</div>
</div>
<div class="row separated">
<div class="span12"><h3>Computer Graphics: Graphics Cards</h3></div>
<div class="span6">
<p>This video is a demonstration by Adam Savage and Jamie Hyneman from the television series Mythbusters. Adam and Jamie demonstrate the use of multi-core processors and parallelization in Graphics Processing Units (GPU's) that enable them to display graphical information with a much higher throughput than an ordinary CPU.</p>
</div>
<div class="span6">
<iframe width="460" height="264" src="http://www.youtube.com/embed/0udMBdo0Rac" frameborder="0" allowfullscreen></iframe>
</div>
</div>
<div class="row">
<div class="span12"><h3>Machine Learning: Neural Networks</h3></div>
<div class="span6">
<p>This video is an introduction to a concept from Machine Learning, Neural Networks. Neural Networks operate similar to neurons in the human brain. Since Machine Learning processes like Neural Networks require lengthy <em>training</em> where the data provided to the algorithm is used to generate a computational model, multithreading can allow robots or other computer systems to be more responsive by increasing the throughput of Machine Learning algorithms.</p>
</div>
<div class="span6">
<iframe width="460" height="342" src="http://www.youtube.com/embed/DG5-UyRBQD4" frameborder="0" allowfullscreen></iframe>
</div>
</div>
</div>
<div class="container" id="matrices-unit">
<div class="row separated">
<h1>Matrix Multiplication</h1>
</div>
<div class="row separated">
<div class="span12"><h3>Pseudocode for Multithreaded Matrix Multiplication</h3></div>
<div class="span7">
<pre class="prettyprint">
function multiplyMatrices(C, A, B):
n = A.rows
if n == 1:
C[1][1] = A[1][1] * B[1][1]
else:
T = new Matrix[n][n]
a1 = new Matrix[n/2][n/2] = A[1][1]...A[n/2][n/2]
a2 = new Matrix[n/2][n/2] = A[n/2][1]...A[n][n/2]
a3 = new Matrix[n/2][n/2] = A[1][n/2]...A[n/2][n]
a4 = new Matrix[n/2][n/2] = A[n/2][n/2]...A[n][n]
b1 = new Matrix[n/2][n/2] = B[1][1]...B[n/2][n/2]
b2 = new Matrix[n/2][n/2] = B[n/2][1]...B[n][n/2]
b3 = new Matrix[n/2][n/2] = B[1][n/2]...B[n/2][n]
b4 = new Matrix[n/2][n/2] = B[n/2][n/2]...B[n][n]
c1 = new Matrix[n/2][n/2] = C[1][1]...C[n/2][n/2]
c2 = new Matrix[n/2][n/2] = C[n/2][1]...C[n][n/2]
c3 = new Matrix[n/2][n/2] = C[1][n/2]...C[n/2][n]
c4 = new Matrix[n/2][n/2] = C[n/2][n/2]...C[n][n]
t1 = new Matrix[n/2][n/2] = T[1][1]...T[n/2][n/2]
t2 = new Matrix[n/2][n/2] = T[n/2][1]...T[n][n/2]
t3 = new Matrix[n/2][n/2] = T[1][n/2]...T[n/2][n]
t4 = new Matrix[n/2][n/2] = T[n/2][n/2]...T[n][n]
spawn multiplyMatrices(c1, a1, b1)
spawn multiplyMatrices(c3, a1, b3)
spawn multiplyMatrices(c2, a2, b1)
spawn multiplyMatrices(c4, a2, b3)
spawn multiplyMatrices(t1, a3, b2)
spawn multiplyMatrices(t3, a3, b4)
spawn multiplyMatrices(t2, a4, b2)
multiplyMatrices(t4, a4, b4)
sync
parallel for i = 1 to n:
parallel for j = 1 to n:
C[i][j] = C[i][j] + T[i][j]
</pre>
</div>
<div class="span5">
<p><br />Lines 3-4 of this pseudocode handle the base case where two <em>1x1</em> matrices are being multiplied. In this situation, the result is just the product of the first elements of the respective matrices.</p>
<p><strong>T</strong> is a temporary matrix.</p>
<p><br /><br /><br /><br />The matrices <strong>A</strong>, <strong>B</strong>, <strong>C</strong>, and <strong>T</strong> are split into 4 even <em>(n/2)x(n/2)</em> submatrices.</p>
<p><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />Parallel recursive calls to <strong>multiplyMatrices</strong> are spawned, multiplying the submatrices from the previous step such that <strong>A</strong> and <strong>B</strong> are broken down until the base case (the multiplication of two <em>1x1</em> matrices) is reached.</p>
<p><br /><br /><br />The <strong>sync</strong> forces the parallel recursive calls to <strong>multiplyMatrices</strong> to be completed before the results are combined.</p>
<p>Since <a href="http://en.wikipedia.org/wiki/Matrix_multiplication">matrix multiplication</a> comes from the dot product of the row and column vectors of two matrices, the values present in <strong>T</strong> and <strong>C</strong> must be added together to arrive at the final answer.</p>
</div>
<div class="span6">
<h3>Explanation of Matrix Multiplication</h3>
<p>The following diagram represents the subproblems of matrix multiplication: the dot product of the row and column vectors of two matrices [<a href="http://en.wikipedia.org/wiki/Matrix_multiplication">source</a>].</p>
<p><img src="img/matrix-mult-diagram.png" alt="Matrix Multiplication"></p>
<p>For more information, see <a href="http://en.wikipedia.org/wiki/Matrix_multiplication">Wikipedia</a> and the <a href="http://www.khanacademy.org/math/linear-algebra/v/matrix-multiplication--part-1">Khan Academy</a>.</p>
<p>Below, you will find Python code for a run-time demonstration of multithreaded matrix multiplication, along with a sample output from the demonstration showing 20 iterations of single-threaded and multithreaded multiplication of two 200x200 matrices. The benchmark times indicate total CPU time, including the overhead required by the OS for spawning and synchronizing threads.</p>
</div>
<div class="span6">
<h3>Analysis of the Algorithm</h3>
<p>
On a single-core CPU, the multithreaded matrix multiplication algorithm presented above follows the following recurrence that represents the total amount of <strong>work</strong> done by the algorithm:
<br />
$$ M_{1}(n) = 8M_{1}(n/2) + \Theta(n^2) = \Theta(n^3)$$
</p>
<p>
On a processor with an infinite number of cores, the multithreaded matrix multiplication algorithm presented above would follow the following recurrence that models its <strong>span</strong>:<br />
$$ M_{\infty}(n) = M_{\infty}(n/2) + \Theta(log n) = \Theta(log^2 n) $$
</p>
<p>As indicated by the above analysis, the <strong>parallelism</strong> of the algorithm is:<br />
$$ M_{1}(n)/M_{\infty}(n) = \Theta(n^3 / log^2 n)$$
</p>
<p>Since matrix multiplication relies on the values of subproblems that are easily calculated in parallel, its parallelism as an algorithm is much higher than most parallel algorithms.</p>
</div>
</div>
<div class="row separated">
<div class="span6">
<h3>Run-time Demonstration</h3>
<pre class="prettyprint">
import threading
import random
import time
MATRIX_SIZE = (200, 200)
TOTAL_ITERATIONS = 20
def zero_matrix(m,n):
new_matrix = [[0 for row in range(n)] for col in range(m)]
return new_matrix
def random_matrix(m,n):
new_matrix = [[random.random() for row in range(n)] for col in range(m)]
return new_matrix
def print_matrix(matrix):
for col in matrix:
print col
def mult(matrix1,matrix2):
if len(matrix1[0]) != len(matrix2):
print 'Matrices must be m*n and n*p to multiply!'
else:
new_matrix = zero_matrix(len(matrix1),len(matrix2[0]))
for i in range(len(matrix1)):
for j in range(len(matrix2[0])):
for k in range(len(matrix2)):
new_matrix[i][j] += matrix1[i][k]*matrix2[k][j]
return new_matrix
class ThreadedMult(threading.Thread):
def __init__(self, m1, m2, **kwargs):
self.m1 = m1
self.m2 = m2
self.result = []
threading.Thread.__init__(self, **kwargs)
def run(self):
m1 = self.m1
m2 = self.m2
n = len(m1)
if len(m1[0]) != len(m2):
print 'Matrices must be square to multiply!'
elif n < 10:
self.result = mult(m1, m2)
else:
step = n / 10
i = 0
threads = []
while (i + step) < n:
threads.append(ThreadedMult(m1[i:i+step], m2))
i += step
if (n - i) > 0:
threads.append(ThreadedMult(m1[i:n], m2))
for t in threads:
t.start()
for t in threads:
t.join()
self.result.extend(t.result)
total_utime = 0
total_ttime = 0
for y in range(TOTAL_ITERATIONS):
m1 = random_matrix(MATRIX_SIZE[0], MATRIX_SIZE[1])
m2 = random_matrix(MATRIX_SIZE[1], MATRIX_SIZE[0])
start = time.time()
single_mult = mult(m1, m2)
utime = time.time() - start
print "Regular implementation:", utime, "seconds."
threaded_sample = ThreadedMult(m1, m2)
start = time.time()
threaded_sample.start()
threaded_sample.join()
threaded_mult = threaded_sample.result
ttime = time.time() - start
print "Threaded implementation:", ttime, "seconds."
print "Threading overhead: ", ttime - utime, "seconds."
equal = True
for i in range(len(single_mult)):
for j in range(len(single_mult[0])):
equal = equal and single_mult[i][j] == threaded_mult[i][j]
if equal:
print "The results are equal."
else:
print "Thre results are not equal."
total_utime += utime
total_ttime += ttime
print "Unthreaded:Threaded ratio: ", total_utime / total_ttime
</pre>
</div>
<div class="span6">
<h3>Sample Output of Demo</h3>
<pre class="prettyprint">
Regular implementation: 2.11618208885 seconds.
Threaded implementation: 3.23017311096 seconds.
Threading overhead: 1.11399102211 seconds.
The results are equal.
Regular implementation: 2.14634609222 seconds.
Threaded implementation: 3.14311385155 seconds.
Threading overhead: 0.996767759323 seconds.
The results are equal.
Regular implementation: 2.14657688141 seconds.
Threaded implementation: 3.23033189774 seconds.
Threading overhead: 1.08375501633 seconds.
The results are equal.
Regular implementation: 2.15448904037 seconds.
Threaded implementation: 3.18466806412 seconds.
Threading overhead: 1.03017902374 seconds.
The results are equal.
Regular implementation: 2.16370606422 seconds.
Threaded implementation: 3.26226806641 seconds.
Threading overhead: 1.09856200218 seconds.
The results are equal.
Regular implementation: 2.17773604393 seconds.
Threaded implementation: 3.29016304016 seconds.
Threading overhead: 1.11242699623 seconds.
The results are equal.
Regular implementation: 2.1634311676 seconds.
Threaded implementation: 3.20348000526 seconds.
Threading overhead: 1.04004883766 seconds.
The results are equal.
Regular implementation: 2.13924813271 seconds.
Threaded implementation: 3.2506248951 seconds.
Threading overhead: 1.11137676239 seconds.
The results are equal.
Regular implementation: 2.16290187836 seconds.
Threaded implementation: 3.20742702484 seconds.
Threading overhead: 1.04452514648 seconds.
The results are equal.
Regular implementation: 2.16222405434 seconds.
Threaded implementation: 3.20069313049 seconds.
Threading overhead: 1.03846907616 seconds.
The results are equal.
Regular implementation: 2.11181497574 seconds.
Threaded implementation: 3.24382519722 seconds.
Threading overhead: 1.13201022148 seconds.
The results are equal.
Regular implementation: 2.12940382957 seconds.
Threaded implementation: 3.2539498806 seconds.
Threading overhead: 1.12454605103 seconds.
The results are equal.
Regular implementation: 2.17440009117 seconds.
Threaded implementation: 3.20231103897 seconds.
Threading overhead: 1.0279109478 seconds.
The results are equal.
Regular implementation: 2.12763500214 seconds.
Threaded implementation: 3.24842190742 seconds.
Threading overhead: 1.12078690529 seconds.
The results are equal.
Regular implementation: 2.1404299736 seconds.
Threaded implementation: 3.19953393936 seconds.
Threading overhead: 1.05910396576 seconds.
The results are equal.
Regular implementation: 2.14696788788 seconds.
Threaded implementation: 3.25090122223 seconds.
Threading overhead: 1.10393333435 seconds.
The results are equal.
Regular implementation: 2.22095990181 seconds.
Threaded implementation: 3.26174402237 seconds.
Threading overhead: 1.04078412056 seconds.
The results are equal.
Regular implementation: 2.1863899231 seconds.
Threaded implementation: 3.23105788231 seconds.
Threading overhead: 1.04466795921 seconds.
The results are equal.
Regular implementation: 2.1539850235 seconds.
Threaded implementation: 3.20885300636 seconds.
Threading overhead: 1.05486798286 seconds.
The results are equal.
Regular implementation: 2.16126418114 seconds.
Threaded implementation: 3.21673417091 seconds.
Threading overhead: 1.05546998978 seconds.
The results are equal.
Unthreaded:Threaded ratio: 0.667791512001
</pre>
</div>
</div>
</div>
<div class="container" id="merge-sort-unit">
<div class="row separated">
<h1>Merge Sort</h1>
</div>
<div class="row separated">
<div class="span12"><h3>Pseudocode for Multithreaded Merge Sort</h3></div>
<div class="span7">
<pre class="prettyprint">
function mergeSort(A, p, r, B, s):
n = r - p + 1
if n == 1:
B[s] = A[p]
else
T = new Array[1...n]
q = Math.floor((p + r) / 2)
q2 = q - p + 1
spawn mergeSort(A, p, q, T, 1)
mergeSort(A, q+1, r, T, q2 + 1)
sync
merge(T, 1, q2, q2 + 1, n, B, s)
</pre>
<pre class="prettyprint">
function merge(T, p1, r1, p2, r2, A, p3):
n1 = r1 - p1 + 1
n2 = r2 - p2 + 1
if n1 < n2:
temp = p1
p1 = p2
p2 = p1
temp = r1
r1 = r2
r2 = r1
temp = n1
n1 = n2
n2 = n1
if n1 == 0:
return
else:
q1 = Math.floor((p1 + r1) / 2)
q2 = binarySearch(T[q1], T, p2, r2)
q3 = p3 + (q1 - p1) + (q2 - p2)
A[q3] = T[q1]
spawn merge(T, p1, q1 - 1, p2, q2 - 1, A, p3)
merge(T, q1 + 1, r1, q2, r2, A, q3 + 1)
sync
</pre>
<pre class="prettyprint">
function binarySearch(x, T, p, r):
low = p
high = Math.max(p, r+1)
while low < high:
mid = Math.floor((low + high) / 2)
if x <= T[mid]:
high = mid
else:
low = mid + 1
return high
</pre>
</div>
<div class="span5">
<p><br /><br />This multithreaded implementation of the <a href="http://en.wikipedia.org/wiki/Merge_sort">Merge Sort algorithm</a> is very similar to the regular implementation, except that the recursive calls to <strong>mergeSort</strong> are <strong>spawn</strong>ed on a new thread.</p>
<p><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />This multithreaded implementation of <strong>merge</strong> is very similar to the regular implementation, except that the recursive calls to <strong>merge</strong> are <strong>spawn</strong>ed on a new thread.</p>
<p>The first <strong>if</strong> statement handles the condition in which the starting and ending indices for the two input arrays are incorrectly ordered.</p>
<p><br /><br /><br /><br />The second <strong>if</strong> statement handles the base case, in which the input arrays are both of length 0.</p>
<p><br /><br />The <strong>sync</strong> forces the parallel recursive calls to <strong>merge</strong> to be completed before the results are combined.</p>
<p><br /><br /><br /><br /><br /><br /><br /><br />This is a standard implementation of the <a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">Binary Search algorithm</a>.</p>
</div>
<div class="span6">
<h3>Analysis of the Algorithm</h3>
<p>
On a single-core CPU, the work done by the <strong>mergeSort</strong> function presented above follows the following recurrence:
<br />
$$ M_{1}(n) = 2M_{1}(n/2) + \Theta(n) = \Theta(n log n)$$
</p>
<p>
On a processor with an infinite number of cores, the multithreaded Merge Sort algorithm presented above would follow the following recurrence that models its <strong>span</strong>:<br />
$$ M_{\infty}(n) = M_{\infty}(n/2) + \Theta(n) = \Theta(n) $$
</p>
<p>As indicated by the above analysis, the <strong>parallelism</strong> of the algorithm is:<br />
$$ M_{1}(n)/M_{\infty}(n) = \Theta(log n)$$
</p>
<p>
On a single-core CPU, the work done by the <strong>merge</strong> function presented above follows the following recurrence because a recursive call to <strong>merge</strong> will only operate on <em>at the very most</em> <em>3n/4</em> elements:
<br />
$$ M_{1}(n) = M_{1}(\alpha n) + M_{1}((1 - \alpha)n) + O(log n) = \Theta(n)$$
</p>
<p>
On a processor with an infinite number of cores, the multithreaded <strong>merge</strong> function presented above would follow the following recurrence that models its <strong>span</strong>:<br />
$$ M_{\infty}(n) = M_{\infty}(3n/4) + \Theta(log n) = \Theta(log^2n) $$
</p>
<p>As indicated by the above analysis, the <strong>parallelism</strong> of the algorithm is:<br />
$$ M_{1}(n)/M_{\infty}(n) = \Theta(n/log^2n)$$
</p>
</div>
<div class="span6"><h3>Difficulties Related to Binary Search</h3>
<p>The Binary Search algorithm, unlike Merge Sort or Matrix Multiplication, does not split the main problem (finding the index of a value) into smaller subproblems. Since Binary Search needs to store the previously-search index in the <strong>high</strong> and <strong>low</strong> variables present in the pseudocode above, the algorithm cannot be optimized because the next comparison in the <strong>while</strong> loop relies on the result from the previous comparison.</p>
</div>
<div class="span12">
<h3>Diagrams of Merge Sort Execution</h3>
<p>The following two diagrams demonstrate the Merge Sort algorithm when handled "by hand."</p>
</div>
<div class="span8">
<h5>Merge Sort Subproblems</h5>
<p><img src="img/merge-sort-tree.png" alt="Merge Sort"></p>
</div>
<div class="span4">
<h5>Animation of Splitting and Merging</h5>
<p><img src="img/Merge-sort-example-300px.gif" alt="Merge Sort"></p>
</div>
</div>
<div class="row separated">
<div class="span12">
<h3>Video of Merge Sort</h3>
</div>
</div>
<div class="row">
<div class="span6">
<p>To the right, you can view a video demonstration of a multithreaded implementation of the Merge Sort algorithm executed on a list of 50,000 elements with 512 parallel execution threads.</p>
<p>The sorted "list" in the video is a diagonal line across the screen, and the unsorted list in the beginning is the resulting image when each of the pixels from the line are horizontally shuffled to different locations. As the Merge operation of Merge Sort is completed, fewer and fewer of the "sorted" diagonal lines are combined until a single "sorted" diagonal line is formed.</p>
<p>Below, you will find Python code for a run-time demonstration of multithreaded merge sort, along with a sample output from the demonstration showing 20 iterations of single-threaded and multithreaded sorting of 200,000 random elements. The benchmark times indicate total CPU time, including the overhead required by the OS for spawning and synchronizing threads.</p>
</div>
<div class="span6">
<iframe width="460" height="342" src="http://www.youtube.com/embed/pTeUZUnPDFI" frameborder="0" allowfullscreen></iframe>
</div>
</div>
<div class="row separated">
<div class="span6">
<h3>Run-time Demonstration</h3>
<pre class="prettyprint">
import threading
import random
import time
LIST_SIZE = 200000
TOTAL_ITERATIONS = 20
def merge(left, right):
result = []
while left and right:
if left[0] <= right[0]:
result.append(left.pop(0))
else:
result.append(right.pop(0))
if left:
result.extend(left)
if right:
result.extend(right)
return result
def merge_sort(l):
if len(l) < 2:
return l
middle = len(l) / 2
left = merge_sort(l[:middle])
right = merge_sort(l[middle:])
return merge(left, right)
class ThreadedSort(threading.Thread):
def __init__(self, lst, **kwargs):
self.lst = lst
threading.Thread.__init__(self, **kwargs)
def run(self):
l = self.lst
if 1 < len(l) < 1000:
middle = len(l) / 2
left = merge_sort(l[:middle])
right = merge_sort(l[middle:])
self.lst = merge(left, right)
else:
middle = len(l) / 2
left_thread = ThreadedSort(l[:middle])
right_thread = ThreadedSort(l[middle:])
left_thread.start()
right_thread.start()
left_thread.join()
right_thread.join()
self.lst = merge(left_thread.lst, right_thread.lst)
total_utime = 0
total_ttime = 0
for y in range(TOTAL_ITERATIONS):
sample_list = []
for i in range(LIST_SIZE):
sample_list.append(random.randint(0, LIST_SIZE))
start = time.time()
single_sorted = merge_sort(sample_list)
utime = time.time() - start
print "Regular implementation:", utime, "seconds."
threaded_sample = ThreadedSort(sample_list)
start = time.time()
threaded_sample.start()
threaded_sample.join()
threaded_sorted = threaded_sample.lst
ttime = time.time() - start
print "Threaded implementation:", ttime, "seconds."
print "Threading overhead: ", ttime - utime, "seconds."
equal = True
for x in range(LIST_SIZE):
if single_sorted[x] != threaded_sorted[x]:
equal = False
break
if equal:
print "The results are equal."
else:
print "Thre results are not equal."
total_utime += utime
total_ttime += ttime
print "Unthreaded:Threaded ratio: ", total_utime / total_ttime
</pre>
</div>
<div class="span6">
<h3>Sample Output of Demo</h3>
<pre class="prettyprint">
Regular implementation: 5.53684401512 seconds.
Threaded implementation: 6.70835995674 seconds.
Threading overhead: 1.17151594162 seconds.
The results are equal.
Regular implementation: 5.64260601997 seconds.
Threaded implementation: 6.72791099548 seconds.
Threading overhead: 1.08530497551 seconds.
The results are equal.
Regular implementation: 5.62852311134 seconds.
Threaded implementation: 6.78940796852 seconds.
Threading overhead: 1.16088485718 seconds.
The results are equal.
Regular implementation: 5.65552401543 seconds.
Threaded implementation: 6.85554814339 seconds.
Threading overhead: 1.20002412796 seconds.
The results are equal.
Regular implementation: 5.78146505356 seconds.
Threaded implementation: 6.8217048645 seconds.
Threading overhead: 1.04023981094 seconds.
The results are equal.
Regular implementation: 5.61178684235 seconds.
Threaded implementation: 6.76777505875 seconds.
Threading overhead: 1.1559882164 seconds.
The results are equal.
Regular implementation: 5.57701802254 seconds.
Threaded implementation: 6.93167209625 seconds.
Threading overhead: 1.35465407372 seconds.
The results are equal.
Regular implementation: 5.60108113289 seconds.
Threaded implementation: 6.90342903137 seconds.
Threading overhead: 1.30234789848 seconds.
The results are equal.
Regular implementation: 5.59894990921 seconds.
Threaded implementation: 6.79823184013 seconds.
Threading overhead: 1.19928193092 seconds.
The results are equal.
Regular implementation: 5.59718108177 seconds.
Threaded implementation: 6.90508317947 seconds.
Threading overhead: 1.3079020977 seconds.
The results are equal.
Regular implementation: 5.59603214264 seconds.
Threaded implementation: 7.0819568634 seconds.
Threading overhead: 1.48592472076 seconds.
The results are equal.
Regular implementation: 5.57651782036 seconds.
Threaded implementation: 6.81592392921 seconds.
Threading overhead: 1.23940610886 seconds.
The results are equal.
Regular implementation: 5.57425713539 seconds.
Threaded implementation: 6.7655711174 seconds.
Threading overhead: 1.19131398201 seconds.
The results are equal.
Regular implementation: 5.6062130928 seconds.
Threaded implementation: 6.88873195648 seconds.
Threading overhead: 1.28251886368 seconds.
The results are equal.
Regular implementation: 5.55545091629 seconds.
Threaded implementation: 6.74864506721 seconds.
Threading overhead: 1.19319415092 seconds.
The results are equal.
Regular implementation: 5.58515095711 seconds.
Threaded implementation: 6.80785298347 seconds.
Threading overhead: 1.22270202637 seconds.
The results are equal.
Regular implementation: 5.5728058815 seconds.
Threaded implementation: 6.84219384193 seconds.
Threading overhead: 1.26938796043 seconds.
The results are equal.
Regular implementation: 5.6099550724 seconds.
Threaded implementation: 6.81285405159 seconds.
Threading overhead: 1.20289897919 seconds.
The results are equal.
Regular implementation: 5.69153785706 seconds.
Threaded implementation: 6.83303999901 seconds.
Threading overhead: 1.14150214195 seconds.
The results are equal.
Regular implementation: 5.9380300045 seconds.
Threaded implementation: 7.82567191124 seconds.
Threading overhead: 1.88764190674 seconds.
The results are equal.
Unthreaded:Threaded ratio: 0.81766802697
</pre>
</div>
</div>
<div class="row">
<div class="span12">
<h3>Comparison of Parallel Sorting Algorithms</h3>
</div>
</div>
<div class="row">
<div class="span6">
<p>To the right, you can view a video comparison of the execution times and functionality of single-threaded Merge Sort, multithreaded Quick Sort, multithreaded Merge Sort, and multithreaded Merge Sort with multithreaded Merge as executed on a list of 1,000 elements with 256 parallel execution threads.</p>
<p>The sorted "list" in the video is a diagonal line across the screen, and the unsorted list in the beginning is the resulting image when each of the pixels from the line are horizontally shuffled to different locations. As the sorting operation is completed, a single "sorted" diagonal line is formed.</p>
</div>
<div class="span6">
<iframe width="460" height="342" src="http://www.youtube.com/embed/SdgBjy1GWa0" frameborder="0" allowfullscreen></iframe>
</div>
</div>
</div>
<div class="container" id="glossary-unit">
<div class="row separated">
<h1>Glossary</h1>
</div>
<div class="row separated">
<div class="span3">
<h5>Serial Algorithm</h5>
<p>
An algorithm designed with instructions executed one-after-the-other.
</p>
</div>
<div class="span3">
<h5>Parallel Algorithm</h5>
<p>
An algorithm designed with portions of its instructions executed simultaneously.
</p>
</div>
<div class="span3">
<h5>Shared Memory</h5>
<p>
Memory that is accessible to all cores of a computer's processor.
</p>
</div>
<div class="span3">
<h5>Distributed Memory</h5>
<p>
Memory that is divided between the cores of a computer's processor and is private to each core.
</p>
</div>
</div>
<div class="row">
<div class="span3">
<h5>Multiprocessor</h5>
<p>
A processor in a computer that contains <strong>multiple cores</strong> that execute instructions simultaneously.
</p>
</div>
<div class="span3">
<h5>Static Threading</h5>
<p>
A software abstraction that allows a programmer to use shared memory while executing code on 2 or more CPU cores.
</p>
</div>
<div class="span3">
<h5>Concurrency Platform</h5>
<p>
A layer of software that coordinates, schedules, and manages parallel-computing resources.
</p>
</div>
<div class="span3">
<h5>Dynamic Multithreading</h5>
<p>
The programmatic specification of parallelism with automatic handling of communication protocols and load balancing.
</p>
</div>
</div>
<div class="row">
<div class="span3">
<h5>Spawn</h5>
<p>
A psuedocode keyword that is used to indicate the delegation of a function call to another thread.
</p>
</div>
<div class="span3">
<h5>Sync</h5>
<p>
A pseudocode keyword that is used to indicate the need to wait for the return value of a function call delegated to a separate thread.
</p>
</div>
<div class="span3">
<h5>Parallel</h5>
<p>
A pseudocode keyword that is used to indicate the execution of a programmatic loop in parallel.
</p>
</div>
<div class="span3">
<h5>Nested Parallelism</h5>
<p>
This term describes the <strong>spawning</strong> of a thread from the execution of a function that was previously spawned on a separate thread.
</p>
</div>
</div>
<div class="row">
<div class="span3">
<h5>Parent Thread</h5>
<p>
This term describes the function call or thread that spawns another thread.
</p>
</div>
<div class="span3">
<h5>Child Thread</h5>
<p>
This term describes a function call or thread that was spawned by another thread.
</p>
</div>
<div class="span3">
<h5>Scheduler</h5>
<p>
This is the component of the <strong>concurrency platform</strong> that determines the time at which any particular thread is actually executed in parallel.
</p>
</div>
<div class="span3">
<h5>Strand</h5>
<p>
A chain of instructions that cannot be executed in parallel and are grouped together for execution in the same thread.
</p>
</div>
</div>
<div class="row">
<div class="span3">
<h5>Work</h5>
<p>
The total time to execute a computation on a single processor.
</p>
</div>
<div class="span3">
<h5>Span</h5>
<p>
The longest time to execute the full sequence of <strong>strands</strong> in a program.
</p>
</div>
<div class="span3">
<h5>Speedup</h5>
<p>
The ratio of execution time of a program on a single processor core and a multiprocessor.
</p>
</div>
<div class="span3">
<h5>Parallelism</h5>
<p>
The speedup when a program may be executed on a processor with infinite processor cores.
</p>
</div>
</div>
<div class="row">
<div class="span3">
<h5>Slackness</h5>
<p>
The factor by which the parallelism of a program exceeds the number of processors.
</p>
</div>
<div class="span3">
<h5></h5>
<p>
</p>
</div>
<div class="span3">
<h5></h5>
<p>
</p>
</div>
<div class="span3">
<h5></h5>
<p>
</p>
</div>
</div>
</div>
<div class="footer">
<div class="container">
<div class="row">
<div class="span8">
© 2012 Shantanu Bala
</div>
<div class="span4">
<a href="http://creativecommons.org/licenses/by-sa/3.0/">Some rights reserved.</a>
</div>
</div>
</div>
</div>
<br /><br /><br /><br />
<script src="js/jquery-1.7.2.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<script src="js/home.js"></script>
</body>
</html>