forked from mikeckennedy/talk-python-transcripts
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path029_Python_at_the_Large_Hadron_Collider_LHC.vtt
2036 lines (1357 loc) · 71.8 KB
/
029_Python_at_the_Large_Hadron_Collider_LHC.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
00:00:00.001 --> 00:00:03.820
The largest machine ever built is the Large Hadron Collider at CERN.
00:00:03.820 --> 00:00:10.100
Its primary goal was the discovery of the Higgs boson, the fundamental particle which gives all objects mass.
00:00:10.100 --> 00:00:18.140
The LHC team actually achieved this audacious goal in 2012, winning them the Nobel Prize in physics in the process.
00:00:18.140 --> 00:00:25.220
Today on Talk Python to Me, Kyle Cranmer is here to share how Python was at the core of this amazing achievement.
00:00:25.760 --> 00:00:31.640
This is episode number 29, recorded Thursday, September 24th, 2015.
00:00:51.520 --> 00:01:05.240
Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:01:05.240 --> 00:01:09.360
This is your host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy.
00:01:09.360 --> 00:01:15.820
Keep up with the show and listen to past episodes at talkpython.fm and follow the show on Twitter via at Talk Python.
00:01:16.520 --> 00:01:19.780
This episode is brought to you by Hired and CodeShip.
00:01:19.780 --> 00:01:25.120
Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip.
00:01:25.120 --> 00:01:31.060
I don't have much news to share this week, but I am both honored and thrilled to bring you this episode,
00:01:31.060 --> 00:01:32.800
and I can't wait for you to listen to it.
00:01:32.800 --> 00:01:34.940
So, let's get right to the interview.
00:01:34.940 --> 00:01:37.160
Let me introduce Kyle.
00:01:37.160 --> 00:01:44.340
Kyle Cranmer is an American physicist and professor at New York University at the Center for Cosmology and Particle Physicists
00:01:44.340 --> 00:01:48.340
and affiliated faculty member at NYU's Center for Data Science.
00:01:48.340 --> 00:01:54.480
He is an experimental particle physicist working primarily on the Large Hadron Collider based in Geneva, Switzerland.
00:01:54.480 --> 00:02:00.540
Cranmer popularized a collaborative statistical modeling approach and developed statistical modeling,
00:02:00.540 --> 00:02:06.060
which was used extensively for the discovery of the Higgs boson at the LHC in July 2012.
00:02:06.060 --> 00:02:08.300
Kyle, welcome to the show.
00:02:08.300 --> 00:02:10.180
Thank you. Thank you. It's a pleasure to be here.
00:02:10.780 --> 00:02:17.340
Yeah, we have some amazing science and programming to talk about today, so I'm really excited to dig into all these topics with you.
00:02:17.340 --> 00:02:20.020
Yeah, no, I'm excited to see where it goes.
00:02:20.020 --> 00:02:21.600
Yeah, for sure.
00:02:21.600 --> 00:02:26.640
So, let's, you know, we're going to talk about the Large Hadron Collider,
00:02:26.640 --> 00:02:31.540
about using Python for scientific research and all those sorts of things,
00:02:31.580 --> 00:02:33.620
as well as some other cool projects that you've got going on.
00:02:33.620 --> 00:02:41.500
But people like to know how folks like you and your position kind of got started and they like to hear the background.
00:02:41.500 --> 00:02:47.580
So, maybe we could start with, you know, what got you interested in physics and what got you interested in programming and how do you get to where you are?
00:02:48.020 --> 00:02:53.200
I've been interested in, you know, in physics, you know, since I was a kid, not really knowing that that's what it was called.
00:02:53.200 --> 00:03:00.860
But later on, you know, I think, I guess it was in high school is when I really realized that it was physics that I wanted to do.
00:03:01.460 --> 00:03:09.420
I grew up in Arkansas and, you know, Arkansas is not exactly known for, like, leading the tide of physicists and computer scientists in the world.
00:03:09.420 --> 00:03:15.300
But they had started a special math and science high school that it was public school, but you actually lived there.
00:03:15.300 --> 00:03:21.600
And when I was there, I was just surrounded by all sorts of people, kind of the nerds and geeks of Arkansas.
00:03:21.600 --> 00:03:23.260
And it was really a special time.
00:03:23.260 --> 00:03:29.980
So, during that time, I, you know, got even more into physics, but it's also, that's when I was first exposed to serious programming.
00:03:30.880 --> 00:03:41.940
So, actually, even Python, like in 95, actually 94, I guess, I had a friend that was into early web things and he was playing with Zope and the Zope, you know, object database.
00:03:41.940 --> 00:03:49.560
And so, I started working with him and did some early web projects and that was kind of my first exposure to Python.
00:03:49.560 --> 00:03:51.320
So, that was a long time ago.
00:03:51.320 --> 00:03:58.060
And then that's, it's funny how much those experiences kind of keep being revisited today.
00:03:58.060 --> 00:04:00.620
Yeah, I'm sure you keep coming back to that.
00:04:00.780 --> 00:04:06.780
You know, basically, programming these days seems like a required skill to be a physicist.
00:04:06.780 --> 00:04:14.060
Yeah, no, well, it depends on what you do, but definitely for what we do, programming is a required skill.
00:04:14.060 --> 00:04:22.400
And unfortunately, it can, you know, for people that don't have those strengths, it really takes away from their ability to try to do the physics that they want to do.
00:04:23.060 --> 00:04:31.020
So, you know, so for incoming graduate students, you usually see a pretty big divide between people that are, have some programming skills and don't.
00:04:31.020 --> 00:04:36.800
And usually, the people that don't will catch up a little bit later, but, you know, you lose time and that's unfortunate.
00:04:36.800 --> 00:04:37.500
Right.
00:04:37.560 --> 00:04:43.920
I'm sure it's like a huge scramble, like, oh my gosh, I got to learn all this programming stuff too because, you know, we have projects or whatever, right?
00:04:44.400 --> 00:04:44.940
Right, right.
00:04:44.940 --> 00:04:56.180
It also colors a lot of the flavor about how we approach computing because somehow you have this enormous computing problem that you need to deal with and you would like to do it as nicely as possible.
00:04:56.180 --> 00:05:02.720
But it also can't be too fancy or the bulk of the physicists might not be able to understand what's going on.
00:05:02.720 --> 00:05:10.860
You have older physicists from the Fortran days and you have younger physicists that maybe never took any, you know, programming courses, like any serious programming courses.
00:05:10.860 --> 00:05:16.800
So things have to be somehow kept simple but still work for the difficult problems.
00:05:16.800 --> 00:05:18.800
And so it's a difficult balance to strike.
00:05:18.800 --> 00:05:19.860
Yeah, I'm sure.
00:05:20.660 --> 00:05:24.820
So let's talk a little bit about what you guys are doing at the Large Hadron Collider.
00:05:24.820 --> 00:05:30.000
And first of all, you know, congratulations on the Higgs boson discovery.
00:05:30.000 --> 00:05:30.580
That's amazing.
00:05:30.580 --> 00:05:31.720
Oh, thank you.
00:05:31.720 --> 00:05:35.580
No, it was, yeah, many, many years of work.
00:05:35.580 --> 00:05:39.480
And when it finally came, it was a huge treat.
00:05:39.480 --> 00:05:39.940
I don't know.
00:05:39.940 --> 00:05:44.500
It's funny to have such a big thing like that happen fairly early in your career.
00:05:44.500 --> 00:05:47.180
It's like, now what?
00:05:47.180 --> 00:05:47.960
Yeah.
00:05:47.960 --> 00:05:53.800
So, but yeah, so at the LHC, you know, we have this huge collider that's in Switzerland.
00:05:53.800 --> 00:05:57.220
It's about 17 miles around and underground.
00:05:57.220 --> 00:06:03.580
There are all these super connecting magnets that help protons go, you know, bend them in a circle at essentially the speed of light.
00:06:03.580 --> 00:06:06.320
And they're colliding together all the time.
00:06:06.420 --> 00:06:11.480
And they smack into each other and they make, you know, a lot of new particles that come flying out.
00:06:11.480 --> 00:06:13.660
And they hit our detector.
00:06:13.660 --> 00:06:17.280
And our detector, you can think of sort of like a digital camera.
00:06:17.280 --> 00:06:19.100
You know, it's like basically a bunch of pixels.
00:06:19.100 --> 00:06:22.860
And the particles smack into it and you get an image.
00:06:22.860 --> 00:06:24.500
But it's a 3D image.
00:06:24.500 --> 00:06:25.680
So it's a 3D detector.
00:06:25.980 --> 00:06:29.380
And the detector is like the size of a 12-story building.
00:06:29.380 --> 00:06:30.580
So, yeah.
00:06:30.580 --> 00:06:43.520
I think that, you know, when you just hear about particle colliders and especially LHC, you have maybe this idea of like a tube where things are shooting around.
00:06:43.520 --> 00:06:45.080
And, you know, how big does the tube have to be?
00:06:45.080 --> 00:06:45.860
Not that big.
00:06:46.120 --> 00:06:53.260
But the actual experiments, the collectors, I was blown away when I learned about how big they are.
00:06:53.260 --> 00:06:54.800
Like you said, 12 stories.
00:06:54.800 --> 00:06:55.640
These things are huge.
00:06:55.640 --> 00:06:56.280
Yeah.
00:06:56.280 --> 00:06:56.860
No, they are.
00:06:56.860 --> 00:07:04.960
And the range of scales is pretty crazy because we have to be able to track where these particles go very precisely.
00:07:04.960 --> 00:07:09.140
So, like, close to where they interact, you know, we're measuring things at the, like, micron level.
00:07:09.680 --> 00:07:14.480
And then they fly out over, like, the size of a building.
00:07:14.480 --> 00:07:18.260
And we're still measuring where they're going at this very precise level.
00:07:18.260 --> 00:07:20.920
But, you know, it's just this gargantuan thing.
00:07:20.920 --> 00:07:24.280
So you have to align it properly and all sorts of challenges there.
00:07:24.280 --> 00:07:30.820
And there are about 100 million, you know, well, a few hundred million electronic readouts coming out of this beast.
00:07:30.820 --> 00:07:35.580
So, you know, it's like a 100-megapixel camera or something like that.
00:07:35.580 --> 00:07:38.760
And we're taking 40 million photos every second.
00:07:38.760 --> 00:07:41.300
That's a stunning amount of data.
00:07:41.300 --> 00:07:42.800
That is a stunning amount of data.
00:07:42.800 --> 00:07:56.140
And so if you, you know, we have to slap special electronics, like, straight onto the detector to be able to start preprocessing it and compressing it and sort of, you know, coming up with some way to deal with the data volume.
00:07:56.140 --> 00:08:04.340
Because, you know, it's something, it's, you know, there are just totally staggering numbers about the data flow that's coming straight out of the detector.
00:08:04.340 --> 00:08:06.640
So how do you capture and store that?
00:08:06.700 --> 00:08:11.900
Do you store that, like, on hardware right on, like, Atlas in the machines?
00:08:11.900 --> 00:08:15.400
Or do you, like, get that into, like, a cluster of servers?
00:08:15.400 --> 00:08:16.440
Or what happens?
00:08:16.440 --> 00:08:17.440
Right, right.
00:08:17.540 --> 00:08:27.320
So we have a kind of a hierarchical online real-time system for tossing away, you know, the majority of the data.
00:08:27.320 --> 00:08:35.000
So we have to actually, we write algorithms that have to look at the data and real-time decide, does this look interesting or not?
00:08:35.000 --> 00:08:41.500
And so we go from the sort of 40 million a second through this, like, three levels of filtering down.
00:08:41.500 --> 00:08:47.300
And then we get to the point that we save something like a few hundred of these collisions every second.
00:08:48.000 --> 00:08:55.260
And that turns into, you know, several petabytes a year of data that we actually analyze later.
00:08:55.260 --> 00:08:56.660
That's amazing.
00:08:56.660 --> 00:08:59.800
It's got to be a little stressful to work on that initial filtering algorithm.
00:08:59.800 --> 00:09:03.240
Because what if you threw away the Higgs boson before you discovered it, right?
00:09:03.240 --> 00:09:04.140
That's right.
00:09:04.140 --> 00:09:04.320
Yeah.
00:09:04.320 --> 00:09:08.620
No, people, we always worry that we're, you know, kind of throwing the baby out with the bathwater.
00:09:09.580 --> 00:09:13.620
And sorry about the living in New York here.
00:09:13.620 --> 00:09:14.740
Yeah, no worries.
00:09:14.740 --> 00:09:18.540
So the, yeah, we call that thing the trigger.
00:09:18.540 --> 00:09:21.700
And, you know, that's something that I worked on a bit.
00:09:21.700 --> 00:09:29.000
It's true that, like, if we don't find anything else in this next run of the LHC, you know, a lot of people will think exactly that.
00:09:29.000 --> 00:09:33.960
That maybe, you know, the way the trigger was configured, we were throwing away the interesting stuff.
00:09:33.960 --> 00:09:36.000
But luckily, we're not stuck to that.
00:09:36.000 --> 00:09:38.360
You know, we can go and we can change it and things like that.
00:09:38.360 --> 00:09:40.200
But that is the worry.
00:09:40.200 --> 00:09:41.120
Sure.
00:09:41.120 --> 00:09:41.500
Yeah.
00:09:41.500 --> 00:09:45.540
I mean, you still have the time spent and the energy and all that, right?
00:09:45.540 --> 00:09:46.140
Yeah.
00:09:46.140 --> 00:09:46.660
Yeah, sure.
00:09:46.660 --> 00:09:48.560
You can rerun it, of course.
00:09:48.560 --> 00:09:54.180
But, you know, you got to, I suspect time is a valuable thing on that machine.
00:09:54.180 --> 00:09:55.640
Yeah.
00:09:55.640 --> 00:09:56.280
No, for sure.
00:09:56.280 --> 00:09:57.540
It's expensive to run.
00:09:57.540 --> 00:09:59.300
So, absolutely.
00:09:59.300 --> 00:10:00.740
Yeah.
00:10:00.740 --> 00:10:07.660
So, you know, I have a lot of listeners who are scientists and physicists and data science and so on.
00:10:07.660 --> 00:10:12.180
But a lot of them who are probably not.
00:10:12.180 --> 00:10:23.200
And so, I wanted to make a movie recommendation and a book recommendation just for people to, you know, if they want to kind of set the stage and learn the background, you know, as part of this whole thing we're talking about.
00:10:23.200 --> 00:10:26.780
I wanted to recommend the Particle Fever documentary.
00:10:26.780 --> 00:10:27.680
Have you seen this?
00:10:28.380 --> 00:10:28.880
Yeah, yeah.
00:10:28.880 --> 00:10:30.780
I actually have a credit in that movie.
00:10:30.780 --> 00:10:35.180
I worked with them quite a bit.
00:10:35.180 --> 00:10:45.020
And at one point there was a scene that was shot in my office but ended up having to cut it because it didn't really fit well with the, you know, it was a good choice.
00:10:45.120 --> 00:10:46.300
But it was painful.
00:10:46.300 --> 00:10:47.780
But they were nice.
00:10:47.780 --> 00:11:00.740
I worked with them a fair amount and got to go to the, like, you know, the opening in Sheffield at a documentary film festival and hang out with the producers and the whole crew.
00:11:00.740 --> 00:11:03.660
But it's a great film.
00:11:04.020 --> 00:11:08.340
I think it definitely, it's good for a non-physics audience also.
00:11:08.340 --> 00:11:09.500
It's not a technical film.
00:11:09.500 --> 00:11:18.260
It just basically captures what it's like to be inside one of these experiments and the sort of stress and the, you know, the drama associated to it.
00:11:18.260 --> 00:11:21.520
I think it's really one of the best science documentaries ever.
00:11:21.520 --> 00:11:22.900
I absolutely agree with you.
00:11:22.900 --> 00:11:30.540
I think it really captures the excitement, the imagination, the drama in a way that, you know, anybody could appreciate.
00:11:30.540 --> 00:11:32.440
And so, I definitely recommend people watch that.
00:11:32.600 --> 00:11:36.560
It's available for streaming on Netflix and iTunes and other places.
00:11:36.560 --> 00:11:44.980
And then the other thing is the book called Present at the Creation, Discovering the Higgs Boson by Amir Axel.
00:11:44.980 --> 00:11:45.960
I messed up his name.
00:11:45.960 --> 00:11:47.400
But that's also really good.
00:11:47.400 --> 00:11:51.580
So, people who are out there and want to learn more about what we're talking about, I think I recommend those.
00:11:51.580 --> 00:11:52.920
Okay, great.
00:11:52.920 --> 00:11:54.640
Yeah, I actually haven't read that second book.
00:11:54.640 --> 00:11:58.000
Yeah, I really enjoyed that book as well.
00:11:58.000 --> 00:12:00.420
It predates the Higgs Boson.
00:12:00.420 --> 00:12:02.160
So, it's like a lot of anticipation.
00:12:02.480 --> 00:12:02.800
So, that's cool.
00:12:02.800 --> 00:12:03.020
I see.
00:12:03.020 --> 00:12:03.940
Okay, great.
00:12:03.940 --> 00:12:11.320
Maybe we could talk a little bit about, like, the really big picture of software at the LHC.
00:12:11.320 --> 00:12:14.480
Because there's not just one team and there's not just one experiment.
00:12:14.480 --> 00:12:15.620
There's how many collectors?
00:12:15.620 --> 00:12:16.700
Are there seven collectors?
00:12:16.700 --> 00:12:17.920
Right.
00:12:17.920 --> 00:12:23.140
So, well, there are two really big kind of multipurpose particle detectors, Atlas and CMS.
00:12:23.140 --> 00:12:24.260
And I'm on Atlas.
00:12:24.640 --> 00:12:31.440
And those two experiments have, you know, in the neighborhood, a little bit more than 3,000 physicists working on them.
00:12:32.060 --> 00:12:35.340
So, you know, so it's a, there are big groups of people.
00:12:35.340 --> 00:12:45.620
And then there are two other experiments that are, you know, slightly smaller in scale, but they do, you know, and slightly more specialized in terms of the physics that they do.
00:12:45.940 --> 00:12:51.020
And then there are several other smaller dedicated experiments that are quite a bit smaller.
00:12:51.580 --> 00:12:55.820
And so, I don't, you know, it depends on how you count a little bit.
00:12:55.820 --> 00:13:05.520
But usually, you know, there's sort of the two big multipurpose detectors and two other more specialized ones that are the dominant, like, LHC experiments.
00:13:05.520 --> 00:13:06.560
Okay, cool.
00:13:06.560 --> 00:13:15.260
And so, maybe from, like, the higher level or larger scale, like, the thing that actually runs the machine down into the experiments, down into more, like, the data processing details.
00:13:15.400 --> 00:13:18.680
Could you give us a picture, like, what the software looks like there, what you guys are doing?
00:13:18.680 --> 00:13:19.820
Sure, yeah.
00:13:19.820 --> 00:13:23.600
So, I mean, it's mainly, you know, we have a whole bunch of collisions.
00:13:23.600 --> 00:13:31.180
And each collision, you know, if you think of what this metaphor of it being like an image, you know, it's like a pipeline for doing a bunch of image processing, you know.
00:13:31.180 --> 00:13:39.360
And you're looking for, you're trying to find the collisions that, you know, maybe have evidence of some new particle.
00:13:39.360 --> 00:13:50.400
So, you have lots of teams of people that are looking for different things, and each of those teams will develop a little pipeline to process the data to try to, you know, to search for what they want.
00:13:50.400 --> 00:14:00.120
Also, to put it into perspective a little bit, we had a quadrillion, you know, a couple quadrillion collisions total at the LHC.
00:14:00.120 --> 00:14:08.180
And when we discovered the Higgs, it was, you know, of the order of 100 or 1,000 of those collisions that were the interesting ones.
00:14:08.460 --> 00:14:11.180
So, it's a huge needle in a haystack problem.
00:14:11.180 --> 00:14:17.420
But it's also not really like a data mining kind of just generally looking for something weird in the data.
00:14:17.420 --> 00:14:30.080
We have theories that tell us, you know, what to look for, which is good because there's such small little deviations in the data that it would be basically impossible to find if you didn't have a good guide.
00:14:30.320 --> 00:14:41.280
And then this processing chain, because there's so much data and performance is such an issue, most of it, well, several years ago, the decision was to write most of the software in C++.
00:14:41.280 --> 00:14:44.880
You know, C++ has also evolved a ton during the time.
00:14:44.880 --> 00:14:49.900
Are you using like, are people using like C++ 11 and those types of things?
00:14:50.140 --> 00:14:50.640
Right.
00:14:50.640 --> 00:14:59.880
So, the different experiments kind of, you know, move to these new, you know, new standards and new computing technologies kind of at different paces.
00:14:59.880 --> 00:15:01.540
There's a lot of worry.
00:15:01.540 --> 00:15:07.800
You know, it's generally a pretty conservative attitude, you know, but we are making those kinds of transitions.
00:15:08.580 --> 00:15:14.040
But, you know, you just, it has to go through a lot of vetting before we make a big jump like that.
00:15:14.040 --> 00:15:33.520
We also usually have a very homogeneous computing environment in terms of like operating systems and things like that because, you know, we run into issues where, you know, you don't want to have to be worrying about like floating point arithmetic in your kernel or something, you know, when you're, because it's, so, so we just tried it.
00:15:33.520 --> 00:15:42.420
Yeah, so it's a little bit funny, you know, that CERN was responsible for sort of developing the web browser, right, you know, and HTML and things like that.
00:15:42.420 --> 00:15:45.700
And so, they had this huge win of, you know, where the web was born.
00:15:45.700 --> 00:15:51.420
And then that was followed by this idea of like, okay, we had the web, now we're going to have grid computing.
00:15:51.420 --> 00:15:53.760
And there was a lot of money poured into it.
00:15:53.760 --> 00:15:59.160
And the promise of the grid basically turned into what has happened with the cloud.
00:15:59.560 --> 00:16:09.400
And, but in, you know, and then within IMG physics, we do have the grid, but it's kind of like a huge global, you know, batch system in some sense.
00:16:09.400 --> 00:16:20.020
So, it tends to be, you know, more uniform and things like that than what, you know, at first people were working really hard to be able to work over very heterogeneous computing environments.
00:16:20.020 --> 00:16:24.320
But, you know, that all evolved over, you know, more than a decade.
00:16:24.320 --> 00:16:25.960
Yeah, I'm sure.
00:16:25.960 --> 00:16:29.740
I used to do a lot of work in sort of scientific computing and visualization.
00:16:29.740 --> 00:16:35.920
And it's super hard to do reproducibility and checking stuff, you know.
00:16:35.920 --> 00:16:36.620
Right.
00:16:36.620 --> 00:16:41.940
If you've got a sufficiently complicated series of mathematical steps, you can apply to something.
00:16:42.180 --> 00:16:46.680
You know, like, if it's so complicated, how do you know when you're right or not?
00:16:46.680 --> 00:16:46.880
Right.
00:16:46.880 --> 00:16:52.420
You know, how do you know when you're discovering something new versus, oh, it's like I expected or whatever, right?
00:16:52.420 --> 00:16:53.020
Right.
00:16:53.020 --> 00:17:01.880
Well, we're working a lot right now on trying to address the sort of reproducibility, you know, issues kind of specific and the challenges associated to our field.
00:17:02.460 --> 00:17:05.960
And there are a lot of challenges because there's so much data and software is very complicated.
00:17:05.960 --> 00:17:06.940
Yeah.
00:17:06.940 --> 00:17:13.640
So the core algorithms tend to all be C++, but they're, you know, they're organized into lots of, you know, lots of different tools.
00:17:13.640 --> 00:17:19.480
And, you know, you have a way of kind of composing this pipeline between different processing algorithms.
00:17:19.480 --> 00:17:29.960
And in the end, the configuration of that thing is such a beast that that's the first place where you see Python happening is that we have a way of kind of, you know, doing introspection on all of the tools.
00:17:30.040 --> 00:17:33.820
And then we just represent their configuration in terms of Python objects.
00:17:33.820 --> 00:17:40.400
And then there's a whole separate layer of, you know, of computing, which I mean, of programming, which is just essentially the configuration.
00:17:40.400 --> 00:17:55.600
And that includes both this trigger, that online system that's tossing out the data, as well as the, you know, the people that are analyzing the data, how they, you know, configure all these tools to be able to process the kajillions of events into something that's more manageable.
00:17:55.600 --> 00:17:57.980
Now, that sounds really interesting.
00:17:57.980 --> 00:18:05.860
When I was doing some research, it seemed like one of the major pieces used in Atlas was this thing called Athena.
00:18:05.860 --> 00:18:07.080
Right.
00:18:07.080 --> 00:18:07.860
That's right.
00:18:07.860 --> 00:18:16.880
That's the kind of name of the C++ framework that we use that also includes the way that it builds the Python bindings for configuring all the tools.
00:18:17.580 --> 00:18:24.620
And yeah, so that's, yeah, I've spent more hours than I'd like to admit doing programming in that framework.
00:18:24.620 --> 00:18:35.920
But then what's also interesting, I think a lot of your audience will find interesting, is that once you've used that huge, heavyweight data processing pipeline, usually you get to something quite a bit smaller.
00:18:36.260 --> 00:18:41.000
And that's where a lot of the more interactive and exploratory part of the data analysis happens.
00:18:41.000 --> 00:18:47.900
And at that stage, a lot of people, well, people stop using things like Athena for the most part.
00:18:48.140 --> 00:18:53.440
And that's where you start using, see people using Python a lot more in terms of data analysis.
00:18:53.440 --> 00:19:00.340
And so it's an interesting transition because people are always arguing about where do you make that swap, you know?
00:19:00.340 --> 00:19:01.360
Sure.
00:19:01.360 --> 00:19:01.920
Yeah.
00:19:02.640 --> 00:19:05.140
I suspect you guys probably do a lot of IPython.
00:19:05.140 --> 00:19:05.980
Is that true?
00:19:05.980 --> 00:19:08.700
Well, you would think that more people would.
00:19:08.700 --> 00:19:26.420
I guess part of it is that it's still, even at that stage, you still have so much data to process that the kinds of things that people end up wanting to do are, you know, well suited to having like, you know, programs that, you know, that look really like programs that run.
00:19:26.420 --> 00:19:33.140
And they might be Python based, but, you know, you kind of sort of batch systemy, you run over this thing, and then you get some results and look at them.
00:19:33.140 --> 00:19:35.940
There are times when you're doing something very interactive.
00:19:35.940 --> 00:19:51.920
And so years ago, the team at CERN that makes this tool called Root, which is like kind of the dominant data analysis package in high energy physics, came up with something like an interpreter because you want to sit there and have this feedback loop, right?
00:19:51.920 --> 00:19:55.740
You know, like the end where you can, you know, type commands, see plots.
00:19:55.740 --> 00:19:58.560
And that was actually done amazingly.
00:19:58.560 --> 00:20:01.940
They wrote a C++ interpreter many, many years ago.
00:20:01.940 --> 00:20:08.760
And so you actually write these commands in C++, and then they're interpreted and executed on the fly.
00:20:08.760 --> 00:20:11.120
That's actually pretty interesting by itself, isn't it?
00:20:11.120 --> 00:20:12.260
It is interesting.
00:20:12.260 --> 00:20:17.700
It, of course, had all sorts of issues, and C++ wasn't really meant for doing that, but it worked practically.
00:20:18.360 --> 00:20:27.960
And now they've gone through and they have a much heavier duty version of this interpreter that's based on, you know, Cling and more modern, like, compiling, compiler technologies and things.
00:20:27.960 --> 00:20:32.660
But Python, obviously, is another way to go with that, which is nice.
00:20:42.740 --> 00:20:45.460
This episode is brought to you by Hired.
00:20:45.460 --> 00:20:51.940
Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
00:20:51.940 --> 00:21:01.100
Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.
00:21:01.100 --> 00:21:07.460
Typically, candidates receive five or more offers in just the first week, and there are no obligations, ever.
00:21:07.460 --> 00:21:09.540
Sounds pretty awesome, doesn't it?
00:21:09.540 --> 00:21:11.600
Well, did I mention there's a signing bonus?
00:21:11.600 --> 00:21:15.680
Everyone who accepts a job from Hired gets a $2,000 signing bonus.
00:21:15.680 --> 00:21:20.040
And as Talk Python listeners, it gets way sweeter.
00:21:20.040 --> 00:21:27.600
Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.
00:21:27.600 --> 00:21:29.300
Opportunity's knocking.
00:21:29.300 --> 00:21:32.920
Visit Hired.com slash Talk Python To Me and answer the call.
00:21:32.920 --> 00:21:45.300
So people started moving to the Python.
00:21:45.300 --> 00:21:52.220
Well, some people started moving to the Python way of doing things, I don't know, whatever, you know, eight years ago or something like that.
00:21:52.220 --> 00:21:55.760
But the field is kind of split between, you know, which way.
00:21:55.760 --> 00:21:59.640
And then since then, like, things like IPython and the IPython notebook have come around.
00:21:59.640 --> 00:22:03.980
I think that that's great, especially from the point of view of this, like, reproducibility.
00:22:03.980 --> 00:22:07.020
So we're working now that we give tons of talks.
00:22:07.020 --> 00:22:09.220
If you go to CERN, we have this agenda system.
00:22:09.220 --> 00:22:15.040
And you can see that there are, like, hundreds of thousands of presentations happening within these experiments every year.
00:22:15.040 --> 00:22:16.280
That's excellent.
00:22:16.680 --> 00:22:23.340
Yeah, so one of the things, but they're always, like, PowerPoint or, you know, Keynote or whatever, or LaTeX-based PDF presentations.
00:22:23.340 --> 00:22:32.720
And you read about what someone's doing, but it's not very handy for, like, trying to have reproducibility or for another graduate student to pick up where someone left off.
00:22:32.720 --> 00:22:45.600
So we have this effort now to try to make it so that the agenda system can, you know, can basically display notebooks directly so people can upload their IPython notebook directly and visualize it.
00:22:45.600 --> 00:22:49.480
And then if someone else thinks it's interesting, they can, you know, download it and execute it.
00:22:49.940 --> 00:23:04.880
There are efforts about trying to make it so that the whole computing environment associated to that notebook can be, you know, packaged up so that, because they usually aren't just standalone Python, Python notebooks with, like, SciPy dependency.
00:23:04.880 --> 00:23:06.140
They have a bunch of dependencies.
00:23:06.140 --> 00:23:11.000
So if you can package that all up, that's very handy.
00:23:11.000 --> 00:23:12.760
So there are tools like Binder now.
00:23:12.760 --> 00:23:14.560
There's a tool called Everware.
00:23:14.940 --> 00:23:21.060
And previously, there's something like SageMath, which all allowed you to sort of execute a notebook, you know.
00:23:21.060 --> 00:23:26.900
But the problem was how do you get all these, you know, these software dependencies packaged up?
00:23:26.900 --> 00:23:28.460
And now that problem is starting to be solved.
00:23:28.460 --> 00:23:29.200
Right.
00:23:29.200 --> 00:23:30.360
Oh, that's really excellent.
00:23:30.360 --> 00:23:41.260
Because I can imagine you guys have so much data and maybe these back-end systems you've got to reach into to actually work with the data that you're trying to, you know, do physics on.
00:23:41.260 --> 00:23:41.900
That's right.
00:23:41.900 --> 00:23:48.540
You can't just take the program and hand it out, you know, like, oh, and here's our, you know, 50 gigs of data and you've got to get it this way, right?
00:23:48.540 --> 00:23:49.120
Right.
00:23:49.120 --> 00:23:57.080
And not only that, there's also things like databases that say, like, how was the detector aligned on Friday, you know, November 25th or something.
00:23:57.080 --> 00:24:03.460
So there are all these databases involved that you have to connect to for the software to run.
00:24:03.640 --> 00:24:06.760
And that's also through tons of authentication layers.
00:24:06.760 --> 00:24:09.700
So it's a huge pain in the butt, basically.
00:24:09.700 --> 00:24:12.560
But people are solving it.
00:24:12.560 --> 00:24:15.880
And I think that will be a huge change.
00:24:15.880 --> 00:24:23.720
And the Project Jupyter people, you know, luckily had this great foresight to separate the notebook from the background kernel.
00:24:23.920 --> 00:24:29.700
So we're actually also writing kernel based on this C++ interpreter of root.
00:24:29.700 --> 00:24:34.980
So it still looks like notebooks and all the display and everything is the same.
00:24:34.980 --> 00:24:39.720
But in the background, instead of Python, it's the C++ interpreter, which is, you know, interesting.
00:24:40.260 --> 00:24:44.440
Yeah, I mean, that certainly opens it up to a much wider audience.
00:24:44.440 --> 00:24:53.180
Like, you're saying, like, the group that's working directly with Athena and so on, they can just, you know, possibly start using IPython or what do you call them?
00:24:53.180 --> 00:24:54.600
Call them Jupyter notebooks now?
00:24:54.600 --> 00:24:54.960
I don't know.
00:24:54.960 --> 00:24:56.840
I'm not really sure what the naming is.
00:24:56.840 --> 00:25:00.860
Yeah, the front end, kind of the language agnostic part is now Project Jupyter.
00:25:01.980 --> 00:25:18.480
But it's great because we have people like Fernando Perez, who's, you know, leading this effort as part of this advisory board for a project that we got, a grant we got from the National Science Foundation to try to take the tools that have been developed in Iron View Physics, which are mainly very siloed.
00:25:18.480 --> 00:25:20.040
You know, it's like we're trying to solve our problem.
00:25:20.040 --> 00:25:20.960
It's a very hard problem.
00:25:20.960 --> 00:25:22.600
And we don't have a lot of extra time or money.
00:25:22.600 --> 00:25:23.160
Right.
00:25:23.160 --> 00:25:25.520
And then, but now we've done some nice things.
00:25:25.520 --> 00:25:30.800
So let's try to open that up, make it more interoperable with, like, the scientific Python world.
00:25:31.460 --> 00:25:33.160
And it's definitely a two-way street.
00:25:33.160 --> 00:25:35.260
There are lots of other great tools out there that we don't use.
00:25:35.260 --> 00:25:39.080
So we're working on improving the interoperability of all of these things.
00:25:39.080 --> 00:25:41.840
Yeah, I think that's going to be good for science all over.
00:25:41.840 --> 00:25:45.460
And the Jupyter guys just got a huge grant.
00:25:45.460 --> 00:25:50.460
I'm not sure all the folks that contributed, but it was millions, like six million or something like that.
00:25:50.460 --> 00:25:50.960
Do you remember?
00:25:50.960 --> 00:25:57.620
I don't know the number, but they rightfully have been getting some support because they're doing some great things.
00:25:57.620 --> 00:26:01.180
And yeah, I'm really happy to see that.
00:26:01.180 --> 00:26:07.960
So, you know, people will start building C++ and, you know, I guess Ruby and maybe imagine Fortran.
00:26:07.960 --> 00:26:08.420
I don't know.
00:26:08.420 --> 00:26:11.000
That probably is important somewhere in science.
00:26:11.000 --> 00:26:13.080
But I try to not touch that stuff.