forked from mikeckennedy/talk-python-transcripts
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path031_scikit-learn_and_machine_learning.vtt
1796 lines (1197 loc) · 62.3 KB
/
031_scikit-learn_and_machine_learning.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
00:00:00.001 --> 00:00:06.840
Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for.
00:00:06.840 --> 00:00:14.060
Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms out there called Scikit-Learn.
00:00:14.060 --> 00:00:19.180
In this episode, Alexander Gramfort is here to tell us about Scikit-Learn and machine learning.
00:00:19.180 --> 00:00:25.460
This is Talk Python to Me, number 31, recorded Friday, September 25, 2015.
00:00:25.460 --> 00:00:37.200
I'm a developer in many senses of the word, because I make these applications, but I also use these verbs to make this music.
00:00:37.200 --> 00:00:41.740
I construct it line by line, just like when I'm coding another software design.
00:00:41.740 --> 00:00:47.960
In both cases, it's about design patterns. Anyone can get the job done, it's the execution that matters.
00:00:47.960 --> 00:00:53.500
I have many interests, sometimes conflict, but creativity can usually be a benefit.
00:00:53.760 --> 00:01:00.700
Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:01:00.700 --> 00:01:04.820
This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.
00:01:04.820 --> 00:01:11.280
Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.
00:01:11.280 --> 00:01:15.840
This episode is brought to you by Hired and Codeship.
00:01:15.840 --> 00:01:21.180
Thank them for supporting the show on Twitter via at Hired underscore HQ and at Codeship.
00:01:22.640 --> 00:01:24.880
Hey, everyone. Thanks for listening today.
00:01:24.880 --> 00:01:27.980
Let me introduce Alexander so we can get right to the interview.
00:01:27.980 --> 00:01:38.560
Alexander Grandfort is currently an assistant professor at Telecom Paris Tech and scientific consultant for the CEA Neurospin Brain Imaging Center.
00:01:38.560 --> 00:01:49.600
His work is on statistical machine learning, signal and image processing optimization, scientific computing, and software engineering with a primary focus in brain functional imaging.
00:01:50.220 --> 00:01:56.360
Before joining Telecom Paris Tech, he worked at the Martino Center for Biomedical Imaging at Harvard in Boston.
00:01:56.360 --> 00:02:02.280
He's also an active member for the Center for Data Science at Université Paris-Saclay.
00:02:02.280 --> 00:02:04.760
Alexander, welcome to the show.
00:02:04.760 --> 00:02:05.960
Thank you. Hi.
00:02:06.660 --> 00:02:11.180
Hi. I'm really excited to talk about machine learning and scikit-learn with you today.
00:02:11.180 --> 00:02:17.820
It's something I know almost nothing about, so it's going to be a great chance for me to learn along with everyone else who's listening in.
00:02:17.820 --> 00:02:21.140
So hopefully I'll be able to give relevant answers.
00:02:21.140 --> 00:02:22.760
Yeah, I'm sure that you will.
00:02:23.660 --> 00:02:27.820
All right, so we're going to talk all about machine learning, but before we get there, let's hear your story.
00:02:27.820 --> 00:02:29.060
How did you get into programming in Python?
00:02:29.060 --> 00:02:35.660
Well, I've done a lot of scientific computing and scientific programming over the last maybe 10 to 15 years.
00:02:35.660 --> 00:02:40.660
I started my undergrad in computer science, doing a lot of signal and image processing.
00:02:40.660 --> 00:02:45.700
Well, like these types of people, I've done a lot of MATLAB in my previous life.
00:02:46.000 --> 00:02:49.300
Yes, I've done a lot of MATLAB too. I know about the .im files.
00:02:49.300 --> 00:02:56.560
And I switched to a team for my postdoc.
00:02:56.560 --> 00:02:59.960
Basically, I did a PhD in computer science applied to brain imaging.
00:02:59.960 --> 00:03:05.300
And I switched to a different team where basically I was surrounded by people working with Python.
00:03:05.300 --> 00:03:08.120
And basically, I got into it and switched.
00:03:08.120 --> 00:03:12.500
In one week, MATLAB was gone from my life.
00:03:14.040 --> 00:03:16.060
But it's been maybe five years now.
00:03:16.060 --> 00:03:20.260
And yeah, that's kind of the historical part.
00:03:20.260 --> 00:03:22.220
Do you miss MATLAB?
00:03:22.220 --> 00:03:23.600
Not really.
00:03:23.600 --> 00:03:25.760
Me either.
00:03:25.760 --> 00:03:29.720
There are some cool things about it, but...
00:03:29.720 --> 00:03:34.540
Yeah, I still have students that are insisting to work with me in MATLAB.
00:03:34.540 --> 00:03:38.760
So I have to still do stuff in MATLAB for supervision.
00:03:38.760 --> 00:03:42.440
But not really when I have the choice.
00:03:43.080 --> 00:03:44.200
Yeah, if you get a choice, of course.
00:03:44.200 --> 00:03:53.640
I think one of the things that's really a drawback about specialized systems like MATLAB is it's very hard to build production finished products.
00:03:53.640 --> 00:03:55.220
You can do research.
00:03:55.220 --> 00:03:56.040
You can learn.
00:03:56.040 --> 00:03:57.220
You can write papers.
00:03:57.220 --> 00:03:59.120
You can even test algorithms.
00:03:59.120 --> 00:04:06.940
But if you want to get something that's running on data centers on its own, probably MATLAB is, you know, you could make it work, but it's not generally the right choice.
00:04:06.940 --> 00:04:07.740
Definitely.
00:04:07.740 --> 00:04:08.220
Yeah.
00:04:08.220 --> 00:04:08.920
Yeah.
00:04:09.040 --> 00:04:21.260
And so things like, you know, I think that explains a lot of the growth of Python in this whole data science, scientific computing world, along with great toolkits like scikit-learn, right?
00:04:21.260 --> 00:04:22.080
Yes.
00:04:22.080 --> 00:04:26.740
I mean, definitely the way scikit-learn is now used.
00:04:27.740 --> 00:04:35.020
The fact that the Python stack allows you to make this production type of code is a clear win for everyone.
00:04:36.300 --> 00:04:46.240
So before we get into the details of scikit-learn and how you work with it and all the features it has, let's just, you know, in a really broad way, talk about machine learning.
00:04:46.240 --> 00:04:47.420
Like, what is machine learning?
00:04:47.420 --> 00:04:54.400
I would say the simple example of machine learning is trying to predict something from previous data.
00:04:54.400 --> 00:04:58.400
So what people would call supervised learning.
00:04:58.400 --> 00:05:07.380
And there are plenty of examples of this in everyday life, like your mailbox that predicts for you if your email is a spam or a ham.
00:05:07.380 --> 00:05:17.000
And that's basically a system that learns from previous data how to make an informed choice and give you a prediction.
00:05:17.000 --> 00:05:20.700
And that's basically the most simple way of seeing machine learning.
00:05:21.540 --> 00:05:30.960
And basically you see machine learning problems framed this way in all contexts, from industry to academic science.
00:05:30.960 --> 00:05:33.840
And, I mean, there are many examples.
00:05:33.840 --> 00:05:43.940
And basically, in terms of other types of classes of problems that you see in machine learning, it's not really these prediction problems.
00:05:43.940 --> 00:05:57.120
We're trying to make sense from raw data where you don't have labels like spam or ham, but you just have data and you want to figure out what's the structure, what types of input or insight can you get from it.
00:05:57.120 --> 00:06:02.500
And that's, I would say, the other big class of problem that machine learning addresses.
00:06:02.500 --> 00:06:05.860
Yeah, so there's that general classification.
00:06:06.380 --> 00:06:21.940
I guess with the first category you were talking about, like spam filters and other things that maybe fall into that realm would be like credit card fraud, maybe trading stocks, these kind of binary, do it, don't do it, based on examples.
00:06:21.940 --> 00:06:26.660
That's something that is, is it called structured learning or what's the?
00:06:26.660 --> 00:06:30.400
The common name is supervised learning.
00:06:30.400 --> 00:06:31.780
Supervised learning, that's right.
00:06:31.780 --> 00:06:38.600
Yeah, so basically you have pairs of training observations that are the data and their corresponding labels.
00:06:38.600 --> 00:06:41.440
So text and the label would be spam or ham.
00:06:41.440 --> 00:06:45.460
Or you can also see, this is basically binary classification.
00:06:45.460 --> 00:06:49.740
The other types of machine learning problems you have is, for example, regression.
00:06:49.740 --> 00:06:54.500
You want to predict the price of a house and you know the number of square feet.
00:06:54.500 --> 00:06:57.280
You know the number of rooms.
00:06:57.280 --> 00:06:59.100
You know what's exactly the location.
00:06:59.520 --> 00:07:03.840
And so you have a bunch of variables that describe your house or apartment.
00:07:03.840 --> 00:07:05.820
And from this you want to predict the price.
00:07:05.820 --> 00:07:10.380
And that's another example where now it seems the price is a continuous variable.
00:07:10.380 --> 00:07:11.400
It's not binary.
00:07:11.400 --> 00:07:13.800
This is what people call regression.
00:07:13.800 --> 00:07:17.000
And this is another big class of supervised learning problem.
00:07:17.000 --> 00:07:17.620
Right.
00:07:17.700 --> 00:07:34.220
So you might know through the real estate data, all the houses in the neighborhood that have sold in the last two years, the ones that have sold last month, all their variables and dimensions, if you will, like number of bathrooms, number of bedrooms, square feet, or square meters.
00:07:34.220 --> 00:07:39.720
You could feed it into the system to train it.
00:07:40.180 --> 00:07:44.840
And then you could say, well, now I have a house with two bathrooms and three bedrooms.
00:07:44.840 --> 00:07:46.460
And right here, what's it worth?
00:07:46.460 --> 00:07:46.800
Right?
00:07:46.800 --> 00:07:47.520
Exactly.
00:07:47.520 --> 00:07:55.660
That's basically a typical example and also a typical data set that we use in scikit-learn that basically illustrates the concept of regression with a similar problem.
00:07:55.660 --> 00:07:56.540
Right.
00:07:56.660 --> 00:08:01.900
There's, we'll talk more about it, but there's a scikit-learn comes with some pre-built data sets.
00:08:01.900 --> 00:08:03.940
And one of them is the Boston house market, right?
00:08:03.940 --> 00:08:04.660
Exactly.
00:08:04.660 --> 00:08:05.360
That's the one.
00:08:05.360 --> 00:08:05.840
Yeah.
00:08:05.840 --> 00:08:08.840
How much data do you have to give it?
00:08:08.840 --> 00:08:14.920
Like, suppose I want to try to estimate the value of my house, which, you know, at least in the United States, we have this service called Zillow.
00:08:14.920 --> 00:08:16.880
So they're doing way more.
00:08:16.880 --> 00:08:18.640
I'm sure they're running something like this, actually.
00:08:19.100 --> 00:08:25.160
But suppose I wanted to take it upon myself to, like, grab the real estate data and try to estimate the value of my home.
00:08:25.160 --> 00:08:30.580
How many houses would I have to give it before it would start to be reasonable?
00:08:30.580 --> 00:08:32.660
Well, that's a tough question.
00:08:32.660 --> 00:08:35.380
And I guess there's no simple answer.
00:08:35.380 --> 00:08:44.060
I mean, you have this, that you can see on the cheat sheets of scikit-learn that says if you have less than 50 observations, then go get more data.
00:08:45.600 --> 00:08:48.460
But I guess it's also a simplified answer.
00:08:48.460 --> 00:08:50.640
It depends on the difficulty of the task.
00:08:50.640 --> 00:08:55.300
So at the end of the day, often for these types of problems, you want to know something.
00:08:55.300 --> 00:08:58.500
And this can be easy or hard.
00:08:58.500 --> 00:09:00.000
You cannot really know before trying.
00:09:00.000 --> 00:09:07.440
And typically regression would say, okay, if I predict that the 10% plus or minus, that's maybe good enough for my application.
00:09:07.440 --> 00:09:08.980
And maybe you need less data.
00:09:08.980 --> 00:09:12.000
If you want to be super accurate, you need more data.
00:09:12.000 --> 00:09:17.080
But the question of how much is, it's really hard to answer without really trying and using actual data.
00:09:17.080 --> 00:09:18.380
Yeah, I can imagine.
00:09:18.380 --> 00:09:26.780
It probably also depends on the variability of the data, the accuracy of the data, how many variables you're trying to give it.
00:09:26.780 --> 00:09:40.080
So if you just added, just tried to base it on square footage or square meters of your house, that one variable, maybe it's easier to predict than, you know, 20 components that describe your house, right?
00:09:40.680 --> 00:09:46.760
So the thing, the more variables you have, the more you can hope to get.
00:09:46.760 --> 00:09:53.520
Now it's not as simple as this, because if variables are not informative, then they're basically adding noise to your problem.
00:09:53.520 --> 00:10:02.780
So you want as many variables to describe your data in order to capture the weak signals.
00:10:02.780 --> 00:10:06.560
But sometimes just variables are not relevant or predictive.
00:10:06.560 --> 00:10:10.240
And so there are more, you want to remove them from the prediction problem.
00:10:10.240 --> 00:10:11.920
Okay, that makes sense.
00:10:11.920 --> 00:10:24.700
So I was looking into what are some of the novel uses of machine learning in order to sort of have some things to ask you about and just see what's out there.
00:10:25.580 --> 00:10:27.620
What are ones that come to mind for you?
00:10:27.620 --> 00:10:29.340
And then I'll give you some that I found on my list.
00:10:29.340 --> 00:10:36.860
Maybe I'm biased because I'm really into using machine learning for scientific data and academic problems.
00:10:36.860 --> 00:10:47.080
But I guess for things that are really academic breakthrough that are reaching everybody is really related to computer vision and NLP these days and probably also speech.
00:10:47.080 --> 00:10:58.460
So these types of systems that try to predict something from speech signals or from images like describing you what's the contents, what types of objects you can find.
00:10:58.460 --> 00:11:01.600
And for NLP you have like machine translation.
00:11:01.600 --> 00:11:07.680
We did a show with OpenCV and the whole Python angle there.
00:11:07.680 --> 00:11:11.780
There was a lot of really cool stuff on medical imaging going on there.
00:11:11.780 --> 00:11:13.780
Does that have to do with scikit-learn as well?
00:11:14.420 --> 00:11:29.420
Well, you have people doing medical imaging using scikit-learn, basically extracting features from MR images, magnetic resonance images, or CT scanners, or also like EEG brain signals.
00:11:29.420 --> 00:11:39.580
And they're using EEG – sorry, they're using scikit-learn as the prediction tool, deriving features from their raw data.
00:11:40.440 --> 00:11:45.280
And that reaches, of course, clinical applications in some contexts.
00:11:45.280 --> 00:11:57.100
Maybe automatic systems that say, hey, this looks like it could be cancer or it could be some kind of problem, bring the attention of an expert who could actually look at it and say, yes, no, something like this?
00:11:57.100 --> 00:11:57.960
Yeah, exactly.
00:11:58.060 --> 00:12:19.060
It's like helping diagnosis, like trying to help the clinician to isolate something that looks weird or suspicious in the data to get like the time of this physicist and the clinician onto this particular part of the data to see what's going on and if the patient is suffering for something.
00:12:19.640 --> 00:12:19.960
Right.
00:12:19.960 --> 00:12:20.640
That's really cool.
00:12:20.640 --> 00:12:36.240
I mean, maybe you could take previous biopsies and invasive things that have happened to other people and their pictures and their outcomes and say, look, you have basically the same features and we did this test and the machine believes that you actually don't have a problem.
00:12:36.240 --> 00:12:37.800
So, you know, probably don't worry about it.
00:12:37.800 --> 00:12:39.400
We'll just watch this or something like that, right?
00:12:39.400 --> 00:12:45.860
Yeah, I mean, on this line of thought, there was recently a Kaggle competition using retina pictures.
00:12:45.860 --> 00:12:50.880
So, like people suffering from diabetes usually have problems with retinas.
00:12:50.880 --> 00:13:05.000
And so, you can take pictures of retinas from hundreds of people and see if you can build a system that predicts something about the patient and the state of the disease from these images.
00:13:05.000 --> 00:13:08.040
And this is typically done by pooling data from multiple people.
00:13:08.040 --> 00:13:09.140
That's really cool.
00:13:09.140 --> 00:13:15.140
I've heard this Kaggle competition or challenges before in various places looking at it.
00:13:15.140 --> 00:13:15.620
What is that?
00:13:15.620 --> 00:13:33.540
So, it's basically a website that allows you to organize these types of supervised learning problems where a company or a structure, NGO, whatever, is having data and is trying to build a system, a predictive system.
00:13:33.980 --> 00:13:45.100
And they ask Kaggle to set this up, which basically means for Kaggle putting the training data set online and giving this to data scientists.
00:13:45.100 --> 00:13:52.020
And they basically then spend time building a predictive system that is evaluated on new data on which to get a score.
00:13:52.360 --> 00:14:01.820
And that allows you to see how the system works on new data and to rank basically the data scientists that are playing with the system.
00:14:01.820 --> 00:14:07.100
It's kind of an open innovation approach in data science.
00:14:07.100 --> 00:14:08.680
That's really cool.
00:14:08.920 --> 00:14:10.600
So, that's just Kaggle.com.
00:14:10.600 --> 00:14:11.260
Yes.
00:14:11.260 --> 00:14:13.160
K-A-G-G-L-E.com.
00:14:13.160 --> 00:14:13.640
Exactly.
00:14:13.640 --> 00:14:14.380
Yeah.
00:14:14.380 --> 00:14:14.780
Very nice.
00:14:14.780 --> 00:14:34.200
Some of the other ones that I sort of ran across while I was looking around that were pretty cool was one is some guys at Cornell University built machine learning algorithms to listen for the sound of whales in the ocean and use them in real time to help ships avoid running into whales.
00:14:34.760 --> 00:14:35.760
That's pretty awesome, right?
00:14:35.760 --> 00:14:36.040
Yeah.
00:14:36.040 --> 00:14:36.600
Yeah.
00:14:36.600 --> 00:14:43.100
There was a Kaggle competition on these whale sounds maybe a couple of years ago.
00:14:43.100 --> 00:14:49.980
And it was a – I mean, not many data scientists have experienced, like, listening to whales.
00:14:49.980 --> 00:14:53.460
So, it's kind of everybody doesn't really know what types of data.
00:14:53.920 --> 00:15:01.860
And I remember this presentation from the winner basically saying how to win a Kaggle competition without knowing anything about the data.
00:15:01.860 --> 00:15:03.320
It's kind of a provocative talk.
00:15:03.320 --> 00:15:04.860
That is cool.
00:15:04.860 --> 00:15:12.820
But showing how you can basically build a predictive system by just looking at the data and trying to make sense out of it without really being an expert in the field.
00:15:13.220 --> 00:15:13.400
Yeah.
00:15:13.400 --> 00:15:17.160
That's probably a really valuable skill as a data scientist to have, right?
00:15:17.160 --> 00:15:19.140
Because you can be an expert, but not in everything.
00:15:19.140 --> 00:15:29.500
Some other ones that were interesting was IBM was working on something to look at the handwritten notes of physicians.
00:15:29.500 --> 00:15:30.520
Uh-huh.
00:15:30.520 --> 00:15:37.060
And then it would predict whether – how likely the person that those notes were about would have a heart attack.
00:15:37.060 --> 00:15:37.560
Yeah.
00:15:37.560 --> 00:15:48.960
In the clinical world, it's true that a lot of information is actually raw text, like manual, like just written notes, but also raw text on the system.
00:15:48.960 --> 00:15:56.860
For machine learning, that's a particularly difficult problem because it's what we call unstructured data.
00:15:57.340 --> 00:16:08.780
So you need to – typically for scikit-learn to work on these types of data, you need to do something extra to basically come up with a structure or come up with features that allow you to predict something.
00:16:08.780 --> 00:16:10.080
Sure.
00:16:10.080 --> 00:16:15.360
And so both of those two examples that I brought up have really interesting data origin problems.
00:16:15.980 --> 00:16:30.560
So if I give you an MP3 of a whale or an audio stream of a whale, how do you turn that into numbers that go into the machine even to train it?
00:16:30.560 --> 00:16:36.020
And then similarly with handwriting, how do you – you've got to do handwriting recognition.
00:16:36.020 --> 00:16:40.840
You've got to then do sort of understanding what the handwriting means.
00:16:41.100 --> 00:16:42.180
And there's a lot of levels.
00:16:42.180 --> 00:16:46.540
How do you take this data and actually get it into something like scikit-learn?
00:16:46.540 --> 00:16:56.720
So scikit-learn expects that every observation, we also call it a sample or a data point, is basically described by a vector, like a vector of values.
00:16:57.580 --> 00:17:10.480
So if you take the sound of the whale, you can say, okay, there's a sound in the MP3, it's just a set of floating point values, like every time sample, really time domain signals that you get for a few seconds of data.
00:17:10.480 --> 00:17:15.520
It's probably not the best way to get a predictive – a good predictive system.
00:17:15.680 --> 00:17:25.140
You want to do some feature transformation, change the input to get something that brings features that are more powerful for scikit-learn and the learning system.
00:17:25.140 --> 00:17:38.760
And you would typically do this with time-frequency transform, things like spectrograms, trying to extract features that are really, for example, invariant to some aspects of the day, like frequencies or time shifts.
00:17:38.860 --> 00:17:43.280
So there's probably a bit of pre-processing to do on these row signals.
00:17:43.280 --> 00:17:48.500
And then once you have your vector, you can use the scikit-learn machinery to build your predictive system.
00:17:48.500 --> 00:17:53.320
How much of that pre-processing is in the tool set?
00:17:53.320 --> 00:17:55.860
So it depends for what types of data.
00:17:55.860 --> 00:17:59.780
Typically for signals, there's nothing really specific in scikit-learn.
00:17:59.780 --> 00:18:05.540
You would probably use scipy signal or any types of signal processing Python code that you find online.
00:18:06.320 --> 00:18:14.040
I would say for other types of data, like text, in scikit-learn, there are something that is called feature extraction module.
00:18:14.040 --> 00:18:22.980
And you have – in the feature extraction module, you have something for text, which is probably the biggest part of the feature extraction is really text processing.
00:18:22.980 --> 00:18:28.900
And you have some stuff also for images, but it's quite limited.
00:18:29.580 --> 00:18:33.860
We should probably introduce what scikit-learn is and get into the details of that.
00:18:33.860 --> 00:18:38.240
But I have one more sort of example to let people know about that I think is pretty cool.
00:18:38.240 --> 00:18:42.200
On show 16, I talked to Roy Rappaport from Netflix.
00:18:42.200 --> 00:18:51.700
And Netflix has a tremendously large cloud computing infrastructure to power all of their – you know, basically their movie system, right?
00:18:51.700 --> 00:18:53.460
And everything behind the scenes there.
00:18:53.460 --> 00:19:09.440
And they have so many virtual machine instances and services running on them and then different types of devices, accessing services on those machines that they said it's almost impossible to determine if there's, you know, some edge case where there's a problem manually.
00:19:10.220 --> 00:19:18.980
And so they actually set up machine learning to monitor their infrastructure and then tell them if there's some kind of problem in real time.
00:19:18.980 --> 00:19:19.820
Yeah.
00:19:19.820 --> 00:19:22.280
So I think that's really a cool use of it as well.
00:19:33.460 --> 00:19:36.140
This episode is brought to you by Hired.
00:19:36.140 --> 00:19:42.600
Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
00:19:42.600 --> 00:19:51.760
Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.
00:19:51.760 --> 00:19:58.120
Typically, candidates receive five or more offers in just the first week, and there are no obligations ever.
00:19:58.120 --> 00:20:00.220
Sounds pretty awesome, doesn't it?
00:20:00.220 --> 00:20:02.280
Well, did I mention there's a signing bonus?
00:20:02.600 --> 00:20:06.360
Everyone who accepts a job from Hired gets a $2,000 signing bonus.
00:20:06.360 --> 00:20:10.700
And as Talk Python listeners, it gets way sweeter.
00:20:10.700 --> 00:20:18.280
Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.
00:20:18.280 --> 00:20:20.000
Opportunity's knocking.
00:20:20.000 --> 00:20:23.600
Visit Hired.com slash Talk Python To Me and answer the call.
00:20:31.740 --> 00:20:36.320
Yeah, that's a very cool thing to do.
00:20:36.320 --> 00:20:46.900
And actually, many industries and many companies are looking for these types of systems that they call anomaly detection or failure prediction.
00:20:46.900 --> 00:20:51.920
And it's getting a big use case for machine learning, indeed.
00:20:52.480 --> 00:20:56.740
The Netflix guys were actually using scikit-learn, not just some other machine learning system.
00:20:56.740 --> 00:20:59.040
So let's get to the details of that.
00:20:59.040 --> 00:20:59.760
What's scikit-learn?
00:20:59.760 --> 00:21:00.880
Where did it come from?
00:21:00.880 --> 00:21:06.940
So scikit-learn is probably the biggest machine learning library that you can find in the Python world.
00:21:07.120 --> 00:21:16.900
So it dates back from almost 10 years ago when David Cornapal was doing a Google Summer of Code to kickstart the scikit-learn project.
00:21:16.900 --> 00:21:24.360
And then for a few years, there was a French guy called Mathieu Broucher who took on the project.
00:21:24.700 --> 00:21:28.480
But it was kind of a one-guy project for many years.
00:21:28.480 --> 00:21:45.020
And in 2010, with colleagues at INRIA in France, we decided to basically try to start from this state of scikit-learn and make it bigger and really try to build a community around this.
00:21:46.460 --> 00:21:56.000
So these people are Gael Varroco and Fabian Pedregosa and also somebody you may have heard of in the machine learning world with Olivier Grisel.
00:21:56.000 --> 00:22:03.240
And so that was pretty much 2010, so five years ago.
00:22:03.240 --> 00:22:05.800
And basically it took on pretty quickly.
00:22:05.800 --> 00:22:16.440
After, I would say, a year of scikit-learn, we had more than 10 core developers way beyond the initial lab where it started.
00:22:17.300 --> 00:22:18.740
That's really excellent.
00:22:18.740 --> 00:22:24.200
Yeah, I mean, it's definitely an absolutely mainstream project that people are using in production these days.
00:22:24.200 --> 00:22:26.500
So congratulations to everyone on that.
00:22:26.500 --> 00:22:26.960
That's great.
00:22:26.960 --> 00:22:27.600
Thank you.
00:22:27.600 --> 00:22:28.020
Yeah.
00:22:28.020 --> 00:22:37.420
And so the name scikit-learn comes from the fact that it's basically an extension to the SciPy pieces, right?
00:22:37.420 --> 00:22:48.700
So SciPy is like NumPy for numerical processing, SciPy for scientific stuff, Matplotlib, IPython, SimPy for symbolic math, and Pandas, right?
00:22:48.700 --> 00:22:50.200
And then there's these extensions.
00:22:50.200 --> 00:22:51.480
Yes.
00:22:51.960 --> 00:22:55.500
So basically the kind of division is that you cannot put everything in SciPy.
00:22:55.500 --> 00:22:57.200
SciPy is already a big project.
00:22:57.200 --> 00:23:03.680
And the idea of the SciKits were to build extensions around SciPy that are more domain-specific.
00:23:03.680 --> 00:23:08.220
Also, it's kind of also easier to contribute to a smaller project.
00:23:08.220 --> 00:23:16.960
So basically the barrier of entry for newcomers is much lower when you contribute to a Scikit than to SciPy, which is a fairly big project now.
00:23:16.960 --> 00:23:21.940
Yeah, there's so much support for the whole SciPy system, right?
00:23:22.180 --> 00:23:27.300
So it's much better to just build on that than try to duplicate anything and say NumPy or whatever.
00:23:27.300 --> 00:23:28.240
Exactly.
00:23:28.240 --> 00:23:37.100
I mean, there's a lot of efforts to see what could be NumPy 2.0 and what's going to be the future of it and how to extend it.
00:23:37.100 --> 00:23:44.780
I mean, a lot of people are thinking of what's next because, I mean, NumPy is almost 10 years old, probably more than 10 years old now.
00:23:44.780 --> 00:23:49.120
And, yeah, people are trying to see also how it can evolve.
00:23:49.120 --> 00:23:49.680
Sure.
00:23:49.680 --> 00:23:50.660
That makes a lot of sense.
00:23:51.320 --> 00:23:57.000
So speaking of evolving and going forward, what are the plans with scikit-learn?
00:23:57.000 --> 00:23:57.860
Where is it going?
00:23:57.860 --> 00:24:04.420
So I would say in terms of features, I mean, scikit-learn is really in the consolidation stage.
00:24:04.420 --> 00:24:06.600
scikit-learn is five years old.
00:24:06.600 --> 00:24:09.060
The API is pretty much settled.
00:24:09.060 --> 00:24:20.260
There's a few things here and there that are basically that we have to deal with now that basically due to early decisions in terms of API that needs to be fixed.
00:24:20.460 --> 00:24:44.060
And I guess the big objective is to basically do scikit-learn 1.0, like the first stable, fully stable release in terms of API because that's something that we've been talking about between the core developers for, I mean, more than two years now, coming with this 1.0 version that stabilizes every part of the API.
00:24:44.060 --> 00:24:44.540
Right.
00:24:44.540 --> 00:24:49.200
One final major cleanup, if you can, and then stabilizing it, yeah?
00:24:49.520 --> 00:24:49.920
Exactly.
00:24:49.920 --> 00:24:49.960
Exactly.
00:24:49.960 --> 00:25:01.160
And in terms of new features, I mean, you always have a lot of cool stuff that are around and you see the number of pull requests that are coming on top of scikit-learn.
00:25:01.160 --> 00:25:02.280
It's pretty crazy.
00:25:02.280 --> 00:25:07.040
And I would say a huge maintainer's effort and reviewing effort.
00:25:07.540 --> 00:25:14.100
So features are coming in slowly now in scikit-learn, much more slowly than it used to be, but I guess it's normal for a project that is getting big.
00:25:14.100 --> 00:25:16.060
Yeah, it's definitely getting big.
00:25:16.060 --> 00:25:22.480
It has 7,600 stars and 4,500 forks on GitHub, so that's pretty awesome.
00:25:22.480 --> 00:25:23.040
Yeah.
00:25:23.040 --> 00:25:24.500
It has 457 contributors.
00:25:24.500 --> 00:25:24.880
Cool.
00:25:25.440 --> 00:25:30.740
Yeah, I would say for every release we get, I mean, we try to release every six months.
00:25:30.740 --> 00:25:36.060
And for every release we get a big number of contributors.
00:25:36.060 --> 00:25:44.020
So maybe we could do like a survey of the modules of scikit-learn, just the important ones that come to mind.
00:25:44.020 --> 00:25:45.660
What are the moving parts in there?
00:25:46.180 --> 00:25:52.720
So I would say maybe something I know the most, which is a part of the module that I maintain the most, which is the linear model.
00:25:52.720 --> 00:25:58.760
And recently the efforts on the linear models were to scale it up.
00:25:58.760 --> 00:26:07.780
Basically try to learn this linear models in an out-of-core fashion to be able to scale to data that do not fit in RAM.
00:26:07.780 --> 00:26:15.040
And that's part of the, I would say, part of the plan for this linear model module in scikit-learn.
00:26:15.400 --> 00:26:15.960
That's cool.
00:26:15.960 --> 00:26:17.420
So what kind of problems do you solve with that?
00:26:17.420 --> 00:26:24.640
The types of problems where you have a, like, humongous number of samples and potentially a lot number of features.
00:26:24.640 --> 00:26:31.360
So there are not so many applications where you get that many number of samples, but that's typically text or log files.
00:26:31.360 --> 00:26:37.460
These types of industry problems where you collect a lot of samples on a regular basis.
00:26:38.140 --> 00:26:46.720
You have these examples also if you monitor an industrial system, like if you want to do what we discussed before about, like, predictive maintenance.
00:26:46.720 --> 00:26:49.360
That's probably a use case where this can be useful.
00:26:50.620 --> 00:27:00.820
Probably the other, like, module that also attracts a lot of effort these days is the Ensemble module, and especially the tree module.
00:27:00.820 --> 00:27:13.080
So for models like Random Forest or Gradient Boosting, which are very popular models that have been helping people to win cargo competitions for the last few years.
00:27:13.560 --> 00:27:17.020
Yeah, I've heard a lot about these forests and so on.
00:27:17.020 --> 00:27:18.600
Can you talk a little bit about what that is?
00:27:19.080 --> 00:27:32.140
So a random forest basically is a set of decision trees that you pull together to get a prediction that is more accurate.
00:27:32.140 --> 00:27:36.840
More accurate because it has less variance in technical terms.
00:27:36.840 --> 00:27:46.720
And the way it works is you try to basically build decision trees from a subset of data, a subset of samples, subset of features in a clever way.
00:27:46.720 --> 00:27:50.540
And then you pull all these trees in one big predictive model.
00:27:50.540 --> 00:27:59.080
And, for example, if you do binary classification and you train a thousand trees, you ask for a new observation to the thousand trees.
00:27:59.080 --> 00:27:59.900
What's the label?
00:27:59.900 --> 00:28:01.200
Is it positive or negative?
00:28:01.200 --> 00:28:05.200
And then you basically count the number of trees that are saying positive.
00:28:05.200 --> 00:28:08.200
And if you have more trees saying positive, then you predict positive.
00:28:08.200 --> 00:28:11.600
That's kind of the basic idea of random forest.
00:28:11.600 --> 00:28:13.280
And it turns out to be super powerful.
00:28:13.280 --> 00:28:14.460
That's really cool.
00:28:14.460 --> 00:28:23.120
Well, it seems to me like it would bring in kind of different perspectives or taking different components or parts of a problem into account.
00:28:23.120 --> 00:28:28.940
So some of the trees look at some features and maybe the other trees look at other features.
00:28:28.940 --> 00:28:31.820
And then they can combine in some important way.
00:28:31.820 --> 00:28:32.660
Exactly.
00:28:32.660 --> 00:28:33.200
Yeah.
00:28:33.200 --> 00:28:36.820
Another one that I see coming up is the SVM module.
00:28:36.820 --> 00:28:37.860
What's that one do?
00:28:38.800 --> 00:28:52.240
So SVM is a very popular machine learning approach that was basically, I mean, very big in the 90s and 10 years ago and still get some attraction.
00:28:52.240 --> 00:29:10.000
And basically, the idea of a support vector machine, which is the SVM is the according for, is to be able to use kernels on the data and basically solve linear problems in an abstract space where you project your raw data.
00:29:10.000 --> 00:29:11.360
Let me try to give an example.
00:29:11.560 --> 00:29:18.320
If you take a graph or if you take a graph or if you take a string, that's not naturally something that can be represented by a vector.