forked from mikeckennedy/talk-python-transcripts
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path031_scikit-learn_and_machine_learning.txt
1192 lines (596 loc) · 50 KB
/
031_scikit-learn_and_machine_learning.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
00:00:00 Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for.
00:00:06 Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms out there called Scikit-Learn.
00:00:14 In this episode, Alexander Gramfort is here to tell us about Scikit-Learn and machine learning.
00:00:19 This is Talk Python to Me, number 31, recorded Friday, September 25, 2015.
00:00:25 I'm a developer in many senses of the word, because I make these applications, but I also use these verbs to make this music.
00:00:37 I construct it line by line, just like when I'm coding another software design.
00:00:41 In both cases, it's about design patterns. Anyone can get the job done, it's the execution that matters.
00:00:47 I have many interests, sometimes conflict, but creativity can usually be a benefit.
00:00:53 Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.
00:01:00 This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.
00:01:04 Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.
00:01:11 This episode is brought to you by Hired and Codeship.
00:01:15 Thank them for supporting the show on Twitter via at Hired underscore HQ and at Codeship.
00:01:22 Hey, everyone. Thanks for listening today.
00:01:24 Let me introduce Alexander so we can get right to the interview.
00:01:27 Alexander Grandfort is currently an assistant professor at Telecom Paris Tech and scientific consultant for the CEA Neurospin Brain Imaging Center.
00:01:38 His work is on statistical machine learning, signal and image processing optimization, scientific computing, and software engineering with a primary focus in brain functional imaging.
00:01:50 Before joining Telecom Paris Tech, he worked at the Martino Center for Biomedical Imaging at Harvard in Boston.
00:01:56 He's also an active member for the Center for Data Science at Université Paris-Saclay.
00:02:02 Alexander, welcome to the show.
00:02:04 Thank you. Hi.
00:02:06 Hi. I'm really excited to talk about machine learning and scikit-learn with you today.
00:02:11 It's something I know almost nothing about, so it's going to be a great chance for me to learn along with everyone else who's listening in.
00:02:17 So hopefully I'll be able to give relevant answers.
00:02:21 Yeah, I'm sure that you will.
00:02:23 All right, so we're going to talk all about machine learning, but before we get there, let's hear your story.
00:02:27 How did you get into programming in Python?
00:02:29 Well, I've done a lot of scientific computing and scientific programming over the last maybe 10 to 15 years.
00:02:35 I started my undergrad in computer science, doing a lot of signal and image processing.
00:02:40 Well, like these types of people, I've done a lot of MATLAB in my previous life.
00:02:46 Yes, I've done a lot of MATLAB too. I know about the .im files.
00:02:49 And I switched to a team for my postdoc.
00:02:56 Basically, I did a PhD in computer science applied to brain imaging.
00:02:59 And I switched to a different team where basically I was surrounded by people working with Python.
00:03:05 And basically, I got into it and switched.
00:03:08 In one week, MATLAB was gone from my life.
00:03:14 But it's been maybe five years now.
00:03:16 And yeah, that's kind of the historical part.
00:03:20 Do you miss MATLAB?
00:03:22 Not really.
00:03:23 Me either.
00:03:25 There are some cool things about it, but...
00:03:29 Yeah, I still have students that are insisting to work with me in MATLAB.
00:03:34 So I have to still do stuff in MATLAB for supervision.
00:03:38 But not really when I have the choice.
00:03:43 Yeah, if you get a choice, of course.
00:03:44 I think one of the things that's really a drawback about specialized systems like MATLAB is it's very hard to build production finished products.
00:03:53 You can do research.
00:03:55 You can learn.
00:03:56 You can write papers.
00:03:57 You can even test algorithms.
00:03:59 But if you want to get something that's running on data centers on its own, probably MATLAB is, you know, you could make it work, but it's not generally the right choice.
00:04:06 Definitely.
00:04:07 Yeah.
00:04:08 Yeah.
00:04:09 And so things like, you know, I think that explains a lot of the growth of Python in this whole data science, scientific computing world, along with great toolkits like scikit-learn, right?
00:04:21 Yes.
00:04:22 I mean, definitely the way scikit-learn is now used.
00:04:27 The fact that the Python stack allows you to make this production type of code is a clear win for everyone.
00:04:36 So before we get into the details of scikit-learn and how you work with it and all the features it has, let's just, you know, in a really broad way, talk about machine learning.
00:04:46 Like, what is machine learning?
00:04:47 I would say the simple example of machine learning is trying to predict something from previous data.
00:04:54 So what people would call supervised learning.
00:04:58 And there are plenty of examples of this in everyday life, like your mailbox that predicts for you if your email is a spam or a ham.
00:05:07 And that's basically a system that learns from previous data how to make an informed choice and give you a prediction.
00:05:17 And that's basically the most simple way of seeing machine learning.
00:05:21 And basically you see machine learning problems framed this way in all contexts, from industry to academic science.
00:05:30 And, I mean, there are many examples.
00:05:33 And basically, in terms of other types of classes of problems that you see in machine learning, it's not really these prediction problems.
00:05:43 We're trying to make sense from raw data where you don't have labels like spam or ham, but you just have data and you want to figure out what's the structure, what types of input or insight can you get from it.
00:05:57 And that's, I would say, the other big class of problem that machine learning addresses.
00:06:02 Yeah, so there's that general classification.
00:06:06 I guess with the first category you were talking about, like spam filters and other things that maybe fall into that realm would be like credit card fraud, maybe trading stocks, these kind of binary, do it, don't do it, based on examples.
00:06:21 That's something that is, is it called structured learning or what's the?
00:06:26 The common name is supervised learning.
00:06:30 Supervised learning, that's right.
00:06:31 Yeah, so basically you have pairs of training observations that are the data and their corresponding labels.
00:06:38 So text and the label would be spam or ham.
00:06:41 Or you can also see, this is basically binary classification.
00:06:45 The other types of machine learning problems you have is, for example, regression.
00:06:49 You want to predict the price of a house and you know the number of square feet.
00:06:54 You know the number of rooms.
00:06:57 You know what's exactly the location.
00:06:59 And so you have a bunch of variables that describe your house or apartment.
00:07:03 And from this you want to predict the price.
00:07:05 And that's another example where now it seems the price is a continuous variable.
00:07:10 It's not binary.
00:07:11 This is what people call regression.
00:07:13 And this is another big class of supervised learning problem.
00:07:17 Right.
00:07:17 So you might know through the real estate data, all the houses in the neighborhood that have sold in the last two years, the ones that have sold last month, all their variables and dimensions, if you will, like number of bathrooms, number of bedrooms, square feet, or square meters.
00:07:34 You could feed it into the system to train it.
00:07:40 And then you could say, well, now I have a house with two bathrooms and three bedrooms.
00:07:44 And right here, what's it worth?
00:07:46 Right?
00:07:46 Exactly.
00:07:47 That's basically a typical example and also a typical data set that we use in scikit-learn that basically illustrates the concept of regression with a similar problem.
00:07:55 Right.
00:07:56 There's, we'll talk more about it, but there's a scikit-learn comes with some pre-built data sets.
00:08:01 And one of them is the Boston house market, right?
00:08:03 Exactly.
00:08:04 That's the one.
00:08:05 Yeah.
00:08:05 How much data do you have to give it?
00:08:08 Like, suppose I want to try to estimate the value of my house, which, you know, at least in the United States, we have this service called Zillow.
00:08:14 So they're doing way more.
00:08:16 I'm sure they're running something like this, actually.
00:08:19 But suppose I wanted to take it upon myself to, like, grab the real estate data and try to estimate the value of my home.
00:08:25 How many houses would I have to give it before it would start to be reasonable?
00:08:30 Well, that's a tough question.
00:08:32 And I guess there's no simple answer.
00:08:35 I mean, you have this, that you can see on the cheat sheets of scikit-learn that says if you have less than 50 observations, then go get more data.
00:08:45 But I guess it's also a simplified answer.
00:08:48 It depends on the difficulty of the task.
00:08:50 So at the end of the day, often for these types of problems, you want to know something.
00:08:55 And this can be easy or hard.
00:08:58 You cannot really know before trying.
00:09:00 And typically regression would say, okay, if I predict that the 10% plus or minus, that's maybe good enough for my application.
00:09:07 And maybe you need less data.
00:09:08 If you want to be super accurate, you need more data.
00:09:12 But the question of how much is, it's really hard to answer without really trying and using actual data.
00:09:17 Yeah, I can imagine.
00:09:18 It probably also depends on the variability of the data, the accuracy of the data, how many variables you're trying to give it.
00:09:26 So if you just added, just tried to base it on square footage or square meters of your house, that one variable, maybe it's easier to predict than, you know, 20 components that describe your house, right?
00:09:40 So the thing, the more variables you have, the more you can hope to get.
00:09:46 Now it's not as simple as this, because if variables are not informative, then they're basically adding noise to your problem.
00:09:53 So you want as many variables to describe your data in order to capture the weak signals.
00:10:02 But sometimes just variables are not relevant or predictive.
00:10:06 And so there are more, you want to remove them from the prediction problem.
00:10:10 Okay, that makes sense.
00:10:11 So I was looking into what are some of the novel uses of machine learning in order to sort of have some things to ask you about and just see what's out there.
00:10:25 What are ones that come to mind for you?
00:10:27 And then I'll give you some that I found on my list.
00:10:29 Maybe I'm biased because I'm really into using machine learning for scientific data and academic problems.
00:10:36 But I guess for things that are really academic breakthrough that are reaching everybody is really related to computer vision and NLP these days and probably also speech.
00:10:47 So these types of systems that try to predict something from speech signals or from images like describing you what's the contents, what types of objects you can find.
00:10:58 And for NLP you have like machine translation.
00:11:01 We did a show with OpenCV and the whole Python angle there.
00:11:07 There was a lot of really cool stuff on medical imaging going on there.
00:11:11 Does that have to do with scikit-learn as well?
00:11:14 Well, you have people doing medical imaging using scikit-learn, basically extracting features from MR images, magnetic resonance images, or CT scanners, or also like EEG brain signals.
00:11:29 And they're using EEG – sorry, they're using scikit-learn as the prediction tool, deriving features from their raw data.
00:11:40 And that reaches, of course, clinical applications in some contexts.
00:11:45 Maybe automatic systems that say, hey, this looks like it could be cancer or it could be some kind of problem, bring the attention of an expert who could actually look at it and say, yes, no, something like this?
00:11:57 Yeah, exactly.
00:11:58 It's like helping diagnosis, like trying to help the clinician to isolate something that looks weird or suspicious in the data to get like the time of this physicist and the clinician onto this particular part of the data to see what's going on and if the patient is suffering for something.
00:12:19 Right.
00:12:19 That's really cool.
00:12:20 I mean, maybe you could take previous biopsies and invasive things that have happened to other people and their pictures and their outcomes and say, look, you have basically the same features and we did this test and the machine believes that you actually don't have a problem.
00:12:36 So, you know, probably don't worry about it.
00:12:37 We'll just watch this or something like that, right?
00:12:39 Yeah, I mean, on this line of thought, there was recently a Kaggle competition using retina pictures.
00:12:45 So, like people suffering from diabetes usually have problems with retinas.
00:12:50 And so, you can take pictures of retinas from hundreds of people and see if you can build a system that predicts something about the patient and the state of the disease from these images.
00:13:05 And this is typically done by pooling data from multiple people.
00:13:08 That's really cool.
00:13:09 I've heard this Kaggle competition or challenges before in various places looking at it.
00:13:15 What is that?
00:13:15 So, it's basically a website that allows you to organize these types of supervised learning problems where a company or a structure, NGO, whatever, is having data and is trying to build a system, a predictive system.
00:13:33 And they ask Kaggle to set this up, which basically means for Kaggle putting the training data set online and giving this to data scientists.
00:13:45 And they basically then spend time building a predictive system that is evaluated on new data on which to get a score.
00:13:52 And that allows you to see how the system works on new data and to rank basically the data scientists that are playing with the system.
00:14:01 It's kind of an open innovation approach in data science.
00:14:07 That's really cool.
00:14:08 So, that's just Kaggle.com.
00:14:10 Yes.
00:14:11 K-A-G-G-L-E.com.
00:14:13 Exactly.
00:14:13 Yeah.
00:14:14 Very nice.
00:14:14 Some of the other ones that I sort of ran across while I was looking around that were pretty cool was one is some guys at Cornell University built machine learning algorithms to listen for the sound of whales in the ocean and use them in real time to help ships avoid running into whales.
00:14:34 That's pretty awesome, right?
00:14:35 Yeah.
00:14:36 Yeah.
00:14:36 There was a Kaggle competition on these whale sounds maybe a couple of years ago.
00:14:43 And it was a – I mean, not many data scientists have experienced, like, listening to whales.
00:14:49 So, it's kind of everybody doesn't really know what types of data.
00:14:53 And I remember this presentation from the winner basically saying how to win a Kaggle competition without knowing anything about the data.
00:15:01 It's kind of a provocative talk.
00:15:03 That is cool.
00:15:04 But showing how you can basically build a predictive system by just looking at the data and trying to make sense out of it without really being an expert in the field.
00:15:13 Yeah.
00:15:13 That's probably a really valuable skill as a data scientist to have, right?
00:15:17 Because you can be an expert, but not in everything.
00:15:19 Some other ones that were interesting was IBM was working on something to look at the handwritten notes of physicians.
00:15:29 Uh-huh.
00:15:30 And then it would predict whether – how likely the person that those notes were about would have a heart attack.
00:15:37 Yeah.
00:15:37 In the clinical world, it's true that a lot of information is actually raw text, like manual, like just written notes, but also raw text on the system.
00:15:48 For machine learning, that's a particularly difficult problem because it's what we call unstructured data.
00:15:57 So you need to – typically for scikit-learn to work on these types of data, you need to do something extra to basically come up with a structure or come up with features that allow you to predict something.
00:16:08 Sure.
00:16:10 And so both of those two examples that I brought up have really interesting data origin problems.
00:16:15 So if I give you an MP3 of a whale or an audio stream of a whale, how do you turn that into numbers that go into the machine even to train it?
00:16:30 And then similarly with handwriting, how do you – you've got to do handwriting recognition.
00:16:36 You've got to then do sort of understanding what the handwriting means.
00:16:41 And there's a lot of levels.
00:16:42 How do you take this data and actually get it into something like scikit-learn?
00:16:46 So scikit-learn expects that every observation, we also call it a sample or a data point, is basically described by a vector, like a vector of values.
00:16:57 So if you take the sound of the whale, you can say, okay, there's a sound in the MP3, it's just a set of floating point values, like every time sample, really time domain signals that you get for a few seconds of data.
00:17:10 It's probably not the best way to get a predictive – a good predictive system.
00:17:15 You want to do some feature transformation, change the input to get something that brings features that are more powerful for scikit-learn and the learning system.
00:17:25 And you would typically do this with time-frequency transform, things like spectrograms, trying to extract features that are really, for example, invariant to some aspects of the day, like frequencies or time shifts.
00:17:38 So there's probably a bit of pre-processing to do on these row signals.
00:17:43 And then once you have your vector, you can use the scikit-learn machinery to build your predictive system.
00:17:48 How much of that pre-processing is in the tool set?
00:17:53 So it depends for what types of data.
00:17:55 Typically for signals, there's nothing really specific in scikit-learn.
00:17:59 You would probably use scipy signal or any types of signal processing Python code that you find online.
00:18:06 I would say for other types of data, like text, in scikit-learn, there are something that is called feature extraction module.
00:18:14 And you have – in the feature extraction module, you have something for text, which is probably the biggest part of the feature extraction is really text processing.
00:18:22 And you have some stuff also for images, but it's quite limited.
00:18:29 We should probably introduce what scikit-learn is and get into the details of that.
00:18:33 But I have one more sort of example to let people know about that I think is pretty cool.
00:18:38 On show 16, I talked to Roy Rappaport from Netflix.
00:18:42 And Netflix has a tremendously large cloud computing infrastructure to power all of their – you know, basically their movie system, right?
00:18:51 And everything behind the scenes there.
00:18:53 And they have so many virtual machine instances and services running on them and then different types of devices, accessing services on those machines that they said it's almost impossible to determine if there's, you know, some edge case where there's a problem manually.
00:19:10 And so they actually set up machine learning to monitor their infrastructure and then tell them if there's some kind of problem in real time.
00:19:18 Yeah.
00:19:19 So I think that's really a cool use of it as well.
00:19:33 This episode is brought to you by Hired.
00:19:36 Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.
00:19:42 Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.
00:19:51 Typically, candidates receive five or more offers in just the first week, and there are no obligations ever.
00:19:58 Sounds pretty awesome, doesn't it?
00:20:00 Well, did I mention there's a signing bonus?
00:20:02 Everyone who accepts a job from Hired gets a $2,000 signing bonus.
00:20:06 And as Talk Python listeners, it gets way sweeter.
00:20:10 Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.
00:20:18 Opportunity's knocking.
00:20:20 Visit Hired.com slash Talk Python To Me and answer the call.
00:20:31 Yeah, that's a very cool thing to do.
00:20:36 And actually, many industries and many companies are looking for these types of systems that they call anomaly detection or failure prediction.
00:20:46 And it's getting a big use case for machine learning, indeed.
00:20:52 The Netflix guys were actually using scikit-learn, not just some other machine learning system.
00:20:56 So let's get to the details of that.
00:20:59 What's scikit-learn?
00:20:59 Where did it come from?
00:21:00 So scikit-learn is probably the biggest machine learning library that you can find in the Python world.
00:21:07 So it dates back from almost 10 years ago when David Cornapal was doing a Google Summer of Code to kickstart the scikit-learn project.
00:21:16 And then for a few years, there was a French guy called Mathieu Broucher who took on the project.
00:21:24 But it was kind of a one-guy project for many years.
00:21:28 And in 2010, with colleagues at INRIA in France, we decided to basically try to start from this state of scikit-learn and make it bigger and really try to build a community around this.
00:21:46 So these people are Gael Varroco and Fabian Pedregosa and also somebody you may have heard of in the machine learning world with Olivier Grisel.
00:21:56 And so that was pretty much 2010, so five years ago.
00:22:03 And basically it took on pretty quickly.
00:22:05 After, I would say, a year of scikit-learn, we had more than 10 core developers way beyond the initial lab where it started.
00:22:17 That's really excellent.
00:22:18 Yeah, I mean, it's definitely an absolutely mainstream project that people are using in production these days.
00:22:24 So congratulations to everyone on that.
00:22:26 That's great.
00:22:26 Thank you.
00:22:27 Yeah.
00:22:28 And so the name scikit-learn comes from the fact that it's basically an extension to the SciPy pieces, right?
00:22:37 So SciPy is like NumPy for numerical processing, SciPy for scientific stuff, Matplotlib, IPython, SimPy for symbolic math, and Pandas, right?
00:22:48 And then there's these extensions.
00:22:50 Yes.
00:22:51 So basically the kind of division is that you cannot put everything in SciPy.
00:22:55 SciPy is already a big project.
00:22:57 And the idea of the SciKits were to build extensions around SciPy that are more domain-specific.
00:23:03 Also, it's kind of also easier to contribute to a smaller project.
00:23:08 So basically the barrier of entry for newcomers is much lower when you contribute to a Scikit than to SciPy, which is a fairly big project now.
00:23:16 Yeah, there's so much support for the whole SciPy system, right?
00:23:22 So it's much better to just build on that than try to duplicate anything and say NumPy or whatever.
00:23:27 Exactly.
00:23:28 I mean, there's a lot of efforts to see what could be NumPy 2.0 and what's going to be the future of it and how to extend it.
00:23:37 I mean, a lot of people are thinking of what's next because, I mean, NumPy is almost 10 years old, probably more than 10 years old now.
00:23:44 And, yeah, people are trying to see also how it can evolve.
00:23:49 Sure.
00:23:49 That makes a lot of sense.
00:23:51 So speaking of evolving and going forward, what are the plans with scikit-learn?
00:23:57 Where is it going?
00:23:57 So I would say in terms of features, I mean, scikit-learn is really in the consolidation stage.
00:24:04 scikit-learn is five years old.
00:24:06 The API is pretty much settled.
00:24:09 There's a few things here and there that are basically that we have to deal with now that basically due to early decisions in terms of API that needs to be fixed.
00:24:20 And I guess the big objective is to basically do scikit-learn 1.0, like the first stable, fully stable release in terms of API because that's something that we've been talking about between the core developers for, I mean, more than two years now, coming with this 1.0 version that stabilizes every part of the API.
00:24:44 Right.
00:24:44 One final major cleanup, if you can, and then stabilizing it, yeah?
00:24:49 Exactly.
00:24:49 Exactly.
00:24:49 And in terms of new features, I mean, you always have a lot of cool stuff that are around and you see the number of pull requests that are coming on top of scikit-learn.
00:25:01 It's pretty crazy.
00:25:02 And I would say a huge maintainer's effort and reviewing effort.
00:25:07 So features are coming in slowly now in scikit-learn, much more slowly than it used to be, but I guess it's normal for a project that is getting big.
00:25:14 Yeah, it's definitely getting big.
00:25:16 It has 7,600 stars and 4,500 forks on GitHub, so that's pretty awesome.
00:25:22 Yeah.
00:25:23 It has 457 contributors.
00:25:24 Cool.
00:25:25 Yeah, I would say for every release we get, I mean, we try to release every six months.
00:25:30 And for every release we get a big number of contributors.
00:25:36 So maybe we could do like a survey of the modules of scikit-learn, just the important ones that come to mind.
00:25:44 What are the moving parts in there?
00:25:46 So I would say maybe something I know the most, which is a part of the module that I maintain the most, which is the linear model.
00:25:52 And recently the efforts on the linear models were to scale it up.
00:25:58 Basically try to learn this linear models in an out-of-core fashion to be able to scale to data that do not fit in RAM.
00:26:07 And that's part of the, I would say, part of the plan for this linear model module in scikit-learn.
00:26:15 That's cool.
00:26:15 So what kind of problems do you solve with that?
00:26:17 The types of problems where you have a, like, humongous number of samples and potentially a lot number of features.
00:26:24 So there are not so many applications where you get that many number of samples, but that's typically text or log files.
00:26:31 These types of industry problems where you collect a lot of samples on a regular basis.
00:26:38 You have these examples also if you monitor an industrial system, like if you want to do what we discussed before about, like, predictive maintenance.
00:26:46 That's probably a use case where this can be useful.
00:26:50 Probably the other, like, module that also attracts a lot of effort these days is the Ensemble module, and especially the tree module.
00:27:00 So for models like Random Forest or Gradient Boosting, which are very popular models that have been helping people to win cargo competitions for the last few years.
00:27:13 Yeah, I've heard a lot about these forests and so on.
00:27:17 Can you talk a little bit about what that is?
00:27:19 So a random forest basically is a set of decision trees that you pull together to get a prediction that is more accurate.
00:27:32 More accurate because it has less variance in technical terms.
00:27:36 And the way it works is you try to basically build decision trees from a subset of data, a subset of samples, subset of features in a clever way.
00:27:46 And then you pull all these trees in one big predictive model.
00:27:50 And, for example, if you do binary classification and you train a thousand trees, you ask for a new observation to the thousand trees.
00:27:59 What's the label?
00:27:59 Is it positive or negative?
00:28:01 And then you basically count the number of trees that are saying positive.
00:28:05 And if you have more trees saying positive, then you predict positive.
00:28:08 That's kind of the basic idea of random forest.
00:28:11 And it turns out to be super powerful.
00:28:13 That's really cool.
00:28:14 Well, it seems to me like it would bring in kind of different perspectives or taking different components or parts of a problem into account.
00:28:23 So some of the trees look at some features and maybe the other trees look at other features.
00:28:28 And then they can combine in some important way.
00:28:31 Exactly.
00:28:32 Yeah.
00:28:33 Another one that I see coming up is the SVM module.
00:28:36 What's that one do?
00:28:38 So SVM is a very popular machine learning approach that was basically, I mean, very big in the 90s and 10 years ago and still get some attraction.
00:28:52 And basically, the idea of a support vector machine, which is the SVM is the according for, is to be able to use kernels on the data and basically solve linear problems in an abstract space where you project your raw data.
00:29:10 Let me try to give an example.
00:29:11 If you take a graph or if you take a graph or if you take a string, that's not naturally something that can be represented by a vector.
00:29:18 And when you do an SVM, you have a tool, which is a kernel that allows you to compare these observations, like a kernel between strings, a kernel between graphs.
00:29:27 And once you define this kernel, and this kernel needs to satisfy some properties that I'm going to skip, then you can use this SVM to do classification but also regression.
00:29:37 This is what you have in the SVM module of Seciturn, which is basically a very clever and efficient binding of an underlying library, which is called LibSVM.
00:29:47 Okay, excellent.
00:29:48 And is that used more in the unsupervised world?
00:29:51 It's completely supervised.
00:29:52 When you do SVM, it's classification or regression that's supervised.
00:29:55 There's one use case of SVM in an unsupervised setting, which is what we call the one-class SVM.
00:30:02 So you just have one class, which basically means that you don't have labels, you just have data, and you're trying to see what are the data that are the less like the others.
00:30:11 That's more like an anomaly detection problem, or we call it also novelty detection or outlier detection.
00:30:17 Maybe we could talk a little bit about some of the algorithms.
00:30:21 As a non-expert in sort of the data science machine learning field, I go in there and I see all these cool algorithms and graphs, but I don't really know what would I do with that.
00:30:31 On the site, it says there's all these algorithms it supports.
00:30:34 So, for example, it supports dimensionality reduction.
00:30:38 Like, what kind of problems would I bring that in for?
00:30:41 I guess it's hard to summarize.
00:30:44 The hundreds of hundreds of pages that you have in Scikit-Learn in the documentation, I'm trying to give you a big picture without too much technical detail to tell you when these algorithms are useful and what they are useful for, and what are the hypotheses and what kind of output you can hope to get.
00:31:02 It's one of the strengths of the Scikit-Learn documentation, by the way.
00:31:05 And so to answer your question, dimensionality reduction, I would say like the 101 way of doing it is the principal component analysis, where you're trying to extract subspace that captures the most variance in the data.
00:31:22 And that can be used to do visualization of the data in low dimension.
00:31:26 If you do a PCA in two or three dimensions, then you can look at your observation as a scatterplot in two or three D.
00:31:33 And that's basically visualization.
00:31:35 But you can also use this to reduce the size of your data set, maybe without losing too much predictive power.
00:31:43 So you take your biggest data set, you run a PCA, and then you reduce the dimension.
00:31:48 And then suddenly you have a learning problem, which is on smaller data, because you basically reduce the number of features.
00:31:55 That's kind of the standard approaches, which is like visualization or reducing of the data set to have a more efficient learning in terms of computing time, but also sometimes in prediction problem.
00:32:08 Okay, that makes a lot of sense.
00:32:10 That's really cool.
00:32:10 So like if we went back to my house example, maybe I was feeding like the length of the driveway and the number of trees in the yard.
00:32:18 And it might turn out that neither of those have any effect on house prices.
00:32:22 So we could reduce it to a smaller problem by having this whole PCA go, look, those don't matter.
00:32:27 Throw that part out.
00:32:28 It's really about the number of bathrooms and the square footage or something.
00:32:32 Well, yes and no.
00:32:36 That's kind of the idea.
00:32:36 Okay, but in this example of Boston, the prediction of houses, you want to reduce the dimension in an informed way.
00:32:44 Because the number of trees in the yard can be informative for something, but maybe not to predict the price of the apartment or price of the house.
00:32:52 So when you do dimensionally reduction in the context of supervised learning, that can be also what you call feature selection or basically selecting the predictive features, which ultimately leads to a reduced data set because you remove features.
00:33:05 But that would be in a supervised context.
00:33:07 When you do PCA, you're really in an unsupervised way.
00:33:10 You don't know what are the labels.
00:33:11 You just want to figure out what's the variance.
00:33:14 Where is the variance in the data coming from?
00:33:16 On which axis and which direction should I look to see the structure?
00:33:20 Another thing that is in there are ensemble methods for predicting multiple supervised models.
00:33:29 What's the story there?
00:33:30 That sounds cool.
00:33:30 So random forest is an example of ensemble methods.
00:33:36 When you have an ensemble, it's basically saying that you're taking a lot of classifiers or a lot of regressors and you combine them in a bag of prediction, a bag of models or an ensemble of models.
00:33:50 And then you make them collaborate in order to build a better prediction.
00:33:53 And random forest is basically an ensemble of trees.
00:33:57 But you can also do an ensemble of neural networks.
00:34:02 You can do an ensemble of whatever model you want to pull.
00:34:07 And that turns out to be in practice often a very efficient approach.
00:34:11 Yeah, like we were saying, the more perspectives, different models, it seems like that's a really good idea.
00:34:18 So you mentioned neural networks.
00:34:20 Yes.
00:34:21 So Scikit-Learn has support for neural networks as well?
00:34:23 Well, you have a multilayer perception, which is like the basic neural network.
00:34:29 I mean, these days in neural network, people talk about deep learning.
00:34:32 I've heard about it.
00:34:33 That's about the extent of it.
00:34:34 What's deep learning?
00:34:35 This episode is brought to you by Codeship.
00:34:53 Codeship has launched organizations, create teams, set permissions for specific team members,
00:34:58 and improve collaboration in your continuous delivery workflow.
00:35:01 Maintain centralized control of your organization's projects and teams with Codeship's new organizations plan.
00:35:07 And as Talk Python listeners, you can save 20% off any premium plan for the next three months.
00:35:12 Just use the code TALKPYTHON, all caps, no spaces.
00:35:17 Check them out at CodeChip.com and tell them thanks for supporting the show on Twitter where they're at, CodeChip.
00:35:28 So deep learning is basically neural networks 2.0, where you take neural networks and you stack more layers.
00:35:35 So kind of the story there is that for many years, people were kind of stuck with networks of two or three layers.
00:35:44 So not very deep.
00:35:46 And part of the issue is that it was really hard to train something that would add more layers.
00:35:51 In terms of research, there was two things that came up, which is first that we get access to more data,
00:35:57 which means that we can train bigger and more complex models.
00:36:01 But also there were some breakthrough in learning these models that allowed people to avoid overfitting,
00:36:09 trying to be able to learn this bigish model, these big models, because you have more data,
00:36:14 but also clever ways to prevent overfitting.
00:36:17 And that basically led to deep learning these days.
00:36:19 Oh, very interesting.
00:36:20 Yeah, that's been one of the problems with neural networks, right?
00:36:23 Is that if you teach it too much, then it only knows, you know, just the things you've taught it or something, right?
00:36:27 Exactly.
00:36:28 It basically learns by heart what you provide as trading observations and end up being very bad when you provide new observations.
00:36:38 Want to talk a little bit about the datasets that come built in there?
00:36:41 Uh-huh.
00:36:42 We've talked a little bit about the Boston one, and that's the Boston house prices for regression.
00:36:47 What I hear coming up a lot is one called Iris.
00:36:50 Is that like your eye itself?
00:36:54 So Iris is the dataset that we use to illustrate all the classification problems.
00:37:00 It's really something that is a very common dataset that turned out to have a good license that we could ship it with scikit-learn,
00:37:07 and basically we built most of the examples using this Iris dataset, which is also very much used in textbooks of machine learning.
00:37:15 So that was kind of the default choice, and it talks to people because you understand what's the problem that you're trying to do,
00:37:23 and it's rich enough and not too big, so we can make all these examples run super fast and build a nice location.
00:37:30 That's very cool.
00:37:30 What is the dataset?
00:37:31 What exactly is it about?
00:37:33 So the Iris dataset, you're trying to predict the types of plants, for example, using the sepal length, so the sepal width.
00:37:44 So you have a number of features that describe the plant, and you're trying to predict which one among three it is.
00:37:51 So it's a three-label, three-class classification problem.
00:37:55 Yeah, that's cool.
00:37:56 Enough data to not just be a linear model or something, a single variable model, but not too much?
00:38:02 Exactly.
00:38:04 It's not completely linear a bit, but not too hard at the same time.
00:38:10 Right.
00:38:10 If you get 20 variables, that's probably too much to deal with.
00:38:13 Then one is on diabetes.
00:38:14 What about diabetes does that dataset represent?
00:38:17 Do you know?
00:38:18 I'm actually not really sure what's the – no, it's a regression problem.
00:38:24 It's used a lot in the linear model, especially for the sparse regression models because the – I mean, part of these sparse regression models are trying to extract the predictive features.
00:38:34 I guess in the diabetes dataset, you try to find something related to diabetes, and you're interested into finding the most predictive features.
00:38:41 What are the best features?
00:38:43 And then that's part of the reason I think we're using it.
00:38:46 And then another one is digits, which kind of meant to model images, right?
00:38:51 One of the early, I would say, breakthrough of machine learning was this work in the 90s where Yad Lequin and other people were trying to build a system that could predict what was the digit present on the screen or in the image.
00:39:10 So it's a very old machine learning problem where you start from a picture or an image of a digit that is handwritten, and you're trying to predict what it is from zero to nine.
00:39:20 And it's an example that basically people can easily grasp in order to understand what's the machine learning.
00:39:26 You give me an image, and I'll predict something between zero and nine.
00:39:30 And historically, when we did the first version of the scikit-learn website, we had something like seven or eight lines of Python code that were running classification of digits.
00:39:41 So that was kind of the motivation example where we said, okay, scikit-learn is machine learning made easy.
00:39:46 And here it is, an example.
00:39:48 It's ten lines of code classifying digits.
00:39:51 And that was basically the punchline.
00:39:53 Solving this old hard problem in a nice, simple way, right?
00:39:57 Yeah.
00:39:57 You know, lately, there's been a lot of talk about artificial intelligence, and especially from people like Elon Musk and Stephen Hawking,
00:40:08 saying that maybe we should be concerned about artificial intelligence and things like that.
00:40:14 So one of my first questions around this area is, is machine learning the same thing as artificial intelligence?
00:40:20 Depends who you ask.
00:40:23 Okay.
00:40:24 Sure.
00:40:25 No, I mean, AI was basically the early name of trying to teach a computer to do something.
00:40:34 I mean, it dates back from the 60s and 70s, where basically in the US, for example, at MIT, you had labs that are basically called AI labs.
00:40:42 And machine learning is kind of a, I would say, more restricted set of problems that compared to AI,
00:40:53 which is, say, when you do AI and you want to do work with text or linguistic, you want to build a system that understands linguistic.
00:41:02 That would be an AI problem.
00:41:05 But machine learning is kind of a saying, okay, I've got a loss function.
00:41:08 I want to optimize my criteria.
00:41:10 I've got something that I want to train my system on.
00:41:15 And in a sense, you teach a system to learn.
00:41:17 And so you create some kind of intelligence.
00:41:21 But it's not, I would say it's a, I would say simpler thing to say than saying intelligence, which is kind of a hard concept.
00:41:28 That's maybe part of my personal answer to this.
00:41:31 Yeah, no, it's a great answer.
00:41:33 Just from my limited exposure to it, it seems like machine learning is more about classification and prediction,
00:41:39 whereas the AI concept is a, there's a strong autonomous component that is just completely lacking for machine learning.
00:41:47 Yeah, I guess I would say, I would explain it simply like this, exactly.
00:41:52 What things have you seen people using scikit-learn for that surprised you?
00:41:58 Or like, wow, you guys are doing that?
00:42:00 That's amazing.
00:42:03 So on scikit-learn, we have this testimonial page where we ask typically companies or institutes that are using scikit-learn to write a couple of sentences to say, okay, what they're using scikit-learn for and why they think it's great.
00:42:23 I'm trying to find this.
00:42:25 And I remember there was this, I think, a dating website.
00:42:30 Saying that they were using scikit-learn to optimize dates between people.
00:42:36 That was great.
00:42:38 That was like a funny one.
00:42:40 That is funny.
00:42:41 So there may be people out there who are married and maybe even babies who are born because of scikit-learn.
00:42:46 Yeah, that would be great.
00:42:49 I'm going to add this to my resume.
00:42:52 It's awesome.
00:42:53 Matchmaker.
00:42:54 So if people want to get started with scikit-learn, they're out there listening, they're like, this is awesome.
00:43:00 Where do I start?
00:43:01 What would you recommend for sort of getting into this whole world of machine learning and getting started with scikit-learn in particular?