transcripts/031_scikit-learn_and_machine_learning.vtt

WEBVTT

00:00:00.001 --> 00:00:06.840
Machine learning allows computers to find hidden insights without being explicitly programmed where to look or what to look for.

00:00:06.840 --> 00:00:14.060
Thanks to the work of some dedicated developers, Python has one of the best machine learning platforms out there called Scikit-Learn.

00:00:14.060 --> 00:00:19.180
In this episode, Alexander Gramfort is here to tell us about Scikit-Learn and machine learning.

00:00:19.180 --> 00:00:25.460
This is Talk Python to Me, number 31, recorded Friday, September 25, 2015.

00:00:25.460 --> 00:00:37.200
I'm a developer in many senses of the word, because I make these applications, but I also use these verbs to make this music.

00:00:37.200 --> 00:00:41.740
I construct it line by line, just like when I'm coding another software design.

00:00:41.740 --> 00:00:47.960
In both cases, it's about design patterns. Anyone can get the job done, it's the execution that matters.

00:00:47.960 --> 00:00:53.500
I have many interests, sometimes conflict, but creativity can usually be a benefit.

00:00:53.760 --> 00:01:00.700
Welcome to Talk Python to Me, a weekly podcast on Python, the language, the libraries, the ecosystem, and the personalities.

00:01:00.700 --> 00:01:04.820
This is your host, Michael Kennedy. Follow me on Twitter, where I'm @mkennedy.

00:01:04.820 --> 00:01:11.280
Keep up with the show and listen to past episodes at talkpython.fm, and follow the show on Twitter via at Talk Python.

00:01:11.280 --> 00:01:15.840
This episode is brought to you by Hired and Codeship.

00:01:15.840 --> 00:01:21.180
Thank them for supporting the show on Twitter via at Hired underscore HQ and at Codeship.

00:01:22.640 --> 00:01:24.880
Hey, everyone. Thanks for listening today.

00:01:24.880 --> 00:01:27.980
Let me introduce Alexander so we can get right to the interview.

00:01:27.980 --> 00:01:38.560
Alexander Grandfort is currently an assistant professor at Telecom Paris Tech and scientific consultant for the CEA Neurospin Brain Imaging Center.

00:01:38.560 --> 00:01:49.600
His work is on statistical machine learning, signal and image processing optimization, scientific computing, and software engineering with a primary focus in brain functional imaging.

00:01:50.220 --> 00:01:56.360
Before joining Telecom Paris Tech, he worked at the Martino Center for Biomedical Imaging at Harvard in Boston.

00:01:56.360 --> 00:02:02.280
He's also an active member for the Center for Data Science at Université Paris-Saclay.

00:02:02.280 --> 00:02:04.760
Alexander, welcome to the show.

00:02:04.760 --> 00:02:05.960
Thank you. Hi.

00:02:06.660 --> 00:02:11.180
Hi. I'm really excited to talk about machine learning and scikit-learn with you today.

00:02:11.180 --> 00:02:17.820
It's something I know almost nothing about, so it's going to be a great chance for me to learn along with everyone else who's listening in.

00:02:17.820 --> 00:02:21.140
So hopefully I'll be able to give relevant answers.

00:02:21.140 --> 00:02:22.760
Yeah, I'm sure that you will.

00:02:23.660 --> 00:02:27.820
All right, so we're going to talk all about machine learning, but before we get there, let's hear your story.

00:02:27.820 --> 00:02:29.060
How did you get into programming in Python?

00:02:29.060 --> 00:02:35.660
Well, I've done a lot of scientific computing and scientific programming over the last maybe 10 to 15 years.

00:02:35.660 --> 00:02:40.660
I started my undergrad in computer science, doing a lot of signal and image processing.

00:02:40.660 --> 00:02:45.700
Well, like these types of people, I've done a lot of MATLAB in my previous life.

00:02:46.000 --> 00:02:49.300
Yes, I've done a lot of MATLAB too. I know about the .im files.

00:02:49.300 --> 00:02:56.560
And I switched to a team for my postdoc.

00:02:56.560 --> 00:02:59.960
Basically, I did a PhD in computer science applied to brain imaging.

00:02:59.960 --> 00:03:05.300
And I switched to a different team where basically I was surrounded by people working with Python.

00:03:05.300 --> 00:03:08.120
And basically, I got into it and switched.

00:03:08.120 --> 00:03:12.500
In one week, MATLAB was gone from my life.

00:03:14.040 --> 00:03:16.060
But it's been maybe five years now.

00:03:16.060 --> 00:03:20.260
And yeah, that's kind of the historical part.

00:03:20.260 --> 00:03:22.220
Do you miss MATLAB?

00:03:22.220 --> 00:03:23.600
Not really.

00:03:23.600 --> 00:03:25.760
Me either.

00:03:25.760 --> 00:03:29.720
There are some cool things about it, but...

00:03:29.720 --> 00:03:34.540
Yeah, I still have students that are insisting to work with me in MATLAB.

00:03:34.540 --> 00:03:38.760
So I have to still do stuff in MATLAB for supervision.

00:03:38.760 --> 00:03:42.440
But not really when I have the choice.

00:03:43.080 --> 00:03:44.200
Yeah, if you get a choice, of course.

00:03:44.200 --> 00:03:53.640
I think one of the things that's really a drawback about specialized systems like MATLAB is it's very hard to build production finished products.

00:03:53.640 --> 00:03:55.220
You can do research.

00:03:55.220 --> 00:03:56.040
You can learn.

00:03:56.040 --> 00:03:57.220
You can write papers.

00:03:57.220 --> 00:03:59.120
You can even test algorithms.

00:03:59.120 --> 00:04:06.940
But if you want to get something that's running on data centers on its own, probably MATLAB is, you know, you could make it work, but it's not generally the right choice.

00:04:06.940 --> 00:04:07.740
Definitely.

00:04:07.740 --> 00:04:08.220
Yeah.

00:04:08.220 --> 00:04:08.920
Yeah.

00:04:09.040 --> 00:04:21.260
And so things like, you know, I think that explains a lot of the growth of Python in this whole data science, scientific computing world, along with great toolkits like scikit-learn, right?

00:04:21.260 --> 00:04:22.080
Yes.

00:04:22.080 --> 00:04:26.740
I mean, definitely the way scikit-learn is now used.

00:04:27.740 --> 00:04:35.020
The fact that the Python stack allows you to make this production type of code is a clear win for everyone.

00:04:36.300 --> 00:04:46.240
So before we get into the details of scikit-learn and how you work with it and all the features it has, let's just, you know, in a really broad way, talk about machine learning.

00:04:46.240 --> 00:04:47.420
Like, what is machine learning?

00:04:47.420 --> 00:04:54.400
I would say the simple example of machine learning is trying to predict something from previous data.

00:04:54.400 --> 00:04:58.400
So what people would call supervised learning.

00:04:58.400 --> 00:05:07.380
And there are plenty of examples of this in everyday life, like your mailbox that predicts for you if your email is a spam or a ham.

00:05:07.380 --> 00:05:17.000
And that's basically a system that learns from previous data how to make an informed choice and give you a prediction.

00:05:17.000 --> 00:05:20.700
And that's basically the most simple way of seeing machine learning.

00:05:21.540 --> 00:05:30.960
And basically you see machine learning problems framed this way in all contexts, from industry to academic science.

00:05:30.960 --> 00:05:33.840
And, I mean, there are many examples.

00:05:33.840 --> 00:05:43.940
And basically, in terms of other types of classes of problems that you see in machine learning, it's not really these prediction problems.

00:05:43.940 --> 00:05:57.120
We're trying to make sense from raw data where you don't have labels like spam or ham, but you just have data and you want to figure out what's the structure, what types of input or insight can you get from it.

00:05:57.120 --> 00:06:02.500
And that's, I would say, the other big class of problem that machine learning addresses.

00:06:02.500 --> 00:06:05.860
Yeah, so there's that general classification.

00:06:06.380 --> 00:06:21.940
I guess with the first category you were talking about, like spam filters and other things that maybe fall into that realm would be like credit card fraud, maybe trading stocks, these kind of binary, do it, don't do it, based on examples.

00:06:21.940 --> 00:06:26.660
That's something that is, is it called structured learning or what's the?

00:06:26.660 --> 00:06:30.400
The common name is supervised learning.

00:06:30.400 --> 00:06:31.780
Supervised learning, that's right.

00:06:31.780 --> 00:06:38.600
Yeah, so basically you have pairs of training observations that are the data and their corresponding labels.

00:06:38.600 --> 00:06:41.440
So text and the label would be spam or ham.

00:06:41.440 --> 00:06:45.460
Or you can also see, this is basically binary classification.

00:06:45.460 --> 00:06:49.740
The other types of machine learning problems you have is, for example, regression.

00:06:49.740 --> 00:06:54.500
You want to predict the price of a house and you know the number of square feet.

00:06:54.500 --> 00:06:57.280
You know the number of rooms.

00:06:57.280 --> 00:06:59.100
You know what's exactly the location.

00:06:59.520 --> 00:07:03.840
And so you have a bunch of variables that describe your house or apartment.

00:07:03.840 --> 00:07:05.820
And from this you want to predict the price.

00:07:05.820 --> 00:07:10.380
And that's another example where now it seems the price is a continuous variable.

00:07:10.380 --> 00:07:11.400
It's not binary.

00:07:11.400 --> 00:07:13.800
This is what people call regression.

00:07:13.800 --> 00:07:17.000
And this is another big class of supervised learning problem.

00:07:17.000 --> 00:07:17.620
Right.

00:07:17.700 --> 00:07:34.220
So you might know through the real estate data, all the houses in the neighborhood that have sold in the last two years, the ones that have sold last month, all their variables and dimensions, if you will, like number of bathrooms, number of bedrooms, square feet, or square meters.

00:07:34.220 --> 00:07:39.720
You could feed it into the system to train it.

00:07:40.180 --> 00:07:44.840
And then you could say, well, now I have a house with two bathrooms and three bedrooms.

00:07:44.840 --> 00:07:46.460
And right here, what's it worth?

00:07:46.460 --> 00:07:46.800
Right?

00:07:46.800 --> 00:07:47.520
Exactly.

00:07:47.520 --> 00:07:55.660
That's basically a typical example and also a typical data set that we use in scikit-learn that basically illustrates the concept of regression with a similar problem.

00:07:55.660 --> 00:07:56.540
Right.

00:07:56.660 --> 00:08:01.900
There's, we'll talk more about it, but there's a scikit-learn comes with some pre-built data sets.

00:08:01.900 --> 00:08:03.940
And one of them is the Boston house market, right?

00:08:03.940 --> 00:08:04.660
Exactly.

00:08:04.660 --> 00:08:05.360
That's the one.

00:08:05.360 --> 00:08:05.840
Yeah.

00:08:05.840 --> 00:08:08.840
How much data do you have to give it?

00:08:08.840 --> 00:08:14.920
Like, suppose I want to try to estimate the value of my house, which, you know, at least in the United States, we have this service called Zillow.

00:08:14.920 --> 00:08:16.880
So they're doing way more.

00:08:16.880 --> 00:08:18.640
I'm sure they're running something like this, actually.

00:08:19.100 --> 00:08:25.160
But suppose I wanted to take it upon myself to, like, grab the real estate data and try to estimate the value of my home.

00:08:25.160 --> 00:08:30.580
How many houses would I have to give it before it would start to be reasonable?

00:08:30.580 --> 00:08:32.660
Well, that's a tough question.

00:08:32.660 --> 00:08:35.380
And I guess there's no simple answer.

00:08:35.380 --> 00:08:44.060
I mean, you have this, that you can see on the cheat sheets of scikit-learn that says if you have less than 50 observations, then go get more data.

00:08:45.600 --> 00:08:48.460
But I guess it's also a simplified answer.

00:08:48.460 --> 00:08:50.640
It depends on the difficulty of the task.

00:08:50.640 --> 00:08:55.300
So at the end of the day, often for these types of problems, you want to know something.

00:08:55.300 --> 00:08:58.500
And this can be easy or hard.

00:08:58.500 --> 00:09:00.000
You cannot really know before trying.

00:09:00.000 --> 00:09:07.440
And typically regression would say, okay, if I predict that the 10% plus or minus, that's maybe good enough for my application.

00:09:07.440 --> 00:09:08.980
And maybe you need less data.

00:09:08.980 --> 00:09:12.000
If you want to be super accurate, you need more data.

00:09:12.000 --> 00:09:17.080
But the question of how much is, it's really hard to answer without really trying and using actual data.

00:09:17.080 --> 00:09:18.380
Yeah, I can imagine.

00:09:18.380 --> 00:09:26.780
It probably also depends on the variability of the data, the accuracy of the data, how many variables you're trying to give it.

00:09:26.780 --> 00:09:40.080
So if you just added, just tried to base it on square footage or square meters of your house, that one variable, maybe it's easier to predict than, you know, 20 components that describe your house, right?

00:09:40.680 --> 00:09:46.760
So the thing, the more variables you have, the more you can hope to get.

00:09:46.760 --> 00:09:53.520
Now it's not as simple as this, because if variables are not informative, then they're basically adding noise to your problem.

00:09:53.520 --> 00:10:02.780
So you want as many variables to describe your data in order to capture the weak signals.

00:10:02.780 --> 00:10:06.560
But sometimes just variables are not relevant or predictive.

00:10:06.560 --> 00:10:10.240
And so there are more, you want to remove them from the prediction problem.

00:10:10.240 --> 00:10:11.920
Okay, that makes sense.

00:10:11.920 --> 00:10:24.700
So I was looking into what are some of the novel uses of machine learning in order to sort of have some things to ask you about and just see what's out there.

00:10:25.580 --> 00:10:27.620
What are ones that come to mind for you?

00:10:27.620 --> 00:10:29.340
And then I'll give you some that I found on my list.

00:10:29.340 --> 00:10:36.860
Maybe I'm biased because I'm really into using machine learning for scientific data and academic problems.

00:10:36.860 --> 00:10:47.080
But I guess for things that are really academic breakthrough that are reaching everybody is really related to computer vision and NLP these days and probably also speech.

00:10:47.080 --> 00:10:58.460
So these types of systems that try to predict something from speech signals or from images like describing you what's the contents, what types of objects you can find.

00:10:58.460 --> 00:11:01.600
And for NLP you have like machine translation.

00:11:01.600 --> 00:11:07.680
We did a show with OpenCV and the whole Python angle there.

00:11:07.680 --> 00:11:11.780
There was a lot of really cool stuff on medical imaging going on there.

00:11:11.780 --> 00:11:13.780
Does that have to do with scikit-learn as well?

00:11:14.420 --> 00:11:29.420
Well, you have people doing medical imaging using scikit-learn, basically extracting features from MR images, magnetic resonance images, or CT scanners, or also like EEG brain signals.

00:11:29.420 --> 00:11:39.580
And they're using EEG – sorry, they're using scikit-learn as the prediction tool, deriving features from their raw data.

00:11:40.440 --> 00:11:45.280
And that reaches, of course, clinical applications in some contexts.

00:11:45.280 --> 00:11:57.100
Maybe automatic systems that say, hey, this looks like it could be cancer or it could be some kind of problem, bring the attention of an expert who could actually look at it and say, yes, no, something like this?

00:11:57.100 --> 00:11:57.960
Yeah, exactly.

00:11:58.060 --> 00:12:19.060
It's like helping diagnosis, like trying to help the clinician to isolate something that looks weird or suspicious in the data to get like the time of this physicist and the clinician onto this particular part of the data to see what's going on and if the patient is suffering for something.

00:12:19.640 --> 00:12:19.960
Right.

00:12:19.960 --> 00:12:20.640
That's really cool.

00:12:20.640 --> 00:12:36.240
I mean, maybe you could take previous biopsies and invasive things that have happened to other people and their pictures and their outcomes and say, look, you have basically the same features and we did this test and the machine believes that you actually don't have a problem.

00:12:36.240 --> 00:12:37.800
So, you know, probably don't worry about it.

00:12:37.800 --> 00:12:39.400
We'll just watch this or something like that, right?

00:12:39.400 --> 00:12:45.860
Yeah, I mean, on this line of thought, there was recently a Kaggle competition using retina pictures.

00:12:45.860 --> 00:12:50.880
So, like people suffering from diabetes usually have problems with retinas.

00:12:50.880 --> 00:13:05.000
And so, you can take pictures of retinas from hundreds of people and see if you can build a system that predicts something about the patient and the state of the disease from these images.

00:13:05.000 --> 00:13:08.040
And this is typically done by pooling data from multiple people.

00:13:08.040 --> 00:13:09.140
That's really cool.

00:13:09.140 --> 00:13:15.140
I've heard this Kaggle competition or challenges before in various places looking at it.

00:13:15.140 --> 00:13:15.620
What is that?

00:13:15.620 --> 00:13:33.540
So, it's basically a website that allows you to organize these types of supervised learning problems where a company or a structure, NGO, whatever, is having data and is trying to build a system, a predictive system.

00:13:33.980 --> 00:13:45.100
And they ask Kaggle to set this up, which basically means for Kaggle putting the training data set online and giving this to data scientists.

00:13:45.100 --> 00:13:52.020
And they basically then spend time building a predictive system that is evaluated on new data on which to get a score.

00:13:52.360 --> 00:14:01.820
And that allows you to see how the system works on new data and to rank basically the data scientists that are playing with the system.

00:14:01.820 --> 00:14:07.100
It's kind of an open innovation approach in data science.

00:14:07.100 --> 00:14:08.680
That's really cool.

00:14:08.920 --> 00:14:10.600
So, that's just Kaggle.com.

00:14:10.600 --> 00:14:11.260
Yes.

00:14:11.260 --> 00:14:13.160
K-A-G-G-L-E.com.

00:14:13.160 --> 00:14:13.640
Exactly.

00:14:13.640 --> 00:14:14.380
Yeah.

00:14:14.380 --> 00:14:14.780
Very nice.

00:14:14.780 --> 00:14:34.200
Some of the other ones that I sort of ran across while I was looking around that were pretty cool was one is some guys at Cornell University built machine learning algorithms to listen for the sound of whales in the ocean and use them in real time to help ships avoid running into whales.

00:14:34.760 --> 00:14:35.760
That's pretty awesome, right?

00:14:35.760 --> 00:14:36.040
Yeah.

00:14:36.040 --> 00:14:36.600
Yeah.

00:14:36.600 --> 00:14:43.100
There was a Kaggle competition on these whale sounds maybe a couple of years ago.

00:14:43.100 --> 00:14:49.980
And it was a – I mean, not many data scientists have experienced, like, listening to whales.

00:14:49.980 --> 00:14:53.460
So, it's kind of everybody doesn't really know what types of data.

00:14:53.920 --> 00:15:01.860
And I remember this presentation from the winner basically saying how to win a Kaggle competition without knowing anything about the data.

00:15:01.860 --> 00:15:03.320
It's kind of a provocative talk.

00:15:03.320 --> 00:15:04.860
That is cool.

00:15:04.860 --> 00:15:12.820
But showing how you can basically build a predictive system by just looking at the data and trying to make sense out of it without really being an expert in the field.

00:15:13.220 --> 00:15:13.400
Yeah.

00:15:13.400 --> 00:15:17.160
That's probably a really valuable skill as a data scientist to have, right?

00:15:17.160 --> 00:15:19.140
Because you can be an expert, but not in everything.

00:15:19.140 --> 00:15:29.500
Some other ones that were interesting was IBM was working on something to look at the handwritten notes of physicians.

00:15:29.500 --> 00:15:30.520
Uh-huh.

00:15:30.520 --> 00:15:37.060
And then it would predict whether – how likely the person that those notes were about would have a heart attack.

00:15:37.060 --> 00:15:37.560
Yeah.

00:15:37.560 --> 00:15:48.960
In the clinical world, it's true that a lot of information is actually raw text, like manual, like just written notes, but also raw text on the system.

00:15:48.960 --> 00:15:56.860
For machine learning, that's a particularly difficult problem because it's what we call unstructured data.

00:15:57.340 --> 00:16:08.780
So you need to – typically for scikit-learn to work on these types of data, you need to do something extra to basically come up with a structure or come up with features that allow you to predict something.

00:16:08.780 --> 00:16:10.080
Sure.

00:16:10.080 --> 00:16:15.360
And so both of those two examples that I brought up have really interesting data origin problems.

00:16:15.980 --> 00:16:30.560
So if I give you an MP3 of a whale or an audio stream of a whale, how do you turn that into numbers that go into the machine even to train it?

00:16:30.560 --> 00:16:36.020
And then similarly with handwriting, how do you – you've got to do handwriting recognition.

00:16:36.020 --> 00:16:40.840
You've got to then do sort of understanding what the handwriting means.

00:16:41.100 --> 00:16:42.180
And there's a lot of levels.

00:16:42.180 --> 00:16:46.540
How do you take this data and actually get it into something like scikit-learn?

00:16:46.540 --> 00:16:56.720
So scikit-learn expects that every observation, we also call it a sample or a data point, is basically described by a vector, like a vector of values.

00:16:57.580 --> 00:17:10.480
So if you take the sound of the whale, you can say, okay, there's a sound in the MP3, it's just a set of floating point values, like every time sample, really time domain signals that you get for a few seconds of data.

00:17:10.480 --> 00:17:15.520
It's probably not the best way to get a predictive – a good predictive system.

00:17:15.680 --> 00:17:25.140
You want to do some feature transformation, change the input to get something that brings features that are more powerful for scikit-learn and the learning system.

00:17:25.140 --> 00:17:38.760
And you would typically do this with time-frequency transform, things like spectrograms, trying to extract features that are really, for example, invariant to some aspects of the day, like frequencies or time shifts.

00:17:38.860 --> 00:17:43.280
So there's probably a bit of pre-processing to do on these row signals.

00:17:43.280 --> 00:17:48.500
And then once you have your vector, you can use the scikit-learn machinery to build your predictive system.

00:17:48.500 --> 00:17:53.320
How much of that pre-processing is in the tool set?

00:17:53.320 --> 00:17:55.860
So it depends for what types of data.

00:17:55.860 --> 00:17:59.780
Typically for signals, there's nothing really specific in scikit-learn.

00:17:59.780 --> 00:18:05.540
You would probably use scipy signal or any types of signal processing Python code that you find online.

00:18:06.320 --> 00:18:14.040
I would say for other types of data, like text, in scikit-learn, there are something that is called feature extraction module.

00:18:14.040 --> 00:18:22.980
And you have – in the feature extraction module, you have something for text, which is probably the biggest part of the feature extraction is really text processing.

00:18:22.980 --> 00:18:28.900
And you have some stuff also for images, but it's quite limited.

00:18:29.580 --> 00:18:33.860
We should probably introduce what scikit-learn is and get into the details of that.

00:18:33.860 --> 00:18:38.240
But I have one more sort of example to let people know about that I think is pretty cool.

00:18:38.240 --> 00:18:42.200
On show 16, I talked to Roy Rappaport from Netflix.

00:18:42.200 --> 00:18:51.700
And Netflix has a tremendously large cloud computing infrastructure to power all of their – you know, basically their movie system, right?

00:18:51.700 --> 00:18:53.460
And everything behind the scenes there.

00:18:53.460 --> 00:19:09.440
And they have so many virtual machine instances and services running on them and then different types of devices, accessing services on those machines that they said it's almost impossible to determine if there's, you know, some edge case where there's a problem manually.

00:19:10.220 --> 00:19:18.980
And so they actually set up machine learning to monitor their infrastructure and then tell them if there's some kind of problem in real time.

00:19:18.980 --> 00:19:19.820
Yeah.

00:19:19.820 --> 00:19:22.280
So I think that's really a cool use of it as well.

00:19:33.460 --> 00:19:36.140
This episode is brought to you by Hired.

00:19:36.140 --> 00:19:42.600
Hired is a two-sided, curated marketplace that connects the world's knowledge workers to the best opportunities.

00:19:42.600 --> 00:19:51.760
Each offer you receive has salary and equity presented right up front, and you can view the offers to accept or reject them before you even talk to the company.

00:19:51.760 --> 00:19:58.120
Typically, candidates receive five or more offers in just the first week, and there are no obligations ever.

00:19:58.120 --> 00:20:00.220
Sounds pretty awesome, doesn't it?

00:20:00.220 --> 00:20:02.280
Well, did I mention there's a signing bonus?

00:20:02.600 --> 00:20:06.360
Everyone who accepts a job from Hired gets a $2,000 signing bonus.

00:20:06.360 --> 00:20:10.700
And as Talk Python listeners, it gets way sweeter.

00:20:10.700 --> 00:20:18.280
Use the link Hired.com slash Talk Python To Me, and Hired will double the signing bonus to $4,000.

00:20:18.280 --> 00:20:20.000
Opportunity's knocking.

00:20:20.000 --> 00:20:23.600
Visit Hired.com slash Talk Python To Me and answer the call.

00:20:31.740 --> 00:20:36.320
Yeah, that's a very cool thing to do.

00:20:36.320 --> 00:20:46.900
And actually, many industries and many companies are looking for these types of systems that they call anomaly detection or failure prediction.

00:20:46.900 --> 00:20:51.920
And it's getting a big use case for machine learning, indeed.

00:20:52.480 --> 00:20:56.740
The Netflix guys were actually using scikit-learn, not just some other machine learning system.

00:20:56.740 --> 00:20:59.040
So let's get to the details of that.

00:20:59.040 --> 00:20:59.760
What's scikit-learn?

00:20:59.760 --> 00:21:00.880
Where did it come from?

00:21:00.880 --> 00:21:06.940
So scikit-learn is probably the biggest machine learning library that you can find in the Python world.

00:21:07.120 --> 00:21:16.900
So it dates back from almost 10 years ago when David Cornapal was doing a Google Summer of Code to kickstart the scikit-learn project.

00:21:16.900 --> 00:21:24.360
And then for a few years, there was a French guy called Mathieu Broucher who took on the project.

00:21:24.700 --> 00:21:28.480
But it was kind of a one-guy project for many years.

00:21:28.480 --> 00:21:45.020
And in 2010, with colleagues at INRIA in France, we decided to basically try to start from this state of scikit-learn and make it bigger and really try to build a community around this.

00:21:46.460 --> 00:21:56.000
So these people are Gael Varroco and Fabian Pedregosa and also somebody you may have heard of in the machine learning world with Olivier Grisel.

00:21:56.000 --> 00:22:03.240
And so that was pretty much 2010, so five years ago.

00:22:03.240 --> 00:22:05.800
And basically it took on pretty quickly.

00:22:05.800 --> 00:22:16.440
After, I would say, a year of scikit-learn, we had more than 10 core developers way beyond the initial lab where it started.

00:22:17.300 --> 00:22:18.740
That's really excellent.

00:22:18.740 --> 00:22:24.200
Yeah, I mean, it's definitely an absolutely mainstream project that people are using in production these days.

00:22:24.200 --> 00:22:26.500
So congratulations to everyone on that.

00:22:26.500 --> 00:22:26.960
That's great.

00:22:26.960 --> 00:22:27.600
Thank you.

00:22:27.600 --> 00:22:28.020
Yeah.

00:22:28.020 --> 00:22:37.420
And so the name scikit-learn comes from the fact that it's basically an extension to the SciPy pieces, right?

00:22:37.420 --> 00:22:48.700
So SciPy is like NumPy for numerical processing, SciPy for scientific stuff, Matplotlib, IPython, SimPy for symbolic math, and Pandas, right?

00:22:48.700 --> 00:22:50.200
And then there's these extensions.

00:22:50.200 --> 00:22:51.480
Yes.

00:22:51.960 --> 00:22:55.500
So basically the kind of division is that you cannot put everything in SciPy.

00:22:55.500 --> 00:22:57.200
SciPy is already a big project.

00:22:57.200 --> 00:23:03.680
And the idea of the SciKits were to build extensions around SciPy that are more domain-specific.

00:23:03.680 --> 00:23:08.220
Also, it's kind of also easier to contribute to a smaller project.

00:23:08.220 --> 00:23:16.960
So basically the barrier of entry for newcomers is much lower when you contribute to a Scikit than to SciPy, which is a fairly big project now.

00:23:16.960 --> 00:23:21.940
Yeah, there's so much support for the whole SciPy system, right?

00:23:22.180 --> 00:23:27.300
So it's much better to just build on that than try to duplicate anything and say NumPy or whatever.

00:23:27.300 --> 00:23:28.240
Exactly.

00:23:28.240 --> 00:23:37.100
I mean, there's a lot of efforts to see what could be NumPy 2.0 and what's going to be the future of it and how to extend it.

00:23:37.100 --> 00:23:44.780
I mean, a lot of people are thinking of what's next because, I mean, NumPy is almost 10 years old, probably more than 10 years old now.

00:23:44.780 --> 00:23:49.120
And, yeah, people are trying to see also how it can evolve.

00:23:49.120 --> 00:23:49.680
Sure.

00:23:49.680 --> 00:23:50.660
That makes a lot of sense.

00:23:51.320 --> 00:23:57.000
So speaking of evolving and going forward, what are the plans with scikit-learn?

00:23:57.000 --> 00:23:57.860
Where is it going?

00:23:57.860 --> 00:24:04.420
So I would say in terms of features, I mean, scikit-learn is really in the consolidation stage.

00:24:04.420 --> 00:24:06.600
scikit-learn is five years old.

00:24:06.600 --> 00:24:09.060
The API is pretty much settled.

00:24:09.060 --> 00:24:20.260
There's a few things here and there that are basically that we have to deal with now that basically due to early decisions in terms of API that needs to be fixed.

00:24:20.460 --> 00:24:44.060
And I guess the big objective is to basically do scikit-learn 1.0, like the first stable, fully stable release in terms of API because that's something that we've been talking about between the core developers for, I mean, more than two years now, coming with this 1.0 version that stabilizes every part of the API.

00:24:44.060 --> 00:24:44.540
Right.

00:24:44.540 --> 00:24:49.200
One final major cleanup, if you can, and then stabilizing it, yeah?

00:24:49.520 --> 00:24:49.920
Exactly.

00:24:49.920 --> 00:24:49.960
Exactly.

00:24:49.960 --> 00:25:01.160
And in terms of new features, I mean, you always have a lot of cool stuff that are around and you see the number of pull requests that are coming on top of scikit-learn.

00:25:01.160 --> 00:25:02.280
It's pretty crazy.

00:25:02.280 --> 00:25:07.040
And I would say a huge maintainer's effort and reviewing effort.

00:25:07.540 --> 00:25:14.100
So features are coming in slowly now in scikit-learn, much more slowly than it used to be, but I guess it's normal for a project that is getting big.

00:25:14.100 --> 00:25:16.060
Yeah, it's definitely getting big.

00:25:16.060 --> 00:25:22.480
It has 7,600 stars and 4,500 forks on GitHub, so that's pretty awesome.

00:25:22.480 --> 00:25:23.040
Yeah.

00:25:23.040 --> 00:25:24.500
It has 457 contributors.

00:25:24.500 --> 00:25:24.880
Cool.

00:25:25.440 --> 00:25:30.740
Yeah, I would say for every release we get, I mean, we try to release every six months.

00:25:30.740 --> 00:25:36.060
And for every release we get a big number of contributors.

00:25:36.060 --> 00:25:44.020
So maybe we could do like a survey of the modules of scikit-learn, just the important ones that come to mind.

00:25:44.020 --> 00:25:45.660
What are the moving parts in there?

00:25:46.180 --> 00:25:52.720
So I would say maybe something I know the most, which is a part of the module that I maintain the most, which is the linear model.

00:25:52.720 --> 00:25:58.760
And recently the efforts on the linear models were to scale it up.

00:25:58.760 --> 00:26:07.780
Basically try to learn this linear models in an out-of-core fashion to be able to scale to data that do not fit in RAM.

00:26:07.780 --> 00:26:15.040
And that's part of the, I would say, part of the plan for this linear model module in scikit-learn.

00:26:15.400 --> 00:26:15.960
That's cool.

00:26:15.960 --> 00:26:17.420
So what kind of problems do you solve with that?

00:26:17.420 --> 00:26:24.640
The types of problems where you have a, like, humongous number of samples and potentially a lot number of features.

00:26:24.640 --> 00:26:31.360
So there are not so many applications where you get that many number of samples, but that's typically text or log files.

00:26:31.360 --> 00:26:37.460
These types of industry problems where you collect a lot of samples on a regular basis.

00:26:38.140 --> 00:26:46.720
You have these examples also if you monitor an industrial system, like if you want to do what we discussed before about, like, predictive maintenance.

00:26:46.720 --> 00:26:49.360
That's probably a use case where this can be useful.

00:26:50.620 --> 00:27:00.820
Probably the other, like, module that also attracts a lot of effort these days is the Ensemble module, and especially the tree module.

00:27:00.820 --> 00:27:13.080
So for models like Random Forest or Gradient Boosting, which are very popular models that have been helping people to win cargo competitions for the last few years.

00:27:13.560 --> 00:27:17.020
Yeah, I've heard a lot about these forests and so on.

00:27:17.020 --> 00:27:18.600
Can you talk a little bit about what that is?

00:27:19.080 --> 00:27:32.140
So a random forest basically is a set of decision trees that you pull together to get a prediction that is more accurate.

00:27:32.140 --> 00:27:36.840
More accurate because it has less variance in technical terms.

00:27:36.840 --> 00:27:46.720
And the way it works is you try to basically build decision trees from a subset of data, a subset of samples, subset of features in a clever way.

00:27:46.720 --> 00:27:50.540
And then you pull all these trees in one big predictive model.

00:27:50.540 --> 00:27:59.080
And, for example, if you do binary classification and you train a thousand trees, you ask for a new observation to the thousand trees.

00:27:59.080 --> 00:27:59.900
What's the label?

00:27:59.900 --> 00:28:01.200
Is it positive or negative?

00:28:01.200 --> 00:28:05.200
And then you basically count the number of trees that are saying positive.

00:28:05.200 --> 00:28:08.200
And if you have more trees saying positive, then you predict positive.

00:28:08.200 --> 00:28:11.600
That's kind of the basic idea of random forest.

00:28:11.600 --> 00:28:13.280
And it turns out to be super powerful.

00:28:13.280 --> 00:28:14.460
That's really cool.

00:28:14.460 --> 00:28:23.120
Well, it seems to me like it would bring in kind of different perspectives or taking different components or parts of a problem into account.

00:28:23.120 --> 00:28:28.940
So some of the trees look at some features and maybe the other trees look at other features.

00:28:28.940 --> 00:28:31.820
And then they can combine in some important way.

00:28:31.820 --> 00:28:32.660
Exactly.

00:28:32.660 --> 00:28:33.200
Yeah.

00:28:33.200 --> 00:28:36.820
Another one that I see coming up is the SVM module.

00:28:36.820 --> 00:28:37.860
What's that one do?

00:28:38.800 --> 00:28:52.240
So SVM is a very popular machine learning approach that was basically, I mean, very big in the 90s and 10 years ago and still get some attraction.

00:28:52.240 --> 00:29:10.000
And basically, the idea of a support vector machine, which is the SVM is the according for, is to be able to use kernels on the data and basically solve linear problems in an abstract space where you project your raw data.

00:29:10.000 --> 00:29:11.360
Let me try to give an example.

00:29:11.560 --> 00:29:18.320
If you take a graph or if you take a graph or if you take a string, that's not naturally something that can be represented by a vector.

00:29:18.320 --> 00:29:27.120
And when you do an SVM, you have a tool, which is a kernel that allows you to compare these observations, like a kernel between strings, a kernel between graphs.

00:29:27.120 --> 00:29:37.980
And once you define this kernel, and this kernel needs to satisfy some properties that I'm going to skip, then you can use this SVM to do classification but also regression.

00:29:37.980 --> 00:29:47.340
This is what you have in the SVM module of Seciturn, which is basically a very clever and efficient binding of an underlying library, which is called LibSVM.

00:29:47.340 --> 00:29:48.460
Okay, excellent.

00:29:48.460 --> 00:29:51.260
And is that used more in the unsupervised world?

00:29:51.260 --> 00:29:52.840
It's completely supervised.

00:29:52.840 --> 00:29:55.860
When you do SVM, it's classification or regression that's supervised.

00:29:55.860 --> 00:30:02.780
There's one use case of SVM in an unsupervised setting, which is what we call the one-class SVM.

00:30:02.780 --> 00:30:11.160
So you just have one class, which basically means that you don't have labels, you just have data, and you're trying to see what are the data that are the less like the others.

00:30:11.160 --> 00:30:17.640
That's more like an anomaly detection problem, or we call it also novelty detection or outlier detection.

00:30:17.640 --> 00:30:20.920
Maybe we could talk a little bit about some of the algorithms.

00:30:21.560 --> 00:30:31.640
As a non-expert in sort of the data science machine learning field, I go in there and I see all these cool algorithms and graphs, but I don't really know what would I do with that.

00:30:31.640 --> 00:30:34.680
On the site, it says there's all these algorithms it supports.

00:30:34.680 --> 00:30:38.760
So, for example, it supports dimensionality reduction.

00:30:38.760 --> 00:30:41.400
Like, what kind of problems would I bring that in for?

00:30:41.780 --> 00:30:44.060
I guess it's hard to summarize.

00:30:44.060 --> 00:31:02.060
The hundreds of hundreds of pages that you have in Scikit-Learn in the documentation, I'm trying to give you a big picture without too much technical detail to tell you when these algorithms are useful and what they are useful for, and what are the hypotheses and what kind of output you can hope to get.

00:31:02.800 --> 00:31:05.760
It's one of the strengths of the Scikit-Learn documentation, by the way.

00:31:05.760 --> 00:31:22.280
And so to answer your question, dimensionality reduction, I would say like the 101 way of doing it is the principal component analysis, where you're trying to extract subspace that captures the most variance in the data.

00:31:22.620 --> 00:31:26.480
And that can be used to do visualization of the data in low dimension.

00:31:26.480 --> 00:31:33.540
If you do a PCA in two or three dimensions, then you can look at your observation as a scatterplot in two or three D.

00:31:33.540 --> 00:31:35.460
And that's basically visualization.

00:31:35.460 --> 00:31:43.000
But you can also use this to reduce the size of your data set, maybe without losing too much predictive power.

00:31:43.000 --> 00:31:48.080
So you take your biggest data set, you run a PCA, and then you reduce the dimension.

00:31:48.080 --> 00:31:55.960
And then suddenly you have a learning problem, which is on smaller data, because you basically reduce the number of features.

00:31:55.960 --> 00:32:08.440
That's kind of the standard approaches, which is like visualization or reducing of the data set to have a more efficient learning in terms of computing time, but also sometimes in prediction problem.

00:32:08.440 --> 00:32:10.140
Okay, that makes a lot of sense.

00:32:10.140 --> 00:32:10.780
That's really cool.

00:32:10.780 --> 00:32:18.480
So like if we went back to my house example, maybe I was feeding like the length of the driveway and the number of trees in the yard.

00:32:18.480 --> 00:32:22.120
And it might turn out that neither of those have any effect on house prices.

00:32:22.120 --> 00:32:27.580
So we could reduce it to a smaller problem by having this whole PCA go, look, those don't matter.

00:32:27.580 --> 00:32:28.400
Throw that part out.

00:32:28.400 --> 00:32:32.440
It's really about the number of bathrooms and the square footage or something.

00:32:32.440 --> 00:32:36.060
Well, yes and no.

00:32:36.060 --> 00:32:36.900
That's kind of the idea.

00:32:36.900 --> 00:32:44.240
Okay, but in this example of Boston, the prediction of houses, you want to reduce the dimension in an informed way.

00:32:44.600 --> 00:32:52.140
Because the number of trees in the yard can be informative for something, but maybe not to predict the price of the apartment or price of the house.

00:32:52.300 --> 00:33:05.920
So when you do dimensionally reduction in the context of supervised learning, that can be also what you call feature selection or basically selecting the predictive features, which ultimately leads to a reduced data set because you remove features.

00:33:05.920 --> 00:33:07.620
But that would be in a supervised context.

00:33:07.620 --> 00:33:10.240
When you do PCA, you're really in an unsupervised way.

00:33:10.240 --> 00:33:11.380
You don't know what are the labels.

00:33:11.380 --> 00:33:14.180
You just want to figure out what's the variance.

00:33:14.180 --> 00:33:16.200
Where is the variance in the data coming from?

00:33:16.200 --> 00:33:20.440
On which axis and which direction should I look to see the structure?

00:33:20.440 --> 00:33:28.920
Another thing that is in there are ensemble methods for predicting multiple supervised models.

00:33:29.160 --> 00:33:30.120
What's the story there?

00:33:30.120 --> 00:33:30.900
That sounds cool.

00:33:30.900 --> 00:33:36.780
So random forest is an example of ensemble methods.

00:33:36.780 --> 00:33:49.920
When you have an ensemble, it's basically saying that you're taking a lot of classifiers or a lot of regressors and you combine them in a bag of prediction, a bag of models or an ensemble of models.

00:33:50.300 --> 00:33:53.520
And then you make them collaborate in order to build a better prediction.

00:33:53.520 --> 00:33:57.440
And random forest is basically an ensemble of trees.

00:33:57.440 --> 00:34:02.760
But you can also do an ensemble of neural networks.

00:34:02.760 --> 00:34:07.120
You can do an ensemble of whatever model you want to pull.

00:34:07.120 --> 00:34:11.940
And that turns out to be in practice often a very efficient approach.

00:34:11.940 --> 00:34:17.780
Yeah, like we were saying, the more perspectives, different models, it seems like that's a really good idea.

00:34:18.540 --> 00:34:20.400
So you mentioned neural networks.

00:34:20.400 --> 00:34:21.040
Yes.

00:34:21.040 --> 00:34:23.380
So Scikit-Learn has support for neural networks as well?

00:34:23.380 --> 00:34:29.560
Well, you have a multilayer perception, which is like the basic neural network.

00:34:29.560 --> 00:34:32.580
I mean, these days in neural network, people talk about deep learning.

00:34:32.580 --> 00:34:33.600
I've heard about it.

00:34:33.600 --> 00:34:34.520
That's about the extent of it.

00:34:34.520 --> 00:34:35.000
What's deep learning?

00:34:35.000 --> 00:34:52.640
This episode is brought to you by Codeship.

00:34:53.360 --> 00:34:58.460
Codeship has launched organizations, create teams, set permissions for specific team members,

00:34:58.460 --> 00:35:01.500
and improve collaboration in your continuous delivery workflow.

00:35:01.500 --> 00:35:07.380
Maintain centralized control of your organization's projects and teams with Codeship's new organizations plan.

00:35:07.380 --> 00:35:12.900
And as Talk Python listeners, you can save 20% off any premium plan for the next three months.

00:35:12.900 --> 00:35:16.440
Just use the code TALKPYTHON, all caps, no spaces.

00:35:17.020 --> 00:35:22.160
Check them out at CodeChip.com and tell them thanks for supporting the show on Twitter where they're at, CodeChip.

00:35:28.160 --> 00:35:35.880
So deep learning is basically neural networks 2.0, where you take neural networks and you stack more layers.

00:35:35.880 --> 00:35:44.340
So kind of the story there is that for many years, people were kind of stuck with networks of two or three layers.

00:35:44.340 --> 00:35:46.080
So not very deep.

00:35:46.080 --> 00:35:50.840
And part of the issue is that it was really hard to train something that would add more layers.

00:35:51.320 --> 00:35:57.460
In terms of research, there was two things that came up, which is first that we get access to more data,

00:35:57.460 --> 00:36:01.900
which means that we can train bigger and more complex models.

00:36:01.900 --> 00:36:09.620
But also there were some breakthrough in learning these models that allowed people to avoid overfitting,

00:36:09.620 --> 00:36:14.440
trying to be able to learn this bigish model, these big models, because you have more data,

00:36:14.440 --> 00:36:16.880
but also clever ways to prevent overfitting.

00:36:17.100 --> 00:36:19.520
And that basically led to deep learning these days.

00:36:19.520 --> 00:36:20.380
Oh, very interesting.

00:36:20.380 --> 00:36:23.140
Yeah, that's been one of the problems with neural networks, right?

00:36:23.140 --> 00:36:27.580
Is that if you teach it too much, then it only knows, you know, just the things you've taught it or something, right?

00:36:27.580 --> 00:36:28.560
Exactly.

00:36:28.560 --> 00:36:33.100
It basically learns by heart what you provide as trading observations

00:36:33.100 --> 00:36:37.340
and end up being very bad when you provide new observations.

00:36:38.480 --> 00:36:41.760
Want to talk a little bit about the datasets that come built in there?

00:36:41.760 --> 00:36:42.100
Uh-huh.

00:36:42.100 --> 00:36:47.880
We've talked a little bit about the Boston one, and that's the Boston house prices for regression.

00:36:47.880 --> 00:36:50.640
What I hear coming up a lot is one called Iris.

00:36:50.640 --> 00:36:54.000
Is that like your eye itself?

00:36:54.000 --> 00:36:59.900
So Iris is the dataset that we use to illustrate all the classification problems.

00:37:00.340 --> 00:37:07.860
It's really something that is a very common dataset that turned out to have a good license that we could ship it with scikit-learn,

00:37:07.860 --> 00:37:12.060
and basically we built most of the examples using this Iris dataset,

00:37:12.060 --> 00:37:15.400
which is also very much used in textbooks of machine learning.

00:37:15.620 --> 00:37:23.880
So that was kind of the default choice, and it talks to people because you understand what's the problem that you're trying to do,

00:37:23.880 --> 00:37:30.240
and it's rich enough and not too big, so we can make all these examples run super fast and build a nice location.

00:37:30.240 --> 00:37:30.800
That's very cool.

00:37:30.800 --> 00:37:31.600
What is the dataset?

00:37:31.600 --> 00:37:33.240
What exactly is it about?

00:37:33.240 --> 00:37:43.860
So the Iris dataset, you're trying to predict the types of plants, for example, using the sepal length, so the sepal width.

00:37:44.420 --> 00:37:51.460
So you have a number of features that describe the plant, and you're trying to predict which one among three it is.

00:37:51.460 --> 00:37:55.860
So it's a three-label, three-class classification problem.

00:37:55.860 --> 00:37:56.680
Yeah, that's cool.

00:37:56.680 --> 00:38:02.880
Enough data to not just be a linear model or something, a single variable model, but not too much?

00:38:02.880 --> 00:38:04.200
Exactly.

00:38:04.200 --> 00:38:10.000
It's not completely linear a bit, but not too hard at the same time.

00:38:10.000 --> 00:38:10.280
Right.

00:38:10.280 --> 00:38:12.820
If you get 20 variables, that's probably too much to deal with.

00:38:13.220 --> 00:38:14.840
Then one is on diabetes.

00:38:14.840 --> 00:38:17.780
What about diabetes does that dataset represent?

00:38:17.780 --> 00:38:18.140
Do you know?

00:38:18.140 --> 00:38:24.240
I'm actually not really sure what's the – no, it's a regression problem.

00:38:24.580 --> 00:38:34.120
It's used a lot in the linear model, especially for the sparse regression models because the – I mean, part of these sparse regression models are trying to extract the predictive features.

00:38:34.120 --> 00:38:41.560
I guess in the diabetes dataset, you try to find something related to diabetes, and you're interested into finding the most predictive features.

00:38:41.560 --> 00:38:43.160
What are the best features?

00:38:43.260 --> 00:38:46.060
And then that's part of the reason I think we're using it.

00:38:46.060 --> 00:38:51.480
And then another one is digits, which kind of meant to model images, right?

00:38:51.800 --> 00:39:09.340
One of the early, I would say, breakthrough of machine learning was this work in the 90s where Yad Lequin and other people were trying to build a system that could predict what was the digit present on the screen or in the image.

00:39:10.000 --> 00:39:20.040
So it's a very old machine learning problem where you start from a picture or an image of a digit that is handwritten, and you're trying to predict what it is from zero to nine.

00:39:20.760 --> 00:39:26.880
And it's an example that basically people can easily grasp in order to understand what's the machine learning.

00:39:26.880 --> 00:39:30.040
You give me an image, and I'll predict something between zero and nine.

00:39:30.040 --> 00:39:41.500
And historically, when we did the first version of the scikit-learn website, we had something like seven or eight lines of Python code that were running classification of digits.

00:39:41.500 --> 00:39:46.920
So that was kind of the motivation example where we said, okay, scikit-learn is machine learning made easy.

00:39:46.920 --> 00:39:48.400
And here it is, an example.

00:39:48.400 --> 00:39:51.420
It's ten lines of code classifying digits.

00:39:51.420 --> 00:39:53.460
And that was basically the punchline.

00:39:53.460 --> 00:39:57.000
Solving this old hard problem in a nice, simple way, right?

00:39:57.000 --> 00:39:57.440
Yeah.

00:39:57.440 --> 00:40:08.380
You know, lately, there's been a lot of talk about artificial intelligence, and especially from people like Elon Musk and Stephen Hawking,

00:40:08.900 --> 00:40:14.080
saying that maybe we should be concerned about artificial intelligence and things like that.

00:40:14.080 --> 00:40:20.920
So one of my first questions around this area is, is machine learning the same thing as artificial intelligence?

00:40:20.920 --> 00:40:23.700
Depends who you ask.

00:40:23.700 --> 00:40:24.340
Okay.

00:40:24.340 --> 00:40:25.740
Sure.

00:40:25.740 --> 00:40:34.000
No, I mean, AI was basically the early name of trying to teach a computer to do something.

00:40:34.180 --> 00:40:42.600
I mean, it dates back from the 60s and 70s, where basically in the US, for example, at MIT, you had labs that are basically called AI labs.

00:40:42.600 --> 00:40:53.340
And machine learning is kind of a, I would say, more restricted set of problems that compared to AI,

00:40:53.500 --> 00:41:02.760
which is, say, when you do AI and you want to do work with text or linguistic, you want to build a system that understands linguistic.

00:41:02.760 --> 00:41:05.340
That would be an AI problem.

00:41:05.340 --> 00:41:08.600
But machine learning is kind of a saying, okay, I've got a loss function.

00:41:08.600 --> 00:41:10.260
I want to optimize my criteria.

00:41:10.260 --> 00:41:14.580
I've got something that I want to train my system on.

00:41:15.240 --> 00:41:17.640
And in a sense, you teach a system to learn.

00:41:17.640 --> 00:41:21.300
And so you create some kind of intelligence.

00:41:21.300 --> 00:41:28.400
But it's not, I would say it's a, I would say simpler thing to say than saying intelligence, which is kind of a hard concept.

00:41:28.400 --> 00:41:31.940
That's maybe part of my personal answer to this.

00:41:31.940 --> 00:41:33.000
Yeah, no, it's a great answer.

00:41:33.800 --> 00:41:39.920
Just from my limited exposure to it, it seems like machine learning is more about classification and prediction,

00:41:39.920 --> 00:41:47.560
whereas the AI concept is a, there's a strong autonomous component that is just completely lacking for machine learning.

00:41:47.560 --> 00:41:52.560
Yeah, I guess I would say, I would explain it simply like this, exactly.

00:41:52.560 --> 00:41:58.280
What things have you seen people using scikit-learn for that surprised you?

00:41:58.280 --> 00:42:00.620
Or like, wow, you guys are doing that?

00:42:00.620 --> 00:42:01.180
That's amazing.

00:42:03.680 --> 00:42:23.100
So on scikit-learn, we have this testimonial page where we ask typically companies or institutes that are using scikit-learn to write a couple of sentences to say, okay, what they're using scikit-learn for and why they think it's great.

00:42:23.100 --> 00:42:25.240
I'm trying to find this.

00:42:25.240 --> 00:42:30.200
And I remember there was this, I think, a dating website.

00:42:30.380 --> 00:42:36.500
Saying that they were using scikit-learn to optimize dates between people.

00:42:36.500 --> 00:42:38.660
That was great.

00:42:38.660 --> 00:42:40.300
That was like a funny one.

00:42:40.300 --> 00:42:41.540
That is funny.

00:42:41.540 --> 00:42:46.700
So there may be people out there who are married and maybe even babies who are born because of scikit-learn.

00:42:46.700 --> 00:42:49.940
Yeah, that would be great.

00:42:49.940 --> 00:42:52.320
I'm going to add this to my resume.

00:42:52.320 --> 00:42:53.480
It's awesome.

00:42:53.480 --> 00:42:54.440
Matchmaker.

00:42:54.440 --> 00:43:00.320
So if people want to get started with scikit-learn, they're out there listening, they're like, this is awesome.

00:43:00.320 --> 00:43:01.340
Where do I start?

00:43:01.340 --> 00:43:07.340
What would you recommend for sort of getting into this whole world of machine learning and getting started with scikit-learn in particular?

00:43:07.340 --> 00:43:11.320
They first start with the scikit-learn website, which is pretty extensive.

00:43:11.800 --> 00:43:22.280
But you also have a lot of tutorials that have been given by core devs of scikit-learn in different conferences like SciPy, EuroSciPy, or PyData events.

00:43:22.280 --> 00:43:24.760
And you can find all these videos online.

00:43:24.760 --> 00:43:35.640
I'll just tell you, just take some of them and just sit down and just listen and try to do it yourself afterwards.

00:43:35.640 --> 00:43:41.620
I mean, for example, in SciPy, you've got tutorials on scikit-learn that are pretty much a whole day of tutorials, which is hands-on.

00:43:41.620 --> 00:43:42.840
And all these are taped.

00:43:42.840 --> 00:43:48.020
So you can really look and get the materials online from the tutorial and get started.

00:43:48.020 --> 00:43:49.200
Oh, that's excellent.

00:43:49.200 --> 00:43:56.480
Yeah, I think it's really amazing these days that there's so many of these videos online that you can...

00:43:56.480 --> 00:43:59.120
There's some topic you imagine, like, hey, I want to know this thing in Python.

00:43:59.120 --> 00:44:03.520
There's a very good chance that someone gave some kind of conference talk on it and it's online.

00:44:03.520 --> 00:44:04.140
Yeah.

00:44:05.460 --> 00:44:10.880
Anything you want to give sort of a shout-out to or a final call to action before we sort of wrap things up a bit?

00:44:10.880 --> 00:44:18.660
So if you have free time, you like machine learning, come give us a hand to maintain this cool library.

00:44:18.660 --> 00:44:20.060
Yeah, absolutely.

00:44:20.060 --> 00:44:26.360
Yeah, like I said, there's 457 contributors, but, you know, you guys are looking to stabilize things and move forward.

00:44:26.360 --> 00:44:28.860
So I'm sure there's a lot to be done around that.

00:44:28.860 --> 00:44:29.180
Yeah.

00:44:29.180 --> 00:44:32.160
I mean, basically, you also have two types of contributors.

00:44:32.460 --> 00:44:42.280
You have these one-time contributors that are really expert in something that contributes something that is really specific and valuable that gets merged to the main code base.

00:44:42.800 --> 00:44:51.180
And you have, I would say, less people that are investing their time to read the code from others, keep their library consistent in terms of API.

00:44:51.180 --> 00:44:55.600
And that's really this big reviewing work that I would say.

00:44:55.600 --> 00:45:05.460
I would say the historical core devs of scikit-learn are pretty much mostly doing these days and invest little time to do really new stuff that is basically left to the newcomers.

00:45:05.460 --> 00:45:21.720
And I think what would be, if I had to wish something for the future is that these people or these one-time contributors also spend a bit of their time to help us maintain the entire library in longer run.

00:45:21.720 --> 00:45:22.640
Yeah, that makes sense.

00:45:22.740 --> 00:45:36.180
I can see in something like scikit-learn where it's kind of a family of all these different algorithms and little techniques that if you want to add your technique, you just go in there and you do that little bit and you kind of stay out of the rest of the code.

00:45:36.180 --> 00:45:40.280
And I can see how that would definitely lead to inconsistencies and so on.

00:45:40.680 --> 00:45:54.600
Yeah, and in terms of policy, I mean, in terms of scikit-learn, that's why maybe there's less things that are coming in these days is that we're not trying to build a library that contains all the algorithms that you can ever think of or that get published every year.

00:45:54.600 --> 00:46:03.120
We're trying to keep or have the algorithms that are better on some clear use case in the current state of the art.

00:46:04.020 --> 00:46:12.260
And so we cannot implement everything, but at least if you have a particular type of problem, you should have something in scikit-learn that does a good job.

00:46:12.260 --> 00:46:16.320
So before I let you go, I have two more final questions for you.

00:46:16.320 --> 00:46:21.340
So if you're going to open, if you're going to write some Python code, what editor do you open up?

00:46:21.340 --> 00:46:26.580
So I've been a big user of TextMate over the years.

00:46:27.420 --> 00:46:34.520
And I have to admit, I switched to Sublime recently because I got convinced by my neighbor.

00:46:34.520 --> 00:46:37.900
So no Vim or Emacs troll with me.

00:46:37.900 --> 00:46:39.500
Yeah, that's cool.

00:46:39.500 --> 00:46:41.140
Yeah, I like Sublime Text a lot.

00:46:41.140 --> 00:46:41.620
Very nice.

00:46:41.620 --> 00:46:52.080
And of all the cool machine learning and Python in general packages out on PyPI, what are some that you think people maybe don't know about that you're like, hey, this is awesome.

00:46:52.080 --> 00:46:52.860
You should know about it.

00:46:53.320 --> 00:46:57.300
Well, maybe I'm biased because I do a lot of machine learning for brain science.

00:46:57.300 --> 00:47:09.460
And so unrelated to scikit-learn per se, but I've been working for the last five years on this project that's called MNE, which allows you to process brain waves and classifying brain states.

00:47:09.460 --> 00:47:15.920
Like, for example, build brain-computer interfaces or analyze clinical data of electrophysiology.

00:47:15.920 --> 00:47:18.880
That's basically, if you want to play with brain waves, you can check it out.

00:47:18.880 --> 00:47:20.100
That's really cool.

00:47:20.460 --> 00:47:24.660
And when you say brain-machine interfaces, is it like EEGs and stuff like that?

00:47:24.660 --> 00:47:25.380
Exactly.

00:47:25.380 --> 00:47:26.540
EEG, MEG.

00:47:26.540 --> 00:47:27.020
Yeah.

00:47:27.020 --> 00:47:27.640
Okay.

00:47:27.640 --> 00:47:28.380
Wow.

00:47:28.380 --> 00:47:28.880
Very awesome.

00:47:28.880 --> 00:47:29.920
Yeah, I hadn't heard of that one.

00:47:29.920 --> 00:47:30.360
That's cool.

00:47:30.360 --> 00:47:32.640
So again, I'm biased.

00:47:32.640 --> 00:47:34.620
That's more my second baby.

00:47:34.620 --> 00:47:36.320
Yeah, that's great.

00:47:36.320 --> 00:47:40.280
So, Alexander, it's been really great to have you on the show.

00:47:40.280 --> 00:47:42.440
And this has been a super interesting conversation.

00:47:42.440 --> 00:47:42.940
Thanks.

00:47:42.940 --> 00:47:44.140
Thank you very much.

00:47:44.140 --> 00:47:44.700
You bet.

00:47:44.700 --> 00:47:45.160
Talk to you later.

00:47:46.660 --> 00:47:49.540
This has been another episode of Talk Python to Me.

00:47:49.540 --> 00:47:51.840
Today's guest was Alexandra Gramfort.

00:47:51.840 --> 00:47:55.180
And this episode has been sponsored by Hired and CodeShip.

00:47:55.180 --> 00:47:57.320
Thank you guys for supporting the show.

00:47:57.320 --> 00:48:00.760
Hired wants to help you find your next big thing.

00:48:00.760 --> 00:48:06.720
Visit Hired.com slash Talk Python To Me to get five or more offers with salary and equity presented right up front.

00:48:06.720 --> 00:48:09.500
And a special listener signing bonus of $4,000.

00:48:09.500 --> 00:48:13.640
CodeShip wants you to always keep shipping.

00:48:13.640 --> 00:48:17.680
Check them out at CodeShip.com and thank them on Twitter via at CodeShip.

00:48:17.680 --> 00:48:19.740
Don't forget the discount code for listeners.

00:48:19.740 --> 00:48:20.400
It's easy.

00:48:20.400 --> 00:48:21.420
Talk Python.

00:48:21.420 --> 00:48:22.160
All caps.

00:48:22.160 --> 00:48:23.000
No spaces.

00:48:24.320 --> 00:48:30.760
You can find the links from today's show at talkpython.fm/episodes slash show slash 31.

00:48:30.760 --> 00:48:33.240
Be sure to subscribe to the show.

00:48:33.240 --> 00:48:35.320
Open your favorite podcatcher and search for Python.

00:48:35.320 --> 00:48:36.700
We should be right at the top.

00:48:36.700 --> 00:48:40.860
You can also find the iTunes and direct RSS feeds in the footer of the website.

00:48:40.860 --> 00:48:46.180
Our theme music is Developers, Developers, Developers by Corey Smith, who goes by Smix.

00:48:46.180 --> 00:48:49.500
You can hear the entire song on talkpython.fm.

00:48:49.500 --> 00:48:52.240
This is your host, Michael Kennedy.

00:48:52.240 --> 00:48:54.080
Thank you very much for listening.

00:48:54.900 --> 00:48:56.480
Smix, take us out of here.

00:48:56.480 --> 00:48:57.480
Thank you.

00:48:57.480 --> 00:48:58.740
Stating with my voice.

00:48:58.740 --> 00:49:00.520
There's no norm that I can feel within.

00:49:00.520 --> 00:49:01.720
Haven't been sleeping.

00:49:01.720 --> 00:49:03.360
I've been using lots of rest.

00:49:03.360 --> 00:49:06.220
I'll pass the mic back to who rocked it best.

00:49:06.220 --> 00:49:08.520
First of all, first of all.

00:49:08.520 --> 00:49:15.520
First of all, first of all.

00:49:15.520 --> 00:49:18.320
developers, developers, developers, developers.