<a href="https://colab.research.google.com/github/xpertdesh/ml-class21/blob/main/labs/intro_to_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to SciKitLearn (sklearn)
#### Part of the [Inquiryum Machine Learning Fundamentals Course](http://inquiryum.com/machine-learning/)


![](https://upload.wikimedia.org/wikipedia/commons/0/05/Scikit_learn_logo_small.svg)

sklearn is a Python library that implements many machine learning algorithms.

The best bet, as with probably everything else, is just to google what you want. For example, if you want to learn how to use kNN (k nearest neighbors) with sklearn google **sklearn knn**  and you will find the information. The good thing about sklearn documentation, is that they provide a short example of how to use each algorithm.

That said, let's get started.

Suppose we have a small dataset of the heights of women who play in the Women's NBA or who are gymnasts. We already have a small dataset with women whose sport is basketball, gymnastics, or track. So we will read the file in and filter out the track people. We will also lose the weight column


In [None]:
import pandas as pd
from pandas import DataFrame
athlete = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/athletes.csv', index_col='Name')
athletes = athlete[((athlete.Sport == 'Basketball') | (athlete.Sport == 'Gymnastics'))][['Sport', 'Height']]
athletes = athletes.sort_index()
athletes

Unnamed: 0_level_0,Sport,Height
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Asuka Teramoto,Gymnastics,54
Brittainey Raven,Basketball,72
Chen Nan,Basketball,78
Elena Delle Donne,Basketball,77
Gabby Douglas,Gymnastics,49
Jennifer Lacy,Basketball,75
Laurie Hernandez,Gymnastics,60
Linlin Deng,Gymnastics,54
Madison Kocian,Gymnastics,62
Nakia Sanford,Basketball,76


## Classifying sport based on height
Ok, this is going to be a super easy task, but suppose I want to build a kNN classifier where I will give someone's height and the classifier will say whether their sport is gymnastics or basketball. 

First, I will import the kNN algorithm and make an instance of it.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}

The 

    knn = KNeighborsClassifier(n_neighbors=3)

makes an instance of a k nearest neighbor classifier with k=3.

As the name of the method suggests, the line

    knn.get_params()
    
displays the parameters of the classifier. In this case:

```
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}
 ```
 
I won't explain all the parameters now, but notice that the metric is Minkowski, and the power or `p` of Minkowski is 2 making it Euclidean distance. The number of neighbors `n_neighbors` is 3. 


---


Most classifiers in sklearn want the labels (the thing we are trying to predict) to be a separate parameter from the features. So, let's create a DataFrame for the features (in this case there is only one but this isn't usually the case). And we will create a Series for the labels. First, `athletes_features` 


In [None]:
athletes_features = athletes[['Height']]
athletes_features

Unnamed: 0_level_0,Height
Name,Unnamed: 1_level_1
Asuka Teramoto,54
Brittainey Raven,72
Chen Nan,78
Elena Delle Donne,77
Gabby Douglas,49
Jennifer Lacy,75
Laurie Hernandez,60
Linlin Deng,54
Madison Kocian,62
Nakia Sanford,76


Great! Now, one for Athlete's labels. Again, we are trying to predict what sport they play so we will use the Sport column.

In [None]:
athletes_labels =  athletes['Sport']
athletes_labels

Name
Asuka Teramoto       Gymnastics
Brittainey Raven     Basketball
Chen Nan             Basketball
Elena Delle Donne    Basketball
Gabby Douglas        Gymnastics
Jennifer Lacy        Basketball
Laurie Hernandez     Gymnastics
Linlin Deng          Gymnastics
Madison Kocian       Gymnastics
Nakia Sanford        Basketball
Nikki Blue           Basketball
Qiushuang Huang      Gymnastics
Rebecca Tunney       Gymnastics
Seimone Augustus     Basketball
Shanna Crossley      Basketball
Shavonte Zellous     Basketball
Simone Biles         Gymnastics
Viktoria Komova      Gymnastics
Name: Sport, dtype: object

Again, the features are a Pandas DataFrame and the labels are a Pandas Series.

We train the knn classifier we created by using the `fit` method as follows:

In [None]:
knn.fit(athletes_features, athletes_labels)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

Most machine learning (ML) algorithms build a model. From the examples in the training set, they build an internal representation, a model, of the relationship between the features and the labels. Once the algorithm builds this model it 'forgets' the individual data instances. Such ML algorithms are called eager learners. On the other hand, a lazy learner does not build a model beforehand. It remembers all the training instances and when it needs to make a prediction on a new instance it then processes all the training data. 

kNN is a lazy learner. During the learning phase, during `fit` it simply remembers the data (it remembers that Gabby Douglas' height is 49 inches and Elena Delle Donne's is 79 ). In the traditional kNN that we have worked through by hand, when we want to make a prediction, we calculate the distance between that new instance and every instance in our training set (how close is this new instance to Gabby Douglas? To Elena Delle Done? And to the 100 other people in our small dataset. As you can imagine, this takes some time. So fitting is fast because it doesn't actually build a model but predicting is slow.

### Three kNN algorithms
There are three basic kNN algorithms: 

* **brute** which uses a brute force method we described above.
* **kd_tree** which, when you are fitting, creates a binary tree. This makes prediction faster. 
* **ball_tree** which also creates a binary tree during training (fitting)

Since this is our first machine learning algorithm, we won't get too bogged down in learning about kd trees and ball trees, but let's cover a few things. First, since the algorithm needs to construct a binary tree when it fits the data, fitting takes longer but with that binary tree, prediction is faster. That is the trade off. For both trees, a binary tree is constructed that divides the training data into a number of sets that have a specific size limit known as leaf size, which is a hyperparameter of the algorithm. When we specify a large leaf size, the depth of the constructed binary tree will be shallow and the amount of time constructing the tree will be reduced. When we specify a smaller leaf size, training time will be increased but predictions will be faster. 

When we did `knn.get_params()` we saw

```
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}
```

The first line:

```
'algorithm': 'auto',
```

as you can guess, displays the algorithm used. The options are:

* `brute`
* `kd_tree`
* `ball_tree`
* `auto`

The default is `auto` which simply allows the algorithm itself to determine the best algorithm to use based on the training data.

We can specify the algorithm by using the `algorithm` hyperparameter:

```
knn = KNeighborsClassifier(algorithm='brute', n_neighbors=3)
```

The next line

```
'leaf_size': 30,
```

displays the leaf size which we just talked about. 30 is the default value. The optimal value is dependent on the problem. You can set it by


```
knn = KNeighborsClassifier(algorithm='brute', leaf_size=10, n_neighbors=3)
```
The next line
```
'metric': 'minkowski',
```

specifies the distance metric we are using. Minkowski is the default.  Its strength is in working with real values. If you are dealing with boolean or integer values other metrics may be a better fit.

Finally,
```
'weights': 'uniform'}
```

As we know, when we use k=3, the three closest neighbors get a vote in determining the classification of the new instance. If 2 say 'basketball' and one 'gymnast'  we classify the new instance as basketball. In this case, all three neighbors have an equal, or uniform, vote. That's what this `'weights': 'uniform'` specifies. But suppose we want the closest neighbor to have more weight in the vote than the others. In that case we could use `weights='distance'`, where the weight of each vote is the inverse of the distance.


 This long description was intended to peel away some of the mystery, and not to bore or confuse you. 


----
Back to the matter at hand ...

```
knn.fit(athletes_features, athletes_labels)
```

We could have done all that in one line:

    knn.fit(athletes['Height'].reshape(-1, 1), athletes['Sport'])
    
but I just wanted to show you a common convention of naming the features `_features` and the class that example is in as either `_labels` or `_class`.  We could have also named it `features_athletes`-- so anything that makes it clear. 

### let's classify something

Nneka Ogwumike is 6'2" or 74 inches. Let's see what our classifier predicts:

In [None]:
print(knn.predict([[74]]))

['Basketball']


Ok. That is good considering Nneka Ogwumike is the 2016 MVP for the WNBA.

We can also ask the classifier the probability that Nneka is a basketball player

In [None]:
print(knn.predict_proba([[74]]))

[[1. 0.]]


Ok. The probability that she is a basketball player is 1.0 and the probabity that she is a gymnast is 0.0. How did we get that probability? Well `k` was 3 so we used the three nearest neighbors and all of them were basketball players. If 2 were basketball players and 1 a gymnast the probability would be 66.6%. There is no magic here.

Cool. Let's try Leilani Mitchell who is 5'5" tall (or Svetlana Khorkina who is also the same height):

In [None]:
print(knn.predict([[65]]))
print(knn.predict_proba([[65]]))

['Gymnastics']
[[0.33333333 0.66666667]]


Ok. So here our classifer predicts gymnastics. But is only .66 confident. Why?  

For that let's look at the athletes in our training set sorted by height. 

In [None]:
athletes.sort_values(by='Height')

Unnamed: 0_level_0,Sport,Height
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Gabby Douglas,Gymnastics,49
Asuka Teramoto,Gymnastics,54
Linlin Deng,Gymnastics,54
Simone Biles,Gymnastics,57
Rebecca Tunney,Gymnastics,58
Laurie Hernandez,Gymnastics,60
Qiushuang Huang,Gymnastics,61
Viktoria Komova,Gymnastics,61
Madison Kocian,Gymnastics,62
Nikki Blue,Basketball,68


We are using kNN with a k of 3 and we are trying to classify someone who is 65 inches tall.  
<h3 style="color:red">3 Nearest Neighbors</h3>
<span style="color:red">What sports do the 3 nearest neighbors play?</span>
Double click this cell, enter the data, and shift-enter to render the cell

Sport | Euclidean Distance
 :---: | :--: |
  x |  0   
  x | 0 
  x | 0 
  |
  
Why did I ask for Euclidean Distance? Well, again, when we created the classifier and then used the `get_params` method, the classifier returned:

```
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}           
```

You see that it uses the Minkowski distance. The `p` parameter is the power parameter for Minkowski and you see the default value is 2. When p=1 Minkowski distance is the Manhattan Distance, when p=2 it is the Euclidean distance. And when we created the classifier we set `k`, the number of nearest neightbors for kNN to 3 (the default is 5).  So the three nearest neighbors 'vote' on the label to give that example. So for 5'5" Svetlana Khorkina 2 neighbors voted gymnastics and 1 voted basketball so gymnastics won.

As we just discussed, for knn we have a number of parameters we can use to modify the classifier. For example, to build a Manhattan Distance kNN classifier with a k of 1:

In [None]:
knnOne = KNeighborsClassifier(n_neighbors=1, p= 1)
knnOne.fit(athletes[['Height']], athletes['Sport'])

KNeighborsClassifier(n_neighbors=1, p=1)

And using `sklearn`, it is quite easy to build and use a variety of classifiers. For example, although you might no nothing about Guassian Naive Bayes classifiers, you can build one of those without even knowing much about the algorithm:

In [None]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(athletes[['Height']], athletes['Sport'])

GaussianNB()

In [None]:
clf.predict([[70]])

array(['Basketball'], dtype='<U10')

So regardless of algorithm, the steps were

1. create the classifier. For ex., `knn = KNeighborsClassifier()`
2. fit the classifier. For ex., `knn.fit(athletes_features, athletes_labels)`
3. used the classifier to make predictions on new data. For ex., `knn.predict([70])`


Ok. back to kNN.



# return to kNN
We still have our `knn` classifier:

In [None]:
knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

### training and test sets

When we `fit` the classifier:

    knn.fit(athletes_features, athletes_labels)

We trained the classifer on a set of data that we use for training, and not surprisingly, this is called the **training set**.  So perhaps a better variable name would be:

    athletes_training_features   athletes_training_labels

To see how **good** a classifier we made (more on the meaning of good later) we use a set of data called the **test** set. 

Let's create a tiny test set now:
    
    



In [None]:
athletes_test_features = DataFrame({'Height': [74, 65, 65]}, index=['Nneka Ogwumike', 'Svetlana Khorkina', 'Leilani Mitchell'])
athletes_test_features

Unnamed: 0,Height
Nneka Ogwumike,74
Svetlana Khorkina,65
Leilani Mitchell,65


and now we can make predictions on everyone in our test set:

In [None]:
knn.predict(athletes_test_features)

array(['Basketball', 'Gymnastics', 'Gymnastics'], dtype=object)

sweet!

## the non-coding part of the notebook
First, here is a confession. I know zip about sports. I've never had an interest and don't watch any sporting event. So if I make some horrendous mistake in these descriptions you may want to cut me some slack. I am using sports as an example because 

1. height and weight are easy things to talk about (and it is harder to come up with easy features for musicians, and 
2. it is easy to get the height and weight of sports people (it's much, much, tougher to get the height and weight of dancers for example). 

Let's go back to thinking about height and weight as features. Allyson Michelle Felix is among the fastest women on the planet (she won 6 Olympic Gold Medals). Her height is 5'5 and she weighs 121. Courtney Williams is a guard for the Connecticut Suns WNBA team. She is 5'8" and weighs 136.  Here is the chart so you can see those numbers:


 person | height | weight
 :---: |  :---: | :---: 
Allyson Michelle Felix | 65 | 121
Courtney Williams | 68 | 136


Now I want to classify an athlete who is 5'4 and weighs 130. What is your gut feeling? Do you think she is a track person or a WNBA player?

My thinking is that she is track since she seems too short for a basketball player (and plus I know that those are the stats for Carmelita 'The Jet' Jeter, the fastest women on the planet. (Although, the shortest person in the WNBA, Shannon Denise Bobbitt, is only 5'2").  But if we classify someone who is 5'4" by using the Manhattan Distance:

    distance(Carmelita, Allyson) = abs(64 - 65) + abs(130 - 120) = 1 + 10 = 11
    distance(Carmelita, Courtney) = abs(64 - 68) + abs(130 - 136) = 4 + 6 = 10
    
We'd pick that she was a basketball player (and we would still pick basketball even if we used Euclidean Distance.
    

So this is sort of a bummer.  It's the same problem that I mentioned in the kNN video. If I had a match making site and had this misguided idea that the best relationships are those people who are about the same age and have the same salaries. And I have 2 guys:

 Guys | age | salary
 ---: | :---: | :---: 
 Mr. Cool | 26 | 80,000
 Old Dude | 67 | 115,000
 
 And I am trying to match up Ann who is 28 and earns 100k. 
 
 The Manhattan Distance between Ann and Mr. Cool is 2 + 20k = 20,002.
 
 The Manhattan Distance between Ann and Old Dude is 31 + 15k  = 15,031
 
 So, sadly, our algorithm would recommend the old dude to Ann. This is again a bummer. The problem in both examples is that the range of values in one column is far larger than the range in another column. In our case, the weight column values are much larger than the height values.
 
### rescaling

A solution to this problem is to rescale the values so the values in all columns range from 0 to 1. There are other (and possibly better) ways to rescale but let's start with this simple one for now.

##### the formula for minmax rescaling:

### $$x'= \frac{x-x_{min}}{x_{max}-x_{min}}$$

Let's look at our simple example of Ann, Mr. Cool and Old Dude:

Person | Age | Salary
 ---: | :---: | :---: 
 Mr. Cool | 26 | 80,000
 Old Dude | 67 | 115,000
 Ann   | 28 | 100,000

so the minimum value of the age column is 26 and the max is 67 and let's say I want to normalize Mr. Cool's age:


### $$x'_{Mr.Cool}= \frac{x_{Mr.Cool}-x_{min}}{x_{max}-x_{min}} = \frac{26-26}{67-26} = \frac{0}{41} = 0$$

Ann's normalized age:

### $$x'_{Ann}= \frac{x_{Ann}-x_{min}}{x_{max}-x_{min}} = \frac{28-26}{67-26} = \frac{2}{41} = 0.048$$

Old Dude's normalized age:

### $$x'_{OldDude}= \frac{x_{OldDude}-x_{min}}{x_{max}-x_{min}} = \frac{67-26}{67-26} = \frac{41}{41} = 1$$


<h3 style="color:red">Normalize Salary</h3>
<span style="color:red">Can you normalize the values in the salary column?</span>

Double click this cell, enter the data, and shift-enter to render this markdown cell

Person | Age | Salary
 ---: | :---: | :---: 
 Mr. Cool | 0 | 80,000
Old Dude | 0.048 | 115,000
 Ann   | 1 | 100,000

## It's pretty easy to do this in straight Python:



In [None]:
age = [26, 67, 28]

def scale(arr):
    return [(x - min(arr))/ (max(arr) - min(arr)) for x in arr]

scale(age)


[0.0, 1.0, 0.04878048780487805]

## Using the min-max scale method in sklearn
It's even easier to do it for pandas DataFrames. 
First, let's make a dataframe from the data we have been using.

In [None]:
simple = DataFrame({'age': [26, 67, 28], 'salary': [80000, 115000, 100000]}, index=['Mr. Cool', 'Old Dude', 'Ann'])
simple

Unnamed: 0,age,salary
Mr. Cool,26,80000
Old Dude,67,115000
Ann,28,100000


ok. and now let's scale those values:

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
simple[['age', 'salary']] =  scaler.fit_transform(simple[['age', 'salary']] )
simple

Unnamed: 0,age,salary
Mr. Cool,0.0,0.0
Old Dude,1.0,1.0
Ann,0.04878,0.571429


Cool.  Now when I try to find the Manhattan distance from Ann to both Mr. Cool and Old Dude I get:

###   $$distance_{Ann,Mr.Cool} = \left|.048 - 0.0\right| + \left|0.57-0\right| = .048 + 0.57 = 0.618$$

###   $$distance_{Ann,OldDude} = \left|.048 - 1.0\right| + \left|0.57-1\right| = .952 + 0.57 = 1.32$$

#### Now, fortunately, Ann is closer to Mr. Cool!





<h1 style="color:red">A bigger challenge: The Iris Data Set</h1>


<img src="https://upload.wikimedia.org/wikipedia/commons/1/1e/IMG_7911-Iris_virginica.jpg" width="250" />

We are going to use the Iris Dataset, one of the standard data mining data sets which has been around since 1988.  The data set contains 3 classes of 50 instances each

1. Iris Setosa 
2. Iris Versicolour 
3. Iris Virginica (the picture above)

There are only 4 attributes or features:

1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 

Here is an example of the data:

Sepal Length|Sepal Width|Petal Length|Petal Width|Class
:--: | :--: |:--: |:--: |:--: 
5.3|3.7|1.5|0.2|Iris-setosa
5.0|3.3|1.4|0.2|Iris-setosa
5.0|2.0|3.5|1.0|Iris-versicolor
5.9|3.0|4.2|1.5|Iris-versicolor
6.3|3.4|5.6|2.4|Iris-virginica
6.4|3.1|5.5|1.8|Iris-virginica

The job of the classifier is to determine the class of an instance (the type of Iris) based on the values of the attributes.

I will pause a moment while you load the dataset from

    https://raw.githubusercontent.com/zacharski/ml-class/master/data/irisTrain.csv
    
<h3 style="color:red">Load the data</h3>

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/irisTrain.csv')
iris

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,Class
0,5.4,3.7,1.5,0.2,Iris-setosa
1,4.8,3.4,1.6,0.2,Iris-setosa
2,4.8,3.0,1.4,0.1,Iris-setosa
3,4.3,3.0,1.1,0.1,Iris-setosa
4,5.8,4.0,1.2,0.2,Iris-setosa
...,...,...,...,...,...
115,6.7,3.0,5.2,2.3,Iris-virginica
116,6.3,2.5,5.0,1.9,Iris-virginica
117,6.5,3.0,5.2,2.0,Iris-virginica
118,6.2,3.4,5.4,2.3,Iris-virginica


Let's take a look at the data and plot out the petal length and width of the irises:

In [None]:
import bokeh.plotting as bpl
import bokeh.models as bmo
from bokeh.palettes import d3
bpl.output_notebook()
source = bpl.ColumnDataSource(iris)

# use whatever palette you want...
palette = d3['Category10'][len(iris['Class'].unique())]
color_map = bmo.CategoricalColorMapper(factors=iris['Class'].unique(),
                                   palette=palette)

# create figure and plot
p = bpl.figure(title="The Petal Length and Width of Different Iris Classes", x_axis_label="Petal Length", y_axis_label="Petal Width")
p.scatter(x='Petal Length', y='Petal Width',
          color={'field': 'Class', 'transform': color_map},
          legend_field='Class', source=source)
bpl.show(p)


The data seems easy to classify since the classes are nicely separated!

<h2 style="color:red">Build a kNN classifier and use it to classify instances</h2>
<span style="color:red">The instances I would like you to classify are at:</span>.  

     https://raw.githubusercontent.com/zacharski/ml-class/master/data/irisTest.csv
    


At this point you should have 4 data structures: the training features, the training classes or labels, the test features and the test labels.

#### First, create a classifier using Euclidean distance and k=3

In [None]:
# TO DO


#### Next, `fit` the training set data

In [None]:
# TO DO


<h2 style="color:red">Accuracy</h2>

I am defining accuracy as:

### $$accuracy= \frac{totalNumberCorrectlyClassified}{totalNumberOfCasesInOurTestSet} $$

So if we have 40 items in our test set and we classified 30 correctly we have an accuracy of 0.75 or 75%

<span style="color:red">I would like you to calculate the accuracy</span>.  

`sklearn` has a method that does this. To see how to use it google `sklearn accuracy`



In [None]:
# your code here


<h2 style="color:red">A few questions</h2>

        
1. Instead of using all 4 features, do you get as good an accuracy with just using sepal length and width?
2. What about using just petal length and width?
3. Do you get as good a performance when using Manhattan distance?

Please show your work