# Module 3 Fundamental Algorithms II

# Introduction Slide
### Module 3 Fundamental Algorithms II
- K-nearest Neighbors
- Support Vector Machine
- Random Forest

---
## Introduction Script
Hello and welcome. 

In this module, you will learn 3 more supervised machine learning algorithms. Lesson1 introduces k-nearest neighbors, lesson 2 introduces support vector machine, lesson 3 introduces random forest. Similar to the last module, in the lesson notebooks, we often include some python code that create plots to help you understand some concepts. You are not required to understand those code. You should, however, understand how the algorithm work in general, and more importantly, know how to apply those algorithms using python scripts. As we demonstrated in last module, you just need to follow a few same standard steps to apply machine learning models.

You also need to know the key hyperparameters of each machine learning model and the value options for those key hyperparameters.

As I mentioned before, please watch the video to learn the concepts behind the algorithms, and more importantly, go through the lesson notebooks and practice as much as you can.

---
## Lesson 1: Introduction to K-nearest Neighbors

---
### Slide 1
#### K-nearest Neighbors

- Supervised Learning
- For both classification and regression problems
- No model built(memory demanding)
- Predict with class/value of K nearest neighbors
- Feature Scaling is important



### Slide 1 Script

In this lesson, we will learn k-nearest neighbors, a very intuitive machine learning algorithm.

Most machine learning algorithm will create a model through fitting on a training dataset. For example, a linear regression model will create a model which is defined by the intercept and coefficient values of each feature. A decision tree will create a tree which is defined by the conditions in each node.

K-nearest neighbors, or KNN, on the other hand, doesn't build any model from the training data. Instead, it stores all training data points in an efficient way. To classifiy a new data point, KNN simply find some training data points that are close to the new data point, and use these neighbors to predict the outcome of new data point.

To find neighbors of a data point, we need to be able to calculate the distance between two data point by the features. So it's important to scaling features when using KNN.

KNN can be used for both classification and regression problems. We'll demonstrate how KNN classification works in the next slide.

### Slide 2

<img src='images/knn2.png' width=400>


### Slide 2 Script

In this image, the dots are training data points. There are two different classes in the dataset, yellow and blue. The red star is a new data point we are trying to classify.

In KNN, K represents number of neighbors, which is predetermined. Assume we choose k equals to 3. We will find 3 closest data points to the red star. In this case, there are 2 blue dots and 1 yellow dot, we will classify red star with the majority class, which is blue.

The concept of KNN is very simple, but there are several things that are worth further discussion.

The first thing is how to calculate the distance between two data points. In this image, we assume the straight line distance betweeb two dots. This is so called euclidean distance. There are many other ways to calculate distance. We will not discuss geometry in this lesson. But one thing is for sure, no matter which way we use to calculate distance, it's important that all features are in same scale or unit to ensure accurate calculation. So it's very import that we scale the features when we use KNN.

---
### Slide 3(No needed?)

#### Distance Measurement
- Euclidean
- Manhattan
- Minkowski


### Slide 3 Script
The first thing is how to calculate the distance between two data points. In the previous slide image, we assume distance as the straight line distance betweeb two dots. This is so called euclidean distance. There are many other ways to calculate distance. We will not discuss geometry in more detail. But one thing is clear, no matter which way we use to calculate distance, it's important that all features are in same scale or unit to ensure accurate calculation. So it's very import that we scale the features when we use KNN.

---
### Slide 4

#### Choose Proper K
- Can’t be too small
- Can’t be too large
- Better be odd

<img src='images/knn.png' width=500>

### Slide 4 Script
The other thing is how to determine k. In previous slide we choose k equals to 3. What happens when we choose a different k?

In this slide, we demonstate the classifiacation with k equals 3 and k equals 6. As shown in the image, when k is three, the red star is classified as blue since there are 2 blue dots and one yellow dot among the 3 closest neighbors. But when we use 6 cloeset neighbors for the classfications, the red star will be classified as yellow since now there are more yellow dots in the 6 neighbors.

This shows the importance of choosing proper k. There is a programatic way to help use choose the best k which we will discuss in the future lessons. In this lesson, we will just introduce some general rules of choosing value of k.

First, k can't be too small since a too small k will enlarge the impact of noises.

Second, k can't be too large, since a too large k will bias the classification to the majority class. For example, if we choose k equals to the total number of dots in this image, which is 9, then no matter where the new data point is, it will be classified as blue since blue is the majority class of the training dataset.

K is also better to be odd if there are even number of outcomes. This is to avoid a tie. There are several other rules which we discuss in more detail in the lesson notebook.

---
### Slide 5(Not needed?)
#### Hyperparameters
- `n_neighbors`: Number of neighbors to use, default is 5.
- `metric` : the distance metric to use. The default metric is `'minkowski'`.

### Slide 5 Script


---
### Slide 6
#### Scikit Learn KNeighborsClassifier model
```
from sklearn.neighbors import KNeighborsClassifier

knc = KNeighborsClassifier(n_neighbors=5)
knc.fit(d_train, l_train)
score = 100.0 * knc.score(d_test, l_test)
```


### Slide 6

Scikit learn module defines both K-nearest neighbors classifier and regressor for classification and regression problems repectively. We follow same standard way to apply knn classifier and knn regressor.

As we mentioned above, for a k-nearest neighbor model, we need to determine value of k, which is represented by hyperparamter n_neighbors. If we don't want to use default method to calculate distance, we can also set metric which is the hyperparameter for distance calculation method. 

---
### Slide 7(Not needed)
#### Scikit Learn KNeighborsRegressor model
```
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor(n_neighbors=5)
knr.fit(x_train, y_train)
y_pred = knr.predict(x_test)
```

---
## Lesson 2: Introduction to Support Vector Machine

---
### Slide 1
#### Support Vector Machine
- Supervised learning
- For both classification and regression problems
- performs classification by finding the hyperplane that maximizes the margin between the two classes
- Feature scaling is important



---
### Slide 1 Script
In this lesson, we will learn support vector machine which is a supervised learning that can be used in both classification and regression problems. A Support Vector Machine performs classification by finding the hyperplane that maximizes the margin between the two classes. We will demonstrate the concept in next slide.

### Slide 2
<img src='images/svm_iris.png' width=400>

### Slide 2 Script
Here we use a subset of the iris dataset to domonstrate support vector machine.

The two species of iris, setosa and versicolor, are plotted in this two-feature space. We can easily separate the two species visually, setosa has shorter petal width than versicolor. But how do we separate them with an algorithm?

The support vector machine does this by finding a hyperplane, which is represented by the gray line in the image, to separate the two speicies. There could be many different hyperplanes, we need to find the best one among them, which has the maxium margin to opposite classes. To calculate margins, we use the data points in each classes that are closest to the hyperplane, those data points are called support vectors. We then define the boundary of the hyperplane with the support vectors. By maximizing the distance of the two boundaries, we can find the best hyperplance. This is just like in linear regression, we find best parameters by minimizing overall errors.

So in a support vector machine model, only those support vectors are important, all other training data points are neglectable. The support vector machine model will classify new data points with the support vectors only. Comparing to the k nearest neighbors we introduced in lesson 1, which stores all training data in the model, support vector machine is much less demanding on storage or memory.

---
### Slide 3
#### Support Vector Machine Kernel
- linear
- rbf: radial basis function

<img src='images/svm_rbf.png' width=500>

### Slide 3 Script
The features in the iris example we show above have linear relationship which means we can separate different class linearly or with straight lines. When we have a dataset that is not linearly separable, we will need to transform the dataset first. In this slide, the left image is the presentation of the original dataset, with two features x and y, the two classes are separated by a circle. 

If we transform the features of the dataset and use the distance to the center of the circle as new x, and the angle with x axis as new y, we will have the new presentation of the dataset as shown in the right image. Now we can use a straight line to separate the two classes.

In scikit learn support vector machine model, we can enable this kind of transformation by setting hyperparameter kernel. The transformation we show in this slide can be done by a radial basis function. There are other kernels which we will briefly introduct in the lesson notebook. You may choose the best kernel programatically with the help of cross validation which we will discuss in future lessons.

---
### Slide 4(Not needed)
#### Hyperparameters
- `kernel`: controls the kernel used to transform the data into a linear space. Options include `linear`, `rbf`.
- `C`: penalty term for regularization, setting this high reduces the effects of regularization.
- `class_weight`: determines how unbalanced classes are handled.
- `random_state`: seed for random number generation, setting this ensures reproducibility.


---
### Slide 3 Script



---
### Slide 4
#### Scikit Learn SVC model
```
from sklearn.svm import SVC

svc = SVC(kernel='linear', random_state=23)
svc = svc.fit(d_train_sc, l_train)
score = svc.score(d_test_sc, l_test)
```


### Slide 4 Script

You should be pretty familar with this piece of python code by now. SVC stands for support vector classfier. scikit learn module also defines support vector regressor which is named SVR. The way to apply svc and svr is same as other scikit learn machine learning models. The most import hyperparmeter for support vector machine model is kernel. In this sample code we choose a linear kernel. By the way, the default kernel is radial basis function.

---
### Slide 5(not needed)
#### Scikit Learn SVR model
```
from sklearn.svm import SVR

auto_model = SVR()
auto_model = auto_model.fit(x_train_ss, y_train)
score = auto_model.score(x_test_ss, y_test)
```

---
### Slide 5 Script


---
## Lesson 3: Introduction to Bagging and Random Forest

---
### Slide 1
#### Bagging

<img src='images/dt-rjb-2.png' width=500>

---
### Slide 1 Script
In module 2, we introdueced decision tree. Without limiting the depth of the tree, a decision tree can classify every data point in the training dataset with a combination of conditions. However, this doesn't mean we can identify new data points accurately since the conditions derived from the training data may not be able to classify the new data. This problem is very common in machine learing, when a model captures every detail in the training data including those of noises but fails to generalize the dataset . It's called overfitting.

One of the biggest problem of decision tree is it's prone to overfitting. A simple approach to overcoming the overfitting problem is to train many decision trees on a subset of the data and to average the resulting predictions. This process is known as bootstrap aggregation, which is often shortened to bagging.

This image demonstrate the bagging process. The big box on top represents the whole dataset. We can choose a subset of the original dataset as a new dataset. This new dataset not only has less data points, the data points in the sample also only has a subset of features. For example, if the original dataset has 10 features and 10000 data points, a subset may only have 5 features and 100 data points. We can randomly create many subsets from the original dataset, each subset are independent to each other, means they may have different features and data points. One important rule is, when a data point is picked in a subset, it will remain in the original dataset so that next subset can also pick it, this is called sampling with replacement.

Next we can train many decision trees on the subsets, and use each decision tree to predict on new data, then combine the results of all decision trees to make final prediction.

You can apply bagging with different machine learning algorithms. In this lesson, we will introduce random forest, which is implemented with many decision trees, that's why it's called random forest.

---
### Slide 2
#### Random Forest
- Supervised learning
- For both classification and regression problems
- Can handle categorical feature(numerical)
- Feature scaling is not needed


### Slide 2 Script

Randome forest inherent many features from decision tree because it's ensentially a group of decision trees. It can be used on both classification and regression problems. As with decision tree, we don't need to create dummy variables for categorical features, and we don't need to scale continuous features.

---
### Slide 3
#### Random Forest Classifier
```
from sklearn.ensemble import RandomForestClassifier

adult_model = RandomForestClassifier(n_estimators=10)
adult_model = adult_model.fit(d_train, l_train)
score = adult_model.score(d_test, l_test)
```

### Slide 3 Script
The way to apply scikit learn random forest classifier and regressor is just like that of other machine learning algorithms we've learned. Two important random forest hyperparaters are n_estimators and max_features. n_estimators defines the number of decision trees in the forest, while max_features defines the maximum features allowed in each sub dataset. In the sample code in this slide, we set n_estimators to 10 and take default values for max_features, which is auto, means we rely on the model to choose proper value of max_features based on the dataset. In lesson 3 notebook we discuss briefly on the hyperparmeter values. We will learn how to choose the best values programatically in future lessons.

### Slide 3 Script

---
### Slide 4(Not needed)
#### Random Forest Regressor
```
from sklearn.ensemble import RandomForestRegressor

auto_model = RandomForestRegressor(random_state=23)
auto_model = auto_model.fit(ind_train, dep_train)
score = auto_model.score(ind_test, dep_test)
```

### Slide 4 Script

# Module 3 Review

### Slide 1
#### Module 3 Review
- K-nearest Neighbors
- Support Vector Machine
- Random Forest

### Slide 1 script
In this module, we learned 3 more machine learning algorithms, k-nearest neighbors, support vector machine and random forest. 

K-nearest neighbors use closest data points to classify a new data point. There's no model built, instead, the whole training data is stored and used to classify new data points. Thus k-nearest neighbor has high demand on storage or memory. Choosing a proper k or number of neighbors is critical for a k-nearest heighbors model.

On the other hand, support vector machine only stores support vectors, or the training data points that are used to defind the boundary of the hyperplanes. For a support vector machine model, the most important hyperparameter is kernel, which defines the method to separate different classes. Common kernels are linear kernel and radial basis function or rbf kernel.

Both k-nearest neighbors and support vector machine require distance calculation, so feature scaling is important for the two algorithms.

Random forest is an ensemble learning method which trains many decision trees on subsets of the original training dataset, then aggregates predictions of all decision tress to make final prediction. Random forest helps mitigating overfitting problem with decision tree and normally provides more accurate result. As with decision tree, we don't need to scale features or create dummy variables for categorical features.


### Slide 2
#### Module 3 Assignment Review

```
#1. Import metrics module
from sklearn import metrics
metrics.accuracy_score(predicted, l_test)
```

```
#2. Import method from metrics module
from sklearn.metrics import accuracy_score
accuracy_score(predicted, l_test)
```

### Slide 2 Script
This module's assignment asks you to construct some machine learning models and calculate accuracy score of the trained module using scikit learn metrics module. You might already know this, but I want to demonstrate different ways to call a module function here. 

To calculate accuracy score, for example, you can import metrics module from scikit learn, then call accuracy_score function with module name metrics as prefix. Or you can directly import acurracy_score function from scikit learn metrics module, then use the function directly.

We show both ways in this slide and we also use both ways in our lesson notebooks.

The assignment problems are fairly straightforward. Just remember to work on the problems in order.

Good luck.