# Module 2: Supervised Machine Learning 

## Supervised Machine Learning

- There are <b><em>known</em></b> input and output (outcomes) pairs
- Use these known examples (training set) to train a machine learning model
- Apply this training model to predict the outcomes from given input values (test set)
- Evaluate the performance of the machine learning model

## Classification and Regression
- <b>Classification</b>: predict a discrete class label (binary or multiclass)
- <b>Regression</b>: predict a continuous number

## Generalization, Overfitting, and Underfitting

- <b>Generalization</b>: a model can make accurate prediction on unseen data
- <b>Overfitting</b>: fit a complicated model too closely to the specific characteristics of the training set. The performance of the model is poor in the test set.
- <b>Underfitting</b>: fit a simple model which cannot capture the essential aspects of the variability in the data. The performance of the model is poor in both training and test sets.

<div style="text-align:center"><img style="width:200%" src="fitting.png"></div>


## Supervised Machine Learning Algorithms
- K-Nearest Neighbors
- Linear Models
- Decision Trees and Random Forest
- Support Vector Machines
- Neural Networks
- ...

## Decision Trees

- Decision trees are used for classification and regression
- Decision trees go through a hierarchy of if/else questions and make a decision eventually
- The CART (Classification and Regression Tree) algorithm used by `Scikit-Learn` produce binary trees. 
- Decision Trees are intuitive, and their predictions are easily interpretable. 
- Decision trees require very little data preparation.

## Training a Decision Tree

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)

In [3]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format (tree.score(X_train, y_train)))

Accuracy on training set: 0.988


In [4]:
print("Accuracy on test set: {:.3f}".format (tree.score(X_test, y_test)))

Accuracy on test set: 0.951


## Making Predictions

To classify a new data point, we start at the root node (on the top), and we answer the binary questions and we reach the end leaf. That end leaf represents your class.

## The CART Algorithm

The CART algorithm works by first splitting the training set by feature and threshold. Once the CART algorithm successfully split the initial training data into two subsets, it does the same thing to both subsets. It stops recursing once it reaches the maximum allowed tree depth (the `max_depth` hyper-parameter), or if it cannot find a split that reduces impurity.

## Regularization Hyperparameters

**Decision Trees make very few assumptions about the training data**. If left unconstrained, a decision tree will adapt itself to perfectly fit the training data. which leads to overfitting.

We can restrict the maximum depth of the decision tree, among other regularization hyper-parameters:
- `min_samples_split`: The minimum number of samples a node must have for it to split.
- `min_samples_leaf`: The minimum number of samples a leaf must have.
- `min_weight_fraction_leaf`: `mean_samples_leaf` as a fraction.
- `max_leaf_nodes`: the maximum number of leaf nodes.
- `max_features`: The maximum number of features that are evaluated for any split.

## Regression

Decision Trees are also capable of performing regression tasks. When performing regression, the prediction at each step is a value, not a class. 

In [5]:
from sklearn.tree import DecisionTreeRegressor

## Instability

Decision Trees have a few limitations:
- Decision Trees are sensitive to training set rotation. One way to limit this problem is to use PCA (Principal Component Analysis).
- Decision Trees are sensitive to small variations in the training data. You might get different models for the same training dataset. Random Forests can solve this problem by averaging incoming prediction from many decision trees.

# Support Vector Machines (SVM)

- SVM is a powerful ML model that is capable of performing Classification and Regression. 

- SVMs are particularly suited for complex small-to-medium sized datasets.

**The Fundamental Idea Behind Support Vector Machines**

- For classification, The premise of SVMs is finding decision boundaries that <b>maximize relative distance</b> between them and points and minimize the number of violations in the supporting street.

- For regression, the opposite is true, SVMs optimize for a street the is <b>as close as possible</b> to the training instances, violations in this case represent the data points that are "outside" of the street.

**Support Vectors**

- A Support vector represents the training instance that is used to create a boundary of the street, hence it's considered a "support" for it. Any instance that is not a support vector has no influence on the decision boundary.

**It is important to scale the input when using SVMs**

- SVMs are sensitive to feature scales. A proparly scaled feature space will create empty space for the SVM model to be optimized to split it. If the feature space is not scaled, the algorithm will tend to ignore features with small scales.

## Linear SVM Classification

The fundamental idea behind SVMs can be explained with the following picture:

<div style="text-align:center;"><img style="width:66%;" src="SVM_example.png" /></div>

We can see that the two classes can be separated easily by a straight line (linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. 

The dashed line model is so bad that it doesn't even separate the two groups linearly. 

The other two models work perfectly on the plotted training set but their boundaries are so close to the training data points that they'll probably not perform well on unseen data.

In constrast, the model on the right not only separate the training data linearly, it also stays as far as possible from both classes data points. Thus, it will likely perform well on unseen data.

In [6]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [7]:
iris = datasets.load_iris()

In [8]:
iris['data'].shape

(150, 4)

In [9]:
X = iris['data'][:, [2,3]]  # Petal Length, Petal Width
y = (iris['target'] == 2).astype(np.float64)  # Iris Virginica
X.shape, y.shape

((150, 2), (150,))

In [10]:
svm_clf = Pipeline([
    ('Scaler', StandardScaler()),
    ('Linear_svc', LinearSVC(C=1, loss='hinge'))
])

In [11]:
svm_clf.fit(X, y)

In [12]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

## NonLinear SVM Classification

Many datasets are not even close to being lienarly separable. One approach to handling non-linear modeling is to add more features, such as polynomial features. In some cases this can result in linearly separable datasets.

The following is an example of an original non-linearly separable dataset with only one feature $x_{1}$ (on the left), and an augmented linearly seprable dataset with an added feature $x_{2}=x_{1}^{2}$: 

<div style="text-align:center;"><img style="width:66%;" src="nonlinear_to_linear.png" /></div>

In [13]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

In [14]:
X, y = make_moons(n_samples=100, noise=0.15)

In [15]:
X

array([[ 0.48941112, -0.52156111],
       [ 2.13593337, -0.00646819],
       [-0.9593727 ,  0.77105072],
       [ 2.09137354,  0.71192099],
       [ 1.01961399, -0.42097565],
       [ 1.42640862, -0.25706359],
       [ 2.06875838,  0.46997931],
       [ 0.85316271,  0.17780892],
       [-0.97713348,  0.15463284],
       [-0.72222232,  0.2968266 ],
       [ 1.79083738, -0.15196019],
       [-0.70573884,  0.25771783],
       [ 0.09249781,  0.93169391],
       [ 0.92505462,  0.78137093],
       [ 0.60115386, -0.45585128],
       [ 1.79840872,  0.46638158],
       [-1.12341819,  0.14484513],
       [-0.11819399,  1.28947113],
       [ 0.06558436,  0.11229213],
       [ 0.86578511,  0.84204781],
       [ 0.80124032,  0.54241667],
       [-0.84970907,  0.55867949],
       [ 0.1247648 ,  0.11632774],
       [ 0.58286682,  0.38316864],
       [-0.78622713,  1.1007139 ],
       [-0.1627151 ,  0.36334335],
       [ 1.86199124, -0.14628406],
       [ 0.74373956,  0.6671408 ],
       [-0.8574683 ,

In [16]:
y

array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1], dtype=int64)

In [17]:
polynomial_svm_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge"))
])

In [18]:
polynomial_svm_clf.fit(X, y)

In [19]:
polynomial_svm_clf.score(X, y)

1.0

The following represents the decision boundaries of the model, because we added polynomial degrees, projected boundaries are now non-linear:

<div style="text-align:center;"><img style="width:50%;" src="polynomial_svms.png" /></div>

### Polynomial Kernels

An mathematical technique called the **kernel trick** can make it possible to have the same result as if we added many polynomial features without actually adding them.

In [20]:
from sklearn.svm import SVC

In [21]:
poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

In [22]:
poly_kernel_svm_clf.fit(X, y)

This model trains an SVM classifier using a kernel of third degree features.

If our model is overfitting, we might want to decrease the polynomial degree, 

If it's underfitting, it might be a good idea to increase the degree. 

`coef0` controls how much the model is influenced by high-degree polynomials vs. low degree polynomials.

The following figure shows the previously trained model (on the left) vs. a more complex model of kernel degree 10:

<div style="text-align:center;"><img style="width:66%;" src="kernel_trick.png" /></div>

### Similarity Features

Another technique to tackle non-linear problems is to add features computed using a **similarity function**, which measures how much each instance resembles a particular landmark.

For example, let's take the 1D dataset discussed earlier & add two landmarks to it at $x_{1}=-2$ and $x_{1}=1$, as showcased in the left plot of:

<div style="text-align:center;"><img style="width:66%;" src="similarity_measures.png" /></div>

The similarity function is the **Gaussian Radial Basis Function (RBF)**. 

As we can see from the plot on the right, the instances become lienarly separable using only distance features.

### Gaussian RBF Kernel

The similarity features method can be useful in many ML algorithms, the problem is that with very large datasets, we'll endup with a very big feature space, but once again we have the Kernel trick to make it look as if we added the additional features.

In [23]:
from sklearn.svm import SVC

In [24]:
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])

In [25]:
rbf_kernel_svm_clf.fit(X, y)

Let's take a look at the predictions space with the training set instances (bottom left is the trained model above), others correspond to different hyper-parameter configurations:

<div style="text-align:center;"><img style="width:66%;" src="training_rbfs.png" /></div>

Increasing $\gamma$ makes the decision boundary more irregular, wiggling around individual instances.
- Increasing gamma increases model sensitivity (may lead to overfitting).
- decreasing gamma increases model bias (may lead to underfitting)

We should always try the linear kernel first, if the training set is not too large, we should also try the gaussian RBF kernel.

`LinearSVC` doesn't support the kernel trick. Its algorithm takes longer if we ask for higher precision, precision is controlled by the hyper-parameter $\epsilon$.

`SVC` is based on `libsvm` that supports the kernel trick. It gets dreadfully slow when the training instances count gets big. This algorithm is good for small to medium sized datasets and scales well with the number of features.

## SVM Regression

- SVMs also support linear and nonlinear regression. 
- SVM regression tries to fit **as many instance as possible** on the street while limiting margin violations. 
- The width of the street is controlled by the hyper-parameter $\epsilon$.

<div style="text-align:center;"><img style="width:66%;" src="SVM_regression.png" /></div>

- `SVR` and `LinearSVR` in `sklearn` are for SVM Regression
- `LinearSVR` scales linearly with the size of the training set, while `SVR` is much slower (just like `LinearSVC` & `SVC`).

In [26]:
from sklearn.svm import LinearSVR

In [27]:
svm_reg = LinearSVR(epsilon=1.5)

In [28]:
svm_reg.fit(X, y)

We can use a kernelized SVM model for regression

In [29]:
from sklearn.svm import SVR

In [30]:
svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1, gamma='auto')

In [31]:
svm_poly_reg.fit(X, y)