In [3]:
#Chapter 5. Support Vector Machines

#A support vector machine (SVM) is a powerful and versatile machine learning model, capable of 
#performing linear or nonlinear classification, regression, and even novelty detection. SVMs shine 
#with small to medium-sized nonlinear datasets (i.e., hundreds to thousands of instances), especially 
#for classification tasks. However, they don’t scale very well to very large datasets, as you will see.
#This chapter will explain the core concepts of SVMs, how to use them, and how they work. Let’s jump right in!


#Linear SVM Classification
#The fundamental idea behind SVMs is best explained with some visuals. Figure 5-1 shows part of the iris 
#dataset that was introduced at the end of Chapter 4. The two classes can clearly be separated easily 
#with a straight line (they are linearly separable). The left plot shows the decision boundaries of 
#three possible linear classifiers. The model whose decision boundary is represented by the dashed 
#line is so bad that it does not even separate the classes properly. The other two models work 
#perfectly on this training set, but their decision boundaries come so close to the instances that
#these models will probably not perform as well on new instances. In contrast, the solid line in 
#the plot on the right represents the decision boundary of an SVM classifier; this line not only
#separates the two classes but also stays as far away from the closest training instances as 
#possible. You can think of an SVM classifier as fitting the widest possible street (represented
#by the parallel dashed lines) between the classes. This is called large margin classification.


#Notice that adding more training instances “off the street” will not affect the decision boundary at all:
#it is fully determined (or “supported”) by the instances located on the edge of the street. These
#instances are called the support vectors (they are circled in Figure 5-1).


#SVMs are sensitive to the feature scales, as you can see in Figure 5-2. In the left plot, the vertical 
#scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. 
#After feature scaling (e.g., using Scikit-Learn’s StandardScaler), the decision boundary in the right 
#plot looks much better.

#Soft Margin Classification

#If we strictly impose that all instances must be off the street and on the correct side, this is called 
#hard margin classification. There are two main issues with hard margin classification. First, it only 
#works if the data is linearly separable. Second, it is sensitive to outliers. Figure 5-3 shows the 
#iris dataset with just one additional outlier: on the left, it is impossible to find a hard margin; 
#on the right, the decision boundary ends up very different from the one we saw in Figure 5-1 without 
#the outlier, and the model will probably not generalize as well.


#To avoid these issues, we need to use a more flexible model. The objective is to find a good balance
#between keeping the street as large as possible and limiting the margin violations (i.e., instances 
#that end up in the middle of the street or even on the wrong side). This is called soft margin classification.

## this means that: keeping the street as large as possible = model that can generalize well
## also limiting margin violations = minimize misclassifications when doing the training on the training instances


#When creating an SVM model using Scikit-Learn, you can specify several hyperparameters, including the 
#regularization hyperparameter C. If you set it to a low value, then you end up with the model on the 
#left of Figure 5-4. With a high value, you get the model on the right. As you can see, reducing C makes 
#the street larger, but it also leads to more margin violations. In other words, reducing C results 
#in more instances supporting the street, so there’s less risk of overfitting. But if you reduce it 
#too much, then the model ends up underfitting, as seems to be the case here: the model with C=100 
#looks like it will generalize better than the one with C=1.



#The following Scikit-Learn code loads the iris dataset and trains a linear SVM classifier to detect
#Iris virginica flowers. The pipeline first scales the features, then uses a LinearSVC with C=1:


from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 2) #Iris virginica

In [6]:
svm_clf = make_pipeline(StandardScaler(),
                       LinearSVC(C=1, random_state=42))
svm_clf

In [7]:
svm_clf.fit(X, y)

In [8]:
X_new = [[5.5, 1.7], [5.0, 1.5]]
svm_clf.predict(X_new)

array([ True, False])

In [9]:
#The first plant is classified as an Iris virginica, while the second is not. Let’s look at the scores 
#that the SVM used to make these predictions. These measure the signed distance between each instance 
#and the decision boundary:

svm_clf.decision_function(X_new)



array([ 0.66163411, -0.22036063])

In [11]:
#Unlike LogisticRegression, LinearSVC doesn’t have a predict_proba() method to estimate the class 
#probabilities. That said, if you use the SVC class (discussed shortly) instead of LinearSVC, 
#and if you set its probability hyperparameter to True, then the model will fit an extra model at 
#the end of training to map the SVM decision function scores to estimated probabilities. Under 
#the hood, this requires using 5-fold cross- validation to generate out-of-sample predictions for
#every instance in the training set, then training a LogisticRegression model, so it will slow down
#training considerably. After that, the predict_proba() and predict_log_proba() methods will be available.

#Nonlinear SVM Classification
#Although linear SVM classifiers are efficient and often work surprisingly well, many datasets are not 
#even close to being linearly separable. One approach to handling nonlinear datasets is to add more features,
#such as polynomial features (as we did in Chapter 4); in some cases this can result in a linearly separable
#dataset. Consider the lefthand plot in Figure 5-5: it represents a simple dataset with just one feature, x1.
#This dataset is not linearly separable, as you can see. But if you add a second feature x2 = (x1)2, the
#resulting 2D dataset is perfectly linearly separable.


#To implement this idea using Scikit-Learn, you can create a pipeline containing a PolynomialFeatures 
#transformer (discussed in “Polynomial Regression”), followed by a StandardScaler and a LinearSVC classifier.
#Let’s test this on the moons dataset, a toy dataset for binary classification in which the data points are
#shaped as two interleaving crescent moons (see Figure 5-6). You can generate this dataset using the make_moons()
#function:


from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_svm_clf = make_pipeline(
          PolynomialFeatures(degree=3),
          StandardScaler(),
          LinearSVC(C=10, max_iter=10_000, random_state=42)
        )

polynomial_svm_clf.fit(X, y)

In [12]:
#Polynomial Kernel
#Adding polynomial features is simple to implement and can work great with all sorts of machine 
#learning algorithms (not just SVMs). That said, at a low polynomial degree this method cannot deal
#with very complex datasets, and with a high polynomial degree it creates a huge number of features,
#making the model too slow.

#Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the
#kernel trick (which is explained later in this chapter). The kernel trick makes it possible to get
#the same result as if you had added many polynomial features, even with a very high degree, without
#actually having to add them. This means there’s no combinatorial explosion of the number of features.
#This trick is implemented by the SVC class. Let’s test it on the moons dataset:


from sklearn.svm import SVC

poly_kernel_svm_clf = make_pipeline(StandardScaler(),
                                   SVC(kernel="poly", degree=3, coef0=1, C=5))
poly_kernel_svm_clf.fit(X, y)

In [13]:
#Similarity Features
#Another technique to tackle nonlinear problems is to add features computed using a similarity function, 
#which measures how much each instance resembles a particular landmark, as we did in Chapter 2 when we added
#the geographic similarity features. For example, let’s take the 1D dataset from earlier and add two landmarks
#to it at x1 = –2 and x1 = 1 (see the left plot in Figure 5-8). Next, we’ll define the similarity function to
#be the Gaussian RBF with γ = 0.3. This is a bell-shaped function varying from 0 (very far away from the landmark)
#to 1 (at the landmark). Now we are ready to compute the new features. For example, let’s look at the instance 
#x1 = –1: it is located at a distance of 1 from the first landmark and 2 from the second landmark. Therefore,
#its new features are x2 = exp(– 0.3 × 12) ≈ 0.74 and x3 = exp(–0.3 × 22) ≈ 0.30. The plot on the right
#in Figure 5-8 shows the transformed dataset (dropping the original features). As you can see, it is now
#linearly separable.


#You may wonder how to select the landmarks. The simplest approach is to create a landmark at the location
#of each and every instance in the dataset. Doing that creates many dimensions and thus increases the chances
#that the transformed training set will be linearly separable. The downside is that a training set with m
#instances and n features gets transformed into a training set with m instances and m features (assuming you
#drop the original features). If your training set is very large, you end up with an equally large number of
#features.

#Gaussian RBF Kernel
#Just like the polynomial features method, the similarity features method can be useful with any machine learning
#algorithm, but it may be computationally expensive to compute all the additional features (especially on large
#training sets). Once again the kernel trick does its SVM magic, making it possible to obtain a similar result as
#if you had added many similarity features, but without actually doing so. Let’s try the SVC class with the
#Gaussian RBF kernel:

rbf_kernel_svm_clf = make_pipeline(StandardScaler(),
                                  SVC(kernel="rbf", gamma=5, C=0.001))
rbf_kernel_svm_clf.fit(X, y)



In [None]:
#This model is represented at the bottom left in Figure 5-9. The other plots show models trained 
#with different values of hyperparameters gamma (γ) and C. Increasing gamma makes the bell-shaped curve
#narrower (see the lefthand plots in Figure 5-8). As a result, each instance’s range of influence is 
#smaller: the decision boundary ends up being more irregular, wiggling around individual instances. 
#Conversely, a small gamma value makes the bell- shaped curve wider: instances have a larger range of
#influence, and the decision boundary ends up smoother. So γ acts like a regularization hyperparameter: 
#if your model is overfitting, you should reduce γ; if it is underfitting, you should increase γ 
#(similar to the C hyperparameter).

#Other kernels exist but are used much more rarely. Some kernels are specialized for specific data structures.
#String kernels are sometimes used when classifying text documents or DNA sequences (e.g., using the string
#subsequence kernel or kernels based on the Levenshtein distance).




#SVM Regression
#To use SVMs for regression instead of classification, the trick is to tweak the objective: instead of trying
#to fit the largest possible street between two classes while limiting margin violations, SVM regression tries
#to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the
#street). The width of the street is controlled by a hyperparameter, ε. Figure 5-10 shows two linear SVM 
#regression models trained on some linear data, one with a small margin (ε = 0.5) and the other with a larger
#margin (ε = 1.2).


#So, making predictions with a linear SVM classifier is quite straightforward. How about training? This requires
#finding the weights vector w and the bias term b that make the street, or margin, as wide as possible while limiting
#the number of margin violations. Let’s start with the width of the street: to make it larger, we need to make w