Table of Contents generated with DocToc
- Step from Logistic regression to SVM
We want to modify the cost function of logistic regression
This is called hinge loss
- We replace the first and second terms of logistic regression with the respective cost functions
- We remove (1 / m) because it does not matter
- Instead of A + λB, we use CA + B
- Parameter C similar to the role (1 / λ)
- When C = (1 / λ), the two optimization equations would give same parameters θ
- Compared to logistic regression,
- it does not output a probability
- We get a direct prediction of 1 or 0 instead
- If θTx is => 0: hθ(x) = 1
- If θTx is <= 0: hθ(x) = 0
we can see from logistic regression to SVM. the decision margin is large
Here is a example how the large margin looks likely
- If C is huge, we would want A(the first half) = 0 to minimize the cost function
- If y = 1: A = 0 such that θTx >= 1
- If y = 0: A = 0 such that θTx <= -1
- our optimization problem boils down to minimizing the later term only
The last two constraints can be compacted to:
This is a quadratic program
For the very high dimensional problems, sometimes the data are linearly separable. But in the general case they are not, and even if they are, we might prefer a solution that better separates the bulk of the data while ignoring a few weird noise documents
we introduce software variable ξi, slack, here is illustration
- Minimize
yi(w·xi+b)≥1−ξi, ξi≥0
-
For very large values of the hyper-parameter C, this expression minimizes ∥w∥^2 under the constraint that all training examples are correctly classified with a margin.
-
Large C: Low bias, High Variance. Small C, high bias, low variance
-
If C is infinite large, it is hard margin
-
Smaller values of C relax this constraint and produce markedly better results on noisy problems(addressing overfitting)
This below picture also shows how SVM objective function changed from hard margin to soft margin
Solving the primal problem, we obtain the optimal w, but know nothing about the αi. In order to classify a query point x, we need to explicitly compute the scalar product wTx , which may be expensive if d is large.
The Representer Theorem states that the solution w can always be written as a linear combination of the training data:
Solving the dual problem, we obtain the αi(where αi=0 for all but a few points - the support vectors). In order to classify a query point x we calculate
referring http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
-
Assume N is number of training points, and d is dimension of feature vector x.
-
Need to learn d parameters for primal, and N for dual, If N <<d then more efficient to solve for α than w. we have to calculate a quantity that depends only on the inner product between x and the points in the training set.
-
Moreover, we saw earlier that the αi’s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors
If we apply kernel, dual form is also better, suppose we map original space d to a higher dimensional space D
Why don't we use dual form for logistic regression?
since logitic regression dose not have only support vector points, it is
Primal form only has w, but dual form we need to store a_i and x, so data is large
more from http://www.ritchieng.com/machine-learning-svms-support-vector-machines/
Given the data, is there a different or better choice of the features f1, f2, f3 … fn? We also see that using high order polynomials is computationally expensive
Linear: K(x,y)=xTy
Polynomial: K(x,y)=(xTy+1)d
Sigmoid: K(x,y)=tanh(axTy+b)
RBF: K(x,y)=exp(−γ∥x−y∥2)
-
We would normally use an SVM software package (liblinear, libsvm etc.) to solve for the parameters θ
-
You need to specify the following
- Choice of parameter C
- Choice of kernel (similarity function)
- No kernel is essentially “linear kernel” Predict “y = 1” if θ_transpose * x >= 0 Use this when n is large (number examples) & m is small(feature space)
- Gaussian kernel For this kernel, we have to choose σ^2 Use this when n is small (number of examples) and/or m is large
In practice, the reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of observations, is that it uses regularization.
-
They key to avoiding over-fitting lies in careful tuning of the regularization parameter, C, the missclasification penalty
-
in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters. If you are using a RBF kernel you can try with a wider value for sigma.
- The key thing to note is that if there is a huge number of training examples, a Gaussian kernel takes a long time
- The optimization problem of an SVM is a convex problem, so you will always find the global minimum
- Neural Network: non-convex, may find local optima