### SVM for Non-Linear Data Sets

An example of non-linear data is:

![SVM's for Non-Linear Data Sets](./img/non_linear_svm.png)

In this case we cannot find a straight line to separate apples from lemons. So how can we solve this problem. We will use the Kernel Trick!

The basic idea is that when a data set is inseparable in the current dimensions, add another dimension, maybe that way the data will be separable. 

The example above is in 2D and it is inseparable, but maybe in 3D there is a gap between the apples and the lemons, maybe there is a level difference, so apples are on level one and lemons are on level two. In this case we can easily draw a separating hyperplane (in 3D a hyperplane is a plane) between level 1 and 2.

Let's assume that we add another dimension called X3. Another important transformation is that in the new dimension the points are organized using this formula x1² + x2².

If we plot the plane defined by the x² + y² formula, we will get something like this:

![3d_SVM](./img/3d_svm.png)

Now we have to map the apples and lemons (which are just simple points) to this new space. 

What did we do? We just used a transformation in which we added levels based on distance. 

If you are in the origin, then the points will be on the lowest level. As we move away from the origin, it means that we are climbing the hill (moving from the center of the plane towards the margins) so the level of the points will be higher. 

Now if we consider that the origin is the lemon from the center, we will have something like this:

![Transformed SVM](./img/transformed_svm.png)

Now we can easily separate the two classes. These transformations are called kernels.
Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace RBF Kernel, Sigmoid Kernel, Anove RBF Kernel, etc 

Another example would be:

![](./img/1d_svm.png)

After using the kernel and after all the transformations we will get:

![](./img/transformed_1d_kernel.png)

So after the transformation, we can easily delimit the two classes using just a single line.

In real life applications we won’t have a simple straight line, but we will have lots of curves and high dimensions. In some cases we won’t have two hyperplanes which separates the data with no points between them, so we need some trade-offs, tolerance for outliers. 

Fortunately the SVM algorithm has a so-called regularization parameter to configure the trade-off and to tolerate outliers.

#### Regularisation

The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss classifying each training example.

If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be lower.

On the other hand, if the C is low, then the margin will be big, even if there will be miss classified training data examples. This is shown in the following two diagrams:

![](./img/reg_svm.png)

As you can see in the image, when the C is low, the margin is higher (so implicitly we don’t have so many curves, the line doesn’t strictly follows the data points) even if two apples were classified as lemons. When the C is high, the boundary is full of curves and all the training data was classified correctly. 


**Note:** even if all the training data was correctly classified, this doesn’t mean that increasing the C will always increase the precision (because of overfitting).

#### Examples of SVM kernels

- Polynomial kernel
It is popular in image processing.
Equation is:

![](./img/polynomial-kernel.png)

where d is the degree of the polynomial.

- Gaussian kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:

![](./img/gaussian-kernel.png)

- Sigmoid kernel
We can use it as the proxy for neural networks. Equation is

![](./img/sigmoid-kernel.png)

### Understanding the Multiple Linear Regression

**Linear regression** is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).

The case of one independent variable is called simple linear regression; for more than one, the process is called **multiple linear regression.**

Formula and Calculation of Multiple Linear Regression
\begin{aligned}&y_i = \beta_0 + \beta _1 x_{i1} + \beta _2 x_{i2} + ... + \beta _p x_{ip} + \epsilon\\&\textbf{where, for } i = n \textbf{ observations:}\\&y_i=\text{dependent variable}\\&x_i=\text{explanatory variables}\\&\beta_0=\text{y-intercept (constant term)}\\&\beta_p=\text{slope coefficients for each explanatory variable}\\&\epsilon=\text{the model's error term (also known as the residuals)}\end{aligned}.

#### Assumptions of multiple linear regression:
    
    
Multiple linear regression makes all of the same assumptions as simple linear regression:

- Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

- Independence of observations: the observations in the dataset were collected using statistically valid methods, and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

- Normality: The data follows a normal distribution.

- Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

##### Example: The Wine Dataset

In [2]:
#import libraries
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('./data/winequality-red.csv')

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
#quality -- statistic
df.quality.describe()

count    1599.000000
mean        5.636023
std         0.807569
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         8.000000
Name: quality, dtype: float64

#definition

if quality is > 6.5 => "good"
else => "bad"

In [6]:
#build a regression model to predict the wine quality.

##### Example: The Glass data

In [9]:
df_glass = pd.read_csv('./data/glass.csv')

df_glass.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [10]:
df_glass.Type.value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: Type, dtype: int64

In [11]:
#build a classification model to predict the glass type

#### What Are Degrees of Freedom?


Degrees of Freedom refers to the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample.

- Consider a data sample consisting of, for the sake of simplicity, five positive integers. The values could be any number with no known relationship between them. 

- This data sample would, theoretically, have five degrees of freedom.

- Four of the numbers in the sample are {3, 8, 5, and 4} and the average of the entire data sample is revealed to be 6.

- This must mean that the fifth number has to be 10. It can be nothing else. It does not have the freedom to vary.
So the Degrees of Freedom for this data sample is 4.

The formula for Degrees of Freedom equals the size of the data sample minus one:

\begin{aligned} &\text{D}_\text{f} = N - 1 \\ &\textbf{where:} \\ &\text{D}_\text{f} = \text{degrees of freedom} \\ &N = \text{sample size} \\ \end{aligned}.