# Support Vector Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Dataset and business problem description

This dataset contains information on 50 companies, including their expenses, profits, and which state they operate in. The challenge is to determine how each of the factors available contribute to the profit level of each company.

## Intuition

### Compared to linear regression

![image.png](attachment:image.png)

Ordinary Least Squares as discussed previously is used to derive the linear regression line.

Support Vector Regression instead produces a 'tube'; the simple linear regression line in the middle and an interval bounded above and below this line with total height $2\epsilon$. This interval is called the $\epsilon$-Insensitive Tube - we'll disregard the error for all points inside this tube. 

This adds a bit of a 'buffer' to our model allowing it to be more robust, and amounts to the modeller being able to choose how tolerant the model is of errors through an acceptable error margin (the $\epsilon$-Insensitive Tube).

For points outside the tube, we'll measure error as the distance between those points and the tube itself. These distances are referred to Slack Variables and are referred to as $\xi_i$ if the distance is above the tube, or $\xi_i^*$ if the distance is below the tube.

### Assumptions of linear regression

These assumptions need to be checked before you build your linear regression model to ensure that the resulting model is valid.

1. Linearity: the relationship between the dependent and independent variable(s) is linear. If the relationship is not linear then you'll need to use a non-linear model.
2. Homoscedasticity: the variance of the residuals do not depend on the value of the independent variable (i.e. variance remains the same throughout the dataset). If the residuals vary with X, then the model is not optimal since it isn't capturing all the predictive information in the data.
3. Multivariate normality: residuals are normally distributed around the mean value.
4. Independence of errors: observations are independent of one another. This can be tested with the Durbin Watson statistic.
5. Lack of multicollinearity: when independent variable are correlated with each other, there will be issues in interpreting the coefficients of the model. This can be tested with the Variance Inflation Factor method.

### Dummy variables & intuition

The business problem in this lesson is to see if there is any correlation between spending on R&D, administration, or marketing, as well as the state the startup is operating in; on the profit that the startup is generating. How would you go about creating a model to understand the relationship between these variables and profit? We can use a multiple linear regression for this.

Multiple linear regression is a generalisation of simple linear regression to include multiple independent variables: $y = b_0 + b_1x_1 + \dots + b_Nx_N$. 

The equation coefficients have a similar interpretation to in simple linear regression, where:

* $b_0$ is still the intercept; where the dependent variable is when the independent variables is at 0.
* $b_n$ is the gradient of variable $x_n$; this describes the change in the dependent variable with every unit change in of $x_n$, holding all other independent variables constant.


When adding categorical variables to the model, you'll need to turn them into dummy variables - being careful to not fall into the dummy variable trap. i.e. Use $n-1$ dummy variables when there are $n$ categories; the variable you leave out then becomes the 'default' value, and its effect will be included in the intercept.

### Understanding the P-value - statistical significance

The intuition behind hypothesis testing is that you have two alternate universes:

* $H_0$: Your null hypothesis, the 'default' universe; and,
* $H_1$: Your alternative hypothesis, the universe where the thing you're trying to prove is true.

Put simply, the P-value is the probability of the results that you found occurring, given that we're in a universe where the null hypothesis is true. Statistical significance ($\alpha$) is defined as a P-value threshold below which the event that occurred is so unlikely that you reject the null hypothesis. In other words you are $(1-\alpha)$% sure that we don't live in the 'default' universe (although there's a $\alpha$% chance that we do).

### Building a model (step-by-step)

We have a lot of independent variables; we'll need to decide which ones to keep and which to discard. Why would we want to narrow the variable list down?

* Garbage in, garbage out: there's no guarantee that more variables = better model; especially if there turns out to be multicollinearity in your predictors.
* Explainability: you'll want to be able to explain the effect that each variable has on the outputs, this is hard to do when you have a lot of useless variables.

This course presents 5 methods of building models (stepwise regression refers to methods 2-4):

<div class="alert alert-block alert-warning">
    <b>On the validity of these methods</b><br/> 
    Most of the methods this course presents for finding the 'ideal' model in regression are stepwise selection processes. There are <a href="https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df">arguments against this method of selection</a> citing the reason that the F-test used to assess the fit of the regression model (which differs from the t-test used to test each individual coefficient) is designed only for use in performing one hypothesis test, not many in sequence. As a result of this violation, the following are true:
    <ul>
        <li>Standard errors are biased toward 0 (i.e. overly optimistic)</li>
        <li>p-values are biased towards 0 (i.e. overly optimistic)</li>
        <li>Parameter estimates biased away from 0</li>
        <li>Models too complex</li>
    </ul><br/>
    Alternatives presented include the use of LASSO regression and Least Angle Regression to arrive at a subset of predictors that are useful and predictive.

Back to the course material...
</div>

#### All-in

Throw all your variables in. When would you do this?

* You have prior knowledge that all available variables are useful predictors; or,
* You're required (legislation/business rules) to put these variables in; or,
* You're preparing for backward elimination.

#### Backward elimination

1. Select a significance level as a threshold with which to keep a variable in the model (generally $\alpha=0.05$).
2. Fit full model with all possible predictors.
3. Look at the predictor with the highest P-value. If this P-value exceeds the threshold go to Step 4. Otherwise you are done.
4. Remove the predictor and re-fit the model. Go back to Step 3.

#### Forward selection

A much more complex procedure.

1. Select a significance level to enter the model (generally $\alpha=0.05$).
2. Fit regression models $y~x_n$ for all $n$ variables. Select the model with the lowest P-value.
3. Keep the model with the variable selected in Step 2, and fit all possible models $y~x_1 + x_{n-1}$. Select the model where $x_n$ has the lowest P-value. If the P-value is smaller than $\alpha$ then continue adding variables in this manner. Otherwise remove this latest variable where P > $\alpha$ and you are left with your final model.

#### Bidirectional elimination

1. Select a significance level to enter the model, and a significance level to stay in the model.
2. Perform the next step of forward selection (new variables must have P-value < significance level to enter).
3. Perform all steps of backward elimination.
4. Add another variable via forward selection and then run all steps of backward elimination again.
5. Once you get to the point where no new variables can exit, and no old variables can enter, you are done.

#### Score comparison

Most resource intensive approach.

1. Select a criterion of goodness-of-fit (for example, Akaike's Information Criterion).
2. Construct all possible models ($2^N-1$ total combinations - 10 variables means 1,023 possible combinations).
3. Select the model with the best criterion.

We'll focus on using the backward elimination process in this course since it's generally the fastest of all of these methods.

## Building the model
### Importing the dataset

Note there is no need to scale the features in regression since the coefficients for each variable will adjust based on the scale of the feature.