In [1]:
import numpy as np
import pandas as pd

# Quiz 4#

### Q.1. Support Vectors ###
Which of the following best describes support vectors?

* A linear combination of the input features
* A linear combination of the input samples (observations)
* All data points
* Data points that lie closest to the decision surface
* Data points that lie furthest away from the decision surface -> TRUE
* Features with the highest importance
* Features with the least importance (aka those which we could afford to get rid of)

### Q.2. Linear SVM Classifier ###
True or false? For a linearly separable dataset with two classes, logistic regression and linear SVM will produce the same decision boundary.
* False 

The logistic regression tries to adapt a linear regression, so that it estimates the probability a new entry falls in a class.  It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. The linear decision boundary is simply a consequence of the structure of the regression function and the use of a threshold in the function to classify.

The decision boundary is much more important for Linear SVM's - the maximum margin between the support vectors, no probabilities.

### Q.3. SVM Accuracy ###
You train a linear SVM classifier on three features. It learns the decision boundary given by the plane 3x_1 - 2x_2 + 5x_3 = 0. The table below represents your testing data. What is the accuracy of your model on this data?
60%

In [2]:
test_data = pd.DataFrame(np.array([
         [-5, 2, 1, 0], 
         [3, 17, 11, 0], 
         [16, 18, -2, 1],
         [4, -2, -9, 0],
         [-5, 4, -10, 1]]),
        columns = ['x1', 'x2', 'x3', 'y'])
test_features = test_data.drop('y', axis = 1)
test_labels = test_data['y']

model_boundary = lambda x1, x2, x3: 3 *x1 - 2 * x2 + 5 * x3
model_predicts = [1 if model_boundary(x1, x2, x3) >= 0 else 0 for x1, x2, x3  in test_features.values]

count_TP_TN = np.sum(np.array(model_predicts) == np.array(test_labels))
count_all = len(test_features)
print(f"The classification accuracy is {((count_TP_TN / count_all) * 100).astype(int)}%.")

The classification accuracy is 60%.


### Q.4. RBF Kernel ###
Given a training set with 3 features and 10 examples, you train an SVM with a Gaussian (RBF) kernel. How many dimensions does the kernel function "project" the data into?

* 3 + 1 = 4

Cover's theorem states that given a set of training data that is not linearly separable, one can with high probability transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some non-linear transformation.

### Q.5. Hyperparameters
Which of the following statements about SVMs is / are true?
* A polynomial kernel can accept fractional degrees (e.g. 1.5) which allows us to use roots (square root, etc.) of feature columns -> TRUE (tested with sklearn)
* Like logistic regression, an SVM classifier outputs the probability that each sample belongs to a certain class -> FALSE
* A polynomial kernel can be of any degree > 1
* Using a linearly separable dataset, a gaussian SVM will always lead to overfitting
* Decreasing C leads to more training error -> TRUE
* A polynomial kernel can be at most of degree 3 (cubic) -> FALSE
* Increasing C leads to more training error -> FALSE
* Decreasing C leads to a "wider margin" -> TRUE
* Increasing C leads to a "wider margin" -> FALSE

Notes:

From lectures: C is the penalty for misclassification ($C = \frac{1}{\lambda}$); smaller value = less strict and less regularizaion.

The C parameter controls how much you want to punish your model for each misclassified point for a given curve:

| Large values of C | Small Values of C|
|------|------|
| Large effect of noisy points. |Low effect of noisy points.|
| A plane with very few misclassifications will be given precedence | Planes that separate the points well will be found, even if there are some misclassifications
| less training error | more training error |
| narrower margin | wider margin |
| more strict | less strict |


### Q.6. k-Nearest Neighbors ###
Which of the following is / are true about kNN?
* Decreasing k leads to higher bias - TRUE (local overfitting)
* kNN is much more computationaly expensive to train than to predict new data points - FALSE
* kNN can only be used for clustering - FALSE
* kNN can be used for value imputation - TRUE
* kNN can only describe the training set; it cannot predict new data points - FALSE
* k has to be a number strictly greater than 1 - FALSE (we can have kNN with k = 1)
* kNN can only be used for classification - FALSE

Note:
kNN can be used for classification, regression, Voronoi tiling/tesselation etc.

### Q.7. Anomaly Detection ###
True or false? When training a learning algorithm to perform anomaly detection, we need at least one observation of the anomalous class.
* True
* False - TRUE
* Anomaly detection can never be a supervised learning task
* True only if we perform feature selection; false otherwise
* True only if we use an ensemble of algorithms (e.g. random forest, AdaBoost); false otherwise