# Minimize Empirical Risk

* minimization problem of a loss function
* a regularizer

### unconstrained SVM objective and explains how it can be re-interpreted as an instance of empirical risk minimization. 

## Soft margin SVM

minimize w and b times C times the sum (i == 1 - n), maximum(1 - y * (np.dot(w, xTr) + b),0), plus squared norm of w

hinge loss == sum (i == 1 - n), maximum(1 - y * (np.dot(w, xTr) + b),0)

L2 regularizer = squared norm of w


**This (loss function + regularizer) is the empirical risk minimalization framework** 

Simplify: 

minimize w, b * 1/n * C * sum(i == 1 - n) * "l"((np.dot(w, xTr)), yTr) + r(w)

loss == sum(i == 1 - n) * "l"((np.dot(w, xTr)), yTr)

r == regularizer 

C is also labeled Lambda and sometimes placed in front of the Regularizer.  When C is large, Regularizer is important. When C is small, we care about the loss 

In that case

minimize w, b * 1/n * sum(i == 1 - n) * "l"((np.dot(w, xTr)), yTr) + C * r(w)

## To avoid overfitting

When dataset is small, make C (lambda) large (so: high variance, increase lambda)

When dataset is large, make C (lambda) smaller (so: variance so low that it is overfitting, decrease lambda)


## Types of Loss Functions

### hinge loss

max(1 - h[w] * xTr * yTr, 0) to the power of p

### squared hinge loss

Pro: easier to optimize

Con: Doesn't handle misclassified labels well

## log loss

log(1 + e to the power of -h[w] * xTr * yTr)

Pro: It's a logistic regression so it outputs probabilities which can be intepreted 

Pro: Handles mislabeled data well 

## Exponential loss

e to the power of -h[w] * xTr * yTr

Pro: Can create strong classifier with few iterations 

Con: Does not handle mislabeled data well 

## Zero One Loss

For each data point, is the sign of the prediction same as the sign of the label? 
If yes, loss == 0
If no, loss == 1

Used to view training and testing error 




## Hinge Loss:

Pros: It's typically used for Support Vector Machines (SVM) and is robust to outliers. It doesn't penalize misclassifications as harshly when the prediction is close to the decision boundary, which can be beneficial in certain situations.

Cons: Hinge loss doesn't provide a probabilistic interpretation because its output isn't a probability. In the presence of noisy (mislabelled) data, it might lead to a higher misclassification rate because it cares about getting things right up to a margin, but no further.

## Squared Hinge Loss:

Pros: Squared hinge loss is a modified version of hinge loss where the error is squared. It penalizes misclassified points more heavily than the regular hinge loss, which can lead to better performance when outliers are infrequent.

Cons: The squared term can lead to more sensitivity to outliers. When there's a lot of noise in the data, squared hinge loss can lead to worse performance due to this heightened sensitivity.

## Log Loss (Cross-Entropy Loss):

Pros: Log loss is a standard loss function for binary classification problems and provides probabilities as output. It is sensitive to the confidence of prediction, which makes it suitable for imbalanced datasets.

Cons: It can be heavily impacted by outliers or mislabelled data because it penalizes confident and wrong predictions harshly.

## Exponential Loss:

Pros: Exponential loss is used in boosting algorithms like AdaBoost. It places higher emphasis on misclassified points, making the model focus more on difficult instances.

Cons: Due to its nature, it is very sensitive to outliers or mislabelled data. A noisy dataset could significantly deteriorate the model's performance.

## Zero-One Loss:

Pros: Zero-One Loss gives the total number of misclassifications, making it easy to interpret.

Cons: It’s not differentiable, and thus can’t be used in optimization algorithms that require the gradient. Additionally, it doesn't consider the confidence of predictions, and misclassifications close to the decision boundary are penalized the same as those far away. In the presence of noise, a single mislabelled point could significantly affect the decision boundary.

# Loss Functions for Regression

## squared loss

(h * xTr - yTr) squared

Pro: Predicts the mean label among results, making it good for noisy data (like predicting housing prices)

Con: Outliers can warp predictions 

## absolute loss

absolute(h * xTr - yTr)

Con: Not differentiable at 0 (which has to be mitigated via the optimizer)

## Huber Loss

absolute difference if difference is large

squared difference if difference is small

Pro: Differentiable for small losses

Pro: Good with noisy data

## Squared Loss (L2 Loss): 

This is the most commonly used loss function for regression. It is defined as the square of the difference between the actual and predicted values. The loss increases quadratically with the difference, making it sensitive to outliers.

Pros: It's differentiable, making it easier to compute the gradient for optimization. It also heavily penalizes large errors due to the squaring, which can be desirable in some applications.

Cons: It's sensitive to outliers or noise in the data. A single bad data point can greatly influence the model's parameters. This is because squaring the error amplifies the effect of large errors.

## Absolute Loss (L1 Loss): 

This loss function is defined as the absolute difference between the actual and predicted values.

Pros: It's less sensitive to outliers and noise as compared to squared loss because it doesn't square the error. Therefore, it can be a better choice if your data contains many outliers.

Cons: It's not differentiable at zero, which can make the optimization harder. Also, it might not be the best choice for data where we need to heavily penalize large errors, as it treats small and large errors linearly.

## Huber Loss: 

This loss function can be seen as a combination of squared and absolute loss. For small errors, it behaves as a squared loss and for large errors, it behaves as an absolute loss.

Pros: It combines the best properties of the L1 and L2 loss. It's differentiable and less sensitive to outliers. It also transitions smoothly between the small-error and large-error cases, so it doesn't have the discontinuity of L1 loss.

Cons: It introduces an additional parameter (the threshold between the L1 and L2 behaviors), which needs to be tuned.

As for misclassified labels in a regression context, this would be equivalent to outliers or noise in the target variable. Absolute loss and Huber loss handle these situations better than squared loss due to their reduced sensitivity to large errors.

Regarding small losses, squared loss tends to de-emphasize them (since the square of a number less than 1 is smaller than the original number), while absolute loss treats them linearly. Huber loss behaves like squared loss for small losses, but this depends on the threshold parameter.

# l1 vs l2 Regularizer

Regularization is a technique used to prevent overfitting by adding an additional penalty term to the loss function that constrains the complexity of the model. This penalty term encourages the model to have smaller weights, leading to simpler models that generalize better to unseen data. 

## l1 regularizer 

The L1 regularizer adds the sum of the absolute values (Manhattan distance) of the weights to the loss function.

This has the effect of driving some of the weights to zero, resulting in a sparse model where some of the input features are effectively ignored. This can be useful if you believe that many input features are irrelevant or if you want a model that is easier to interpret.

absolute value of w must be less than or equal to Budget

l1 is used in the gene sequencing example.  When the Budget is low, it filters down to the genes responsible for the condition

Creates a "diamond" around the origin that constrains the output

## l2 regularizer 

The L2 regularizer adds the sum of the squares (Euclidean distance) of the weights to the loss function.

squared norm of w must be less than or equal to Budget

This encourages the model to have small weights, but unlike L1 regularization, it does not drive them to zero. This means that all input features will typically be used by the model, but their impact will be moderated.

L2 regularization is more stable than L1 regularization, but it may include unnecessary features and result in a model that is harder to interpret.

l2 are used in SVM 

Creates a "circle" around the origin that constrains the output





## Combinations of loss function / regularizer

### Squared Loss + L2 Regularization (Ridge Regression): 

This is a common combination used in linear regression problems. The squared loss function gives a measure of the model's prediction error, and the L2 regularizer helps prevent overfitting by penalizing large weights. This combination leads to a smooth loss surface and a unique solution, as both the squared loss and the L2 norm are differentiable and convex.

### Squared Loss + L1 Regularization (Lasso Regression): 

Lasso regression is also commonly used in linear regression problems. The L1 regularizer induces sparsity in the solution, making it useful for feature selection.

### Hinge Loss + L2 Regularization (SVM): 

This combination is used in Support Vector Machines. The hinge loss function is suitable for binary classification tasks, and the L2 regularizer helps prevent overfitting.

### Cross-Entropy Loss + L2 Regularization (Logistic Regression): 

Logistic regression uses the cross-entropy loss function, which is suitable for binary and multi-class classification problems. Adding an L2 regularizer helps prevent overfitting.

### Cross-Entropy Loss + L1 Regularization: 

This combination can be useful in situations where feature selection is important, as the L1 regularizer can induce a sparse solution, reducing the importance of less informative features.

### Huber Loss + L2/L1 Regularization (Robust Regression): 

Huber loss is less sensitive to outliers than squared loss. It's a combination of squared loss for smaller errors and absolute loss for larger errors. Huber Loss combined with L1 or L2 regularization can provide robust regression models which are less influenced by outliers.