# LLMs (Background)

Recounting Mitchell's defintion on machine learning algorithms.

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."

For LLMS this task, as it name suggests, is in regard to languages. 


## Per

Loss functions quantify how well our model's predictions align with the actual observed values, providing a measure of the model's performance.

## Why We Need Loss Functions

1. **Quantification of Error**: Loss functions give us a numerical measure of how far off our predictions are from the true values.

2. **Optimization Target**: They provide a clear objective for optimization algorithms to minimize, guiding the learning process.

3. **Model Comparison**: Loss functions allow us to compare different models or model configurations objectively.

4. **Problem-Specific Evaluation**: Different problems require different evaluation metrics, which loss functions can provide.

## Categorization of Errors

Errors in machine learning can be categorized in several ways:

1. **By Problem Type**:
   - Regression errors
   - Classification errors

2. **By Error Magnitude**:
   - Absolute error
   - Squared error

3. **By Error Direction**:
   - Overestimation
   - Underestimation

4. **By Importance**:
   - Weighted errors
   - Unweighted errors

## Major Loss Functions

### 1. Mean Squared Error (MSE)

**Use Case**: Regression problems

**Formula**: $MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

**Explanation**: MSE calculates the average squared difference between predicted and actual values. It heavily penalizes large errors due to squaring.

**Pros**: 
- Differentiable
- Penalizes larger errors more

**Cons**: 
- Sensitive to outliers
- Not robust to label noise

### 2. Mean Absolute Error (MAE)

**Use Case**: Regression problems

**Formula**: $MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$

**Explanation**: MAE calculates the average absolute difference between predicted and actual values.

**Pros**: 
- Less sensitive to outliers than MSE
- Easier to interpret

**Cons**: 
- Not differentiable at zero
- May not converge to a unique solution

### 3. Huber Loss

**Use Case**: Regression problems, especially with outliers

**Formula**: 
$L_\delta(y, \hat{y}) = \begin{cases}
    \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\
    \delta(|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise}
\end{cases}$

**Explanation**: Huber loss combines the best properties of MSE and MAE. It's quadratic for small errors and linear for large errors.

**Pros**: 
- Robust to outliers
- Differentiable everywhere

**Cons**: 
- Requires tuning of the $\delta$ parameter

### 4. Binary Cross-Entropy

**Use Case**: Binary classification problems

**Formula**: $BCE = -\frac{1}{n} \sum_{i=1}^n [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$

**Explanation**: Measures the performance of a classification model whose output is a probability value between 0 and 1.

**Pros**: 
- Works well for imbalanced datasets
- Provides probabilistic output

**Cons**: 
- Can be numerically unstable for predictions near 0 or 1

### 5. Categorical Cross-Entropy

**Use Case**: Multi-class classification problems

**Formula**: $CCE = -\sum_{i=1}^n \sum_{j=1}^m y_{ij} \log(\hat{y}_{ij})$

**Explanation**: Generalizes binary cross-entropy to multiple classes. It measures the dissimilarity between the true distribution and the predicted distribution.

**Pros**: 
- Suitable for multi-class problems
- Works well with softmax activation

**Cons**: 
- Can suffer from the vanishing gradient problem

### 6. Hinge Loss

**Use Case**: Support Vector Machines, margin-based classification

**Formula**: $L(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$

**Explanation**: Penalizes predictions that are incorrect or not confident enough. Used in maximum-margin classifiers.

**Pros**: 
- Encourages larger margins between classes
- Works well for binary classification

**Cons**: 
- Not probabilistic
- Can be sensitive to class imbalance

### 7. Focal Loss

**Use Case**: Imbalanced classification problems

**Formula**: $FL(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t)$

Where $p_t$ is the model's estimated probability for the correct class, $\alpha_t$ is a balancing factor, and $\gamma$ is the focusing parameter.

**Explanation**: Addresses class imbalance by down-weighting the loss for well-classified examples.

**Pros**: 
- Handles class imbalance effectively
- Focuses on hard examples

**Cons**: 
- Requires tuning of additional parameters

## Conclusion

Choosing the right loss function is crucial for effective model training. It depends on the specific problem, the nature of the data, and the desired properties of the model's predictions. Understanding the characteristics of different loss functions allows data scientists and machine learning engineers to make informed decisions in model design and optimization.

Would you like me to elaborate on any specific aspect of this lecture note or provide more examples?

## Learning
