![image.png](attachment:1ee5b98b-aa04-46e7-9160-e0d80e06c23e.png)

## Regression

**Assumptions:**
1. Target data should be continuous
2. Features should be independent

### Type of regression: 
1. Linear regression
2. Decision Tree
3. Support Vector Regression
4. Random Forest Regressor

### Performance Metrics

`Example:` Consider a regression problem where we want to predict housing prices based on various features. After applying our regression model to a test dataset of 100 houses, we found the result as follows:

```
Actual Prices: [250,000, 300,000, 350,000, ...]
Predicted Prices: [260,000, 295,000, 360,000, ...]
```

#### Mean Absolute Error (MAE)

Mean Absolute Error is a commonly used metric for regression problems. It measures the average absolute difference between the predicted and actual values. MAE provides an easily interpretable measure of how close the predictions are to the actual values.

$MAE = \frac{1}{n} * Σ|y - ŷ|$

where:

- `n` is the number of instances
- `y` represents the actual values
- `ŷ` represents the predicted values

MAE is relatively simple to understand and compute but does not consider the squared errors, potentially making it less sensitive to large errors.

`Example:` For the problem above, we calculate the MAE as follows:

$MAE = \frac{|250,000 - 260,000| + |300,000 - 295,000| + |350,000 - 360,000| + ...}{100}$

The MAE provides an average of the absolute differences between the actual and predicted values. In this example, it measures the average difference between the predicted and actual housing prices.

- **When to use:** MAE is a common metric for evaluating regression models and is suitable when you want to understand the average magnitude of the errors without considering their direction.

#### Mean Squared Error (MSE)

Mean Squared Error is another popular metric for regression evaluation. It measures the average of the squared differences between the predicted and actual values. MSE gives more weight to larger errors compared to MAE.

$MSE = \frac{1}{n} * Σ(y - ŷ)^2$

MSE provides a more comprehensive measure of the model's performance by considering both small and large errors. However, it is not directly interpretable in the original scale of the target variable.

`Example:` Using the same housing price regression example, we calculate the MSE:

$MSE = \frac{(250,000 - 260,000)^2 + (300,000 - 295,000)^2 + (350,000 - 360,000)^2 + ...}{100}$

MSE calculates the average of the squared differences between the predicted and actual values. It penalizes larger errors more heavily than MAE.

- **When to use:** MSE is widely used in regression tasks and is beneficial when you want to emphasize larger errors and penalize them more compared to smaller errors.

#### Root Mean Squared Error (RMSE)

RMSE is derived from MSE and is widely used due to its interpretability in the original scale of the target variable. It represents the square root of the average of the squared differences between the predicted and actual values.

$RMSE = \sqrt{\frac{1}{n} * Σ(y - ŷ)^2}$

RMSE shares the same properties as MSE but provides a more easily understandable and interpretable metric.

`Example:` Based on the previous example, we calculate the RMSE as follows:

$RMSE = sqrt(MSE)$

RMSE provides the square root of the MSE, giving a measure of the average magnitude of the errors in the original unit of the target variable (e.g., dollars in the housing price example).

- **When to use:** RMSE is a commonly used metric for regression problems and is especially useful when you want to interpret the errors in the original unit of the target variable.

#### R-squared Error (Coefficient of Determination)

R-squared (R²) is a statistical metric that represents the proportion of variance in the target variable explained by the regression model. It indicates how well the model fits the data and ranges from 0 to 1. A higher R² value indicates a better fit.

$R² = 1 - \frac{SSE}{SST}$

where:

- `SSE` is the sum of squared residuals $(predicted values - actual values)^2$
- `SST` is the total sum of squares $(actual values - mean of actual values)^2$

R-squared measures the goodness of fit but does not consider the complexity of the model or the number of predictors.

`Example:` In the housing price regression example, we can calculate the R-squared error as follows:

$SST = \sum{((Actual Prices - mean(Actual prices))^2}$

Calculate the residual sum of squares (SSE):

$SSE = \sum{((Actual Prices - Predicted Prices))^2}$

Calculate the R-squared error:

$R² = 1 - \frac{SSE}{SST}$

The R-squared error measures the proportion of the variance in the target variable that can be explained by the regression model. It ranges from 0 to 1, with 1 indicating that the model explains all the variability in the target variable.

- **When to use:** R-squared error is commonly used to assess the goodness of fit of a regression model. It provides an indication of how well the model fits the data.

## Classification

**Assumptions:**
1. Target data should be discret/categorical.
2. Features should be independent

### Type of classifcation: 
1. Logistic Regression
2. Decision Tree Classifier
3. Support Vector Machine
4. Random Forest Classifier

### Performance Metrics

`Example:` Let's consider a binary classification problem where we want to predict whether an email is spam (positive) or not spam (negative). After applying our model to a test dataset of 100 emails

### Confusion Matrix

A confusion matrix is a tabular representation that summarizes the performance of a classification model by counting the number of correct and incorrect predictions. It is often used to evaluate the effectiveness of a machine learning model.
A confusion matrix typically consists of four values:

- True Positive (TP): The number of positive instances correctly predicted as positive.
- True Negative (TN): The number of negative instances correctly predicted as negative.
- False Positive (FP): The number of negative instances incorrectly predicted as positive (Type I error).
- False Negative (FN): The number of positive instances incorrectly predicted as negative (Type II error).

A confusion matrix allows for a more detailed analysis of the model's performance beyond simple accuracy. It is particularly useful when dealing with imbalanced datasets or when the cost of false positives and false negatives varies.

`Example:` Confusion matrix for the example above is as follows:

| | Predicted Spam | Predicted Not Spam |
|---|:-:|:-:|
| Actual Spam | 70 | 5 |
| Actual Not Spam | 10 | 15 |

In this example, the confusion matrix provides a detailed breakdown of the model's predictions, showing the true positives (70), true negatives (15), false positives (10), and false negatives (5).

- **When to use:** Confusion matrix is widely used for evaluating classification models, especially when dealing with imbalanced datasets or when the costs of false positives and false negatives are different.

### Accuracy

Accuracy is a commonly used evaluation metric that measures the overall correctness of a classification model. It calculates the proportion of correctly predicted instances (TP and TN) out of the total number of instances.

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

While accuracy provides a general idea of the model's performance, it may not be suitable for imbalanced datasets, where the class distribution is uneven.

`Example:` Using the same example, we calculate the accuracy of the model:

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{70 + 15}{100} = 0.85$

The model achieves an accuracy of 85%, meaning that 85 out of 100 emails were correctly classified.

- **When to use:** Accuracy is a commonly used metric when the class distribution is balanced, and both types of errors (false positives and false negatives) have similar costs.

### Precision (False Positive Rate)
Precision is a measure of how many of the positive predictions made by the model are actually correct. It quantifies the ability of the model to avoid false positives.

$Precision = \frac{TP}{TP + FP}$

Precision is valuable when the cost of false positives is high, such as in medical diagnostics or fraud detection, where it is crucial to minimize false alarms.

`Example:` Continuing with the previous example, we calculate the precision of the model:

$Precision = \frac{TP}{TP + FP} = \frac{70}{70 + 10} ≈ 0.875$

The precision of the model is approximately 0.875, indicating that among the emails predicted as spam, around 87.5% were actually spam.

- **When to use:** Precision is useful when the cost of false positives is high, such as in medical diagnostics or fraud detection, where it is crucial to minimize false alarms.

### Recall (Sensitivity or True Positive Rate)

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive instances that are correctly identified by the model.

$Recall = \frac{TP}{TP + FN}$

Recall is essential in scenarios where the cost of false negatives is high, such as disease diagnosis, where missing positive cases can have severe consequences.

`Example:` In the same email spam classification example, we calculate the recall of the model:

$Recall = \frac{TP}{TP + FN} = \frac{70}{70 + 5} ≈ 0.933$

The recall, also known as sensitivity, is approximately 93.3%. This means that the model correctly identifies around 93.3% of the actual spam emails.

- **When to use:** Recall is important when the cost of false negatives is high, such as in disease diagnosis, where missing positive cases can have severe consequences.

### F1 Score

The F1 score is a harmonic mean of precision and recall. It provides a balanced measure of a model's performance, particularly when precision and recall have different priorities.

$F1 Score = 2 * \frac{Precision * Recall}{Precision + Recall}$

The F1 score ranges from 0 to 1, with 1 representing the best possible performance.

`Example:` Using the values from precision and recall, we calculate the F1 score:

$F1 Score = 2 * \frac{Precision * Recall}{Precision + Recall} ≈ 2 * \frac{0.933 * 0.875}{0.933 + 0.875} ≈ 0.903$

The F1 score for the model is approximately 0.903, which combines the precision and recall values into a single metric.
When to use: The F1 score is useful when we want a balanced measure of a model's performance, particularly when precision and recall have different priorities.
