# Model Evaluation in Regression Models

**Introduction to Model Evaluation**

Welcome to the exploration of model evaluation. In regression, our goal is accurate prediction for unknown cases. After building a regression model, evaluation becomes crucial. This video covers two evaluation approaches: "Train and Test on the Same Dataset" and "Train/Test Split."

**Train and Test on the Same Dataset**

*Overview:*
- Entire dataset used for training and testing.
- A portion of the data is set aside for testing without labels.
- Labels (actual values) are used only as ground truth for accuracy comparison.

![alt text](image-16.png)

*Evaluation Metrics:*
- Calculate accuracy by comparing predicted values (y hat) with actual values (y).
- Error is the average difference between predicted and actual values.

![alt text](image-17.png)

*Pros and Cons:*
- Simple but may lead to overfitting.
- Training accuracy is high, but out-of-sample accuracy is low.

![alt text](image-18.png)

**Train/Test Split Approach**

*Overview:*
- Dataset split into training (e.g., rows 0-5) and testing (e.g., rows 6-9) sets.
- Model trained on the training set and tested on the separate testing set, both sets are mutually exclusive.

![alt text](image-19.png)

*Benefits:*
- More realistic evaluation of out-of-sample accuracy.
- Avoids overfitting issues seen in the "Train and Test on the Same Dataset" approach.

![alt text](image-20.png)

*Considerations:*
- Ensures the model has no prior knowledge of testing set outcomes.
- Essential for real-world applicability.


**K-Fold Cross-Validation**

*Overview:*
- Addresses issues of dependency in previous approaches.
- Dataset divided into K folds (e.g., K=4).
- Multiple train/test splits performed, and results averaged for consistency.

![alt text](image-21.png)

*Benefits:*
- Mitigates dependency problems seen in other methods.
- Provides a more reliable out-of-sample accuracy.

*Limitation:*
- Detailed exploration of K-Fold Cross-Validation is beyond this course's scope.'ll,

# Model Evaluation Metrics for Regression

**Introduction to Accuracy Metrics**

Welcome to the exploration of accuracy metrics for model evaluation in regression. In this video, we will delve into various metrics that play a crucial role in assessing the performance of a regression model.

**Understanding Model Evaluation Metrics**

*Overview:*
- Evaluation metrics are essential for explaining a model's performance.
- They provide insights into areas requiring improvement during model development.
- The primary focus is on comparing actual values with predicted values to calculate accuracy.

**Common Regression Model Evaluation Metrics**

*1. Mean Absolute Error (MAE):*
- Definition: Mean of the absolute value of errors.
- Interpretation: Represents the average error.
- Ease of Understanding: Simple and straightforward.

*2. Mean Squared Error (MSE):*
- Definition: Mean of the squared errors.
- Significance: Emphasizes large errors due to the squared term.
- Widely Used: More popular than MAE in the data science community (because error terms are squared)

*3. Root Mean Squared Error (RMSE):*
- Definition: Square root of the mean squared error.
- Significance: Interpretable in the same units as the response vector (Y units).
- Popularity: Widely used for its ease of interpretation.

*4. Relative Absolute Error (RAE):*
- Definition: Total absolute error normalized by dividing it by the total absolute error of the simple predictor.
- Significance: Provides a relative measure of error.

*5. Relative Squared Error (RSE):*
- Similar to RAE, adopted widely in the data science community for calculating R-squared.

*6. R-squared:*
- Definition: Metric for the accuracy of the model.
- Significance: Represents how close data values are to the fitted regression line.
- Higher R-squared indicates a better fit of the model to the data.

![alt text](image-22.png)

# Multiple Linear Regression

## **Simple Linear Regression vs. Multiple Linear Regression**

- *Simple Linear Regression:* Utilizes one independent variable to estimate a dependent variable, e.g., predicting CO₂ emission using engine size.
  
- *Multiple Linear Regression:* Incorporates multiple independent variables in predicting the dependent variable, e.g., predicting CO₂ emission using engine size and the number of cylinders.

**Applications of Multiple Linear Regression**

Multiple linear regression is applicable in two scenarios:
1. Identifying the strength of effects: Determines the impact of independent variables on the dependent variable.
   - Example: Examining the effects of revision time, test anxiety, lecture attendance, and gender on student exam performance.
2. Predicting the impact of changes: Understands how the dependent variable changes with alterations in independent variables.
   - Example: Predicting how a person's blood pressure changes with variations in body mass index while holding other factors constant.

**Understanding Multiple Linear Regression**

- Predicts a continuous variable using multiple independent variables.
- The target value (Y) is a linear combination of independent variables (X).



**Model Representation**

The general form of the model:
### $\hat{y}$ = $\theta_0$ + $\theta_1$ $x_1$ + $\theta_2$ $x_2$ + $\ldots$ + $\theta_n$ $x_n$

In vector form: 

## $\hat{y}$ = $ \theta^T $ X, 
where $\theta$ is the parameters vector and $x$ is the feature set vector.


![alt text](image-23.png)

**Optimizing Parameters in Multiple Linear Regression**

- The goal is to minimize the Mean Squared Error (MSE) to achieve the best-fit hyperplane.
- MSE is the mean of squared errors, indicating the average discrepancy between predicted and actual values.

![alt text](image-24.png)

**Parameter Estimation Methods**

1. *Ordinary Least Squares:* Minimizes MSE using linear algebra operations. Suitable for smaller datasets (rows < 10,000).
2. *Optimization Algorithm (e.g., **Gradient Descent**):* Iteratively minimizes error through coefficient adjustments. Suitable for larger datasets.

**Prediction Phase**

Once parameters are found, predictions involve solving the linear model equation for a specific set of inputs.

Example linear model: $\hat{y}$ = 125 + 6.2 * `engine_size` + 14 * `cylinders` + $\ldots$

![alt text](image-25.png)

**Concerns and Considerations**

- **Overfitting:** Adding too many independent variables may lead to overfitting, resulting in a model too specific to the dataset.
- **Variable Selection:** The number of independent variables should be chosen judiciously based on theoretical justification rather than using all fields.
- **Categorical Variables:** Categorical variables can be incorporated by converting them into numerical variables.

**Checking for Linearity**

Ensure a linear relationship between the dependent variable and each independent variable using methods like scatter plots.
