# Linear Regression:
Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors). The objective is to find a linear equation that best describes this relationship.

Linear regression is widely used for predictive modeling, inferential statistics, and understanding relationships in data. Its applications include forecasting sales, assessing risk, and analyzing the impact of different variables on a target outcome.

## 1. Simple Linear Regression:

Simple linear regression is a linear regression with one independent variable, the explanatory variable(Feature), and one dependent variable, the response variable(Target). In simple linear regression, the dependent variable is continuous.

For example, you might want to know how a tree’s height (independent variable) affects the number of leaves it has (dependent variable)

Simple linear regression equation
![image.png](attachment:a8531c61-1e7d-4f6e-9318-153c47e29a97.png)

- m is the slope of the line
- b is the intercept

In Contrast to data science:
![image.png](attachment:11ef4d00-2daf-4e86-818b-84c4196b46aa.png)

If we were using only the slope-intercept equation, we would find the values of m (slope) and b (y-intercept) by  measuring the change in y over the change in x between two points on the line. 

Then, once we have found the slope, we would find the y-intercept b by substituting the coordinates of one point on the line into the equation and solving for b. This final step gives you the point where the line crosses the y-axis.

This doesn’t work in regression because there is no line that goes through all the points, which is why we are finding instead the line of best fit. Fortunately, there are neat, closed-form equations to find the slope and intercept. 

The slope can be calculated by multiplying the correlation r by the quotient of the standard deviation of y over the standard deviation of x. This makes intuitive sense because we are essentially converting the correlation coefficient back into units of the original variables. In the below equation, a refers to the slope, and sy and sx refer to the standard deviation of y and the standard deviation of x, respectively.
![image.png](attachment:786e6809-f471-451e-9a35-cbd66f4066ce.png)

The intercept of the line of best fit for simple linear regression can be calculated after we calculate the slope. We do this by subtracting the product of the slope and the mean of x from the mean of y. In the equation below, i refers to the y-intercept, and the straight line over the x and y values is a way of referring to the mean of x and y, respectively; we refer to these terms as x-bar and y-bar.

![image.png](attachment:52af9a3a-febd-4e1d-9725-d3f2e910e23e.png)
![image.png](attachment:df0d9093-6365-4aae-9ecc-eb7272a46766.png)
Remember that the standard deviation is the same as the square root of the variance, so instead of referring to the standard deviation of y and the standard deviation of x, we could also refer to the square root of the variance of y and the square root of the variance of x. The variance itself, we remember, is the average of the sum of squares. 
![image.png](attachment:f1f796af-4be9-4672-9130-e1d2459d0545.png)
In the above equation for the slope, a, we could also write the sy and sx in terms of the standard deviation, and we could also write out the longer form equation for the correlation r. We could then cross-multiply and simplify the equation by removing common terms, and end up with the following set of equations for the slope and intercept. The point here is less about showing how one equation turns into the other and more about stressing that both equations are the same, since you might see one or the other. 
![image.png](attachment:547e15e7-edde-4fef-945e-bc08e0426352.png)

During training, the model calculates the weight and bias that produce the best model.
![image.png](attachment:b8dccd47-dcdc-4907-9379-45e806c399bb.png)

**How to interpret the slope and the intercept**
- The **intercept** tells you where the regression line crosses the y-axis. In practical terms, it represents the value of the dependent variable when the independent variable is zero. It’s important to know that the intercept is not always interpretable. In our earlier example with trees, our model might have predicted a certain number of leaves for trees with a height of zero. It wouldn’t have made sense to try to interpret the intercept in this context. 
- The **slope** indicates how much the dependent variable is expected to change with a one-unit increase in the independent variable. A positive slope suggests a positive relationship, where the dependent variable increases as the independent variable increases. A negative slope indicates the opposite.

### Simple linear regression model assumptions:
**1. Linearity:** The relationship between the independent and dependent variables must be linear. If the relationship is non-linear, the model won’t capture it well.

**2. Independence of Errors:** Residuals should be independent of each other. This means there should be no patterns or correlations between the residuals. This is something to watch for closely in time-ordered data.

**3. Homoscedasticity:** The residuals should have constant variance across all values of the independent variable. If the variance changes (heteroscedasticity), predictions in certain ranges of x may become less accurate.

**4. Normality of Residuals:** Residuals should ideally follow a normal, or Gaussian distribution. This is important for statistical testing and asserting levels of confidence in our estimate. It’s less critical for making predictions. 

## Multiple linear regression:
Multiple linear regression extends the concept of simple linear regression to model relationships between a dependent variable and multiple independent variables. This allows us to analyze more complex datasets where one predictor alone does not sufficiently explain the variation in the dependent variable.

### Multicollinearity:
When multiple features are highly correlated, they are redundant, meaning that they are essentially giving the model the same information. This situation is referred to as multicollinearity. While multicollinearity doesn’t always impact the accuracy of predictive models, it complicates feature selection and interpretation, especially in linear regression and related models.

To detect and mitigate multicollinearity, you can calculate the variance inflation factor (VIF). You can also remove highly correlated variables, or you can use regularization techniques like ridge regression and lasso regression.
![image.png](attachment:a8d5fa0e-b8a5-4129-9574-3b94724cd3a8.png)
where Ri2 is the R2 value obtained when the predictor Xi is regressed against all other predictors in the model. A higher VIF means the predictor is highly correlated with other variables.

- VIF = 1: no multicollinearity (ideal scenario).
- VIF < 5: low to moderate multicollinearity (generally acceptable).
- VIF > 5: high multicollinearity (consider removing or combining correlated variables).
- VIF > 10: severe multicollinearity (strongly suggests variable redundancy).
  
We can check the correlation matrix; any value more than 0.8 shows a high correlation between the features.
- correlation > 0.8 or < -0.8

## Evaluation metrics
Apart from visual analysis, quantitative metrics are essential for assessing how well the linear regression model fits the data. Some key evaluation metrics include:

**1. R-squared (R²):** This metric tells you how much of the variance in the dependent variable can be explained by the independent variables. An R² value closer to 1 indicates that the model explains a large proportion of the variance, while a value closer to 0 means the model doesn’t fit the data well.
![image.png](attachment:8038551b-8713-48eb-940e-df9340e49fd2.png)
- R2 = 1: The model perfectly explains all the variance in the target variable. 
- R2 = 0: The model explains none of the variance; predictions are no better than simply using the mean. 
- R2 < 0: The model performs worse than simply using the mean, indicating a poor fit.


**2. Adjusted R-squared:** This metric adjusts R² for the number of predictors in the model. It is useful for comparing models with different numbers of predictors. 

**3. Root Mean Squared Error (RMSE):** RMSE measures the average magnitude of the errors in the model’s predictions, with the same units as the dependent variable. A lower RMSE indicates better predictive performance.
To address these issues, RMSE is used, which is simply the square root of MSE
![image.png](attachment:6d9887c3-27c6-445d-82c0-e012d8f951bf.png)
**4. Mean Absolute Error (MAE):** MAE measures the average magnitude of the errors in the model’s predictions, but unlike RMSE, it does not penalize large errors as heavily. It's easier to interpret since it’s in the same units as the dependent variable.

### Residual analysis
Residual analysis is a core part of linear regression diagnostics. It helps to check the assumptions of linear regression and identify any violations. The residuals should be randomly scattered around zero, with no discernible pattern. Here's where visual tools like residual plots and Q-Q plots come into play:

**Residual Plot:** A scatter plot of the residuals versus the predicted values (or independent variables) helps to check if there’s any relationship that the model hasn’t accounted for. Ideally, the plot should show a random spread with no clear pattern, which suggests that the model has captured the linear relationship well.

**Q-Q Plot (Quantile-Quantile Plot):** A Q-Q plot helps to assess whether the residuals follow a normal distribution. For linear regression to be valid, the residuals should be approximately normally distributed. A straight line on the Q-Q plot indicates normality.

## Optimizing a Linear Regression Model - Various Approaches:

Learning/training a linear regression model essentially means estimating the values of the coefficients/parameters used in the representation with the data you have.

MAE(L1) and MSE(L2) are also called as cost function. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.
1. **Gradient descent**: Essentially, gradient descent is a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data.
It works by starting with random values for each coefficient. The sum of the squared errors is calculated for each pair of input and output values. A learning rate is used as a scale factor, and the coefficients are updated in the direction of minimizing the error. The process is repeated until a minimum sum-squared error is achieved or no further improvement is possible.

    The term α (learning rate) is very important here since it determines the size of the improvement step to take on each iteration of the procedure.
    Now, there are commonly two variants of gradient descent:

   - The method that looks at every example in the entire training set on every step and is called __batch gradient descent__.
   - The method where you repeatedly run through the training set, and each time you encounter a training example, you update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called __stochastic gradient descent__ (also incremental gradient descent).

2. **Regularization:**
    1. **Lasso Regression:** adds a penalty term which is equivalent to the absolute value of the magnitude of the coefficients (also called L1 regularization).
    2. **Ridge Regression:** adds a penalty term which is equivalent to the square of the magnitude of coefficients (also called L2 regularization).
   