Linear Regression is one of the most simple machine learning algorithm used extensively in the field of predictive analytics. This method is mostly used for forecasting and finding out cause and effect relationship between variables. It is a linear model i.e., it assumes a linear relationship between the dependent variable (Y) and one or more independent variables (X). In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.

This algorithm falls under supervised learning i.e., the values of the dependent variable in the training dataset is known.

Linear Regression models have many real world applications across all sectors like predicting the sales in Retail, housing price prediction in Real Estate, weather prediction in meteorology, stock price prediction in Finance, predicting disease onset from biological factors in Healthcare, etc. So understanding the intuition behind the model and how to implement it will help us to solve problems in predictive analytics.


## **Table of contents:**
1.	Linear Regression Assumptions
2.	Mathematics behind the model  
   2.1	imple Linear Regression   
   2.2	Multiple Linear Regression  
    
3.	Evaluation metrics

4.	Hands-on Example


# **1. Linear Regression Assumptions:**
Regression is a parametric approach. ‘Parametric’ means it makes assumptions about the data for the purpose of analysis. If a dataset fails to fulfil its assumptions, then the model will perform poorly on the dataset. For this reason, it is essential to validate these assumptions for successful regression analysis.

The important assumptions in regression analysis are:
1.	**Linearity:** It is assumed that there is a linear relationship between the dependent (Y) and independent (X) variable(s).

2.	**Autocorrelation:** There should be no correlation between the residual (error) terms. If some correlation is present then it means that the model is unable to identify some relationship in the data.

3.	**Multicollinearity:**  There should be no correlation between the independent variables. If the independent variables are moderately or highly correlated then it becomes difficult to find out which variable is actually contributing to predict the dependent (response) variable.

4.	**Homoskedasticity:**  The error terms must have constant variance. The presence of non-constant variance in error terms results in heteroskedasticity. This non-constant variance arises in the presence of outliers. When this phenomenon occurs, the confidence interval for the out of sample prediction tends to be unrealistically wide or narrow.

5.	**Normality:** The error terms must be normally distributed. Presence on non-normal distribution suggests that there are few unusual data points which must be studied closely to make a better model.


# **2. Mathematics behind the model:**
Even though we have python libraries which can do the regression analysis in a single line of code, it is really important to know the mathematics behind the model. Because only when you know how a model works from the scratch, you will be able to tweak different model parameters with respect to your problem statement and the dataset at your hand to get the desired result.

There are two kinds of variables in a linear regression model:
* The input or independent or predictor variable(s) is the input for the model and it helps in predicting the output variable. It is represented as X.
* The output or dependent variable(s) is the output of the model i.e., the variable that we want to predict. It is represented as Y.


## **2.1 Simple linear regression:**
When there is one input variable/independent variable (X) then it is called simple linear regression.

The simple linear regression equation looks like this:
![image.png](attachment:image.png)
The main idea behind this model is to fit a straight line in the data. In order to get the best fit line, we have to find the optimum values for the coefficients/parameters β0 and β1 in such a way that it minimizes the error between the predicted and the actual value.

So, how do we find the optimum values for β0 and β1? The simple linear regression can be solved using Ordinary Least Squares (OLS), which is a statistical method, to find the model parameters. 


### *Ordinary Least Squares:*

![image.png](attachment:image.png)


 
 


Ordinary Least Squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable. The objective of the least squares method is to find values of β0 and β1 that minimise the sum of the squared difference between Y and Yₑ. 

![image.png](attachment:image.png)

## **2.2 Multiple Linear Regression:**
When there are multiple input variables/independent variables (X1, X2, X3….) then it is called multiple linear regression.
The multiple linear regression equation looks like this: 
![image.png](attachment:image.png)

In simple linear regression, we used OLS method for finding the best-fit line. Whereas in this case, we have more than one predictor variable which makes it hard for us to use that simple OLS method.
But we can implement a linear regression model for performing Ordinary Least Squares regression using one of the following approaches:
* Solving the model parameters analytically (Normal Equations method)  
* Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, etc.)  


### *Normal Equation (closed-form solution)*
This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the model parameters. It means that all of the data must be available and you must have enough memory to fit the data and perform matrix operations. So this method should be preferred for smaller datasets.

![image.png](attachment:image.png)
For very large datasets, computing the matrix inverse of  X^T.X is costly or in some cases the inverse does not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity). In such cases, the below explained Gradient Descent approach is preferred.

### *Gradient Descent:*
Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Error/Cost here represents the sum of squared error between the predicted and the actual value. This error is defined in terms of a function and is called Mean Squared Error (MSE) cost function. So, the objective of this Gradient Descent optimization algorithm is to minimize the MSE cost function.

Suppose let’s assume that we are standing on top of a hill in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ and it goes in the direction of the descending gradient. Once the gradient is zero, you have reached a minimum.

![image.png](attachment:image.png)

An important parameter in the Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, the algorithm will have to go through many iterations to converge, which will take a long time.


![image.png](attachment:image.png)
On the other hand, if the learning rate is too high, the algorithm will jump over the minimum, making it diverge. Hence the algorithm won’t reach the minimum.

![image.png](attachment:image.png)
To implement the Gradient Descent, you need to compute the gradient of the cost function with respect to the parameter vector θ. Here gradient means how much the cost function will change if there is a small change in the parameter vector θ. To get the gradient, we need to take the partial derivative of the cost function.

The MSE cost function is defined as,
![image.png](attachment:image.png)
Where, m = number of samples in dataset.
  


The partial derivative of the cost function is,
![image.png](attachment:image.png)
The above equation computes the partial derivatives individually for every data point. Instead using the vectorized form, we can compute all of the gradients in one go.

The vectorized form looks like,
![image.png](attachment:image.png)
The gradient vector,∇_θ MSE(θ), contains all the partial derivatives of the cost function (one for each model parameter). 

The update rule to get the updated weights/parameters is given below,
![image.png](attachment:image.png)
Where α is the learning rate hyperparameter.

This sums up the methods which are there to find the parameters of the Linear Regression model. In the practical application point of view, we don’t need to write the algorithm from scratch every time we need to apply the linear regression model. Python provides a machine learning library called scikit-learn which contains the linear regression algorithm which we can use it by a single line of code as you will see in the next section

# **3. Evaluation Metrics:**
So far you have learned how linear regression model works and what are the parameters in this model. Once you train your model and got the predict output, you need to check how good is that prediction compared to the actual output. If the predictive power of the model is very poor you need to go back and tune the hyperparameters or use different algorithm so that the model’s error is reduced thereby increasing its predictive power.

The most commonly used evaluation metrics for linear regression are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE).
![image.png](attachment:image.png)
Let’s apply this linear regression model in a real dataset and see how it works!


# 4. Hands-on Example:
In this section we will see how to actually apply the linear regression using python. We will be doing the multiple linear regression problem here because almost all the real-world problems that you will face will have more than two variables. But do know that the steps we follow for doing multiple linear regression is same as that of simple linear regression.

The dataset we are going to use here is the red wine quality dataset. This dataset is related to red variants of the Portuguese “Vinho Verde” wine. 

This hands-on example is only to show how to apply linear regression model using scikit-learn, so I didn’t go deeper into the Exploratory Data Analysis (EDA) part.

Let’s start our coding:


In [None]:
# Importing all the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Importing the dataset from your file directory where you have downloaded your csv file

dataset = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
dataset.head()

Let's check the number of rows and columns in the dataset.

In [None]:
dataset.shape

This dataset contains 1599 rows and 12 columns. Each row in this dataset represents a wine and each column represents the feature of that particular wine.

Now, lets check for any missing values in the dataset.

In [None]:
dataset.isnull().sum()

As you can see from the output, there is no missing value in the dataset. If there is any missing value present in the dataset we need to first handle it and then proceed to the next step. Because, the algorithm cannot work with missing values hence either it needs to be removed or imputed with other values.

Now, we will check the descriptive statistics for the dataset to get an understanding about the data:

In [None]:
dataset.describe()

From the above statistics, we can clearly get an overall understanding about each independent variables in the dataset like the count, mean, standard deviation, minimum and maximum values, etc.

In this dataset, we are going to predict the quality of wine hence the dependent variable is 'quality'. The rest all are independent/predictor variables, which are the characteristics of the wine, using which we are going to find the quality.

So, now lets separate out the independent and dependent variables from the dataset.

In [None]:
X = dataset.iloc[:, :-1]
Y = dataset.iloc[:, -1]

Now we have our X and Y separately. Again we will split these datasets into training and test set because once our linear regression model is trained we need some data to check how our model is performing.

We will split the complete dataset into 70% training data and 30% test data.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.3, random_state=123)
print ("Training set size:", X_train.shape)
print ("Test set size:", X_test.shape)

As you can see from the output, the training set has 1119 observations and test set has 480 observations.

Now, lets train our model:

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

Like i have already mentioned, training a linear regression model is just a one line of code. 
Now the model is trained and ready to predict the values for the test set.

In [None]:
Y_pred = regressor.predict(X_test)

Let's compare the actual and the predicted values for the test set.

In [None]:
compare = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
compare.head(10)

As you can see from the output, the model predicted decently. Let's plot the actual and predicted values for better understanding. For visualization purpose, we will take only 25 observations from the test set.

In [None]:
df = compare.head(25)
df.plot(kind='bar',figsize=(10,8))

The final step is to evaluate the performance of the model. We'll use evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

Thus we have successfully applied Linear Regression model using scikit-learn on the red-wine quality dataset. This example will just give you a gist of how to perform regression analysis. 


 I hope this article will help you understand the linear regression model in detail. There is still some topics such as Polynomial Regression, Regularization technique, optimization algorithms like Stochastic Gradient Descent (SGD), Batch Gradient Descent, Mini-Batch Gradient Descent etc., to be covered with respect to Linear Regression but that’s for the next article. Till then, Happy Learning !!!!
