## Linear Regression

Linear regression is a widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fit line that minimizes the difference between the predicted and actual values. The goal of linear regression is to make predictions or understand the impact of independent variables on the dependent variable.

In simple linear regression, we consider a single independent variable and a single dependent variable. The relationship between the variables can be represented by the equation:

$y = mx + c$

Where:

- `y` is the dependent variable
- `x` is the independent variable
- `c` is the y-intercept (the value of `y` when `x` is 0)
- `m` is the slope (the change in `y` for a unit change in `x`)

In higher dimension this equation becomes:

$y = wx + b$

The goal is to estimate the values of $w$ and $b$ that best fit the data.

## Ordinary Least Squares (OLS) Estimation

The most common method to estimate the coefficients (`b` and `w`) in linear regression is the Ordinary Least Squares (OLS) estimation. It aims to minimize the sum of squared residuals by finding the values of `b` and `w` that minimize the following equation:

$\frac{∂RSS}{∂b0} = -2Σ(y - b0 - b1 * x) = 0$

$\frac{∂RSS}{∂b1} = -2Σx(y - b0 - b1 * x) = 0$

Solving these equations simultaneously will yield the estimated coefficients b0 and b1:

$w = \frac{Σ(x - x̄)(y - ȳ)}{Σ(x - x̄)^2}$

$b = ȳ - w * x̄$

Where:

- `x̄` is the mean of the independent variable `x`
- `ȳ` is the mean of the dependent variable `y`

These formulas can be computed efficiently, providing the best-fit line.

In [10]:
from sklearn.linear_model import LinearRegression 
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

In [11]:
data = pd.read_csv("clean_data.csv")
data.shape

(313, 29)

In [12]:
X = data.drop('target', axis=1).to_numpy()
y = data['target'].to_numpy()

In [15]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [16]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [24]:
linear_model.coef_

array([ 1.96828058e+00, -8.14145255e-01,  5.14536158e-01, -3.40757986e-01,
       -1.39201015e-01, -1.08507647e-01, -4.54437456e-01,  1.33219416e+00,
       -2.99596597e+00,  2.21555489e-01, -2.74702463e-01,  1.85857457e-01,
        1.28302078e+14,  1.28302078e+14,  3.22247653e+14,  4.19933824e+14,
        6.72251494e+14,  1.69199727e+14,  3.59082168e+14,  2.25526377e+14,
        2.25526377e+14,  4.64574541e+14, -7.74720306e+13, -6.61834127e+14,
       -6.84551749e+14, -4.58330714e+14, -2.02991277e+14, -1.72118678e+14])

In [22]:
linear_model.intercept_

66.8315490625

In [20]:
X.shape

(313, 28)

### Performance Metrics
- Mean Absolute Error
- R2 Score