# Why Linear Regression?

- Sometimes, we need to predict a continuous quantity (e.g. the price of a house, number of people in a house) based on relevant factors (e.g. square foot, number of bedrooms, location)

- Linear regression is a machine learning algorithm to train a computer to make such predictions accurately

# Mathematical Setup

- X<sub>1</sub>, X<sub>2</sub>, ... X<sub>n</sub> - Features (i.e. factors, e.g. car manufacturer, car age, number of doors)

- Y - Label (i.e. the quantity being predicted, e.g., car price)

### Relationship between features and label

- Linear Regression Assumption - Label varies linearly with feature(s)

- <strong>Y = θ<sub>0</sub> + θ<sub>1</sub>X<sub>1</sub> + θ<sub>2</sub>X<sub>2</sub> + ... + θ<sub>n</sub>X<sub>n</sub> + ε</strong>

    - θ<sub>i</sub> - Parameter(s): Gives the exact linear relationship of the label with each feature
    
    - ε - Random Zero-Mean Term: <u>Cannot be predicted</u> exactly; usually to model uncertainty
    
- Goal: Find θ<sub>0</sub>, θ<sub>1</sub>, θ<sub>2</sub>, ... θ<sub>n</sub> <u>as closely as possible</u>

### A more concise representation

- Let, X, θ be vectors in R<sup>n+1</sup> 

    - X = <X<sub>0</sub>, X<sub>1</sub>, X<sub>2</sub>, ... X<sub>n</sub>> where X<sub>0</sub> = 1
    
    - θ = <θ<sub>0</sub>, θ<sub>1</sub>, θ<sub>2</sub>, ... θ<sub>n</sub>>

- Then, <strong>Y = θ<sup>T</sup>X</strong>

In [3]:
from sklearn.datasets import load_diabetes

In [34]:
diabetes = load_diabetes(as_frame=True)
print(diabetes)

{'data':           age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
..        ...       ...       ...       ...       ...       ...       ...   
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   
441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   

           s4        s5        s6  
0   -0.002592  0.019907 -0.017