# Why Linear Regression?

- Sometimes, we need to predict a continuous quantity (e.g. the price of a house, number of people in a house) based on relevant factors (e.g. square foot, number of bedrooms, location)

- Linear regression is a machine learning algorithm to train a computer to make such predictions accurately

# Mathematical Setup

- X<sub>1</sub>, X<sub>2</sub>, ... X<sub>n</sub> - Features (i.e. factors, e.g. car manufacturer, car age, number of doors)

- Y - Label (i.e. the quantity being predicted, e.g., car price)

### Relationship between features and label

- Linear Regression Assumption - Label varies linearly with feature(s)

- <strong>Y = θ<sub>0</sub> + θ<sub>1</sub>X<sub>1</sub> + θ<sub>2</sub>X<sub>2</sub> + ... + θ<sub>n</sub>X<sub>n</sub> + ε</strong>

    - θ<sub>i</sub> - Parameter(s): Gives the exact linear relationship of the label with each feature
    
    - ε - Random Zero-Mean Term: <u>Cannot be predicted</u> exactly; usually to model uncertainty
    
- Goal: Find θ<sub>0</sub>, θ<sub>1</sub>, θ<sub>2</sub>, ... θ<sub>n</sub> <u>as closely as possible</u>

### A more concise representation

- Let, X, θ be vectors in R<sup>n+1</sup> 

    - X = <X<sub>0</sub>, X<sub>1</sub>, X<sub>2</sub>, ... X<sub>n</sub>> where X<sub>0</sub> = 1
    
    - θ = <θ<sub>0</sub>, θ<sub>1</sub>, θ<sub>2</sub>, ... θ<sub>n</sub>>

- Then, <strong>Y = θ<sup>T</sup>X</strong>

In [2]:
from sklearn.datasets import load_diabetes

In [5]:
diabetes = load_diabetes(as_frame=True)
diabetes.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [9]:
df_X = diabetes['data']
df_X

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [10]:
df_y = diabetes['target']
df_y

0      151.0
1       75.0
2      141.0
3      206.0
4      135.0
       ...  
437    178.0
438    104.0
439    132.0
440    220.0
441     57.0
Name: target, Length: 442, dtype: float64

In [14]:
df_X['target'] = df_y
df = df_X
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930,220.0


In [15]:
df_train = df.sample(frac=0.8, random_state=42)
df_test = df.drop(df_train.index)

In [16]:
df_train

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
287,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.032432,-0.005220,219.0
211,0.092564,-0.044642,0.036907,0.021872,-0.024960,-0.016658,0.000779,-0.039493,-0.022517,-0.021788,70.0
72,0.063504,0.050680,-0.004050,-0.012556,0.103003,0.048790,0.056003,-0.002592,0.084492,-0.017646,202.0
321,0.096197,-0.044642,0.051996,0.079265,0.054845,0.036577,-0.076536,0.141322,0.098648,0.061054,230.0
73,0.012648,0.050680,-0.020218,-0.002228,0.038334,0.053174,-0.006584,0.034309,-0.005142,-0.009362,111.0
...,...,...,...,...,...,...,...,...,...,...,...
279,0.038076,0.050680,-0.024529,0.004658,-0.026336,-0.026366,0.015505,-0.039493,-0.015999,-0.025930,200.0
382,0.048974,-0.044642,0.060618,-0.022885,-0.023584,-0.072712,-0.043401,-0.002592,0.104136,0.036201,132.0
337,0.019913,0.050680,-0.012673,0.070072,-0.011201,0.007141,-0.039719,0.034309,0.005386,0.003064,91.0
236,0.027178,-0.044642,0.006728,0.035644,0.079612,0.070710,0.015505,0.034309,0.040673,0.011349,67.0


In [17]:
df_test

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
13,0.005383,0.050680,-0.001895,0.008101,-0.004321,-0.015719,-0.002903,-0.002592,0.038394,-0.013504,185.0
20,-0.049105,-0.044642,-0.056863,-0.043542,-0.045599,-0.043276,0.000779,-0.039493,-0.011897,0.015491,68.0
21,-0.085430,0.050680,-0.022373,0.001215,-0.037344,-0.026366,0.015505,-0.039493,-0.072133,-0.017646,49.0
34,0.016281,-0.044642,-0.063330,-0.057313,-0.057983,-0.048912,0.008142,-0.039493,-0.059471,-0.067351,65.0
...,...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078,66.0
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059,49.0
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357,64.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485,104.0


In [25]:
X_train = df_train.drop('target', axis=1)
y_train = df_train['target'].copy()

X_test = df_test.drop('target', axis=1)
y_test = df_test['target'].copy()

In [26]:
X_train

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
287,0.045341,-0.044642,-0.006206,-0.015999,0.125019,0.125198,0.019187,0.034309,0.032432,-0.005220
211,0.092564,-0.044642,0.036907,0.021872,-0.024960,-0.016658,0.000779,-0.039493,-0.022517,-0.021788
72,0.063504,0.050680,-0.004050,-0.012556,0.103003,0.048790,0.056003,-0.002592,0.084492,-0.017646
321,0.096197,-0.044642,0.051996,0.079265,0.054845,0.036577,-0.076536,0.141322,0.098648,0.061054
73,0.012648,0.050680,-0.020218,-0.002228,0.038334,0.053174,-0.006584,0.034309,-0.005142,-0.009362
...,...,...,...,...,...,...,...,...,...,...
279,0.038076,0.050680,-0.024529,0.004658,-0.026336,-0.026366,0.015505,-0.039493,-0.015999,-0.025930
382,0.048974,-0.044642,0.060618,-0.022885,-0.023584,-0.072712,-0.043401,-0.002592,0.104136,0.036201
337,0.019913,0.050680,-0.012673,0.070072,-0.011201,0.007141,-0.039719,0.034309,0.005386,0.003064
236,0.027178,-0.044642,0.006728,0.035644,0.079612,0.070710,0.015505,0.034309,0.040673,0.011349


In [27]:
y_train

287    219.0
211     70.0
72     202.0
321    230.0
73     111.0
       ...  
279    200.0
382    132.0
337     91.0
236     67.0
207    202.0
Name: target, Length: 354, dtype: float64

In [28]:
X_test

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
13,0.005383,0.050680,-0.001895,0.008101,-0.004321,-0.015719,-0.002903,-0.002592,0.038394,-0.013504
20,-0.049105,-0.044642,-0.056863,-0.043542,-0.045599,-0.043276,0.000779,-0.039493,-0.011897,0.015491
21,-0.085430,0.050680,-0.022373,0.001215,-0.037344,-0.026366,0.015505,-0.039493,-0.072133,-0.017646
34,0.016281,-0.044642,-0.063330,-0.057313,-0.057983,-0.048912,0.008142,-0.039493,-0.059471,-0.067351
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485


In [29]:
y_test

1       75.0
13     185.0
20      68.0
21      49.0
34      65.0
       ...  
431     66.0
434     49.0
435     64.0
438    104.0
441     57.0
Name: target, Length: 88, dtype: float64

In [32]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_train.values, y_train.values)

In [38]:
X_train.values.shape

(354, 10)

In [39]:
y_train.values.shape

(354,)

In [33]:
reg.coef_

array([    5.33090312,  -235.78389174,   500.4856749 ,   275.28785623,
       -1016.96672502,   637.33668394,   186.67545657,   210.32623651,
         878.43250988,    82.29802939])

In [34]:
y_pred = reg.predict(X_test.values)
y_pred

array([ 65.87391936, 169.28287714, 125.29118268,  81.69541875,
        80.37925658,  80.84728701,  71.21187325, 195.58393257,
       159.34244492, 133.4828985 , 154.70502382, 135.47655198,
       106.0213285 , 233.60749315, 175.31534097, 155.2583345 ,
        85.68191598, 192.70633346, 142.04212113, 149.6857362 ,
       122.56178913,  72.12080848, 204.10739587, 242.80130958,
       140.3270011 , 187.18056883, 146.84282909,  85.61051117,
       230.24905558,  58.04460448, 219.75071236, 172.82565377,
        62.51496526, 152.15128882, 118.2082381 , 199.84819586,
       135.82998418,  69.00768643, 232.04709334, 152.17868682,
       101.68975916, 220.23886344, 180.17301355, 120.21533877,
       101.23381546, 274.18539673, 184.02923128, 105.9213372 ,
       138.57348615,  44.71541092, 123.35531056,  91.5075508 ,
       191.66729518, 195.20431419, 186.44552363, 172.71155438,
       156.34736883, 229.37454273, 125.14356238,  88.26898104,
       141.76396946, 216.35204341, 113.27045066, 196.93

In [37]:
reg.score(X_train.values, y_train.values)

0.540551453247162