# ðŸ“Š Linear Regression
> Learns from the **labelled datasets** and **maps the data points with most optimized linear functions which can be used for prediction on new datasets**. It assumes that there is a linear relationship between the input and output, meaning the output changes at a constant rate as the input changes. This relationship is represented by a straight line.
---

- `Simple Linear Regression` : 
    - **Output value depends on only one single input value**
    - **Y = c1 + c2X**

- `Multiple Linear Regression` :
    - **Output value depends on only one single input value**
    - **Y = c1 + c2X1 + c3X2 + c4X3 + ...**

In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from seaborn import load_dataset

In [4]:
data = load_dataset('tips')
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
X = data.drop(columns=['tip'])
Y = data['tip']

In [6]:
data['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [10]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

In [17]:
transformer = ColumnTransformer(transformers=[
    ('trf1', OrdinalEncoder(), ['time', 'sex', 'smoker', 'day']),
], remainder='passthrough')

In [19]:
X_transformed = transformer.fit_transform(X=X)

In [20]:
X_train, X_test, Y_train,  Y_test = train_test_split(X_transformed, Y, test_size=0.2)
X_test.shape, X_train.shape, Y_test.shape, Y_train.shape

((49, 6), (195, 6), (49,), (195,))

In [24]:
X_train

array([[ 0.  ,  0.  ,  1.  ,  2.  , 18.15,  3.  ],
       [ 1.  ,  1.  ,  1.  ,  3.  , 16.  ,  2.  ],
       [ 1.  ,  0.  ,  0.  ,  3.  , 12.48,  2.  ],
       ...,
       [ 0.  ,  1.  ,  1.  ,  1.  , 50.81,  3.  ],
       [ 0.  ,  1.  ,  1.  ,  2.  , 45.35,  3.  ],
       [ 0.  ,  1.  ,  0.  ,  2.  , 14.07,  2.  ]], shape=(195, 6))

In [25]:
model = LinearRegression()
model.fit(X_train, Y_train)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [26]:
Y_preds = model.predict(X_test)
Y_preds

array([2.89672018, 4.57354842, 2.87608228, 1.42069975, 2.64778406,
       2.63210737, 1.49308545, 1.98616799, 4.7919797 , 3.51105062,
       1.94033014, 2.1442911 , 3.24499003, 1.68811983, 3.43519121,
       3.15602093, 3.08481766, 2.63132688, 2.70567492, 1.81453629,
       2.46090232, 2.47740059, 2.64061868, 2.9318312 , 2.82965933,
       3.63766609, 2.14693406, 1.89485395, 2.41899675, 2.76375041,
       2.38612764, 1.99405726, 3.98123728, 4.73164959, 3.06455978,
       4.30630459, 3.89801488, 2.08189837, 3.66210744, 2.9545991 ,
       2.193626  , 2.61798454, 4.00673146, 1.4267471 , 1.85188339,
       3.54085013, 3.33626298, 3.33110166, 2.87316341])

In [27]:
Y_test

106    4.06
187    2.00
191    4.19
222    1.92
115    3.50
140    3.50
92     1.00
220    2.20
112    4.00
72     3.14
53     1.56
27     2.00
113    2.55
195    1.44
35     3.60
63     3.76
46     5.00
20     4.08
134    3.25
226    2.00
171    3.16
93     4.30
41     2.54
104    4.08
55     3.51
185    5.00
133    2.00
148    1.73
105    1.64
17     3.71
32     3.00
51     2.60
77     4.00
180    3.68
65     3.15
216    3.00
239    5.92
70     1.97
114    4.00
89     3.00
161    2.50
0      1.01
173    3.18
111    1.00
30     1.45
240    2.00
107    4.29
227    3.00
21     2.75
Name: tip, dtype: float64

## **METRICES FOR EVALUATION**

In [28]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

##### **MAE**
- Good for outliers
- Easy to interpretate
- Derivativate becomes 0 easily -> Bad for loss function 

In [29]:
mean_absolute_error(Y_test, Y_preds)

0.735473085283227

##### **MSE**
- Worst for outliers
- Hard to interpretate
- Best for loss function

In [30]:
mean_squared_error(Y_test, Y_preds)

0.8895892599739417

#### **R2 SCORE**
- It defines how much better is the detected best fit line in comparision with mean line
- It tells how much percentage the features commulatively effect the result (analysed by the best fit line)

In [31]:
r2_score(Y_test, Y_preds)

0.2928565718773366

#### **STD R2 SCORE**
- Useless columns addition makes R2 score to increase, which deviates the direction 
- STDization normalise this sinario

In [32]:
r2 = r2_score(Y_test, Y_preds)
n = 49
k = 5

std_r2 = 1 - ((1 - r2) * (n - 1) / (n - k - 1))
std_r2

0.21063059186307342

### ***POWER OF R2***
---

In [48]:
X_price_train = X_train[:, 4].reshape(-1, 1)
X_price_test = X_test[:, 4].reshape(-1, 1)
X_price_test.shape, X_price_train.shape

((49, 1), (195, 1))

In [49]:
np.array(X_price_train)

array([[18.15],
       [16.  ],
       [12.48],
       [18.35],
       [19.65],
       [11.87],
       [24.01],
       [18.28],
       [18.64],
       [23.1 ],
       [15.42],
       [32.83],
       [11.02],
       [18.43],
       [29.93],
       [17.81],
       [20.27],
       [17.07],
       [16.04],
       [10.65],
       [34.81],
       [10.33],
       [24.08],
       [15.38],
       [26.59],
       [12.6 ],
       [28.55],
       [13.94],
       [23.17],
       [ 9.68],
       [ 7.25],
       [32.9 ],
       [18.69],
       [11.38],
       [31.27],
       [19.44],
       [12.43],
       [14.52],
       [21.5 ],
       [30.4 ],
       [24.27],
       [22.75],
       [21.7 ],
       [12.74],
       [17.89],
       [17.59],
       [ 8.77],
       [13.16],
       [40.55],
       [20.76],
       [17.51],
       [16.4 ],
       [16.31],
       [17.92],
       [ 8.52],
       [14.73],
       [13.27],
       [20.65],
       [25.29],
       [22.49],
       [32.68],
       [48.33],
       [

In [50]:
model2 = LinearRegression()
model2.fit(X_price_train, Y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [51]:
Y_preds = model2.predict(X_price_test)

In [52]:
Y_preds

array([3.05257846, 4.12851736, 2.97919446, 1.76727933, 2.70940035,
       2.72666718, 1.46187239, 2.15362449, 4.94977063, 3.74001385,
       1.91404732, 2.21082084, 3.4259735 , 1.65720333, 3.43784444,
       2.81515964, 3.24035516, 2.77523011, 2.81192211, 1.93023497,
       2.54752389, 2.60256189, 2.725588  , 3.09898304, 2.94466081,
       3.07416199, 2.16441625, 1.8967805 , 2.49896095, 2.59932436,
       2.46658565, 1.9518185 , 3.77670585, 4.58069229, 3.00833222,
       3.87922761, 3.97419513, 2.13851602, 3.61590856, 3.12488328,
       2.20758331, 2.67486671, 4.27852289, 1.62374886, 1.87195944,
       3.77454749, 3.56194973, 3.04826175, 3.03099493])

In [53]:
Y_test

106    4.06
187    2.00
191    4.19
222    1.92
115    3.50
140    3.50
92     1.00
220    2.20
112    4.00
72     3.14
53     1.56
27     2.00
113    2.55
195    1.44
35     3.60
63     3.76
46     5.00
20     4.08
134    3.25
226    2.00
171    3.16
93     4.30
41     2.54
104    4.08
55     3.51
185    5.00
133    2.00
148    1.73
105    1.64
17     3.71
32     3.00
51     2.60
77     4.00
180    3.68
65     3.15
216    3.00
239    5.92
70     1.97
114    4.00
89     3.00
161    2.50
0      1.01
173    3.18
111    1.00
30     1.45
240    2.00
107    4.29
227    3.00
21     2.75
Name: tip, dtype: float64

In [54]:
r2_score(Y_test, Y_preds)

0.32399479738708825

- Total bill has 32.4 % effect on tips