## Regression:
 * Searches for relationships among variables
 * Find the function that maps independent variables (inputs/features) to dependent variables (output/responses)
 * Dependent variable is usually continuous and unbounded

# Linear Regression

𝑦 = $\theta$₀𝑥₀ + $\theta$₁𝑥₁ + ⋯ + $\theta$ᵣ𝑥ᵣ + 𝜀 ...(𝑥₀ = 1)

y - output

$\theta$₀....$\theta$ᵣ - parameters

𝑥₀....𝑥ᵣ - features/inputs

𝜀- error

## Regression performance

* coefficient of determination, denoted as 𝑅², tells you which amount of variation in 𝑦 can be explained by the dependence on 𝐱 using the particular regression model

## Types of regressions

* Linear regression (univariate)
* Multiple variable Linear regression (multivariate)
* Polynomial Linear Regression (introduce powers and nonlinear terms e.g 𝑥₁², 𝑥₁𝑥₂, f(𝑥₁,𝑥₂) = $\theta$₀ + $\theta$₁𝑥₁² + $\theta$₂𝑥₂𝑥₁) 
    * Here we solve the polynomial term as a linear problem where 𝑥₁² is a distinct feature seperate from 𝑥₁

## Implementation in Python

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

### Univariate

In [27]:
x = np.array([5, 15, 25, 45, 37, 67, 55]).reshape((-1,1))
y = np.array([110, 220, 310, 560, 412, 732, 640])

In [28]:
x.shape, y.shape

((7, 1), (7,))

In [29]:
model = LinearRegression().fit(x,y)

In [30]:
r_sq = model.score(x,y)
print(f'R square  (coefficient of determination): {r_sq}')

R square  (coefficient of determination): 0.9911353350742254


In [31]:
print(f'intercept (𝜃0): {model.intercept_}')
print(f'slope (𝜃1): {model.coef_}')

intercept (𝜃0): 59.802734374999886
slope (𝜃1): [10.30273438]


In [32]:
y_pred = model.predict(x)
y_pred

array([111.31640625, 214.34375   , 317.37109375, 523.42578125,
       441.00390625, 750.0859375 , 626.453125  ])

In [33]:
y_pred = model.intercept_ + model.coef_*x
y_pred

array([[111.31640625],
       [214.34375   ],
       [317.37109375],
       [523.42578125],
       [441.00390625],
       [750.0859375 ],
       [626.453125  ]])

In [34]:
y_pred_new = model.predict(np.arange(10).reshape(-1,1))
y_pred_new

array([ 59.80273437,  70.10546875,  80.40820312,  90.7109375 ,
       101.01367187, 111.31640625, 121.61914062, 131.921875  ,
       142.22460937, 152.52734375])

### Multivariate Linear Regression

In [61]:
x = np.array([[5,45,78], [15,37,38], [25,47,29], [45,34,89], [37,26,18], [67,17,49], [55,16,18]])
y = np.array([110, 220, 310, 560, 412, 732, 640])

In [62]:
model = LinearRegression().fit(x,y)

r_sq = model.score(x,y)
print(f'R square  (coefficient of determination): {r_sq}')

R square  (coefficient of determination): 0.993871363946136


In [63]:
print(f'intercept (𝜃0): {model.intercept_}')
print(f'coefficients (𝜃): {model.coef_}')

intercept (𝜃0): 32.22988777391873
coefficients (𝜃): [10.45292976  0.09564294  0.42124935]


In [64]:
y_pred = model.predict(x)
y_pred

array([121.65591833, 208.57009829, 310.2645811 , 543.35477916,
       429.05749357, 754.84332975, 616.2537998 ])

In [45]:
y_pred = model.intercept_ + np.sum(model.coef_*x,axis =1)
y_pred

array([121.65591833, 208.57009829, 310.2645811 , 543.35477916,
       429.05749357, 754.84332975, 616.2537998 ])

### Polynomial Regression

In [46]:
from sklearn.preprocessing import PolynomialFeatures

In [47]:
x = np.array([5, 15, 25, 45, 37, 67, 55]).reshape((-1,1))
y = np.array([110, 220, 310, 560, 412, 732, 640])

#### include_bias = False in Polynomial feature transformer

In [58]:
transformer = PolynomialFeatures(degree=2, include_bias=False)
transformer.fit(x)
x_ = transformer.transform(x)
print(x_)

model = LinearRegression().fit(x_,y)

r_sq = model.score(x_,y)
print(f'R square  (coefficient of determination): {r_sq}')

print(f'intercept (𝜃0): {model.intercept_}')
print(f'coefficients (𝜃): {model.coef_}')

[[   5.   25.]
 [  15.  225.]
 [  25.  625.]
 [  45. 2025.]
 [  37. 1369.]
 [  67. 4489.]
 [  55. 3025.]]
R square  (coefficient of determination): 0.9914155330813881
intercept (𝜃0): 51.645334294410304
coefficients (𝜃): [ 1.09827006e+01 -9.52302405e-03]


#### include_bias = True in Polynomial feature transformer

In [57]:
transformer = PolynomialFeatures(degree=2, include_bias=True)
transformer.fit(x)
x_ = transformer.transform(x)
print(x_)

model = LinearRegression().fit(x_,y)

r_sq = model.score(x_,y)
print(f'R square  (coefficient of determination): {r_sq}')

print(f'intercept (𝜃0): {model.intercept_}')
print(f'coefficients (𝜃): {model.coef_}')

[[1.000e+00 5.000e+00 2.500e+01]
 [1.000e+00 1.500e+01 2.250e+02]
 [1.000e+00 2.500e+01 6.250e+02]
 [1.000e+00 4.500e+01 2.025e+03]
 [1.000e+00 3.700e+01 1.369e+03]
 [1.000e+00 6.700e+01 4.489e+03]
 [1.000e+00 5.500e+01 3.025e+03]]
R square  (coefficient of determination): 0.991415533081388
intercept (𝜃0): 51.64533429440519
coefficients (𝜃): [ 0.00000000e+00  1.09827006e+01 -9.52302405e-03]


# Advanced Linear Regression with statsmodel

In [66]:
import statsmodels.api as sm

In [67]:
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)

In [70]:
x =sm.add_constant(x) #adds bias term
x

array([[ 1.,  0.,  1.],
       [ 1.,  5.,  1.],
       [ 1., 15.,  2.],
       [ 1., 25.,  5.],
       [ 1., 35., 11.],
       [ 1., 45., 15.],
       [ 1., 55., 34.],
       [ 1., 60., 35.]])

In [77]:
model = sm.OLS(y,x)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.862
Model:                            OLS   Adj. R-squared:                  0.806
Method:                 Least Squares   F-statistic:                     15.56
Date:                Sun, 08 Aug 2021   Prob (F-statistic):            0.00713
Time:                        12:43:04   Log-Likelihood:                -24.316
No. Observations:                   8   AIC:                             54.63
Df Residuals:                       5   BIC:                             54.87
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.5226      4.431      1.246      0.2



In [79]:
results.fittedvalues, y

(array([ 5.77760476,  8.012953  , 12.73867497, 17.9744479 , 23.97529728,
        29.4660957 , 38.78227633, 41.27265006]),
 array([ 4,  5, 20, 14, 32, 22, 38, 43]))

In [81]:
results.predict(sm.add_constant(np.arange(10).reshape((-1,2)))), np.arange(10).reshape((-1,2))

(array([ 5.77760476,  7.18179502,  8.58598528,  9.99017554, 11.3943658 ]),
 array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]))

# Interpreting Linear Regression results using statsmodel api

In [82]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas

In [83]:
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

Unnamed: 0,Lottery,Literacy,Wealth,Region
0,41,37,73,E
1,38,51,22,N
2,66,13,61,C
3,80,46,76,E
4,79,69,83,E


In [86]:
len(df)

85

In [90]:
df.Region.value_counts()

C    17
S    17
W    17
N    17
E    17
Name: Region, dtype: int64

## Terms in summary:

* Df Residuals: n-k-1 (no_of_obs - no_of_variables -1)
* Covariance
* R square -> most important (what percenatge)
    * Property of linear regression - adding more variables will not reduce r-squared, will keep it same or incraese it
* Adjusted r-squared will penalize r squared that some variables are not contributing
* F-statistic - checks for statistical significanse of your entire group of variables and validates null hypo
* Log-Likelihood - 
* AIC-BIC : Used for feature selection 


In [94]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) 
#formula = 'dependant_variables ~ comibantion of independent variables'
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Sun, 08 Aug 2021   Prob (F-statistic):           1.07e-05
Time:                        12:58:19   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

In [85]:
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)

Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64
