## Section 3: Linear Regression with StatsModels

## Simple Linear Regression

### Import relevant libraries

In [1]:
%load_ext lab_black
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

sns.set()

### Load the data

In [2]:
data = pd.read_csv("reference/1.01. Simple linear regression.csv")

In [3]:
data

Unnamed: 0,SAT,GPA
0,1714,2.40
1,1664,2.52
2,1760,2.54
3,1685,2.74
4,1693,2.83
...,...,...
79,1936,3.71
80,1810,3.71
81,1987,3.73
82,1962,3.76


In [4]:
data.describe()

Unnamed: 0,SAT,GPA
count,84.0,84.0
mean,1845.27381,3.330238
std,104.530661,0.271617
min,1634.0,2.4
25%,1772.0,3.19
50%,1846.0,3.38
75%,1934.0,3.5025
max,2050.0,3.81


### Create your first regression

#### Define the dependent and the independent variables

In [5]:
y = data["GPA"]
x1 = data["SAT"]

#### Explore the data

In [6]:
plt.scatter(x1, y)
plt.xlabel("SAT", fontsize=20)
plt.ylabel("GPA", fontsize=20)
plt.show()

#### Regression itself

In [7]:
x = sm.add_constant(x1)
results = sm.OLS(y, x).fit()
results.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.406
Model:,OLS,Adj. R-squared:,0.399
Method:,Least Squares,F-statistic:,56.05
Date:,"Sat, 01 Aug 2020",Prob (F-statistic):,7.2e-11
Time:,00:03:04,Log-Likelihood:,12.672
No. Observations:,84,AIC:,-21.34
Df Residuals:,82,BIC:,-16.48
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2750,0.409,0.673,0.503,-0.538,1.088
SAT,0.0017,0.000,7.487,0.000,0.001,0.002

0,1,2,3
Omnibus:,12.839,Durbin-Watson:,0.95
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.155
Skew:,-0.722,Prob(JB):,0.00031
Kurtosis:,4.59,Cond. No.,32900.0


In [30]:
plt.scatter(x1, y)
plt.xlabel("SAT", fontsize=20)
plt.ylabel("GPA", fontsize=20)

yhat = 0.0017 * x1 + 0.2750
fig = plt.plot(x1, yhat, lw=4, c="orange", label="regression line")

plt.show()

In [9]:
0.0017 * (1850)

3.145

Notes:
OLS - Ordinary Least Squares
* minimizes SSE    
* lower error => better explanatory power
* lowest error => best explanatory power

Other regression methods:    
* Generalized least squares
* Maximum likelihood estimation
* Bayesian regression
* Kernell regression
* Gaussian process regression

In [10]:
945 / 1245

0.7590361445783133

### SST (TSS)
SST - Sum of Squares Total 
It is a measure of the total variablility of the dataset
#### $\sum_{i=1}^n (y_{i} - \overline y)^2$

In [21]:
SST = sum([(y - data["GPA"].mean()) ** 2 for y in data["GPA"]])
f"SST = {SST}"

'SST = 6.1233952380952354'

Another notaion would be TSS or total sum of squares.

In [12]:
results.centered_tss

6.123395238095238

## SSR (ESS)
SSR - Sum of Squares due to regression 
It is the sum of the differences between the predicted value and the mean of the dependent variable.   
Think of it as a measure the describes how well our line fits the data.  
Measures the explained variability by your line.
#### $\sum_{i=1}^n (\hat y_{i} - \overline y)^2$

In [13]:
SSR = sum([(yhat - data["GPA"].mean()) ** 2 for yhat in results.fittedvalues])
f"SSR = {SSR}"

'SSR = 2.486122438514722'

Another common notation is ESS or explained sum of squares

In [14]:
results.ess

2.4861224385147356

## SSE (RSS)
SSE - Sum of Squares Error 
This is the difference between the observed value and the predicted value.
The smaller the error, the better the estimation power of the regression.
Measures the unexplained variability by the regression.
#### $\sum_{i=1}^n e_i^2$

In [15]:
SSE = sum((data["GPA"] - results.fittedvalues) ** 2)
f"SSE = {SSE}"

'SSE = 3.637272799580503'

In [16]:
f"SSE = SST - SSR"
f"SSE = {SST - SSR}"

'SSE = 3.6372727995805136'

It is also known as RSS or residual sum of squares.    
Residual as in: remaining or unexplained.     
It becomes really confusing because some people denote it as SSR.    
This makes it unclear whether we are talking about the Sum of Squares due to Regression or Sum of Squared Residuals.    

In [17]:
results.ssr

3.6372727995805025

### $R^2 = \frac{SSR}{SST}$    

R-squared is a relative measure and takes values ranging from 0 to 1.    
An R-squared of zero means our regression line explains none of the variability of the data.    
An R-squared of one means our regression line explains the entire variability.

In [18]:
f"R^2 = {SSR/SST}"

'R^2 = 0.4060039147967956'

In [19]:
results.rsquared

0.40600391479679765