## OLS Regression

OLS regression stands for **"Ordinary Least Squares"** regression, which is a type of statistical analysis that helps us understand how one variable affects another. 

Imagine you have a set of data that contains two variables, like the amount of **time** someone **studies** and the **grade** they get on a test. OLS regression helps us figure out if there's a **relationship** between these two variables by **drawing a line** through the data that best fits the pattern.

This line is called the **"regression line,"** and it tells us how much the grade on the test changes for every extra hour someone studies. OLS regression calculates this line by finding the line that minimizes the distance between each data point and the line. 

So, in short, OLS regression is a way to figure out **if two things are related**, and **how much** they're related by drawing a line that best fits the data.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns  # https://seaborn.pydata.org/tutorial.html
import statsmodels.api as sm
from sklearn import linear_model

## Seaborn has datasets too

In [2]:
# Conveniently, it returns a pandas dataframe
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
X = iris[["petal_length"]]  # predictor
y = iris["petal_width"]  # response

X.shape, y.shape

((150, 1), (150,))

## Statsmodels gives R-like statistical output

In [12]:
# OLS linear regression
model = sm.OLS(y, X)  # Note the swap of X and y

results = model.fit()

print("\npetal_width vs:\npetal_length\n", results.summary())  # Looks better, maybe
# results.summary()  # But I want the formatting


petal_width vs:
petal_length
                             OLS Regression Results                            
Dep. Variable:            petal_width   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     629.8
Date:                Wed, 03 May 2023   Prob (F-statistic):           1.54e-90
Time:                        10:18:53   Log-Likelihood:                 46.705
No. Observations:                 150   AIC:                            -83.41
Df Residuals:                     145   BIC:                            -68.36
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            

## R-squared (uncentered)

"R-squared (uncentered)" refers to a variant of the R-squared statistic that is calculated using the **uncentered total sum of squares** instead of the **centered total sum of squares.** 

The R-squared statistic is a measure of **how much of the variability** in the response variable (i.e., the dependent variable) is explained by the regression model. It is a number between 0 and 1, with **higher values** indicating a better fit between the model and the data.

The **standard R-squared statistic** (also known as the "centered" R-squared) is calculated using the centered total sum of squares, which is the sum of squared deviations of the observed response values from their mean.

This means that the R-squared value represents the proportion of the total variability in the response variable that is explained by the regression model after accounting for the mean.

In contrast, the **uncentered R-squared** uses the uncentered total sum of squares, which is the sum of squared deviations of the observed response values from 0 (i.e., **without subtracting the mean**).

This means that the uncentered R-squared value represents the proportion of the total variability in the response variable that is explained by the regression model without accounting for the mean.

The uncentered R-squared can be useful in some cases where the mean of the response variable is not relevant or not meaningful. However, the centered R-squared is more commonly used and provides a more comprehensive measure of the goodness of fit of the model. In general, it is important to interpret R-squared in the context of the specific dataset and research question at hand.


## Where is the intercept info?

In the code below, the intercept info is included in the **first column of the `X`** matrix, which is generated by the `np.vander()` function call. 

The **`np.vander()`** function **adds a column of ones** to the input `X` variable, creating a new matrix with two columns.

The **first column** of this new matrix represents the **intercept.**

The **second column** represents the original `X` variable (i.e., **petal length** in this case).

By passing the `X` and `y` variables to the `sm.OLS()` function and fitting the linear regression model, the resulting summary printed by `print(results.summary())` will include **information** about the intercept, including the estimated coefficient value, standard error, t-value, and p-value. 

You can find the intercept info under the **"coef"** column of the "OLS Regression Results" table printed to the console, which will have a row with the label **"const"** representing the intercept.

In [5]:
X = iris["petal_length"]  # float64, shape (150,)

X = np.vander(X, 2)  # Shape (150, 2) (Add a column for the intercept.)

y = iris["petal_width"]  # float64, shape (150,)

model = sm.OLS(y, X)

results = model.fit()

# results.summary()
print("\npetal_width vs:\nx1, const\n", results.summary())


petal_width vs:
x1, const
                             OLS Regression Results                            
Dep. Variable:            petal_width   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.927
Method:                 Least Squares   F-statistic:                     1882.
Date:                Wed, 03 May 2023   Prob (F-statistic):           4.68e-86
Time:                        10:16:45   Log-Likelihood:                 24.796
No. Observations:                 150   AIC:                            -45.59
Df Residuals:                     148   BIC:                            -39.57
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.4158   

### Petal_width = 0.41 - 0.36* (petal_length)

## Multiple Linear regression

More than one predictor.

In [6]:
X = iris[["petal_length", "sepal_length"]]  # predictors
y = iris["petal_width"]

In [7]:
# Note the swap of X and y
X = iris[["petal_length", "sepal_length"]]
X = sm.add_constant(X)  # another way to add a constant row for an intercept

y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()

# results.summary()
print("\npetal_width vs:\nconst, petal_length, sepal_length\n", results.summary())


petal_width vs:
const, petal_length, sepal_length
                             OLS Regression Results                            
Dep. Variable:            petal_width   R-squared:                       0.929
Model:                            OLS   Adj. R-squared:                  0.928
Method:                 Least Squares   F-statistic:                     962.1
Date:                Wed, 03 May 2023   Prob (F-statistic):           3.60e-85
Time:                        10:16:45   Log-Likelihood:                 26.792
No. Observations:                 150   AIC:                            -47.58
Df Residuals:                     147   BIC:                            -38.55
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

## Use categorical variables

In [8]:
# For dummies
dummies = pd.get_dummies(iris["species"])

# Add to the original dataframe
iris = pd.concat([iris, dummies], axis=1)  # assign numerical values to the different species

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,setosa,versicolor,virginica
0,5.1,3.5,1.4,0.2,setosa,1,0,0
1,4.9,3.0,1.4,0.2,setosa,1,0,0
2,4.7,3.2,1.3,0.2,setosa,1,0,0
3,4.6,3.1,1.5,0.2,setosa,1,0,0
4,5.0,3.6,1.4,0.2,setosa,1,0,0


## AC⚡️BC 

### You would be inclined to choose the model that had the lower AIC or BIC value.

**AIC:** "Akaike Information Criterion"

**BIC:** "Bayesian Information Criterion"

They are both statistical measures that help in model selection when comparing multiple regression models.

Both AIC and BIC provide a way to **compare the goodness-of-fit** of different models that have different numbers of parameters.

These measures take into account both the **quality of the fit** of the model to the data (i.e., **how well** it explains the data), as well as the **complexity** of the model (i.e., **how many** parameters it has). 

AIC and BIC are calculated based on the **log-likelihood function** of the model and the number of parameters in the model.

A **lower value** of AIC or BIC indicates a **better fit** with a simpler model, and the model with the lowest AIC or BIC value is typically considered to be the best-fitting model.

In the summary table, you will see both AIC and BIC listed under the **"Information Criteria"** section. The values of AIC and BIC can be used to compare different models with the same data and to determine which model is the most appropriate for a given dataset.

In [9]:
X = iris[["petal_length", "sepal_length", "setosa", "versicolor", "virginica"]]
X = sm.add_constant(X)

y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()

print("\npetal_width vs:\nconst, petal_length, sepal_length, setosa...?", results.summary())
# results.summary()


petal_width vs:
const, petal_length, sepal_length, setosa...?                             OLS Regression Results                            
Dep. Variable:            petal_width   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     629.8
Date:                Wed, 03 May 2023   Prob (F-statistic):           1.54e-90
Time:                        10:16:45   Log-Likelihood:                 46.705
No. Observations:                 150   AIC:                            -83.41
Df Residuals:                     145   BIC:                            -68.36
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------

In [11]:
# Fit the linear model using sklearn
model = linear_model.LinearRegression()

results = model.fit(X, y)

# Print the coefficients
print("\nintercept:\n", results.intercept_)
print("\ncoef:\n", results.coef_)


intercept:
 0.3376683161818015

coef:
 [ 0.          0.23192122 -0.00169337 -0.42226013  0.01039913  0.411861  ]


## Results

When you fit a linear regression model using the `model.fit()` function in Python, the resulting `results` object contains several attributes that provide information about the fitted model.

If you print `results.intercept_`, you will see the estimated intercept of the linear regression model. The intercept represents the value of the dependent variable when all independent variables are equal to zero. In the context of the linear regression model, the intercept is the point where the regression line crosses the y-axis.

If you print `results.coef_`, you will see the estimated coefficients of the independent variables in the linear regression model. Each element of the `results.coef_` array corresponds to the estimated coefficient of one independent variable in the model. The coefficients represent the change in the dependent variable associated with a one-unit change in the corresponding independent variable, while holding all other variables constant.

For example, if you fit a linear regression model with two independent variables, `x1` and `x2`, and print `results.intercept_` and `results.coef_`, you will see the estimated intercept and coefficients for the model. The intercept will represent the predicted value of the dependent variable when both `x1` and `x2` are equal to zero. The coefficients will represent the estimated change in the dependent variable associated with a one-unit increase in `x1` or `x2`, while holding the other variable constant.