### Contents

1. Linear regression
2. Linear regression with sklearn
3. Evaluation metrics
4. Visualization

# Introduction to Regression

Regression, in the context of statistics and machine learning, refers to a set of techniques used to model the relationship between a dependent variable (target) and one or more independent variables (features).

The goal of regression analysis is to understand the nature of the relationship and make predictions based on the given data. It is a supervised learning approach, where the algorithm learns from labeled training data to predict the outcome for new, unseen data.


Regression problems can be broadly categorized into two types:

* **Linear Regression**: Assumes a linear relationship between the independent variables and the dependent variable.
* **Nonlinear Regression**: Assumes a nonlinear relationship between the variables. Example : Polynomial regression

In both cases, the term “regression” is used because the goal is to model or predict a continuous outcome, as opposed to classification, where the goal is to predict a categorical outcome.



In **Simple Linear Regression**, the term “best fit line” usually refers to a straight line in the case of simple linear regression (one independent variable).

**Notation for the Lineal Regression model**.


We will have a parameter $\textbf{y}$ that depends linearly on several covariates $\textbf{x}_i$:


$$ \textbf{y}  =  a_1 \textbf{x}_1  + \dots + a_m \textbf{x}_{m} $$

The $a_i$ terms will be the *parameters* of the model or *coefficients*.

If we write it in matrix form:

$$ \textbf{y}  = X \textbf{w}$$

Where $$ \textbf{y} = \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right), 
 X = \left( \begin{array}{c} x_{11}  \dots x_{1m} \\ x_{21}  \dots x_{2m}\\ \vdots \\ x_{n1}  \dots x_{nm} \end{array} \right),
 \textbf{w} = \left( \begin{array}{c} a_1 \\ a_2 \\ \vdots \\ a_m \end{array} \right) $$
 
In a **simple linear regression** model that only depends on one variable, we will have:

$$ \textbf{y}  =  a_0+ a_1 \textbf{x}_1 $$

With a parameter $a_0$ called constant or cut with the ordinate axis.

If we have a **multivariate/multiple linear regression**, we will have:
$$ \textbf{y} = a_1 \textbf{x}_1 + \dots + a_m \textbf{x}_m = X \textbf{w} $$


# 1. Simple Linear Regression

Let's take a look at some examples in graphic form.

We want to predict the weight of a person, using as a independent variable the height. In order to do so, we will create a Simple Linear Regression Model. 

### Height - weight  dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

import warnings
warnings.filterwarnings("ignore")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("Data/height-weight-regression.csv")
df.head()

In [None]:
df.drop(columns=["Index"], inplace=True)

In [None]:
df

<div style="background-color: #f0f0f0; padding: 25px; border-radius: 5px; margin-top: 25px;">

Plot the data
    
</div>

In [None]:
## matplotlib and seaborn for visaulizations (seaborn is based on matplotlib)
import matplotlib.pylab as plt
import numpy as np
import seaborn as sns

sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2,'font.family': [u'times']})

plt.rc('font', size=12) 
plt.rc('figure', figsize = (12, 5))

In [None]:
n_samples = 200

x =  df.Height_Inches[:n_samples]
y = df.Weight_Pounds[:n_samples]

plt.plot(x, y, "o", alpha=0.3, color='blue') #being alpha the darkness of the markers
# Add axis labels
plt.xlabel("Height (Inches)")
plt.ylabel("Weight (Pounds)")

### Let's try to find some regression model that fits this datapoints

$$ \textbf{y}  =  a_0+ a_1 \textbf{x}_1 $$

It is clear that there is a certain correlation between them that we could see with this model:

In [None]:

# Setting Model Parameters. Here, a0 and a1 represent the intercept and slope of the line, respectively. 
a0 = 10
a1 = 1.5

# Creating the Model Predictions:
model=[a0 + a1*x for x in np.arange(64,75)] 

#Plotting the Data Points and Model:
plt.plot(x,y, "o", alpha=0.3, color='blue')
plt.plot(np.arange(64,75), model,'r')

# Drawing Error Lines for Each Point:
#zip(a,b) will create a tuple of (a,b) pairs, where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.
for xi, yi in zip(x,y):
    plt.plot([xi]*2, [yi, a0 + a1*xi], "k:", linewidth=0.75) # plot errors with black markers (k) connected by a dotted line (":")

plt.xlabel("Height (Inches)")
plt.ylabel("Weight (Pounds)")
plt.show()

### **But which model best fits these data?** 

How do we find the parameters or coefficients of the following equation?

$$ \textbf{y}  =  a_0+ a_1 \textbf{x}_1 $$

In [None]:
plt.plot(x, y, "o", alpha=0.3, color='blue')

# Setting Model Parameters for 3 Models
model1=[0 + 2*x for x in np.arange(64,75)]
model2=[-220 + 5*x for x in np.arange(64,75)]
model3=[200 - 1*x for x in np.arange(64,75)]

# Let's plot the 3 models
plt.plot(np.arange(64,75), model1,'r', label='Model1')
plt.plot(np.arange(64,75), model2,'g', label='Model2')
plt.plot(np.arange(64,75), model3,'y', label='Model3')
plt.legend()

The objective will always be to $\textbf{minimize}$ the sum of the square of the distance between the real points ($y_j$) and the value of the function ($ŷ_j$)

$$\textbf{ŷ} = a_0+a_1 \textbf{x}$$

If we have the data $(\textbf{x},\textbf{y})$, we want to minimize:

$$ ||a_0 + a_1 \textbf{x} -  \textbf{y} ||^2_2 = \sum_{j=1}^n (a_0+a_1 x_{j} -  y_j )^2,$$ 

This expression is known as **sum of squared errors of prediction (SSE)**.

The easiest way to find these two parameters is to use the OLS (*Ordinary Least Squares*) algorithm.


Norm and euclidian norm: https://ca.wikipedia.org/wiki/Norma_(matem%C3%A0tiques)

###  1.1 Ordinary Least Squares (OLS)

Let's see an example:

**SciPy library**

SciPy is an open-source Python library used for scientific and technical computing. Built on top of the NumPy library, SciPy provides a wide range of mathematical, scientific, and engineering functionality. It is widely used for tasks that require high-performance numerical computations, including optimization, integration, interpolation, eigenvalue problems, algebraic equations, and statistics.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html


![Example Image](Figures/lambda.jpg)


In [None]:

#Choose the number of samples to run this example
n_samples = 500

x =  df.Height_Inches[:n_samples]
y = df.Weight_Pounds[:n_samples]


In [None]:
from scipy.optimize import fmin
from scipy.optimize import minimize


# Function to minimize
# lambda a, x, y: This is a short way to define a function without explicitly using the def(arguments): keyword. 
    # This lambda function takes three arguments: a, x, and y.
sse = lambda a, x, y: np.sum((y - a[0] - a[1]*x) ** 2) # save sse (function minimize)


# Initial guess for the parameters [intercept, slope]
initial_guess = [0, 1]

result_sse = minimize(sse, initial_guess, args=(x, y))

# Get the optimal parameters
intercept, slope = result_sse.x


print("Optimal a0 (incercept): ", intercept)
print("Optimal a1 (slope): ", slope)


print('\nSSE = ', np.sum((y - intercept - slope*x) ** 2))
print('\nFinal model:\n',round(intercept,2),'+',round(slope,2),'x')

In [None]:
result_sse.x

In [None]:
# Plot the data and the linear regression line
plt.plot(x, y, 'ro')  #'ro' red circles
plt.plot([min(x), max(x)], [intercept + slope * min(x), intercept + slope * max(x)], alpha=0.8, label="Regression line SSE")  # Line plot
plt.legend()
plt.xlabel("Height (Inches)")
plt.ylabel("Weight (Pounds)")
plt.title("Ordinary Least Squares (OLS) Regression")

for xi, yi in zip(x,y):
    plt.plot([xi]*2, [yi, intercept + slope*xi], "k:", linewidth=0.75) # show errors
plt.xlim(min(x)-1,  max(x)+1); plt.ylim( min(y)-1,  max(y)+1)


### We could also minimize other values as **sum of the absolute value of the differences**. 

In [None]:

# Function to minimize: the ABSOLUTE error
sabs = lambda a, x, y: np.sum(np.abs(y - a[0] - a[1]*x)) ## function to minize has changed

# Initial guess for the parameters [intercept, slope]
initial_guess = [0, 1]

result_sabs = minimize(sabs, initial_guess, args=(x, y))

# Get the optimal parameters
intercept, slope = result_sabs.x


print("Optimal a0 (incercept): ", intercept)
print("Optimal a1 (slope): ", slope)


print('\nSSE = ', np.sum((y - intercept - slope*x) ** 2))
print('\nAbsolute Errors = ', np.sum(np.abs(y - intercept - slope*x)))
print('\nFinal model:\n',round(intercept,2),'+',round(slope,2),'x')



Let's plot the final model

In [None]:
result_sse.x

In [None]:

# Plot the data and the linear regression line
plt.plot(x, y, 'ro')  #'ro' red circles
plt.plot([min(x), max(x)], [intercept + slope * min(x), intercept + slope * max(x)], alpha=0.8, label="Regression line SABS")  # Line plot
plt.plot([min(x), max(x)], [result_sse.x[0] + result_sse.x[1] * min(x), result_sse.x[0] + result_sse.x[1] * max(x)], alpha=0.8, label="Regression line SSE", linewidth=1)  
plt.legend()
plt.xlabel("Height (Inches)")
plt.ylabel("Weight (Pounds)")
plt.title("Ordinary Least Squares (OLS) Regression")

result_sse.x[0]

for xi, yi in zip(x,y):
    plt.plot([xi]*2, [yi, intercept + slope*xi], "k:", linewidth=0.75) # show errors
    

plt.xlim(min(x)-1,  max(x)+1); plt.ylim( min(y)-1,  max(y)+1)


In this case, we penalize less for distant values.

Advantages OLS

+ Computationally easy to calculate for small datasets. For larger datasets the computation of an inverse causes an increase in computation time.
+ Easy to interpret

And the model obtained is:

$$\widehat{\textbf{y}} = \widehat{a}_0+\widehat{a}_1 \textbf{x}$$

Hats indicate that these are estimated values.

# 2. Linear Regression with Sklearn

Luckily, we don't have to develop these algorithms ourselves from scratch. That's what machine learning libraries are already made for!

For example, let's see how easy it is to create a linear regression model in Sklearn by loading a sample dataset from the library:

In [None]:
from sklearn.model_selection import train_test_split

X = x.values.reshape(-1, 1)  # transform in 2D
y = y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 850)

In [None]:
X_train.shape

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression() #uses ordinary least squares regression

# train
regr.fit(X_train, y_train) # Fit linear model

# Check model
print('a1 : \n', regr.coef_) #\n for new line
print('a0 : \n', regr.intercept_)

In [None]:

# Plot the data and the linear regression line
plt.plot(x, y, 'ro', alpha = 0.5)  #'ro' red circles
plt.plot([min(x), max(x)], [intercept + slope * min(x), intercept + slope * max(x)], alpha=0.8, label="Regression line SABS")  # Line plot
plt.plot([min(x), max(x)], [result_sse.x[0] + result_sse.x[1] * min(x), result_sse.x[0] + result_sse.x[1] * max(x)], alpha=0.8, label="Regression line SSE", linewidth=1)
plt.plot([min(x), max(x)], [regr.intercept_ + regr.coef_ * min(x), regr.intercept_ + regr.coef_ * max(x)], alpha=0.8, label="Regression ScikitLearn", linewidth=1)  
plt.legend()
plt.xlabel("Height (Inches)")
plt.ylabel("Weight (Pounds)")
plt.title("Ordinary Least Squares (OLS) Regression")


for xi, yi in zip(x,y):
    plt.plot([xi]*2, [yi, intercept + slope*xi], "k:", linewidth=0.75) # show errors



Once the model is obtained with Scikit Learn, we can also make predictions directly:

In [None]:
y_pred = regr.predict(X_test)
y_pred

## Evaluation metrics

The model obtained can be evaluated by calculating the **mean squared error** ($MSE$) and the **coefficient of determination** $R^2$.

The MSE is calculated as:

$$MSE=\frac{1}{n} \sum_{i=1}^n (\widehat{y}^i-y^i)^2,$$ 

The coefficient $R^2$ is defined as follows:

$$(1 - \textbf{u}/\textbf{v})$$ 

where $\textbf{u}$ is the sum of the squares of the errors: $\textbf{u}=\sum (\textbf{y} - \widehat{\textbf{y}} )^2$ where ${\textbf{y}}$ are the observed values and $\widehat{\textbf{y}}$ are the predicted values.

and $\textbf{v}$ is: $$\textbf{v}=\sum (\textbf{y} - \bar{\textbf{y}})^2,$$ where $\bar{\textbf{y}}$ is the mean of the observed data.

In Sklearn, we could do:

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print('Mean squared error (MSE): %.2f'
      % mean_squared_error(y_test, y_pred))

print('Root Mean squared error (RMSE): %.2f'
      % np.sqrt(mean_squared_error(y_test, y_pred)))

# The coefficient of determination: 1 is perfect prediction
print('R2_score: %.2f'
      % r2_score(y_test, y_pred))

In [None]:
residual = y_test-y_pred
residual

In [None]:

# Plot the data and the linear regression line

plt.scatter(residual.index, residual, alpha=0.8, label="", linewidth=1)  

# Draw a solid line at y=0
plt.axhline(0, color='red', linestyle='-', linewidth=1)

# Add labels and title (optional)
plt.title('Residual Plot')
plt.xlabel('Index')
plt.ylabel('Residuals')

plt.show()


Visualization

We can use Seaborn's ``lmplot()`` function to visualize linear relationships of multidimensional datasets. The input must be in *Pandas* .

## Example:  Macroeconomic dataset


It contains U.S. macroeconomic data from 1947 to 1962, specifically focusing on factors that may influence employment numbers.

* Year: The year the data was collected.
* Employed: The number of people employed (in thousands).
* GNP.deflator: The gross national product implicit price deflator (1954 = 100).
* GNP: The gross national product (in millions of dollars).
* Unemployed: The number of people unemployed (in thousands).
* Armed.Forces: The size of the armed forces (in thousands).
* Population: The population of the U.S. (in thousands).

In [None]:
import pandas as pd
import seaborn as sns
# Read data
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/longley.csv', index_col=0)
df.head()
df.describe()

In [None]:
df

Macroeconomic data from 1947 to 1962.

We want to predict ('Employed') as response $\textbf{y}$ using ('GNP') as predictor $\textbf{x}$.



``lmplot`` is a function from the ``seaborn`` library that stands for "Linear Model Plot". It's a versatile function for visualizing data along with a linear regression model, which can be either a simple linear regression or a more complex polynomial regression, depending on how you use it.

In [None]:
sns.lmplot(x="GNP", y="Employed", data=df, aspect=2)

In [None]:
sns.lmplot(x="GNP", y="Population", data=df, aspect=2)

In [None]:
sns.lmplot(x="Armed.Forces", y="Unemployed", data=df, aspect=2)

We see that there are parts that are not very "linear".

For this we can use polynomial regression.

## 3. Polynomial Regression




Although it is called *linear* regression, we can also fit non-linear functions. The regression will be linear in its parameters not necessarily in its predictors. If nonlinear transformations are added to the linear regression model, the model may become nonlinear

$$ \textbf{y} = a_1 \phi(\textbf{x}_1) + \dots + a_m \phi(\textbf{x}_m) $$

This technique is known as *Polynomial Regression*, where the higher the degree of the polynomial applied the more complex the model can be (watch out for overfitting!!! and computation time!!!).

For example, a cubic model:

$$y_i \approx a_0 + a_1 x_i + a_2 x_i^2 + a_3 x_i^3$$

In [None]:
# Increase the order to estimate a polynomial regression
sns.lmplot(x="Armed.Forces", y="Unemployed", data=df, order=2, aspect=2)
sns.lmplot(x="Armed.Forces", y="Unemployed", data=df, order=3, aspect=2)
sns.lmplot(x="Armed.Forces", y="Unemployed", data=df, order=4, aspect=2)