<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Mathematics Basics

**With `NumPy`**

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

See also `14_math_basics.ipynb`.

## Simple Linear Regression

From Wikipedia (https://en.wikipedia.org/wiki/Simple_linear_regression):

> In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

Assume that a sample data set is given by $(y_i, x_i)_{i=1}^n$.

The goal of simple linear regression is to best approximate the data in linear functional form as follows:

$\hat{y}_i = \alpha + \beta x_i \approx y_i$

Or with residuals $\epsilon_i$ as follows:

$y_i = \alpha + \beta x_i + \epsilon_i$

This leads to the (least-squares) optimization problem:

$\min_{\alpha, \beta} \sum_{i=1}^n(y_i - \hat{y}_i)^2$

The optimal solution is given by

$\beta=\frac{\mathbf{Cov}(x, y)}{\mathbf{Var}(x)}$

and

$\alpha = \bar{y} - \beta \bar{x}$

with $\bar{y}$ being the mean value of the $y_i$.

The variance of $x$ is given by

$\mathbf{Var} = \frac{1}{n}\sum_i \left(x_i - \bar{x}\right)\left(x_i - \bar{x}\right)$

with $\bar{x}$ being the mean value of the $x_i$.

The covariance of $x$ and $y$ is given by:

$\mathbf{Cov} = \frac{1}{n}\sum_i \left(x_i - \bar{x}\right)\left(y_i - \bar{y}\right)$

### Deterministic Sample Data 1 

In [None]:
!git clone https://github.com/tpq-classes/mathematics_basics.git
import sys
sys.path.append('mathematics_basics')


In [None]:
import numpy as np

In [None]:
from pylab import plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

In [None]:
x = np.linspace(0, 10, 1001)

In [None]:
y = 3 + x / 2

In [None]:
plt.plot(x, y);

In [None]:
# np.cov?

In [None]:
np.cov(x, y, bias=True)  # covariance matrix

In [None]:
x.var()

In [None]:
y.var()

In [None]:
def beta(x, y):
    return np.cov(x, y, bias=True)[0, 1] / x.var()

In [None]:
b = beta(x, y)

In [None]:
b

In [None]:
def alpha(xn, yn):
    return yn.mean() - beta(xn, yn) * xn.mean()

In [None]:
a = alpha(x, y)

In [None]:
a

### Deterministic Sample Data 2

In [None]:
x = np.linspace(0, 10, 1001)

In [None]:
y = 3 + x ** 3 / 2

In [None]:
plt.plot(x, y);

In [None]:
b = beta(x, y)

In [None]:
b

In [None]:
a = alpha(x, y)

In [None]:
a

In [None]:
plt.plot(x, y)
plt.plot(x, a + b * x, 'r--');

## Coefficient of Determination

From Wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination):

> In statistics, the coefficient of determination, denoted $R^2$ or $r^2$ and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).<br>It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

The total sum of squares is defined as:

$SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}$

The sum of squares residuals is defined as:


$SS_{\text{res}}=\sum _{i}(y_{i}-\hat{y}_{i})^{2}=\sum _{i}e_{i}^{2}$

The coefficient of determination is then given by:

$R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}$

> In the best case, the modeled values exactly match the observed values, which results in $SS_{\text{res}}=0$ and $R^{2}=1$. A baseline model which always predicts $\bar {y}$ will have $R^{2}=0$. Models that have worse predictions than this baseline will have a negative $R^{2}$.

In [None]:
def sstot(yn):
    mu = yn.mean()
    return ((y - mu) ** 2).sum()

In [None]:
def ssres(yn, yn_):
    return ((yn - yn_) ** 2).sum()

In [None]:
def R2(yn, yn_):
    return 1 - ssres(yn, yn_) / sstot(yn)

In [None]:
R2(y, a + b * x)

## Correlation Coefficient

From Wikipedia (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient):

> In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/) ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation.<br><br>
> $r =\frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}}$<br>
> * $r = $ correlation coefficient<br>
> * $x_{i} = $ values of the x-variable in a sample<br>
> * $\bar{x} = $ mean of the values of the x-variable<br>
> * $y_{i} = $ values of the y-variable in a sample<br>
> * $\bar{y} = $ mean of the values of the y-variable


In [None]:
np.corrcoef(y, a + b * x)  # correlation matrix

In [None]:
def R2(yn, yn_):
    c = np.corrcoef(yn, yn_)[0, 1]
    return c ** 2

In [None]:
R2(y, a + b * x)

## Random Data

### Random Sample Data 1

In [None]:
from numpy.random import default_rng

In [None]:
rng = default_rng(100)

In [None]:
y = 3 + x / 2 + rng.normal(0, 0.2, len(x))

In [None]:
plt.plot(x, y, 'b.');

In [None]:
b = beta(x, y)

In [None]:
b

In [None]:
a = alpha(x, y)

In [None]:
a

In [None]:
plt.plot(x, y, 'b.')
plt.plot(x, [a + b * x for x in x], 'r--');

In [None]:
R2(y, [a + b * x for x in x])

### Random Sample Data 2

In [None]:
y = 3 + np.sqrt(x) + rng.normal(0, 0.2, len(x))

In [None]:
plt.plot(x, y, 'b.');

In [None]:
b = beta(x, y)

In [None]:
b

In [None]:
a = alpha(x, y)

In [None]:
a

In [None]:
plt.plot(x, y, 'b.')
plt.plot(x, a + b * x, 'r--');

In [None]:
R2(y, a + b * x)

### Random Sample Data 3

In [None]:
x = np.linspace(0, 10, 1001)

In [None]:
y = 3 + np.sin(x) + rng.normal(0, 0.2, len(x))

In [None]:
plt.plot(x, y, 'b.');

In [None]:
b = beta(x, y)

In [None]:
b

In [None]:
a = alpha(x, y)

In [None]:
a

In [None]:
plt.plot(x, y, 'b.')
plt.plot(x, a + b * x, 'r--');

In [None]:
R2(y, a + b * x)

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>