## Linear Regression 

Linear regression is a well known machine learning technique that aims to fit a function to input data and known outputs. The data set can be considered a matrix $X$ $\in \mathbb{R}^{n \times d}$. This is to imply there are $n$ data points, each with a dimension $d$. We are also given ground truth outputs $y$ $\in \mathbb{R}^{n}$. 

So we wish to approximate a function $f$ such that $f(x_{i}) = y_{i}$, where $i$ ranges from 1 to $n$. 

When it comes to linear regression, the function looks as follows:

$ \hat{y} = X \Theta$

where the hat on the y indicates the given y outputs. The $X$ matrix is the given data set, and $\Theta \in \mathbb{R}^{n}$ is a set of weights. Our goal is to find the best set of weights that fit our data best.

In matrix form, we can see this even more clearly as

\begin{equation}
\begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{n} \\
\end{bmatrix}
 = 
\begin{bmatrix}
x_{1}^{T} \\
x_{2}^{T} \\
\vdots \\
x_{n}^{T} 
\end{bmatrix}
\begin{bmatrix}
\theta_{1} \\
\theta_{2} \\
\vdots \\
\theta_{n} 
\end{bmatrix}
\end{equation}

In general, there is no exact solution to this linear system. One way to solve this problem is to define a loss function and optimize it. A commone loss function for linear regression is to minimize the squared error from predicted outputs to ground truth outputs. 

$L(\theta) = \sum_{i=1}^{n} (x_{i}^T \theta_{i} - y_{i})^{2} = ||X\Theta - y||^{2}_{2}$

This simply says, for each data vector and output, how far away is the prediction from the actual input value? The question now is how do we minimize this loss function. We want to solve for $\Theta$. 

We can use some help from vector calculus to reduce the loss function to a solvable state. 

We can obtain a minimum solution $\Theta$ when the gradient of the loss is equal to zero. 

$\nabla{L(\Theta^{*})} = 0$, where we've denoted here $\Theta^{*}$ as the minimizing set of weight values. If our loss function is what is termed convex, we know that this local minimum is also the global minimum. In cases of non-convex functions, we are not guaranteed to have our local minima be the global minimum. Let's leave that issue for a later notebook. 

First, let us expand our loss function so we can easily take the gradient of it. 

\begin{equation}
L(\Theta) = ||X\Theta - y ||_{2}^{2} \\
= (X \Theta - y)^{T} (X \Theta - y) \\
= (\Theta^{T} X^{T} - y^{T})(X \Theta - y) \\
= \Theta^{T} X^{T} X \Theta - 2 \Theta^{T} X^{T} y - y^{T} y\\
\end{equation}

Note, we have made use of the fact that $\Theta^{T} X^{T} y = y^{T} X \Theta$, as they are both scalar values, and the transpose of a scalar is the same scalar value. 



So we now have reorganized our loss function to look as follows: 

$L(\Theta) = \Theta^{T} X^{T} X \Theta - 2 \Theta^{T} X^{T} y - y^{T} y$. 

We want to take the gradient of this function with respect to the variable $\Theta$. 

From vector calculus, we have: 
$\nabla (a^T x) = a$ and $\nabla (x^T A x) = (A + A^{T}) x$. 

Applying this to our problem, we find the gradient of the loss to be: 

$\nabla L(\Theta) = \nabla (\Theta^{T} X^{T} X \Theta) - 2 \nabla (\Theta^{T} X^{T} y) - \nabla(y^{T} y)$

This gives

$\nabla L(\Theta) = 2 X^{T} X \Theta  - 2 X^{T} y$

We want this value to be equal to zero to find our minimizing value for $\Theta$. 

Re-arranging, we have 

$2 X^{T} X \Theta  - 2 X^{T} y = 0$ which leads to 

\begin{equation}
X^{T} X \Theta = X^{T} y
\end{equation}


To solve this linear system, we can use many different approaches. The matrix $X^T X$ will be positive symmetric definite, so we can use a variety of matrix solving techniques. Do NOT take the inverse of the matrix. Numerically this is a highly unstable approach. If using built in solvers in numpy, we have options. 

That is all, now let us make up some data and see if we can fit a linear regression model. 


In [2]:
import numpy as np 

In [9]:
#Make up a random data matrix with n data samples and d dimensions
n = 1000
d = 20 

X = np.random.randn(n,d)

#Make up a theta vector that we know ahead of time! We will use this to creat a ground truth y data set, 
#and then solve to see if we get the same value back
theta = np.random.randn(d) 

#True data 
y_true = np.dot(X,theta)

#Now find theta give y_true and X. We should get the same theta matrix we ahd already. 

left_hand_matrix = np.dot(np.transpose(X), X) 
right_hand_matrix = np.dot(np.transpose(X), y_true)

theta_guess = np.linalg.solve(left_hand_matrix,right_hand_matrix)

#Let's compare the difference of the given theta and the guessed theta. 

error = np.linalg.norm(theta-theta_guess)

In [10]:
print(error)

2.61353233938522e-15


We see that our error is within machine precision. Try this on a real life data set to see how good of a fit you can have. This example is simply a sanity check. Let's write a clean function that takes a data set $X$ and ground truth $y$ and gives us optimal weights. 

In [11]:
def linear_regression(X,y):
    left_hand_matrix = np.dot(X.transpose, X)
    right_hand_matrix = np.dot(X.tranpose, y) 
    
    theta = np.linalg.solve(left_hand_matrix, right_hand_matrix)
    
    return theta

There it is, that simple!