# Univariate Linear Regression

For the programming notebook this week we'll build all the components for a univariate linear regression model. It is a very simple regression model, but many of the ideas here are a part of more complex regression models too, so understanding them in this simple context will be good practice.

To start, let's load some of the tools and data we'll need. *Matplotlib* will allow us to graphically plot the data and the results, making everything a little easier to interpret. *Numpy* is library that is used a lot in machine learning when building models, but here we'll just use the function `loadtxt` as an easy way to load in the data file. The function `train_test_split` from *scikit-learn* will once again be used to split the data into *70%* training and *30%* testing.

In [None]:
# The following line makes sure that matplotlib understands that we want the plots
# to be shown inside the notebook as an image
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

data = np.loadtxt('linear_data.csv', delimiter=',', dtype=float)

training_data, testing_data = train_test_split(data, train_size=0.7)

print("First 10 rows of the training data:\n")
print(training_data[:10])

## Numpy Arrays

The training set in now stored a *Numpy* data matrix, with every row corresponding to a single training example of one x-value and one y-value. A data matrix is very similar to a *list of lists*, but the matrix format has some advantages, which we'll get into in later assignments. For now, we're just going to convert this data back to two simple lists (technically *Numpy* arrays, but you can use them exactly like you would a list).

In [None]:
x_training = training_data[:, 0]
y_training = training_data[:, 1]

x_testing = testing_data[:, 0]
y_testing = testing_data[:, 1]

print("\nTraining x-values")
print(x_training)

print("\nTraining y-values")
print(y_training)


## Plotting the data

The left column of the training data matrix is now converted to a list x-values and right column of the training data matrix is now converted to a list y-values. It is important to note that, even though we split the data into two lists, the indexes between these two list are linked. So, for example, the x-value at index 5 and the y-value index 5 together form a single training pair.

This is a simple artificial data set that was generated to be suitable for linear regression, but to make it more concrete you could still view this as the housing prediction problem from the theory videos. The x-values would then be square meters and the y-values the price of the sold house in thousands of euros. This data would then be a set of records of past sales and we're trying to predict the price of new houses based on just the number of square meters they each contain.

These pairs of values, at the same index in the x-values and the y-values lists, together form a sample of data for the sale of a single house, with its surface area and price. Viewing them as a plotted point instead of lists of numbers, would therefore probably be more informative. Use the *matplotlib* functions [plt.plot()](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html) and [plt.show()](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.show.html) to create a plot for the training data. Make sure to plot **points** and not a line, as these are individual (shuffled) samples.

In [None]:
# YOUR CODE HERE


## Linear model

Next, we'll define the simple linear model we'll use to try and predict the y-values. A linear equation is defined as an equation of the form:

$$y=ax+b$$

*Note:* In the theory videos Andrew uses $\theta_0$ and $\theta_1$ as the model parameters, but we'll stick with $a$ and $b$ for now, just to make sure the distinction between the two is very clear. The underlying model is of course identical, and we'll still use a $\theta$ parameter notation in later assignments.

Start by implementing the function `linear_model()`, which takes an input `x` and the model parameters `a` and `b` and returns the prediction for `y` (this should be very straightforward). Next, make a function `linear_model_list()` which takes a list of x-values and applies the `linear_model()` function to every x-value in the list and returns a new list of predicted y-values.

### Plotting a model

Now, let's plot to see what such a model would look like. This model is a linear function and so by definition it can make predictions for *any* possible x-value. If we want to plot this line, the best thing we can do is just sample a lot of points on the x-axis, compute the corresponding y-values for each of them and plot all of these points, but connect every point together with a straight line.

*Side note: This might seem like a strange way to plot a linear function, but this is actually how any function is plotted on a computer. This also includes curved functions, like for example parabola or sine waves, so it will be good practice to use the same approach here. Just take whole lot of points and compute the output value for each of them and let the computer "connect the dots".*

Based on the plot you made before, make an approximate guess for what you think $a$ and $b$ should be. Make a range of x-values using [np.linspace()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) over a sensible interval (again, take a look at the plot from the previous cell for some hints). Compute the predicted y-values for those sampled x-values and your guesses for $a$ and $b$ using `linear_model_list()`.

Plot the training data as points, exactly as you did before, but don't use `plt.show()` yet. Next, plot the sampled x-values and the predicted y-values as a **line** and using a different color (see the [plot documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html) for details). Finally, show the results. You should end up with a plot that has both your predictive line, and the training data.

Try at least 3 different values for $a$ and $b$ in your plot and see how well the predicted line fits the data.

#### Which values for $a$ and $b$ did you try and which one seemed to fit the data the best?

*Your answer goes here.*


In [None]:
def linear_model(x, a, b):
    # YOUR CODE HERE
    

def linear_model_list(x_list, a, b):
    # YOUR CODE HERE
    


# YOUR CODE HERE


## Cost function

Now, we'll define the cost function for this linear model:

$$J(a, b) = \frac{1}{2m}\sum^m_{i=1} ((a x^i + b) - y^i)^2$$

Note that this function only depends on the model parameters $a$ and $b$, and not the data vectors $\mathbf{x}$ and $\mathbf{y}$. This training data is considered to be *constant* within the cost function, as the data will not change at all. The only thing that changes the cost of the model for a specific problem is changing the model parameters $a$ and $b$.

In order to make proper function out the input, this function *does* of course also depend on the x-values and y-values of the training data. Define the function `linear_cost`, which takes model parameters `a` and `b`, and a data set in the vorm of lists of `x_values` and `y_values`, and computes the model cost for that data set. Reuse your functions from before as much as possible.

Apply this function to your guesses for $a$ and $b$ based on the plots of the training data before. Print the parameters and resulting cost for each guess. Order the guesses from highest cost to lowest cost.

#### Did the ordering of the costs for each guess correspond with your expectations? Explain your answer.

*Your answer goes here.*


In [None]:
def linear_cost(a, b, x_values, y_values):
    # YOUR CODE HERE
    

# YOUR CODE HERE


## 3D plot of the cost surface

Now, let's take a further look at this cost function and construct a full 3D plot. Below is the code to plot the complete cost surface. The code takes 100 samples for possible values of $a$ and 100 samples for possible values of $b$ and then computes the cost for *every possible combination*. Here you can start to get a hint of why *Numpy* is used so often in machine learning, as we can do this complex operation in only a couple of lines of code.

The code then simply plots these computed results and connects the dots in the same way as for the line, but now forming a 3-dimensional surface. The x-axis and y-axis are the values for $a$ and $b$ respective, with the z-axis containing the corresponding cost for that combination of $a$ and $b$. The scale for $a$ is from 0 to 5 and for $b$ is from -100 to 200, as this is own my estimate of the range of sensible value combinations that will *definitely* contain the minimum of the cost function.

Make sure you understand this 3-d plot and how it relates to the line and point plots from earlier, before moving on. Understanding how this surface relates to the estimates made by the model will be critical for the rest of the assignment.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

ax = Axes3D(plt.figure(figsize=(16,12)))
plt.rc('axes', labelsize=24)

a = np.linspace(0, 5, 100)
b = np.linspace(-100, 200, 100)

X, Y = np.meshgrid(a, b)
cost_map = np.vectorize(linear_cost, excluded=(2, 3))
Z = cost_map(np.ravel(X), np.ravel(Y), x_training, y_training).reshape(X.shape)

ax.plot_wireframe(X, Y, Z)
ax.set_xlabel('a')
ax.set_ylabel('b')
ax.set_zlabel('Cost')
plt.show()

## Finding the minimum of the cost function

The goal in linear regression is to try and find the model parameters $a$ and $b$ that result in the lowest possible cost on the training data, i.e. the lowest point on the surface plotted above. Given that we've just computed the cost for a lot of different combinations of $a$ and $b$, we could just select the the combination with the lowest cost from this plot. This approach has two problems:

* This will really only give us quite a coarse approximation, as we've taken 100 samples for both $a$ and $b$, so the space between samples is large and ideally we'd have a more precise solution.
* It usually isn't feasible to compute all combinations in this manner (which is part of the reason why we're using this simple generated data set) and this approach won't work for larger data sets.

So instead we'll use a *gradient descent* approach, where we start with a random value for $a$ and $b$ and iteratively follow the gradient down to the minimum value. For this we'll need to be able to compute what the gradient of the cost function is at a specific point $a, b$. If we can obtain the partial derivative of the cost function with respect to $a$ and $b$, then we can easily compute the gradient at any point.

### LaTeX

The next few assignments in this notebook will require you to write some equations in *Markdown* cells. There is a language called *LaTeX* that is used in almost all scientific disciplines to write equations, which you can write directly in *Markdown* cell. We'll only need a couple of simple tools, listed below:

* All equations should be surrounded by `$$` on both sides
* `\frac{1}{2}` makes a fraction of 1 over 2: $\frac{1}{2}$
* `\sum` makes a sum symbol: $\sum$
* `^` makes next character superscript
* `_` makes next character subscript
* `\partial` makes the partial derivative symbol: $\partial$

If you *run* a markdown cell you can see the rendered equations.

## Partial derivative of $J(a, b)$ w.r.t. $b$

For this assignment you should work out what the partial derivative of the cost function is with respect to $b$. This will already give us one half of the gradient of the cost function, and it is a slightly easier derivative to start with.

Apply the rules of derivation to the equation below and simplify the result as much as possible. You should label every step with the name of the rule you use to determine the next equation. Continue to apply rules until you can no longer simplify. The first step has already been given as an example:

$$\frac{\partial}{\partial b} J(a, b)$$

Substitute the definition

$$\frac{\partial}{\partial b} \frac{1}{2m}\sum^m_{i=1} ((a x^i + b) - y^i)^2$$

***Continue with the next rule here***



## First half of the gradient

With this equation you should now be able to compute the $b$ component of gradient of the cost function at a specific point $a, b$. Fill in the function `b_gradient()` to compute the equation you obtained for this at the previous step. We will check the correctness of the function (and the equation) at later steps, so for now implement the function and continue with the next step.

In [None]:
def b_gradient(a, b, x_values, y_values):
    # YOUR CODE HERE
    

## Partial derivative of $J(a, b)$ w.r.t. $a$

For this assignment you should work out what the partial derivative of the cost function is with respect to $a$. It will be very similar to steps you just completed for $b$, and together they will define the complete gradient of the cost function. 

Apply the rules of derivation to the equation below and simplify the result as much as possible. You should label every step with the name of the rule you use to determine the next equation. Continue to apply rules until you can no longer simplify. The first step has already been given as an example:

$$\frac{\partial}{\partial a} J(a, b)$$

Substitute the definition

$$\frac{\partial}{\partial a} \frac{1}{2m}\sum^m_{i=1} ((a x^i + b) - y^i)^2$$

***Continue with the next rule here***



## Second half of the gradient

With this equation you should now be able to compute the $a$ component of gradient of the cost function at a specific point $a, b$. Fill in the function `a_gradient()` to compute the equation you obtained for this at the previous step. We will check the correctness of the function (and the equation) at later steps, so for now implement the function and continue with the next step.

In [None]:
def a_gradient(a, b, x_values, y_values):
    # YOUR CODE HERE
    


## Approximating the gradient

Instead of this exact computation, we can also just make a numeric approximation of the gradient. For this we can use the old *rate of change* definition:

$$\frac{f(x + \epsilon) - f(x)}{\epsilon}$$

If $\epsilon$ is small enough, then this *difference quotient* should approximate the derivative of the function $f$ at $x$. We can use this same approximation for the parital derivative the cost function with respect to $a$ and $b$:

$$\frac{\partial}{\partial a} J(a, b) \approx \frac{J(a + \epsilon,\ b) - J(a, b)}{\epsilon}$$

$$\frac{\partial}{\partial b} J(a, b) \approx \frac{J(a,\ b + \epsilon) - J(a, b)}{\epsilon}$$

We can use this numerical approximation to check the analytical gradient; if they both come out to about the same value, then it is much more likely that analytical gradient was correct (which can be especially tricky for larger models). We can then use the much more exact, and faster, analytical gradient when optimizing the model, while having some certainty it is correct.

Implement the functions `a_gradient_approx()` and `b_gradient_approx()`, which should approximate the partial derivative of the cost function w.r.t $a$ and $b$ respectively, for a set of `x_values` and `y_values`, at the point $a,b$. The default value for epsilon is set to $\epsilon = 10^{-6}$.


In [None]:
def a_gradient_approx(a, b, x_values, y_values, eps=10**-6):
    # YOUR CODE HERE
    

def b_gradient_approx(a, b, x_values, y_values, eps=10**-6):
    # YOUR CODE HERE
    

## Checking the gradient

Now that we have functions for both the analytical and the numerical gradient, let's write some code to compare them. First, write the function `check_gradient()`, which compares the results of `a_gradient()` and `a_gradient_approx()`, and of `b_gradient()` and `b_gradient_approx()`, for a set of `x_values` and `y_values`, at the point $a,b$. The function should check the absolute difference between each approximation and its analytical counterpart, for which you can use the built-in function [abs](https://docs.python.org/3/library/functions.html#abs). If this absolute difference is greater than some threshold value for either half of the gradient, the function should print out what the approximation error was and return `False`. If all approximations were within the threshold range, the function should return `True`.

This function will only check the gradient at a specific point $a,b$ of the cost surface for these `x_values` and `y_values`. So next, write the function `check_gradient_loop()`, that will try `iterations` number different random points $a,b$, and check the gradient for each of them. Use the built-in [random.uniform](https://docs.python.org/3/library/random.html#random.uniform) distribution to generate random values for $a$ between `a_min` and `a_max` for $b$ between `b_min` and `b_max`. Your function should use `check_gradient()` for this, which will already print an error message if the gradient approximates differ by more that `thres`. If all the `iterations` number of checks pass, the function should print a message saying all gradients seem correct.

Call the `check_gradient_loop()` function at the end of the cell and make sure the analytical gradients seem correct.

### Sanity check 1

It should be the case that for a large enough value of `thres`, that all these checks pass, even if the gradient is actually incorrect. Conversely, for a small enough value of `thres` some tests will also always fail, as accuracy of the gradient approximation necessarily depends on the size of $\epsilon$. Start larger values of `thres` and systematically decrease it until you get an approximation error.

#### At what value of thres did you get an approximation error? Does this seem like a reasonable limit for the gradient approximation?

*Your answer goes here.*

In [None]:
import random

def check_gradient(a, b, x_values, y_values, thres):
    # YOUR CODE HERE
    

def check_gradient_loop(a_min, a_max, b_min, b_max, x_values, y_values, thres, iterations=10**5):
    # YOUR CODE HERE
    


# YOUR CODE HERE


## Gradient Descent

If you're convinced that the gradient functions are correct, we can start to actually build *gradient descent*. Obtaining the correct gradient for the cost function can be the tricky part of this algorithm, but with that done, that descent part of the algorithm should be pretty straightforward.

All you really need to do is pick a starting point $a,b$ and repeatedly compute the gradient and move in the *opposite* direction, i.e. the gradient will "point up" and you want to move down. Then you just need a learning rate parameter $\alpha$ to control how big each step down is, and another threshold value to determine if your algorithm has *converged* and you've reached the minimum.

One step of gradient descent for $a$ then just becomes

$$a = a - \alpha \frac{\partial J(a,b)}{\partial a}$$

And for $b$

$$b = b - \alpha \frac{\partial J(a,b)}{\partial b}$$

The *minus* sign ensures the move is in the opposite direction, and $\alpha$ scales the size of each gradient step. Note that these gradient terms should both be computed *before* you actually update the parameters $a$ and $b$, as you want the changes to happen *simultaneously*.

Because this cost function is actually *convex*, it really doesn't matter where we start with $a$ and $b$, as all points should eventually lead to the same *global minimum*. We'll get back to what this means exactly, and in what cases you don't have these nice guarantees, in later assignments. For now, it is good to know that the starting point shouldn't matter. We could have the algorithm randomly start in some range, as with `gradient_check_loop()`, but then we'd need minimum and maximum values for both parameters, so the easy solution for now is just to have them both start at 0.

Now we already have all the elements to do one step of gradient descent. The only remaining element is really to just repeat this in a loop and compute the cost at every step. If know the cost of the current step and the previous step, then we can compute the difference between the two, and see how much progress the algorithm made that step. This gives us two important pieces of information:

1. If the difference becomes small enough, then we know the algorithm has *converged* and we can stop the loop.
2. If the difference ever becomes *negative*, then the cost is increasing, and not decreasing. This means the algorithm is *diverging* and the step size of the learning rate must be too large.

If your code ever encounters situation *2*, the function should print an error message to inform the user. For situation *1* we'll use also use threshold value to determine if the difference is small enough, the default value for which is now set to $10^{-6}$. Note that you could also include using the function `check_gradient()` at every step of gradient descent, but this would be quite a bit slower. Also, as the function above takes a large random sample of the gradients, we can be pretty confident that this part should work correctly. You may add a call to `check_gradient()` in your loop, but this is not required.

Write the function `gradient_descent()` according to the description above. The function should return the converged parameters $a,b$ when completed.

When your function is done, call the function using the training data to see what your computed best estimates for $a,b$ are. Print the $a$ and $b$ you found, and then print the cost on the training data and finally, also print the cost on the testing data.

### Sanity check 2

It should be the case that for a large enough value of $\alpha$, the cost will always *diverge* at some point. Conversely, if $\alpha$ is very small, then it the algorithm will still *converge*, but the large number of steps required means that this will take a long time. Start with a value for $\alpha$ that does *diverge* and systematically decrease it until you get a value that does *converge*.

#### At what value of alpha did your algorithm first converge? Does this this seem like a reasonable value for the learning rate to you? Why, or why not?

*Your answer goes here.*

In [None]:
def gradient_descent(x_values, y_values, alpha, a_start=0, b_start=0, thres=10**-6):
    # YOUR CODE HERE
    
    return (a, b)
        

# YOUR CODE HERE


## Plotting the best fit

Copy your code to plot the training data and the estimated line from before. This time, also add the testing data to the plot, making sure to use a different color point for training and testing, so you can distinguish the two.

Then, use the values your `gradient_descent()` found for $a$ and $b$ and use the function `linear_model_list()` to show the complete prediction line for that combination of $a$ and $b$.

#### Does it look like gradient descent solved this prediction problem correctly? Does this match what you expected based results you got for the costs of training and testing data? Why, or why not?

*Your answer goes here.*

In [None]:
# YOUR CODE HERE
