# Linear regression from scratch
The goal of the exercise is here to implement a simple linear regression between two 1-dimensional variables from scratch using numpy.

Let say that we want to predict the values of n points $y = \{y_i\}_i$ from the values of $x = \{x_i\}_i$ with a simple linear equation:

\\[ \hat{y_i} = b_1 . x_i  + b_0 \\]

Where $\hat{y_i}$ is the predicted value of $y_i$ knowing $x_i$, $b_1$ is called the slope and $b_0$ the intercept.

The formula for the coefficients of a simple linear regression between x and y minimizing the mean squared error between $y$ and $\hat{y}$ is given by: 


\\[ b_1 = \frac{cov_{xy}}{var_{x}}\\]
and 
\\[ b_0 = \bar{y} - b_1 .\bar{x} \\]

With:
- $\bar{y} = \sum_{i=1}^n{y_i} / n$ the empirical mean of $y$
- $\bar{x} = \sum_{i=1}^n{x_i} / n$ the empirical mean of $x$
- $cov_{xy} = \sum_{i=1}^n{(x_i -\bar{x})(y_i -\bar{y})} / n$, the empirical covariance between x and y
- $var_{x} = \sum_{i=1}^n{(x_i -\bar{x})^2} / n$, the variance of x


## Helper functions
A few helper functions for data loading and visualization are available in helpers.py:

In [None]:
import helpers
# test the import by running  a dummy function
helpers.print_hello()

In [None]:
x = helpers.data_linear_regression[:, 0]
y = helpers.data_linear_regression[:, 1]

helpers.data_linear_regression  # The data we will be using here

In [None]:
fig, ax = helpers.plot_linear_regression(x, y, slope=1., intercept=0.)

## Implementation
Fill in the following functions to obtain your implementation of a simple linear regression **without using loops**. You can define additional functions if you need to:

### The very lazy solution (which is what you would do in real life)
You could directly use [scipy's](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html) (or [scikit's](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)) linear regression tool:

In [None]:
import scipy.stats
# Compute the coefficients
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x,y)
# Visualization of the output
fig, ax = helpers.plot_linear_regression(x, y, slope=slope, intercept=intercept)

### The semi lazy solution
The covariance matrix between x and y can be directly computed using [np.cov](https://numpy.org/doc/stable/reference/generated/numpy.cov.html).

In [None]:
import numpy as np

def compute_slope(x, y):
    "Return the slope for a 1-dimensional linear regression"
    cov_matrix = np.cov(x, y)  # covariance matrix for x and y
    slope = cov_matrix[0, 1] / cov_matrix[0, 0]
    return slope
    

def compute_linear_regression(x, y):
    "Return the slope and the intercept for a 1-dimensional linear regression"
    slope = compute_slope(x, y)
    intercept = y.mean() - slope * x.mean()
    return slope, intercept

In [None]:
# Visualize the output:
slope, intercept = compute_linear_regression(x, y)

fig, ax = helpers.plot_linear_regression(x, y, slope=slope, intercept=intercept)

### The solution for the bravest
We can also compute the coefficient using only the basic operations available in numpy:

In [None]:
import numpy as np

def mean1D(arr):
    "Return the mean of a 1-Dimensional array"
    return arr.sum() / len(arr)


def cov1D(x, y):
    """Compute the covariance between two 1-dimensional arrays.

    N.B. compute_cov(x, x) returns the variance of x.
    """
    # Precomputation of the mean of the inputs
    x_mean = mean1D(x)
    y_mean = mean1D(y)
    # Compute the covariance using its definition
    return mean1D((x - x_mean) * (y - y_mean))


def compute_slope(x, y):
    "Return the slope for a 1-dimensional linear regression"
    slope = cov1D(x, y) / cov1D(x, x)
    return slope
    

def compute_linear_regression(x, y):
    "Return the slope and the intercept for a 1-dimensional linear regression"
    slope = compute_slope(x, y)
    intercept = mean1D(y) - slope * mean1D(x)
    return slope, intercept

In [None]:
# Visualize the output:
slope, intercept = compute_linear_regression(x, y)

fig, ax = helpers.plot_linear_regression(x, y, slope=slope, intercept=intercept)

## To go further
Additional tasks if you want to go further:
- Compute the mean square error between the true value of y and the predicted value
- Generalize the code to use a multidimensional x input (see the formula [here](https://www.hackerearth.com/practice/machine-learning/linear-regression/multivariate-linear-regression-1/tutorial/))