# Linear regression

## Theory

Linear regression is used to fit a straight line (1D data), plane (2D data), or hyperplane (>2D data) to the data. The hypothesis is a linear combination of the features $\left(\pmb{x}_i\right)$, which are parametrized with the parameters $\pmb{\theta}$.

$$
h_{\pmb{\theta}} \left( \pmb{x}_i \right) \coloneqq \pmb{\theta^T x}_i
$$

Choosing $\pmb{\theta}$ at random will most likely not produce a result that is close to the true value, which results in an error. The goal of linear regression is to minimize the error by finding the optimal parameters. First, a function needs to be chosen to define how the error is computed. This is the cost function. A commonly used cost function in linear regression is mean squared error (MSE):

$$ 
J \left(\pmb{\theta}\right) \coloneqq \frac{1}{N} \sum_{i=0}^N \left( h_{\pmb{\theta}} \left( \pmb{x}_i \right) - y_i \right )^2 = \frac{1}{N} \left(\pmb{X\theta} - \pmb{y}\right)^T \left(\pmb{X\theta} - \pmb{y}\right)
$$

The cost function can be numerically minimized using the gradient descent algorithm. Gradient descent is an iterative algorithm, which updates the model parameters every epoch. One epoch is defined as the iteration of every training example through the hypothesis. The model parameters are updated by subtracting the partial derivative of the cost function to the updating parameter $\left(\theta^{(j)}\right)$ weighed by the learning rate $\alpha$.

$$
\theta^{(j)} \leftarrow \theta^{(j)} - \alpha \frac{\partial J \left(\pmb{\theta}\right)}{\partial \theta^{(j)}}
$$
$$
\frac{\partial J \left(\pmb{\theta}\right)}{\partial \theta^{(j)}} = \frac{2}{N} \sum_{i=0}^N \left( h_{\pmb{\theta}} \left( \pmb{x}_i \right) - y_i \right ) x_i^{(j)}
$$

## Implementation

In [7]:
import numpy as np
import plotly.graph_objects as go
import random

In [16]:
def create_dataset(datapoints, variance, correlation=None, step=2):
    """
    Create a random dataset

    :param datapoints: number of datapoints
    :param variance: the amount of variance in the dataset
    :param correlation: either 'pos' or 'neg' (default is None)
    :param step: determines the slope of the correlation (default is 2)
    """
    val = 1
    points = []

    for x in range(datapoints):
        y = val + random.randrange(-variance, variance)
        # X = np.asarray([1, x])
        points.append((1, x, y))
        if correlation == "pos":
            val += step
        elif correlation == "neg":
            val -= step

    return np.asarray(points, dtype=object)

In [34]:
def compute_MSE(theta, points):
    """
    Compute the Mean Square Error

    :param theta: regression parameters
    :param points: datapoints
    """
    
    # Make a matrix where every row is a training example
    X = points[:, :-1]
    # Make a vector of all true values
    y = points[:, -1]

    return np.mean((X@theta - y)**2)

In [29]:
def gradient_descent(
    points, learning_rate, num_iterations, threshold=1e-3
):
    """
    Use gradient descent to optimize regression parameters theta in order to find the best straight 
    line for the given points

    :param points: points to fit the line
    :param learning_rate: learning rate used in the algorithm
    :param num_iterations: maximum number of iterations
    :param threshold: minimum difference between two sequential mean squared 
        error values (default is 1e-3)
    """

    # Init values
    theta = np.zeros(len(points[0])-1)
    m = len(points)
    iteration = 0

    J = compute_MSE(theta, points)
    prev_J = np.inf

    # Loop until convergence or maximum number of iterations is reached
    while iteration < num_iterations and np.all(np.abs(J - prev_J) > threshold):
        
        new_thetas = np.zeros(len(theta))
        prev_J = J

        # Compute new theta's using the gradient
        for j, theta_j in enumerate(theta):
            new_thetas[j] = theta_j - learning_rate/m * np.sum([(theta @ x - y)*x[j] for *x, y in points])

        theta = new_thetas
        # Compute new MSE
        J = compute_MSE(theta, points)
        iteration += 1

    return theta

In [32]:
def fit(points):
    """
    Plot the regression line of the given points using gradient descent

    :param points: datapoints to fit a straight line
    """

    # hyperparameters
    learning_rate = 0.001

    # Optimize m and b using gradient descent
    theta = gradient_descent(
        points, learning_rate, num_iterations=1000
    )

    # Plot the data and the regression line
    xs, ys = points[:, :-1], points[:, -1]
    regression_line = [theta @ x for x in xs]
    
    fig = go.Figure()

    fig.add_traces([
        go.Scatter(x=[x[1] for x in xs], y=ys, mode='markers'),
        go.Scatter(x=[x[1] for x in xs], y=regression_line, mode='lines', name=f"{theta[1]:.3f}x + {theta[0]:.3f}")
    ])

    fig.update_layout(autosize=False, width=700, height=500, margin=dict(l=60, r=50, b=70, t=30),
        xaxis_title='x', yaxis_title='y')

    fig.show()

In [35]:
points = create_dataset(datapoints=40, variance=1, correlation='neg')
fit(points)

In [56]:
points = create_dataset(datapoints=40, variance=1, correlation='pos')
fit(points)

In [57]:
points = create_dataset(datapoints=40, variance=10, correlation='neg')
fit(points)

In [59]:
points = create_dataset(datapoints=40, variance=10)
fit(points)

source: https://www.coursera.org/specializations/machine-learning-introduction?utm_source=gg&utm_medium=sem&utm_campaign=04-CourseraPlus-EU&utm_content=B2C&campaignid=13520447723&adgroupid=124369969820&device=c&keyword=coursera&matchtype=b&network=g&devicemodel=&adpostion=&creativeid=527622276210&hide_mobile_promo=&gclid=EAIaIQobChMIv_GT2NbQ-AIVgo9oCR0p1QN0EAAYASAAEgI4IPD_BwE