# Machine Learning Programming Exercise 1

## 1 Write a function to generate an m+1 dimensional data set, of size n, consisting of m continuous independent variables (X) and one dependent variable (Y) defined as

$y_i = x_i\beta + e$  

where,

- e is a Gaussuan distribution with mean 0 and standard deviation (σ), representing the unexplained variation in Y
- β is a random vector of dimensionality m + 1, representing the coefficients of the linear relationship between X and Y, and  
- $\forall i \in [1, n], x_{i0} = 1$

The function should take the following parameters:  

- σ: The spread of noise in the output variable
- n: The size of the data set   
- m: The number of indepedent variables

Output from the function should be:  

- X: An n × m numpy array of independent variable values (with a 1 in the first column)  
- Y: The n × 1 numpy array of output values
- β: The random coefficients used to generatre Y from X



In [None]:
import numpy as np

def generate_dataset(datasetSize, no_of_ind_var, sigma):
    """Function for generating the required data set.

    Args:
        datasetSize (Integer): This is the size of the data set.
        no_of_ind_var (Integer): This is the number of independent variables.
        sigma (Float or Integer): This is the standard deviation of the error term.

    Returns:
        X (2D array): Size of (datasetSize, no_of_ind_var)
        Y (2D array): Size of (datasetSize, 1) 
        Beta (1D array): Size of (no_of_ind_var+1)
    """
    e = np.random.normal(0,sigma,(datasetSize,1))
    beta = np.random.rand(no_of_ind_var+1,1)

    X = np.random.rand(datasetSize,no_of_ind_var)
    X = np.hstack((np.ones((datasetSize, 1)), X)) 
    
    Y = X @ beta + e
    return X,Y,beta

print(generate_dataset(10, 3, 1))
    

## 2 Write a function that learns the parameters of a linear regression line given inputs  

- X: An n × m numpy array of independent variable values  
- Y: The n × 1 numpy array of output values
- k: the number of iteractions (epochs)   
- τ: the threshold on change in Cost function value from the previous to current iteration

The function should implement the Gradient Descent algorithm as discussed in class that initialises β with random values and then updates these values in each iteraction by moving in the the direction defined by the partial derivative of the cost function with respect to each of the coefficients. The function should use only one loop that ends after a number of iterations (k) or a threshold on the change in cost function value (τ).  

The output should be an m + 1 dimensional vector of coefficients and the final cost function value.  






In [None]:
import numpy as np

def linear_regress(X, Y, k, tau):   # Here tau is the threshold on change in cost function value.
    n, m = X.shape
    X = np.hstack((np.ones((n, 1)), X)) # Adding the bias term
    beta = np.random.rand(m + 1, 1) # Initializing coefficients
    cost = float("inf") # Initializing cost
    for i in range(k):
        y_pred = X @ beta
        error = Y - y_pred
        cost_new = (1 / (2 * n)) * np.sum(error ** 2)
        if np.abs(cost - cost_new) < tau:
            break
        cost = cost_new
        gradient = -(1 / n) * X.T @ error
        beta = beta - gradient
    
    beta = beta.T
    
    return beta, cost
