# Exam goal
This lab simulates time and momentum data with the goal of predicting force. Force, defined as ($F = 5 \times t + 0.002 * p$), is added as a label. Random noise has been added as well to the force values so that it mimics a real world scenario. Assume you don't know a-priory the proportionality constants that from momentum and time give you the force in this specific case, and use linear regression to predict the force from the available data.

The goal of the exercise is to show that feature scaling help in improving the performance of a regression model. Since in the course you have been exposed to z score normalization, you should use that to show that a model's prediction on the test data improves if the model is trained and tested on normalized features.

In the cell below, the datased is generated and then split into separate training and testing samples, which you have to use through the exercise.

In [16]:
# importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate synthetic data for time, momentum, and force
np.random.seed(42)
num_samples = 100
time = np.random.uniform(0, 1, num_samples)
momentum = np.random.uniform(0, 1000, num_samples)
force = 5 * time + 0.002 * momentum + np.random.normal(0, 1, num_samples)

# Split the dataset into training (X_train, y_train) and testing (X_test, y_test) sets
X_ = np.column_stack((time, momentum))
y_ = force
X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.8, random_state=42)

In the cell below, implement the cost function for linear regression, incliding a regularization term lambda_ .

In [17]:
def compute_cost(X, y, w, b, lambda_ = 1):
    """
    Computes the cost function for linear regression over all examples
    Args:
      X (ndarray (m,n): data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """

In the cell below, implement the gradient computation for the linear regression model you defined above. You don't need to include a regularization term in the gradient computation, in this simple exercise. 

In [None]:
def compute_gradient(X, y, w, b): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """

In the cell below, the gradient descent algorithm is provided to you. Make sure your implementation of the cost nad gradient functions is compatible with the code below.

In [None]:
import copy, math
def gradient_descent(X, y, w_in, b_in, alpha, num_iters, lambda_):
    """
    Performs batch gradient descent
    
    Args:
      X (ndarray (m,n)   : Data, m examples with n features
      y (ndarray (m,))   : target values
      w_in (ndarray (n,)): Initial values of model parameters  
      b_in (scalar)      : Initial values of model parameter
      alpha (float)      : Learning rate
      num_iters (scalar) : number of iterations to run gradient descent
      lambda_ (scalar)   : regularization parameter for the cost function
      
    Returns:
      w (ndarray (n,))   : Updated values of parameters
      b (scalar)         : Updated value of parameter 
    """
    J_history = []
    w = copy.deepcopy(w_in)
    b = b_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters
        dj_db, dj_dw = compute_gradient(X, y, w, b )

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw
        b = b - alpha * dj_db
      
        # Save cost J at each iteration
        if i<100000:      
            J_history.append( compute_cost(X, y, w, b, lambda_) )

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]}   ")
        
    return w, b, J_history

In the cell below, intitialize the model parameters appropriately and call the gradient descent function on 1000 iterations. The regularization parameter is fixed to lambda_ = 0.1 and given to you, while you have to find a value of the learning rate alpha_ in the interval between $10^{-4}$ and $10^{-8}$ that ensures the gradient descent algorithm converges.    

In [None]:
# training and testing prior to feature scaling
w_tmp  = 
b_tmp  = 
# find an alpha_ value between 10^-4 and 10^-8 that ensures convergence for the gradient descent
alpha_ = 
# do not need to change lambda_ or iters
lambda_ = 0.1
iters = 1000

# call the gradient descent function here below on the training dataset, using the arguments you have just initialized

Complete the cell below so that "y_pred" stores the model predictions for the testing dataset. 

In [None]:
y_pred = 

# Mean Squared Error
The mean squared error (MSE) can be used to evaluate the performance of a regression model and it is defined as: $\frac{1}{m}\sum_{i=1}^{m}(pred_i-y_i)^2$. Here m is the total number of samples while $pred_i$ represents the model prediction for an individual sample i. The $y_i$ represent the actual label of the sample. Scikit lean already has a function for it and can be imported as:
```
from sklearn.metrics import mean_squared_error
```
Run the cell below, which uses "y_pred" that you computed above, to obtain the MSE for your trained model. 

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (predictions against test data): {mse}")

In the cell below, define a finction that implements the $z$-score normalization for the model features.

In [21]:
def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
    """


In the cell below, use the $z$-score normalization you implemented above to obtain the scaled training ("X_train_norm") and testing ("X_test_norm") datasets.

In [None]:
X_train_norm = 
X_test_norm = 

In the cell below, call the gradien descent function for 1000 iterations using the scaled training ("X_train_norm") and testing ("X_test_norm") datasets. The regularization parameter is fixed to lambda_ = 0.1 and given to you. **Also the learning rate alpha_ in this case has been changed to alpha_ = 0.01 and given to you**: you don't need to change its value. This is because feture scaling helps the convergence and allows you to use a greater learning rate. 

In [None]:
# training and testing after feature scaling
w_tmp  = 
b_tmp  = 
# don't need to change alpha_, lambda_, iters
alpha_ = 0.01 
lambda_ = 0.1
iters = 1000

# call the gradient descent function here below on the training dataset, using scaled fetures and the arguments you have just initialized

Complete the cell below so that "y_pred_xnorm" stores the model predictions for the testing dataset using the **scaled** features you have previously computed. 

In [None]:
y_pred_xnorm = 

Finally, run the cell below to obtain the MSE for your model after feature scaling. You expect to obtain a value noticeably smaller than what you have previously obtained by training the dataset on the original features, before $z$-score normalization.

In [None]:
mse_norm = mean_squared_error(y_test, y_pred_xnorm)
print(f"Mean Squared Error (predictions against test data): {mse_norm}")