The patient IDs were removed from this version of the data, leaving 384 input features which were put in each of the ```“X_...”``` arrays. The corresponding CT scan slice location has been put in the ```“y_...”``` arrays. We shifted and scaled the ```“y_...”``` location values for the version of the data that you are using. The shift and scaling was chosen to make the training locations have zero mean and unit variance. The first 73 patients were put in the ```_train``` arrays, the next 12 in the ```_val``` arrays, and the final 12 in the ```_test``` arrays. Please use this training, validation, test split as given. **Do not shuffle the data further in this assignment.**

## Task 1: Get Started

In [None]:
import numpy as np
data = np.load('ct_data.npz')
X_train = data['X_train']; X_val = data['X_val']; X_test = data['X_test']
y_train = data['y_train']; y_val = data['y_val']; y_test = data['y_test']

Verify that (up to numerical rounding errors) the mean of the training positions in ```y_train``` is zero. The mean of the 5,785 positions in the ```y_val``` array is not zero. Report its mean with a “standard error”, temporarily assuming that each entry is independent. For comparison, also report the mean with a standard error of the first 5,785 entries in the ```y_train```. Explain how your results demonstrate that these standard error bars do not reliably indicate what the average of locations in future CT slice data will be. Why are standard error bars misleading here?

In [None]:
# Calculate mean and standard error for y_train
y_train_mean = np.mean(y_train)
y_train_std_error = np.std(y_train, ddof=1) / np.sqrt(len(y_train))

# Calculate mean and standard error for the first 5,785 entries in y_train
y_train_sample_mean = np.mean(y_train[:5785])
y_train_sample_std_error = np.std(y_train[:5785], ddof=1) / np.sqrt(5785)

# Calculate mean and standard error for y_val
y_val_mean = np.mean(y_val)
y_val_std_error = np.std(y_val, ddof=1) / np.sqrt(len(y_val))

y_train_mean, y_train_std_error, y_train_sample_mean, y_train_sample_std_error, y_val_mean, y_val_std_error

Some of the input features are constants: they take on the same value for every training example. Identify these features, and remove them from the input matrices in the training, validation, and testing sets.

Moreover, some of the input features are duplicates: some of the columns in the training set are identical. For each training set column, discard any later columns that are identical. Discard the same columns from the validation and testing sets.

**Use these modified input arrays for the rest of the assignment.** Keep the names of the arrays the same (X_train, etc.), so we know what they’re called. You should not duplicate the code from this part in future questions. We will assume it has been run, and that the modified data are available.

**Warning: As in the real world, mistakes at this stage would invalidate all of your results. We strongly recommend checking your code, for example on small test examples where you can see what it’s doing.**

Report which columns of the X_... arrays you remove at each of the two stages. Report these as 0-based indexes. (For the second stage, you might report indexes in the original array, or after you did the first stage. It doesn’t matter, as long as your code is clear and correct.)

In [None]:
# Step 1: Identify and remove constant columns
constant_columns = [i for i in range(X_train.shape[1]) if np.all(X_train[:, i] == X_train[0, i])]
X_train = np.delete(X_train, constant_columns, axis=1)
X_val = np.delete(X_val, constant_columns, axis=1)
X_test = np.delete(X_test, constant_columns, axis=1)

# Step 2: Identify and remove duplicate columns
_, unique_indices = np.unique(X_train, axis=1, return_index=True)
duplicate_columns = [i for i in range(X_train.shape[1]) if i not in unique_indices]
X_train = np.delete(X_train, duplicate_columns, axis=1)
X_val = np.delete(X_val, duplicate_columns, axis=1)
X_test = np.delete(X_test, duplicate_columns, axis=1)

# Report columns removed in each stage
print("Constant columns removed:", constant_columns)
print("Duplicate columns removed:", duplicate_columns)

# Task 2: Linear Regression Baseline
Using ```numpy.linalg.lstsq```, write a short function “fit_linreg(X, yy, alpha)” that fits the linear regression model
$$f(\b x;\b w,b) = \b w^\top\b x + b,$$
by minimizing the cost function:
$$E(\b w, b) = \alpha\b w^\top\b w + \sum_n (f(\b x^{(n)};\b w,b) - y^{(n)})^2,$$
with regularization constant $\alpha$. As discussed in the lecture materials, fitting a bias parameter $b$ and incorporating the regularization constant can both be achieved by augmenting the original data arrays. Use a data augmentation approach that maintains the numerical stability of the underlying ```lstsq``` solver, rather than a ‘normal equations’ approach. You should only regularize the weights $\textbf{w}$ and not the bias $b$.

(In the lecture materials we used $\lambda$ for the regularization constant, matching Murphy and others. However, lambda is a reserved word in Python, so we swapped to ```alpha``` for our code.)

Use your function to fit weights and a bias to ```X_train``` and ```y_train```. Use $\alpha = 30$.

We can fit the same model with a gradient-based optimizer. The support code has a function ```fit_linreg_gradopt```, which you should look at and try.

Report the root-mean-square errors (RMSE) on the training and validation sets for the parameters fitted using both your ```fit_linreg``` and the provided ```fit_linreg_gradopt```. Do you get exactly the same results? Why or why not?

In [None]:
def fit_linreg(X, yy, alpha):
    X_train_num, features_num = X.shape
    
    # construct phi
    phi = np.concatenate([X_train, np.ones((X_train_num, 1))], axis=1) 
    identity_matrix = np.eye(features_num + 1)
    identity_matrix[-1, -1] = 0
    phi = np.concatenate([phi, np.sqrt(alpha) * identity_matrix])
    
    # construct Y
    Y = np.concatenate([yy, np.zeros(features_num + 1)])
    
    w = np.linalg.lstsq(phi, Y[:, np.newaxis], rcond=None)[0]
    
    return w[:-1, 0], w[-1, 0]

def calculate_rmse(X, yy, w, b):
    predictions = X @ w + b
    return np.sqrt(np.mean((predictions - yy) ** 2))

In [None]:
# use lstsq
w, b = fit_linreg(X_train, y_train, alpha=30)

rmse_train_lst = calculate_rmse(X_train, y_train, w, b)
rmse_val_lst = calculate_rmse(X_val, y_val, w, b)
rmse_train_lst, rmse_val_lst

In [None]:
# use grad
from support_code import *
ww, bb = fit_linreg_gradopt(X_train, y_train, alpha=30)

rmse_train_gd = calculate_rmse(X_train, y_train, ww, bb)
rmse_val_gd = calculate_rmse(X_val, y_val, ww, bb)
rmse_train_gd, rmse_val_gd

# Task 3: Invented classification tasks