The patient IDs were removed from this version of the data, leaving 384 input features which were put in each of the ```“X_...”``` arrays. The corresponding CT scan slice location has been put in the ```“y_...”``` arrays. We shifted and scaled the ```“y_...”``` location values for the version of the data that you are using. The shift and scaling was chosen to make the training locations have zero mean and unit variance. The first 73 patients were put in the ```_train``` arrays, the next 12 in the ```_val``` arrays, and the final 12 in the ```_test``` arrays. Please use this training, validation, test split as given. **Do not shuffle the data further in this assignment.**

## Task 1: Get Started

In [1]:
import numpy as np
data = np.load('ct_data.npz')
X_train = data['X_train']; X_val = data['X_val']; X_test = data['X_test']
y_train = data['y_train']; y_val = data['y_val']; y_test = data['y_test']

Verify that (up to numerical rounding errors) the mean of the training positions in ```y_train``` is zero. The mean of the 5,785 positions in the ```y_val``` array is not zero. Report its mean with a “standard error”, temporarily assuming that each entry is independent. For comparison, also report the mean with a standard error of the first 5,785 entries in the ```y_train```. Explain how your results demonstrate that these standard error bars do not reliably indicate what the average of locations in future CT slice data will be. Why are standard error bars misleading here?

In [2]:
# Calculate mean and standard error for y_train
y_train_mean = np.mean(y_train)
y_train_std_error = np.std(y_train, ddof=1) / np.sqrt(len(y_train))

# Calculate mean and standard error for the first 5,785 entries in y_train
y_train_sample_mean = np.mean(y_train[:5785])
y_train_sample_std_error = np.std(y_train[:5785], ddof=1) / np.sqrt(5785)

# Calculate mean and standard error for y_val
y_val_mean = np.mean(y_val)
y_val_std_error = np.std(y_val, ddof=1) / np.sqrt(len(y_val))

y_train_mean, y_train_std_error, y_train_sample_mean, y_train_sample_std_error, y_val_mean, y_val_std_error

(-9.13868774539957e-15,
 0.0049535309340638205,
 -0.44247687859693674,
 0.011927303389170828,
 -0.2160085093241599,
 0.01290449880016868)

Some of the input features are constants: they take on the same value for every training example. Identify these features, and remove them from the input matrices in the training, validation, and testing sets.

Moreover, some of the input features are duplicates: some of the columns in the training set are identical. For each training set column, discard any later columns that are identical. Discard the same columns from the validation and testing sets.

**Use these modified input arrays for the rest of the assignment.** Keep the names of the arrays the same (X_train, etc.), so we know what they’re called. You should not duplicate the code from this part in future questions. We will assume it has been run, and that the modified data are available.

**Warning: As in the real world, mistakes at this stage would invalidate all of your results. We strongly recommend checking your code, for example on small test examples where you can see what it’s doing.**

Report which columns of the X_... arrays you remove at each of the two stages. Report these as 0-based indexes. (For the second stage, you might report indexes in the original array, or after you did the first stage. It doesn’t matter, as long as your code is clear and correct.)

In [3]:
# Step 1: Identify and remove constant columns
constant_columns = [i for i in range(X_train.shape[1]) if np.all(X_train[:, i] == X_train[0, i])]
X_train = np.delete(X_train, constant_columns, axis=1)
X_val = np.delete(X_val, constant_columns, axis=1)
X_test = np.delete(X_test, constant_columns, axis=1)

# Step 2: Identify and remove duplicate columns
_, unique_indices = np.unique(X_train, axis=1, return_index=True)
duplicate_columns = [i for i in range(X_train.shape[1]) if i not in unique_indices]
X_train = np.delete(X_train, duplicate_columns, axis=1)
X_val = np.delete(X_val, duplicate_columns, axis=1)
X_test = np.delete(X_test, duplicate_columns, axis=1)

# Report columns removed in each stage
print("Constant columns removed:", constant_columns)
print("Duplicate columns removed:", duplicate_columns)

Constant columns removed: [59, 69, 179, 189, 351]
Duplicate columns removed: [76, 77, 185, 195, 283, 354]


# Task 2: Linear Regression Baseline
Using ```numpy.linalg.lstsq```, write a short function “fit_linreg(X, yy, alpha)” that fits the linear regression model
$$f(\b x;\b w,b) = \b w^\top\b x + b,$$
by minimizing the cost function:
$$E(\b w, b) = \alpha\b w^\top\b w + \sum_n (f(\b x^{(n)};\b w,b) - y^{(n)})^2,$$
with regularization constant $\alpha$. As discussed in the lecture materials, fitting a bias parameter $b$ and incorporating the regularization constant can both be achieved by augmenting the original data arrays. Use a data augmentation approach that maintains the numerical stability of the underlying ```lstsq``` solver, rather than a ‘normal equations’ approach. You should only regularize the weights $\textbf{w}$ and not the bias $b$.

(In the lecture materials we used $\lambda$ for the regularization constant, matching Murphy and others. However, lambda is a reserved word in Python, so we swapped to ```alpha``` for our code.)

Use your function to fit weights and a bias to ```X_train``` and ```y_train```. Use $\alpha = 30$.

We can fit the same model with a gradient-based optimizer. The support code has a function ```fit_linreg_gradopt```, which you should look at and try.

Report the root-mean-square errors (RMSE) on the training and validation sets for the parameters fitted using both your ```fit_linreg``` and the provided ```fit_linreg_gradopt```. Do you get exactly the same results? Why or why not?

In [4]:
X_train_num, features_num = X_train.shape
X_train_num, features_num

(40754, 373)

In [5]:
phi = np.concatenate([X_train, np.zeros((X_train_num, 1))], axis=1)
phi.shape, phi[:, -1]

((40754, 374), array([0., 0., 0., ..., 0., 0., 0.]))

In [6]:
# Not use matrix operations--high computing cost
"""
for i in range(features_num + 1):
    add_row = np.zeros(features_num + 1)
    add_row[i] = 1
    phi = np.concatenate([phi, add_row[np.newaxis, :]])
phi[-1, -1] = 0
"""

# Use matrix operations -- get a eye matrix
identity_matrix = np.eye(features_num + 1)
identity_matrix[-1, -1] = 0

phi = np.concatenate([phi, identity_matrix])

identity_matrix, phi[-1, -1], phi.shape

(array([[1., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 1., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 0.0,
 (41128, 374))

In [7]:
# try
y_train.shape, np.zeros((features_num + 1)).shape, np.zeros((features_num + 1, 1)).shape

((40754,), (374,), (374, 1))

In [8]:
Y = np.concatenate([y_train, np.zeros(features_num + 1)])
Y.shape, Y[:, np.newaxis].shape, phi.shape

((41128,), (41128, 1), (41128, 374))

In [9]:
w = np.linalg.lstsq(phi, Y[:, np.newaxis])
w

  w = np.linalg.lstsq(phi, Y[:, np.newaxis])


(array([[-5.14849854e-02],
        [-1.08631266e-01],
        [ 8.22456499e-02],
        [ 2.83872801e-01],
        [ 2.60394488e-01],
        [ 1.19494049e-01],
        [ 1.68744858e-02],
        [ 2.35944393e-01],
        [-3.20385336e-01],
        [-1.73865238e-02],
        [-5.03429199e-02],
        [ 5.29468044e-02],
        [-2.44884609e-02],
        [-1.77291987e-03],
        [-1.04068325e-02],
        [ 4.06039770e-02],
        [ 2.52409523e-02],
        [ 2.23186640e-02],
        [-1.22311809e-01],
        [ 1.68718904e-01],
        [ 7.94975956e-02],
        [ 2.74324075e-02],
        [-4.27183236e-02],
        [-1.69503524e-03],
        [-6.15091084e-02],
        [-4.89240453e-02],
        [-1.31386948e-02],
        [-6.05110555e-01],
        [ 4.90115638e-01],
        [ 2.26251471e-01],
        [-1.11229878e-02],
        [ 3.12659183e-03],
        [-5.45543385e-03],
        [-8.01311822e-03],
        [ 6.67414373e-02],
        [-1.82856145e-01],
        [-1.97597401e-01],
 

In [10]:
def fit_linreg(X, yy, alpha):
    X_train_num, features_num = X.shape
    
    # construct phi
    phi = np.concatenate([X_train, np.ones((X_train_num, 1))], axis=1) 
    identity_matrix = np.eye(features_num + 1)
    identity_matrix[-1, -1] = 0
    phi = np.concatenate([phi, np.sqrt(alpha) * identity_matrix])
    
    # construct Y
    Y = np.concatenate([yy, np.zeros(features_num + 1)])
    
    w = np.linalg.lstsq(phi, Y[:, np.newaxis], rcond=None)[0]
    
    return w[:-1, 0], w[-1, 0]

def calculate_rmse(X, yy, w, b):
    predictions = X @ w + b
    return np.sqrt(np.mean((predictions - yy) ** 2))

In [11]:
# use lstsq
w, b = fit_linreg(X_train, y_train, alpha=30)

rmse_train_lst = calculate_rmse(X_train, y_train, w, b)
rmse_val_lst = calculate_rmse(X_val, y_val, w, b)
rmse_train_lst, rmse_val_lst

(0.3567565397204054, 0.4230521968394695)

In [12]:
# use grad
from support_code import *
ww, bb = fit_linreg_gradopt(X_train, y_train, alpha=30)

rmse_train_gd = calculate_rmse(X_train, y_train, ww, bb)
rmse_val_gd = calculate_rmse(X_val, y_val, ww, bb)
rmse_train_gd, rmse_val_gd

(0.3567556103401202, 0.42305510586203865)

# Task 3: Invented classification tasks

It is often easier to work with binary data than real-valued data: we don’t have to think so hard about how the values might be distributed, and how we might process them. We will invent some binary classification tasks, and fit these.

We will pick 20 positions within the range of training positions, and use each of these to define a classification task:

The logistic regression cost function and gradients are provided with the assignment in the function ```logreg_cost```. It is analogous to the ```linreg_cost``` function for least-squares regression, which is used by the ```fit_linreg_gradopt``` function that you used earlier.

Fit logistic regression to each of the 20 classification tasks above with $\alpha=30$
.

Given a feature vector, we can now obtain 20 different probabilities, the predictions of the 20 logistic regression models. Transform both the training and validation input matrices into new matrices with 20 columns, containing the probabilities from the 20 logistic regression models. You don’t need to loop over the rows of ```X_train``` or ```X_val```, you can use array-based operations to make the logistic regression predictions for every datapoint at once.

In [13]:
# test something to be used
pred = np.array([[1, 2, 3, 4, 5, 6, 7]])

pred = np.concatenate([pred, pred])
print(pred)
pred[pred > 4] = 100
pred[pred <= 4] = 0
print(pred)

[[1 2 3 4 5 6 7]
 [1 2 3 4 5 6 7]]
[[  0   0   0   0 100 100 100]
 [  0   0   0   0 100 100 100]]


In [14]:
def fit_logreg_gradopt(X, yy, alpha):
# TODO: Revise this comment
    """
    fit a regularized logistic regression model with gradient opt

         ww, bb = fit_logreg_gradopt(X, yy, alpha)

     Find weights and bias by using a gradient-based optimizer
     (minimize_list) to improve the regularized least squares cost:

       np.sum(((np.dot(X,ww) + bb) - yy)**2) + alpha*np.dot(ww,ww)

     Inputs:
             X N,D design matrix of input features
            yy N,  real-valued targets
         alpha     scalar regularization constant

     Outputs:
            ww D,  fitted weights
            bb     scalar fitted bias
    """
    D = X.shape[1]
    args = (X, yy, alpha)
    init = (np.zeros(D), np.array(0))
    ww, bb = minimize_list(logreg_cost, init, args)
    return ww, bb

In [15]:
def logreg_k(X, yy, K, alpha=30):
    mx = np.max(yy); mn = np.min(yy); hh = (mx-mn)/(K+1)
    thresholds = np.linspace(mn+hh, mx-hh, num=K, endpoint=True)
    
    # concate方法 频繁concate并不好
    # X_train_new = np.array([])
    # 预先存在数组 效率更高
    X_train_new = np.zeros((X.shape[0], K))
    
    for kk in range(K):
        # get binary training labels based on thresholds[kk]
        labels = yy > thresholds[kk]
        
        # fit logistic regression to these labels
        ww, bb = fit_logreg_gradopt(X, labels, alpha)
        pred_term = X @ ww + bb
        pred = 1 / (1 + np.exp(-pred_term))
        
        # transform to binary
        # pred[pred >= 0.5] = 1
        # pred[pred < 0.5] = 0
        # 更好的方法：
        pred = np.where(pred >= 0.5, 1, 0)
        
        X_train_new[:, kk] = pred
        
        
        # # concate方法： concatenate logreg outputs together
        # pred = pred.reshape(-1, 1)
        # if X_train_new.shape[0] == 0:
        #     X_train_new = pred
        # else:
        #     X_train_new = np.concatenate([X_train_new, pred], axis=1)
            
    return X_train_new

In [16]:
K = 20 # number of thresholded classification problems to fit

# Transform both the training and validation input matrices into new matrices with 20 columns
X_train_new = logreg_k(X_train, y_train, K)
X_val_new = logreg_k(X_val, y_val, K)
np.sum(X_train_new, axis=0), np.sum(X_val_new, axis=0), X_train_new, X_val_new

(array([39998., 38403., 36886., 35169., 33434., 29258., 25547., 22683.,
        20093., 17598., 15203., 12862., 11142.,  9252.,  7715.,  6239.,
         4671.,  3004.,  1368.,    64.]),
 array([5774., 5523., 5229., 4928., 4615., 4035., 3510., 3031., 2681.,
        2256., 1898., 1533., 1195., 1059.,  884.,  762.,  598.,  458.,
         283.,  120.]),
 array([[1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        ...,
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.]]),
 array([[1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        ...,
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.],
        [1., 1., 1., ..., 0., 0., 0.]]))

Fit a regularized linear regression model ($\alpha=30$) to your transformed 20-dimensional training set. Report the training and validation root mean square errors (RMSE) of this model.

In [17]:
from support_code import *
ww, bb = fit_linreg_gradopt(X_train_new, y_train, alpha=30)

rmse_train_gd = calculate_rmse(X_train_new, y_train, ww, bb)
rmse_val_gd = calculate_rmse(X_val_new, y_val, ww, bb)
rmse_train_gd, rmse_val_gd

(0.12290480292993665, 0.16884510101812197)