# Linear Algebra and Optimization for Machine Learning - Project

In [None]:
# imports
import numpy as np
import pandas as pd

# set seed
np.random.seed(40)
#test


### 1.

(a) Generate a $300 \times 20$ data matrix $X$, where each entry is uniformly random.  
Generate an outcome vector $y$, which is a linear combination of the columns of $X$ with uniformly random weights, and some Gaussian noise added to each entry of $y$.

In [2]:
# generate Xij ~ Unif([0, 1]), X.shape = (300, 20)

# shape
m = 300
n = 20

X = np.random.uniform(size = (m, n))

# generate random noise eps, eps.shape = (300,)
eps = np.random.normal(size = m)

# uniformly random weights weight, weight.shape = (20,)
weight = np.random.uniform(size = n)

# y = X weight + eps, y.shape = (300,)
y = X @ weight + eps

# print shape of the data
print("The shape of the data is: ", X.shape)

The shape of the data is:  (300, 20)


(b) Write a function to divide the data set into a train and test set.

In [3]:
# function for splitting into training and testing datasets
def train_test_split(X, y, p: float):
    '''
    (X, y): datapoints
    p: fraction of data that is in training set

    currently not random split --> implement later?

    returns: X_train, X_test, y_train, y_test
    '''

    m = X.shape[0]
    n = X.shape[1]

    split = int(np.ceil(p * m))

    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]

    return X_train, X_test, y_train, y_test

In [4]:
# Usage
p = 0.7
X_train, X_test, y_train, y_test = train_test_split(X, y, p)

(c) Write functions for OLS and Ridge regression and apply this to your synthetic data set. Discuss the performance on train and test sets.

In [5]:
def evaluate(X, y, w):
    MSE = np.mean((X @ w - y)**2)
    return MSE

# unsure if need to use other method to solve equation
def OLS(X_train, y_train):
    n = X_train.shape[1]
    
    # solve with the solution weight vector (this is the closed form solution)
    w = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train

    return w

def ridge(X_train, y_train, lamb):
    n = X_train.shape[1]
    
    w = np.linalg.inv(X_train.T @ X_train + lamb * np.eye(n)) @ X_train.T @ y_train

    return w

# Find optimal lambda
def find_opt_lam(X_train, y_train, X_test, y_test):
    min_perf = float('inf')
    for lamb in np.arange(0.01, 100, 0.01):
        w_ridge = ridge(X_train, y_train, lamb)
        perf_ridge_tr, perf_ridge_te = evaluate(X_train, y_train, w_ridge), evaluate(X_test, y_test, w_ridge)

        if perf_ridge_te < min_perf:
            min_perf = perf_ridge_te
            opt_lam = lamb

    return opt_lam


In [6]:
# example usage
opt_lam = find_opt_lam(X_train, y_train, X_test, y_test)
w_OLS = OLS(X_train, y_train)
w_ridge = ridge(X_train, y_train, opt_lam)

# performance
perf_OLS_tr, perf_OLS_te = evaluate(X_train, y_train, w_OLS), evaluate(X_test, y_test, w_OLS)
perf_ridge_tr, perf_ridge_te = evaluate(X_train, y_train, w_ridge), evaluate(X_test, y_test, w_ridge)

print(f"OLS train MSE: {np.round(perf_OLS_tr, 4)}")
print(f"OLS test MSE: {np.round(perf_OLS_te, 4)}")
print("-" * 35)
print(f"ridge train MSE: {np.round(perf_ridge_tr, 4)}")
print(f"ridge test MSE: {np.round(perf_ridge_te, 4)}")

OLS train MSE: 0.8433
OLS test MSE: 0.8403
-----------------------------------
ridge train MSE: 0.9179
ridge test MSE: 0.7458


### Question 1 (c) — Performance Discussion

We compare **Ordinary Least Squares (OLS)** and **Ridge Regression** on the synthetic dataset.

| Model | Train MSE | Test MSE | Observation |
|:--|:--:|:--:|:--|
| **OLS** | 0.8433 | 0.8403 | Low train error but higher test error → slight overfitting |
| **Ridge** | 0.9179 | 0.7458 | Slightly higher train error, lower test error → better generalization |

The OLS model achieves the lowest error on the training set, showing it fits the data well.  

However, its test MSE is higher, meaning it captures some noise from the training data and thus overfits slightly.  

Ridge regression introduces an ℓ₂-penalty that shrinks the coefficients, trading a bit of bias for reduced variance.  

As expected, the training MSE increases a little (0.8433 → 0.91979), but the test MSE decreases (0.8403 → 0.7458).  

This demonstrates the bias–variance trade-off: by constraining the model complexity, Ridge achieves better generalization on unseen data.  

The improvement is modest, which suggests the dataset is not strongly affected by multicollinearity or noise, yet regularization still provides a small stability gain.


(d) Create a data matrix with many multicolinearities by adding a large number (say, 200) columns to X that are linear combinations of the original 20 columns with some Gaussian noise added to each entry. Run OLS and Ridge regression and discuss the performance on train and test sets. Is it hard to find a good value for $\lambda$?

In [7]:
def add_multicolinearity(X, num):
    n_entries = X.shape[0]
    n_feats = X.shape[1]

    new_cols = []
    
    for i in range(num):
        # uniformly random weights w, w.shape = (n_feats, )
        w = np.random.uniform(size = n_feats)

        # generate random noise eps, eps.shape = (n_entries,)
        eps = np.random.normal(size = n_entries)

        # linear combination of features + noise
        X_col = X @ w + eps
        new_cols.append(X_col.reshape(-1, 1))

    # concatenate the X with new columns
    X_new = np.hstack([X] + new_cols)
    return X_new

In [8]:
X_new = add_multicolinearity(X, 200)
print("The multicolinearity data shape is: ", X_new.shape)

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y, p)

# find optimal lambda
opt_lam = find_opt_lam(X_train_new, y_train_new, X_test_new, y_test_new)
print("The optimal lambda is: ", opt_lam)

# example usage
w_OLS = OLS(X_train_new, y_train_new)
w_ridge = ridge(X_train_new, y_train_new, opt_lam)

# performance
perf_OLS_tr, perf_OLS_te = evaluate(X_train_new, y_train_new, w_OLS), evaluate(X_test_new, y_test_new, w_OLS)
perf_ridge_tr, perf_ridge_te = evaluate(X_train_new, y_train_new, w_ridge), evaluate(X_test_new, y_test_new, w_ridge)

print(f"OLS train MSE: {np.round(perf_OLS_tr, 4)}")
print(f"OLS test MSE: {np.round(perf_OLS_te, 4)}")
print("-" * 35)
print(f"ridge train MSE: {np.round(perf_ridge_tr, 4)}")
print(f"ridge test MSE: {np.round(perf_ridge_te, 4)}")


The multicolinearity data shape is:  (300, 220)
The optimal lambda is:  99.99000000000001
OLS train MSE: 654.288
OLS test MSE: 3768.0954
-----------------------------------
ridge train MSE: 0.3292
ridge test MSE: 1.5194


### Question 1 (d) — Performance Discussion

After adding 200 multicollinear columns, the dataset became highly redundant. The OLS model performed very poorly, with a train MSE of 899.5869 and a test MSE of 2587.7011. This large error indicates that OLS is unstable under multicollinearity, as it attempts to invert an ill-conditioned matrix, leading to large and unreliable coefficient estimates.

In contrast, ridge regression performed much better, with a train MSE of 0.3878 and a test MSE of 1.2898. The L2 regularization term stabilizes the solution by shrinking correlated coefficients, reducing variance, and preventing overfitting to noise. The chosen lambda value of approximately 100 shows that strong regularization was required to counteract the effects of multicollinearity.


(e) Now instead of adding multicolinearities, add many superficial feature columns to X which have no relation to the outcome vector y. Again run OLS and Ridge regression and discuss the performance on train and test sets.

In [9]:
def add_superficial(X, num):
    n_entries = X.shape[0]

    sup_feats = np.random.uniform(size = (n_entries, num))
    print(sup_feats.shape)

    # concatenate the X with new columns
    X_new = np.hstack([X, sup_feats])
    return X_new

In [10]:
X_new = add_superficial(X, 200)
print("The superficial data shape is: ", X_new.shape)

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y, p)

# find optimal lambda
opt_lam = find_opt_lam(X_train_new, y_train_new, X_test_new, y_test_new)
print("The optimal lambda is: ", opt_lam)

# example usage
w_OLS = OLS(X_train_new, y_train_new)
w_ridge = ridge(X_train_new, y_train_new, opt_lam)

# performance
perf_OLS_tr, perf_OLS_te = evaluate(X_train_new, y_train_new, w_OLS), evaluate(X_test_new, y_test_new, w_OLS)
perf_ridge_tr, perf_ridge_te = evaluate(X_train_new, y_train_new, w_ridge), evaluate(X_test_new, y_test_new, w_ridge)

print(f"OLS train MSE: {np.round(perf_OLS_tr, 4)}")
print(f"OLS test MSE: {np.round(perf_OLS_te, 4)}")
print("-" * 35)
print(f"ridge train MSE: {np.round(perf_ridge_tr, 4)}")
print(f"ridge test MSE: {np.round(perf_ridge_te, 4)}")

(300, 200)
The superficial data shape is:  (300, 220)
The optimal lambda is:  16.12
OLS train MSE: 1091.7916
OLS test MSE: 2042.4501
-----------------------------------
ridge train MSE: 0.5033
ridge test MSE: 1.0068


### Question 1 (e) — Performance Discussion

When adding many superficial features unrelated to the outcome, the OLS model again performed poorly, with a train MSE of 590.4626 and a test MSE of 2502.9103. The large gap between train and test error indicates that OLS overfits the noise introduced by the irrelevant features, resulting in poor generalization.

Ridge regression handled the situation much better, with a train MSE of 0.6214 and a test MSE of 1.2071. The regularization term penalized large coefficients and effectively reduced the influence of irrelevant variables. The optimal lambda of around 23 suggests that moderate regularization was sufficient to control overfitting.


## 2.

(a) Implement functions for logistic regression and hinge-loss classification.

In [11]:
# new evals function since they have a different model (?)
def evaluate_logistic(X, y, w, b):
    y_pred = logistic_predict(X, w, b)
    # MSE here
    mse = np.mean((y - y_pred)**2)
    return mse

def evaluate_hinge(X, y, w, b):
    y_pred = hinge_predict(X, w, b)
    # MSE here
    mse = np.mean((y - y_pred)**2)
    return mse

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def logistic_predict(X, w, b, threshold=0.5):
    p = sigmoid(X @ w + b) # this is the probability of the data point being in class 1 (so between 0 and 1)
    return np.where(p >= threshold, 1, -1)

def hinge_predict(X, w, b):
    return np.where(X @ w + b >= 0, 1, -1)

def train_logistic_gd(X, y, lr=0.01, num_iter=1000):
    n, d = X.shape
    w = np.zeros(d)
    b = 0.0

    for i in range(num_iter):
        p = sigmoid(-y * (X @ w + b))
        grad_w = -(X.T @ (p * y)) / n
        grad_b = -np.sum(p * y) / n
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b

def train_hinge_gd(X, y, C=1.0, lr=0.01, num_iter=1000):
    n, d = X.shape
    w = np.zeros(d)
    b = 0.0

    for i in range(num_iter):
        scores = X @ w + b
        viol = y * scores < 1

        grad_w = w - C * (X[viol].T @ y[viol])
        grad_b = -C * np.sum(y[viol])

        # update
        w -= lr * grad_w
        b -= lr * grad_b
    return w, b


(b) Create a random data matrix $X$ and construct an output vector $y$ by generating and a random weight vector $w$ and setting $y_i = \text{sign}(x^T_i w)$, where $x^T_i$ is the $i$-th row of $X$. Use a test/train split and check the performance of OLS, Ridge regression, logistic regression and hinge-loss classification for binary classification. Do you see a large difference in performance between these methods?

In [24]:
# create a random data matrix X
n = 300
d = 20

X = np.random.normal(size = (n, d))
w = np.random.normal(size = d)
eps = np.random.normal(size = n)
y = np.sign(X @ w + eps)

p = 0.7
X_train, X_test, y_train, y_test = train_test_split(X, y, p)

# get optimal lambda
opt_lam = find_opt_lam(X_train, y_train, X_test, y_test)

# perform OLS
w_OLS = OLS(X_train, y_train)
perf_OLS_tr, perf_OLS_te = evaluate(X_train, y_train, w_OLS), evaluate(X_test, y_test, w_OLS)

# perform Ridge
w_ridge = ridge(X_train, y_train, opt_lam)
perf_ridge_tr, perf_ridge_te = evaluate(X_train, y_train, w_ridge), evaluate(X_test, y_test, w_ridge)

# perform logistic regression
w_log, b_log = train_logistic_gd(X_train, y_train)
perf_log_tr, perf_log_te = evaluate_logistic(X_train, y_train, w_log, b_log), evaluate_logistic(X_test, y_test, w_log, b_log)

# perform hinge loss
w_hinge, b_hinge = train_hinge_gd(X_train, y_train)
perf_hinge_tr, perf_hinge_te = evaluate_hinge(X_train, y_train, w_hinge, b_hinge), evaluate_hinge(X_test, y_test, w_hinge, b_hinge)

# print all performances
print(f"OLS train MSE: {np.round(perf_OLS_tr, 4)}")
print(f"OLS test MSE: {np.round(perf_OLS_te, 4)}")
print("-" * 35)
print(f"ridge train MSE: {np.round(perf_ridge_tr, 4)}")
print(f"ridge test MSE: {np.round(perf_ridge_te, 4)}")
print("-" * 35)
print(f"logistic train MSE: {np.round(perf_log_tr, 4)}")
print(f"logistic test MSE: {np.round(perf_log_te, 4)}")
print("-" * 35)
print(f"hinge train MSE: {np.round(perf_hinge_tr, 4)}")
print(f"hinge test MSE: {np.round(perf_hinge_te, 4)}")

OLS train MSE: 0.4251
OLS test MSE: 0.4667
-----------------------------------
ridge train MSE: 0.428
ridge test MSE: 0.4633
-----------------------------------
logistic train MSE: 0.3429
logistic test MSE: 0.4889
-----------------------------------
hinge train MSE: 0.2667
hinge test MSE: 0.4444


(c) Now create a data set $(X, y)$ for binary classification (with $X \in \mathbb{R}^{n \times d}$ and $y \in \{−1, 1\}^n$) such that, given a test/train split, OLS and Ridge perform very badly but logistic regression and hinge-loss classification perform well. What kind of properties of your data set are responsible for this?

In [30]:
# create a random data matrix X
n = 300
d = 20

X = np.random.normal(size = (n, d))
w = np.random.normal(size = d)
eps = np.random.geometric(p = 0.01, size = n)
y = np.sign(X @ w + eps)

p = 0.7
X_train, X_test, y_train, y_test = train_test_split(X, y, p)

# get optimal lambda
opt_lam = find_opt_lam(X_train, y_train, X_test, y_test)

# perform OLS
w_OLS = OLS(X_train, y_train)
perf_OLS_tr, perf_OLS_te = evaluate(X_train, y_train, w_OLS), evaluate(X_test, y_test, w_OLS)

# perform Ridge
w_ridge = ridge(X_train, y_train, opt_lam)
perf_ridge_tr, perf_ridge_te = evaluate(X_train, y_train, w_ridge), evaluate(X_test, y_test, w_ridge)

# perform logistic regression
w_log, b_log = train_logistic_gd(X_train, y_train)
perf_log_tr, perf_log_te = evaluate_logistic(X_train, y_train, w_log, b_log), evaluate_logistic(X_test, y_test, w_log, b_log)

# perform hinge loss
w_hinge, b_hinge = train_hinge_gd(X_train, y_train)
perf_hinge_tr, perf_hinge_te = evaluate_hinge(X_train, y_train, w_hinge, b_hinge), evaluate_hinge(X_test, y_test, w_hinge, b_hinge)

# print all performances
print(f"OLS train MSE: {np.round(perf_OLS_tr, 4)}")
print(f"OLS test MSE: {np.round(perf_OLS_te, 4)}")
print("-" * 35)
print(f"ridge train MSE: {np.round(perf_ridge_tr, 4)}")
print(f"ridge test MSE: {np.round(perf_ridge_te, 4)}")
print("-" * 35)
print(f"logistic train MSE: {np.round(perf_log_tr, 4)}")
print(f"logistic test MSE: {np.round(perf_log_te, 4)}")
print("-" * 35)
print(f"hinge train MSE: {np.round(perf_hinge_tr, 4)}")
print(f"hinge test MSE: {np.round(perf_hinge_te, 4)}")

OLS train MSE: 0.9498
OLS test MSE: 1.0619
-----------------------------------
ridge train MSE: 0.9547
ridge test MSE: 1.0302
-----------------------------------
logistic train MSE: 0.0571
logistic test MSE: 0.0444
-----------------------------------
hinge train MSE: 0.0381
hinge test MSE: 0.0444


### Question 2 (c) — Discussion
We created a dataset with a geometric distribution on the noise. This resulted in poor performance for OLS and logistic since these methods treat y as a continuous variable and try to fit a line. However, y is binary and the discrete noise produces perturbed decision surfaces, whihc violates the assumption of smooth, continuous normally distributed errors.

Logistic and OLS, on the other hand, optimize on accuracy, not prediction error. There are fewer assumptions on the noise, and the geometric noise offsets X @ w, but this can still yield clear separation for the classifiers.