3. Implement a program to train a binary logistic regression model using mini-batch SGD. Use the logistic regression model we derived in class, corresponding to Equation (4.90) from the textbook, and where the feature transformation $\phi$ is the identity function.

The program should include the following hyperparameters:
- Batch size
- Fixed learning rate
- Maximum number of iterations

The logistic regression model we derived in class, corresponding to Equation (4.90) from the textbook, gives the cross-entropy loss:
 $$E(\mathbf{w}) = -\sum_{n=1}^N \left\{t_n \log \left(\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) + (1 - t_n) \log \left(1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) \right\} = \sum_{n=1}^N E_n(\mathbf{w})$$

 $$\Rightarrow \hspace{0.07cm} E_n(\mathbf{w}) = - t_n \log \left(\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) - (1 - t_n) \log \left(1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right)$$

Computing the gradient of $E_n(\mathbf{w})$:

\begin{align*}
\frac{d \sigma}{d z}
& = \frac{d}{dz} (1 + \exp(-z))^{-1} = -(1 + \exp(-z))^{-2} (-\exp(-z)) \\
& = \frac{\exp(-z)}{(1 + \exp(-z))^2} = \frac{1}{1 + \exp(-z)} \left(\frac{\exp(-z)}{1 + \exp(-z)}\right) \\
& = \frac{1}{1 + \exp(-z)} \left(\frac{1 + \exp(-z)}{1 + \exp(-z)} - \frac{1}{1 + \exp(-z)}\right) \\
& = \frac{1}{1 + \exp(-z)} \left(1 - \frac{1}{1 + \exp(-z)}\right) \\
& = \sigma(z) (1 - \sigma(z))
\end{align*}

\begin{align*}
\Rightarrow \nabla E_n(\mathbf{w})
& = - t_n \left(\frac{\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \left(1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) \boldsymbol{\phi}(\mathbf{x}_n)}{\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))}\right) - (1 - t_n) \left(\frac{-\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \left(1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) \boldsymbol{\phi}(\mathbf{x}_n)}{1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))}\right) \\
& = - t_n \left(1 - \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n))\right) \boldsymbol{\phi}(\mathbf{x}_n) + (1 - t_n) \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \boldsymbol{\phi}(\mathbf{x}_n) \\
& = \left(-t_n + t_n \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \right) \boldsymbol{\phi}(\mathbf{x}_n) + \left(\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) - t_n \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \right) \boldsymbol{\phi}(\mathbf{x}_n) \\
& = \left(-t_n + t_n \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) + \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) - t_n \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) \right) \boldsymbol{\phi}(\mathbf{x}_n) \\
& = \left(\sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) - t_n \right) \boldsymbol{\phi}(\mathbf{x}_n)
\end{align*}

Stochastic gradient descent (SGD) makes an update to the weight vector based on one data point at a time, so that
\begin{equation}
\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \nabla E_n (\mathbf{w}^{(\tau)}).
\tag*{(5.43)}
\end{equation}

Thus, for mini-batch SGD with a fixed learning rate $\eta$, the update rule is $$\mathbf{w}^{(k+1)} = \mathbf{w}^{(k)} - \eta \sum_{n \in B_k} \nabla E_n (\mathbf{w}^{(k)}).$$

In [None]:
import numpy as np

def log_sigmoid(z):
  return 1 / (1 + np.exp(-z))

def log_regression(x, t, initial_w, batch_size, learning_rate, max_iterations):
  w = initial_w
  for k in range(max_iterations):
    # Randomly shuffle the samples or rows (without replacement)
    indices = np.random.permutation(x.shape[0])
    x = x[indices]
    t = t[indices]
    design_mat = np.hstack((np.ones((x.shape[0], 1)), x))

    # Randomly select a batch
    b_k = np.random.choice(x.shape[0], batch_size, replace=False)

    # Compute the gradient
    grad = 0
    for n in b_k:
      phi_n = design_mat[n]
      grad += (log_sigmoid(np.dot(w, phi_n)) - t[n]) * phi_n

    # Update w
    w -= learning_rate * grad

  return w


4. In this problem, you will run a logistic regression model for classification on a breast cancer dataset.

(a) Download the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository or scikit-learn’s built-in datasets.

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load and return the breast cancer wisconsin dataset from scikit-learn
breast_cancer = load_breast_cancer(as_frame=True)
x = breast_cancer.data  # (data matrix)
t = breast_cancer.target  # (classification target)

(b) Split the dataset into train, validation, and test sets.

In [None]:
from sklearn.model_selection import train_test_split

# Randomly split the data into train, validation, and test sets (75%-15%-10%)
# Split x and t into 75% training and 25% validation/test
x_train, x_val_test, t_train, t_val_test = train_test_split(x, t,
                                                            test_size=0.25,
                                                            random_state=1)
# Split the validation/test sets into 60% validation and 40% test
x_val, x_test, t_val, t_test = train_test_split(x_val_test, t_val_test,
                                                test_size=0.4, random_state=1)

(c) Report the size of each class in your training (+ validation) set.

In [None]:
import numpy as np

print("Size of class 0 in training set:", t_train.value_counts()[0],
      "\nSize of class 1 in training set:", t_train.value_counts()[1],
      "\n\nSize of class 0 in validation set:", t_val.value_counts()[0],
      "\nSize of class 1 in training set:", t_val.value_counts()[1],
      "\n\nSize of class 0 in test set:", t_test.value_counts()[0],
      "\nSize of class 1 in test set:", t_test.value_counts()[1])

Size of class 0 in training set: 157 
Size of class 1 in training set: 269 

Size of class 0 in validation set: 32 
Size of class 1 in training set: 53 

Size of class 0 in test set: 23 
Size of class 1 in test set: 35


(d) Train a binary logistic regression model using your implementation from problem 3. Initialize the model weights randomly, sampling from a standard Gaussian distribution. Experiment with different choices of fixed learning rate and batch size.

In [None]:
## Experimenting different choices of fixed learning rate and batch size
## with the validation set to find the best choice:
# Set seed for reproducibility
np.random.seed(1)

# Initialize w randomly by sampling from a standard Gaussian distribution
initial_w = np.random.normal(0, 1, x.shape[1] + 1)

# Normalize all the features of x_train and x_val
x_train = (x_train - np.mean(x_train, axis=0)) / np.std(x_train, axis=0)
x_val = (x_val - np.mean(x_val, axis=0)) / np.std(x_val, axis=0)

def evaluate_performance_on_validation_set(w_train, x_val, t_val):
  x_val_design = np.hstack((np.ones((x_val.shape[0], 1)), x_val))
  predicted_t_val = log_sigmoid(x_val_design @ w_train)
  accuracy = np.mean(predicted_t_val == t_val)
  return accuracy

learning_rates = [0.001, 0.005, 0.01, 0.05, 0.1]
batch_sizes = [30, 60, 90, 120, 150]

best_learning_rate = 0
best_batch_size = 0
best_accuracy = 0

for learning_rate in learning_rates:
  for batch_size in batch_sizes:
    w_train = log_regression(x_train.values, t_train.values, initial_w.copy(),
                             batch_size, learning_rate, max_iterations=100)
    accuracy = evaluate_performance_on_validation_set(w_train, x_val.values, t_val.values)
    if accuracy > best_accuracy:
      best_accuracy = accuracy
      best_learning_rate = learning_rate
      best_batch_size = batch_size

best_learning_rate, best_batch_size

(0.1, 120)

In [None]:
## Using the best choice of fixed learning rate and batch size:
# Set seed for reproducibility
np.random.seed(1)

# Initialize w randomly by sampling from a standard Gaussian distribution
initial_w = np.random.normal(0, 1, x.shape[1] + 1)

# Normalize all the features of x_train and x_val
x_train = (x_train - np.mean(x_train, axis=0)) / np.std(x_train, axis=0)

# Train a binary logistic regression model by using my implementation from Problem 3
w_train = log_regression(x_train.values, t_train.values, initial_w,
                         best_batch_size, best_learning_rate, max_iterations=100)

(e) Use the trained model to report the performance of the model on the test set. For evaluation metrics, use accuracy, precision, recall, and F1-score.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Normalize all the features of x_test
x_test = (x_test - np.mean(x_test, axis=0)) / np.std(x_test, axis=0)

# Use the trained model on the test set
x_test_design = np.hstack((np.ones((x_test.shape[0], 1)), x_test))
predicted_t_test = log_sigmoid(x_test_design @ w_train)

# Convert to binary class labels
predicted_t_test = (predicted_t_test >= 0.5).astype(int)

# Report the model's performance
print("Accuracy:", accuracy_score(t_test, predicted_t_test),
      "\nPrecision:", precision_score(t_test, predicted_t_test),
      "\nRecall:", recall_score(t_test, predicted_t_test),
      "\nF1-Score:", f1_score(t_test, predicted_t_test))

Accuracy: 0.9827586206896551 
Precision: 0.9722222222222222 
Recall: 1.0 
F1-Score: 0.9859154929577465


(f) Summarize your findings.

I found that the trained binary logistic regression model performed very well on the test set of the Wisconsin Breast Cancer dataset based on all four evaluation metrics (i.e., accuracy, precision, recall, and F1-score). This dataset contains features that are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image. The target variable, diagnosis, has two classes: malignant (M) and benign (B). For binary-class classification, we can assume that malignant is the positive condition and benign is the negative condition. Regarding evaluation metrics, the trained model's predictions for breast cancer diagnosis based on the test set's features were largely accurate. Specifically, an accuracy of approximately 98.3% indicates that the model correctly predicted almost all diagnoses. A precision of approximately 97.2% means that when predicting malignant diagnoses, the model was correct 97.2% of the time, and a recall of 100% indicates that the model correctly predicted all of the actual malignant diagnoses. Furthermore, an F1-score of approximately 98.6% shows both high precision and recall. Therefore, taking my results for evaluation metrics into account, the trained binary logistic regression model is reliable to make accurate predictions on unseen data as it demonstrated strong generalization to the test set.