<a href="https://colab.research.google.com/github/rwu331/Math_156/blob/main/Math_156_A3_P3_P4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 3. Program to train a binary logistic regression model using mini-batch SGD.

In [None]:
import numpy as np

def lr_minibatchSGD(X, y, learning_rate=0.01, batch_size=32, max_iter=1000):
    def sigmoid(z):
        return 1 / (1 + np.exp(-z))

    def cross_entropy_loss(y_true, y_pred):
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

    num_samples, num_features = X.shape
    w = np.zeros(num_features) # initial weights

    for iteration in range(max_iter):
        # shuffle data each time
        indices = np.arange(num_samples)
        np.random.shuffle(indices)

        # mini-batch sgd
        for start in range(0, num_samples, batch_size):
            batch_indices = indices[start:start + batch_size]
            # select the batch
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]

            # predict
            linear_model = np.dot(X_batch, w)
            y_pred = sigmoid(linear_model)

            # calculate gradients
            err = y_pred - y_batch
            gradient = np.dot(X_batch.T, err) / batch_size

            # update weights
            w -= learning_rate * gradient

    return w


## 4. Run LR model for classification on a breast cancer data set.

a). download Wisconsin Breast Cancer dataset

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
bc = load_breast_cancer()

b). split the dataset into train, validation, and test sets

In [None]:
X = bc.data
y = bc.target
# split data, test-train-val in 75%-15%-10%
X_train, X_other, y_train, y_other = train_test_split(X, y, train_size=0.75, random_state=42)
x_val, X_test, y_val, y_test = train_test_split(X_other, y_other, test_size=0.4, random_state=42)

c). report the size of each class in your training (+ validation) set.

In [None]:
count1 = sum(y_train) + sum(y_val)
count0 = (y_train.size+y_val.size)-count1
print(f"Class 0: {count0}; Class 1: {count1}")

Class 0: 191; Class 1: 320


d). train binary LR model

In [None]:
weights = lr_minibatchSGD(X_train, y_train,learning_rate=0.01, batch_size=10, max_iter=1000)
print("Trained Weights:", weights)

  return 1 / (1 + np.exp(-z))


Trained Weights: [ 2.21871603e+01 -1.88603352e+01  7.97604289e+01  4.21492695e+00
 -1.47805096e-01 -1.60049099e+00 -2.41275463e+00 -9.55374883e-01
 -2.49776550e-01 -2.02521039e-02  3.12011686e-01  5.59777914e-01
 -4.82374982e+00 -2.38971891e+01 -3.33956557e-02 -4.35216372e-01
 -5.67484944e-01 -1.28310058e-01 -9.22341163e-02 -3.83030354e-02
  2.29951067e+01 -4.81254250e+01  3.53101443e+01 -1.56258885e+01
 -4.38248902e-01 -5.46404297e+00 -6.83971581e+00 -1.83039512e+00
 -1.27069965e+00 -4.09695821e-01]


e). report performance on test set. evaluate use accuracy, precision, recall, and F1-score

In [None]:
def sigmoid(z):
        return 1 / (1 + np.exp(-z))

def predict(X, w):
    probs = sigmoid(np.dot(X, w))
    return np.where(probs >= 0.5, 1, 0) # classification using a probability threshold=0.5

y_pred_test = predict(X_test, weights)

print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.88      1.00      0.93        21
           1       1.00      0.92      0.96        37

    accuracy                           0.95        58
   macro avg       0.94      0.96      0.95        58
weighted avg       0.95      0.95      0.95        58



  return 1 / (1 + np.exp(-z))


f). summarize findings

The model has an overall accuracy of 95% so it correctly classifies 95% of the test samples, which is pretty high. Class 1 has higher precision than class 0 (1>0.88). All class 1 predictions were correct. The recall of class 0 shows that all class 0 cases were correctly identified, no class 0 cases were falsely identified as class 1. The recall of class 1 shows that 92% of the actual class 1 cases are identified, some class 1 cases are missed. The F1-scores of both classes show pretty good balance between precision and recall. Overall, the model performs well.