<h1 style="color:blue;">Totorial 2 Coding Task: Logistic Regression on the Breast Cancer Dataset</h1>

In this tutorial, we load the breast cancer data from `sklearn.datasets` and do classification using logistic regression with and without normalization to compute the average performance over 10 experiments. Further, we also do ridge and lasso regularizations to see if we can improve the performance. 

<h1 style="color:red;">Intructions</h1>

- Progress cell-by-cell.
- Check for **<a style="color:red;">Execute</a>s**, where codes for <a style="color:green;">green</a> tasks are already written and you are expected write codes to excute the remaining tasks.
- Check the exercises 2.1, 2.2, and 2.3 on Week 2 Ed lesson for help to complete the tasks.
- After completing all the tasks, write your observations at the end.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt
import seaborn as sns

<h2 style="color:blue;">1) Load the dataset</h2>

Data description
    
The `load_breast_cancer` dataset in scikit-learn is a classic binary‐classification benchmark. It contains 569 tumor samples described by 30 real-valued features—computed from digitized images of fine needle aspirate (FNA) of breast masses—such as radius, texture, perimeter, area, and smoothness (each measured as mean, standard error, and “worst”/largest value). The task is to predict whether a tumor is malignant (212 samples) or benign (357 samples). Data shape:   

- X: array of shape (569, 30)  
- y: array of shape (569,), values in {0 = malignant, 1 = benign}

<h3 style="color:red;">Execute:</h3>

- <a style="color:green;">Load the dataset to create X and y</a>
- Print the shapes of X and y
- Print the first 5 samples 

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Write your code to print shapes X and y here


# Write your code to print the first 5 samples



<h2 style="color:blue;">2) Exploratory Data Analysis</h2>
<h3 style="color:red;">Execute</h3>

- <a style="color:green;">Load the data into a dataframe</a>
- Compute the correlation between the columns in the data (features + target) 
- Print that correlation using `seaborn`

In [None]:
# load into DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# compute correlation matrix (including target)


# optional: plot a heatmap



<h3 style="color:red;">Execute</h3>

- <a style="color:green;">Plot an histogram of sample counts per class</a>
- <a style="color:green;">Pick a feature and plot its distribution per class</a>
- change the following code to pick a different feature of your choice to plot its distribution per class

In [None]:
# identify the classes
df['class']  = df['target'].map({0:'malignant', 1:'benign'})

# Histogram of sample counts per class
plt.figure(figsize=(4,3))
sns.countplot(x='class', data=df, hue='class', palette='Set2', legend=False)
plt.title("Class distribution")
plt.ylabel("Number of samples")
plt.show()

# pick a feature, e.g. 'mean radius'
feat = 'mean radius'
plt.figure(figsize=(6,4))
sns.histplot(df, x=feat, hue='class',
             element='step', stat='density',
             common_norm=False, palette='Set1')
plt.title(f"Distribution of {feat} by class")
plt.show()

<h2 style="color:blue;">3) Split the Data into 80/20 and Fit with 5-Fold Cross-Validation</h2>

<h3 style="color:red;">Execute</h3>

- <a style="color:green;">Split the dataset into 80% training and 20% testing data using `train_test_split`</a>
- <a style="color:green;">Fit a softmax regression model using 5-fold cross-validation</a>
- Fit the model `clf` using `clf.fit` on the training data
- Evaluate the model on the test data, that is, compute `y_pred`
- Print test accuracy and confusion matrix

**Note:** *We compute `cv_scores` before fitting the model because `cross_val_score` does its own fitting-and-scoring inside each fold, you actually pass it an unfitted estimator and let it call fit on every train-split behind the scenes. The final `.fit()` is only to produce your production model.*

In [None]:
# Split the data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train softmax regression model with 5-fold cross-validation
clf = LogisticRegression(solver='saga', penalty=None, tol=1e-3, max_iter=1000)
cv_scores = cross_val_score(clf, X_train, y_train, cv=5)

print("Cross-validation scores:", cv_scores)
print("Mean CV accuracy:", np.mean(cv_scores))

# Fit the model on the training data


# Evaluate on the test data



<h2 style="color:blue;">3) Compute average test performance without normalization</h2>

We conduct 10 experiments without normalization and with different random seeds to compute the average performance on test data.

In [None]:
# Conduct 10 experiments with different random seeds
test_accuracies = []

for seed in range(10):
    np.random.seed(seed)

    # Write you code here





    

print("Test accuracies without normalization:", test_accuracies)
print("\nMean test accuracy without normalization:", np.mean(test_accuracies))

<h2 style="color:blue;">4) Normalize the data to check if we get performance improvement</h2>

<h3 style="color:red;">Execute</h3>

- Conduct 10 experiments with different random seeds. Normalize the dataset in each experiment using 
```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

**Note:** Under the hood, this is what happens:
When you call  
```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
```
two things happen for each feature column $j$ of $X$:

1. *Fit*  
   Computes the sample mean  
     $$\mu_j = \frac{1}{n_{\rm train}}\sum_{i=1}^{n_{\rm train}}X_{i,j}$$
   Computes the sample standard deviation  
     $$\sigma_j = \sqrt{\frac{1}{n_{\rm train}}\sum_{i=1}^{n_{\rm train}}(X_{i,j}-\mu_j)^2}$$

2. *Transform* 
   Replaces each training value by  
     
     $$X'_{i,j} \;=\;\frac{X_{i,j}-\mu_j}{\sigma_j},$$
     so that the transformed column has zero mean and unit variance over the training set.

When you later do  
```python
X_test = scaler.transform(X_test)
```
it uses the same $\mu_j$ and $\sigma_j$ learned from the training data to standardize the test features:
$$
X'_{\rm test, i,j}
=\frac{X_{{\rm test},i,j}-\mu_j}{\sigma_j}.
$$
This ensures your model sees test‐set features on the *exact* scale it was trained on.

In [None]:
# Conduct 10 experiments with different random seeds
test_accuracies = []

for seed in range(10):
    np.random.seed(seed)


    # Write your code here






print("Test accuracies with normalization:", test_accuracies)
print("\nMean test accuracy with normalization:", np.mean(test_accuracies))

<h2 style="color:blue;">5) Ridge and Lasso Regularization to Improve Performance</h2>

We now apply ridge and lasso regularization to the softmax regression model in addition to normalization. Again, conduct 10 ex[eriments and compute the average accuracy.

<h3 style="color:red;">Execute</h3>

- For the ridge regularization, change the penaty to l2 using `penalty = 'l2'`
- For the lasso rregularization, change the penaty to l2 using `penalty = 'l1'`

In [None]:
test_ridge_accuracies = []
test_lasso_accuracies = []

for seed in range(10):
    np.random.seed(seed)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    
    # Write your code here












print("Mean test accuracy with ridge penalty and normalization:", np.mean(test_ridge_accuracies))
print("Mean test accuracy with lasso penalty and normalization:", np.mean(test_lasso_accuracies))

<h2 style="color:blue;">6) Write at least four of your observations here</h2>

- Obs 1:
- Obs 2:
- Obs 3:
- Obs 4: