# Random Forest Classification
This exercise uses the Wisconsin breast cancer dataset (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). Features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image. 

The data is normally quite nice to separate linearly, as shown in the first part for a SVC with a linear kernel. 

If the dimensions are reduced though, the problem becomes harder to classify. For demonstrative purposes, we will first do a PCA and then rerun the classification with a linear and an RBF kernel.

#### Preparation

Load libraries and dataset (via sklearn for convenience)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
)

Load data:

In [None]:
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)

# Convert numeric labels to meaningful class names
y = pd.Series(data.target).map({0: "Malignant", 1: "Benign"})

print("Feature matrix shape:", X.shape)
print("\nClass distribution:")
print(y.value_counts())

Train/test split: Test later how your predictions change if you don't use stratify.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y #ensures similarity of class proportions in test and training data
)

## Part 1: Linear classification

Use a linear kernel in the SVC, optimise the hyperparameters using a Pipeline and a GridSearchCV, then run the classification and plot a confusion matrix to see how the classification compares to the RF model used last time!

Build Pipeline: Scaler (vitally important for SVM) and SVC.

In [None]:
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svc", SVC(
        kernel="linear", 
        class_weight="balanced", # "balanced" can help if classes are imbalanced
        probability=True # needed for the probability calculation (predict_proba) for the metrics later
        )
    )  
])

Define parameters for GridSearch:

In [None]:
param_grid = {
    "svc__C": [0.1, 1, 10, 100], # be sure to always use the same prefix as in the pipeline with two underscores!
    # 'svc__gamma': [0.01, 0.1, 1, 10]
}

In [None]:
grid = GridSearchCV(
    pipe,
    param_grid=param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

best_linear = grid.best_estimator_

print("Best parameters:", grid.best_params_)
print("Best CV ROC-AUC:", grid.best_score_)

Calculate confusion matrix:

In [None]:
ConfusionMatrixDisplay.from_estimator(best_linear, X_test, y_test)
plt.title("Linear kernel - Confusion Matrix")
plt.show()

For future reference: Other metrics to evaluate the performance:

In [None]:
# Define which class is the positive class
POS = "Malignant"

# Pick the probability column corresponding to the positive class
pos_idx = np.where(best_linear.classes_ == POS)[0][0]

y_pred = best_linear.predict(X_test)
y_proba = best_linear.predict_proba(X_test)[:, pos_idx]

# Metrics (note pos_label everywhere it's needed)
roc  = roc_auc_score(y_test, y_proba, labels=best_linear.classes_, max_fpr=None)
acc  = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, pos_label=POS)
rec  = recall_score(y_test, y_pred, pos_label=POS)
f1   = f1_score(y_test, y_pred, pos_label=POS)


print(f"Test Accuracy:  {acc:.3f}")
print(f"Test Precision: {prec:.3f}")
print(f"Test Recall:    {rec:.3f}")
print(f"Test F1:        {f1:.3f}")
print(f"Test ROC-AUC:   {roc:.3f}")




For reference: Plot ROC curve.

In [None]:

RocCurveDisplay.from_estimator(best_linear, X_test, y_test)
plt.title("ROC Curve (SVM)")
plt.show()


## Part 2: SVC on reduced dimensions

For demonstration purposes, only a slice of X was used (arbitrary). Additionally, reduce the dimensions with PCA (Pipeline!) and compare performance of the best linear and the best RBF kernels!

Run a classification using a RF as well - using the same pipeline, just adjusting the model.

For each, display the confusion matrix.

Then, see, how the slice affects the different models. If there is some time still left, try other kernels as well.

In [None]:
X = pd.DataFrame(data.data, columns=data.feature_names).iloc[:, 15:21]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y #ensures similarity of class proportions in test and training data
)

#### Linear kernel

#### RBF kernel