# Classification Pipeline (Breast Cancer Dataset)

## Introduction

**Logistic Regression**

The goal of this pipeline is to build a supervised machine learning model that can **predict a binary outcome** — in this case, whether a tumour is **malignant or benign**.


**Classification pipelines are used widely in financial services to make critical, data-driven decisions. Real-world examples include:**

- **Fraud detection** – classifying transactions as fraudulent or legitimate
- **Credit default prediction** – predicting whether a customer will repay or default on a loan
- **Customer churn prediction** – identifying whether a customer is likely to leave
- **Risk tiering** – assigning customers to risk bands (low, medium, high)

## Step 1: Imports

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

## Step 2: Load the Dataset
We'll use the breast cancer dataset, which includes features extracted from cell nuclei images.

In [None]:
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

## Why Do We Split Into Train and Test Sets?

In supervised machine learning, the aim is to want to build a model that not only performs well on the data it has seen (training data), but also on new, unseen data. This is where splitting into training and test sets becomes essential.



### Training Set
- Used to train the model by learning patterns and relationships in the data.
- The model "fits" its parameters based on this subset.

### Test Set
- Set aside and never shown to the model during training.
- Used only to evaluate performance and simulate how the model would behave in the real world.



### What Happens If You Don’t Split?
- The model may simply memorise the training data (overfitting).
- Highly likely to get a false sense of accuracy because you're testing it on data it already knows.



### Best Practices:
- Use `train_test_split()` from `sklearn.model_selection`
- Common split ratio: 70–80% train, 20–30% test
- Set a `random_state` to ensure results are reproducible



### Real-World Examples
- Train a model on past customer behaviour (e.g., 2022 data)
- Test it on how well it predicts outcomes (e.g., defaults, fraud) in 2023

This split helps ensure the model isn't just learning the past — it’s learning patterns that generalise to the future.


In [12]:
# Step 3 Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Best Practice for Scaling in Classification Pipelines

### When is Scaling Necessary?

Scaling is important when using models that are sensitive to the magnitude of input features. This includes:

- Logistic Regression
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Neural Networks

It is particularly important when the features in the dataset have different units or ranges (e.g., income measured in thousands, and age measured in years).



### Why Scale?

- Without scaling, features with larger numerical ranges can disproportionately influence the model.
- Scaling can improve optimisation performance, especially for models that rely on gradient-based methods (such as logistic regression).
- It also improves the interpretability of model coefficients by placing features on a comparable scale.



### Recommended Scalers

| Scaler            | When to Use It |
|-------------------|----------------|
| `StandardScaler`  | The default choice for linear models. Standardises features to have a mean of 0 and a standard deviation of 1. |
| `MinMaxScaler`    | Appropriate when features must be scaled to a fixed range, typically [0, 1]. Useful for models like neural networks. |
| No Scaling        | Only acceptable if all features in the dataset are already on similar scales (rare in practice). |



### Scaling and Categorical Features


- One-hot encoded or binary categorical variables should not be scaled.
- Scaling should be applied only to continuous numerical features.


### Implementation Guidance

Scaling should be implemented using `StandardScaler()` as part of a `Pipeline`, and applied prior to model fitting. This ensures that preprocessing is performed consistently and avoids data leakage between the training and testing sets.


## Step 4–6: Build, Fit, and Evaluate a Classification Pipeline

### Step 4: Build the Pipeline

A pipeline is a modular workflow that chains together multiple steps (e.g., preprocessing and modelling) into a single object. This approach ensures:

- Each step is applied in a fixed, repeatable order
- Preprocessing (e.g., scaling) is applied consistently to both training and test sets
- The code is easier to maintain and update

In this case, the pipeline contains two steps:
1. `'scaler'`: Standardises the numerical features using `StandardScaler`, which centers each feature to mean 0 and scales to unit variance.
2. `'classifier'`: Applies `LogisticRegression`, a linear model suitable for binary classification tasks.



In [13]:
# Step 4 - Build Pipeline, apply scaler and chose classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=200))
])
pipeline.fit(X_train, y_train)

## Step 6: Evaluate the Model

Once the model is trained, it should be evaluated on the test set to assess how well it generalises to unseen data.

We use two tools in particular:

### 1. `classification_report`

This provides key classification metrics:

- **Precision**: The proportion of positive predictions that were actually correct.
- **Recall** (also called Sensitivity or True Positive Rate): The proportion of actual positives that were correctly predicted.
- **F1 Score**: The harmonic mean of precision and recall. Useful when you need to balance the two (e.g., when classes are imbalanced).
- **Accuracy**: The overall proportion of correct predictions across all classes.

These values are reported for each class (e.g., 0 and 1) and also include macro and weighted averages.

This report gives a balanced view of performance, especially when accuracy alone may be misleading.


### 2. `confusion_matrix`

The confusion matrix presents the number of correct and incorrect predictions in a 2x2 table format:

|                        | Predicted Negative (0) | Predicted Positive (1) |
|------------------------|------------------------|-------------------------|
| Actual Negative (0)    | True Negative (TN)     | False Positive (FP)     |
| Actual Positive (1)    | False Negative (FN)    | True Positive (TP)      |

**Interpretation**:
- **TP (True Positive)**: Correctly predicted positive class
- **TN (True Negative)**: Correctly predicted negative class
- **FP (False Positive)**: Incorrectly predicted as positive (Type I error)
- **FN (False Negative)**: Missed actual positive (Type II error)

This matrix helps identify the type and frequency of classification errors, which is especially important in high-stakes environments (e.g., fraud detection, credit risk).



Understanding both the classification report and confusion matrix provides a comprehensive picture of model performance, beyond just the accuracy score.


In [14]:
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

[[41  2]
 [ 1 70]]
