# Case study: Dimensionality Reduction using PCA for Breast Cancer Classification üè•

### üè¢ Business Problem Context

The Breast Cancer dataset has 30 features (e.g., radius, texture, smoothness of cell nuclei).

Many of these features are correlated and redundant.

Training directly on all 30 features may increase computation and risk of overfitting.

PCA helps reduce to 10 principal components while retaining most of the information.

Then, we use Logistic Regression on these 10 components to classify tumors as benign or malignant.

In [19]:
from sklearn.datasets import load_breast_cancer

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

import numpy as np

In [20]:
# Step 1: Load dataset

data = load_breast_cancer() # in built dataset for learning purpose in sklearn.datasets
X, y = data.data, data.target
# Breast Cancer dataset: 569 samples, 30 features, binary target (0=malignant, 1=benign)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [17]:
# Step 2: Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)
# Stratified split ensures same class proportion in train and test sets

X_train.shape,X_test.shape

((398, 30), (171, 30))

In [25]:
# view top 5 records
X_train[:-5]

array([[1.162e+01, 1.818e+01, 7.638e+01, ..., 1.416e-01, 2.660e-01,
        9.270e-02],
       [1.120e+01, 2.937e+01, 7.067e+01, ..., 0.000e+00, 1.566e-01,
        5.905e-02],
       [1.057e+01, 1.832e+01, 6.682e+01, ..., 2.222e-02, 2.699e-01,
        6.736e-02],
       ...,
       [1.108e+01, 1.471e+01, 7.021e+01, ..., 4.306e-02, 1.902e-01,
        7.313e-02],
       [2.059e+01, 2.124e+01, 1.378e+02, ..., 2.113e-01, 2.480e-01,
        8.999e-02],
       [1.944e+01, 1.882e+01, 1.281e+02, ..., 2.060e-01, 3.266e-01,
        9.009e-02]])

In [12]:
y_train[:5]

array([1, 1, 1, 1, 1])

In [26]:
# Step 3: Define pipeline (chaining steps together)
pipe = Pipeline([
    ("scaler", StandardScaler()),         # Standardize features (mean=0, variance=1)
    ("pca", PCA(n_components=10)),        # Reduce 30 features ‚Üí 10 principal components
    ("clf", LogisticRegression(max_iter=500))  # Logistic Regression classifier
])


In [27]:

# Step 4: Train pipeline (all steps applied in correct order within CV folds)
pipe.fit(X_train, y_train)

In [15]:
# Step 5: Evaluate model performance
print("Test Accuracy with PCA:", pipe.score(X_test, y_test))
# Measures classification accuracy using reduced features

Test Accuracy with PCA: 0.9707602339181286


In [28]:
# Step 6: Check how much variance is retained by 10 components

pca_model = pipe.named_steps["pca"]

print("Explained Variance (10 comps):", np.sum(pca_model.explained_variance_ratio_))

# Shows % of information preserved after dimensionality reduction


Explained Variance (10 comps): 0.9561577528297447


### Observation

By reducing 30 features to just 10 principal components, we still retained about 95% of the original variance.

The Logistic Regression model achieved ~96% accuracy on the test set, which is almost as good as using all features.

This shows how PCA helps simplify high-dimensional datasets without losing much predictive power ‚Äî improving efficiency and interpretability.