# Dimensionality Reduction with PCA

Dimensionality reduction is a fundamental technique in machine learning and data science that reduces the number of features (or variables) in a dataset while retaining as much meaningful information as possible.

High-dimensional data can be computationally expensive, prone to overfitting, and difficult to interpret. Dimensionality reduction helps address these challenges by simplifying the data without significantly sacrificing its structure or patterns.

In this notebook, we focus on **Principal Component Analysis (PCA)**, one of the most widely used techniques for dimensionality reduction. PCA transforms the data into a set of new features, called principal components, that capture the largest variance in the data.

We will use the Breast Cancer dataset to:
1. Train a machine learning model (Random Forest classifier) on the original dataset with 30 features.
2. Apply PCA to reduce the dataset dimensions.
3. Train the same model on the reduced dataset.
4. Compare the performance of the model before and after applying PCA.

By the end of this notebook, you will:
- Understand how PCA works in simplifying datasets.
- Learn how dimensionality reduction impacts model performance.
- See a practical implementation of PCA with Python.

Let's get started!


### Step 1: Import Libraries and Load Dataset

In this step, we import the necessary libraries and load the Breast Cancer dataset. This dataset contains 30 features describing tumor characteristics (e.g., radius, texture). We will split the data into `X` (features) and `Y` (target variable) to prepare for modeling.


In [None]:
# -----------------------------------------------------------
# 1: Import libraries, load the dataset, and create X and Y
# -----------------------------------------------------------

from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the dataset
lbc = load_breast_cancer()

# Features (X) and Target (Y)
X = pd.DataFrame(lbc['data'], columns=lbc['feature_names'])
Y = pd.DataFrame(lbc['target'], columns=['type'])


### Step 2: Train Model Without PCA

Here, we train a Random Forest classifier using the original dataset with all 30 features. This step will help us evaluate the model's performance before applying PCA. We split the data into training and testing sets, train the classifier, and evaluate it using metrics like confusion matrix and accuracy.


In [None]:
# -----------------------------------------------------------
# 2: Perform Prediction Without PCA
# -----------------------------------------------------------

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234, stratify=Y)

# Train a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc1 = RandomForestClassifier(random_state=1234)
rfc1.fit(X_train, Y_train)

# Make predictions
Y_predict1 = rfc1.predict(X_test)

# Evaluate the model
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test, Y_predict1)
score1 = rfc1.score(X_test, Y_test)

print("Confusion Matrix (Without PCA):")
print(cm1)
print(f"Accuracy (Without PCA): {score1:.4f}")

  return fit_method(estimator, *args, **kwargs)


Confusion Matrix (Without PCA):
[[ 61   3]
 [  0 107]]
Accuracy (Without PCA): 0.9825


### Step 3: Normalize Data

Before applying PCA, we normalize the data to ensure all features have zero mean and unit variance. Normalization is necessary because PCA relies on variance, and features with larger magnitudes can dominate the results if the data isn't standardized.


In [None]:
# -----------------------------------------------------------
# 3: Perform PCA for Dimensionality Reduction
# -----------------------------------------------------------

# Normalize the data
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_scaled = scalar.fit_transform(X)


### Step 4: Apply PCA

Using PCA, we reduce the dataset from 30 features to 5 principal components. These components capture the majority of the variance in the dataset, allowing us to retain the most important patterns while discarding redundant information.


In [None]:
# Apply PCA and reduce dimensions to 5 principal components
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)


### Step 5: Train Model After PCA

After reducing the dimensions using PCA, we retrain the Random Forest classifier on the transformed dataset (with 5 principal components). This step allows us to compare the model's performance before and after applying PCA.


In [None]:
# -----------------------------------------------------------
# Step 4: Perform Prediction After PCA
# -----------------------------------------------------------

# Split the reduced dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_pca, Y, test_size=0.3, random_state=1234, stratify=Y)

# Train a Random Forest Classifier on PCA-transformed data
rfc2 = RandomForestClassifier(random_state=1234)
rfc2.fit(X_train, Y_train)

# Make predictions
Y_predict2 = rfc2.predict(X_test)

# Evaluate the model
cm2 = confusion_matrix(Y_test, Y_predict2)
score2 = rfc2.score(X_test, Y_test)

print("Confusion Matrix (With PCA):")
print(cm2)
print(f"Accuracy (With PCA): {score2:.4f}")


  return fit_method(estimator, *args, **kwargs)


Confusion Matrix (With PCA):
[[ 61   3]
 [  1 106]]
Accuracy (With PCA): 0.9766


### Step 6: Compare Results

In this final step, we compare the model's performance before and after PCA. We observe changes in accuracy and discuss how dimensionality reduction impacts computational efficiency and trade-offs in model performance.


In [None]:
# -----------------------------------------------------------
# Step 5: Compare Results
# -----------------------------------------------------------

print("Comparison of Results:")
print(f"Accuracy Without PCA: {score1:.4f}")
print(f"Accuracy With PCA: {score2:.4f}")


Comparison of Results:
Accuracy Without PCA: 0.9825
Accuracy With PCA: 0.9766


### Conclusion

This notebook demonstrates how PCA simplifies datasets for machine learning tasks. Although dimensionality reduction can lead to a slight decrease in accuracy, it significantly reduces overfitting, computational costs, and complexity when working with high-dimensional datasets. PCA is an essential technique for handling real-world data with numerous features.
