# üöÄ Principal Component Analysis (PCA) - Day 33

Welcome to Day 33 of the **100 Days of Data Science & AI** series! Today, we are diving deep into **Principal Component Analysis (PCA)**, one of the most powerful and widely used techniques for dimensionality reduction and data visualization.

---

## üßê What is PCA?

Principal Component Analysis (PCA) is an **unsupervised machine learning** algorithm used for dimensionality reduction. It transforms a large set of variables into a smaller one that still contains most of the information in the large set.

### Why do we need it?
1. **Curse of Dimensionality**: High-dimensional datasets are often hard to visualize and can lead to overfitting.
2. **Visualization**: Reducing data to 2D or 3D allows us to see patterns that were hidden in higher dimensions.
3. **Speed**: Training models on fewer features is faster and often more efficient.
4. **Noise Reduction**: PCA helps in eliminating noisy features that don't contribute significantly to the variance.

---

## üõ†Ô∏è The Mechanics of PCA

1. **Standardization**: PCA is sensitive to variances, so features must be scaled to have a mean of 0 and variance of 1.
2. **Covariance Matrix**: We calculate how features vary from the mean with respect to each other.
3. **Eigen-decomposition**: We compute the **Eigenvectors** (direction of variance) and **Eigenvalues** (magnitude/strength of variance).
4. **Principal Components**: These are the new, uncorrelated variables. PC1 captures the most variance, PC2 the second most, and so on.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

sns.set(style="whitegrid", palette="muted")

## üìÇ Loading the Real-World Dataset

We use the **Breast Cancer Wisconsin (Diagnostic)** dataset. It contains **30 features** representing characteristics of cell nuclei from breast cancer biopsies. Our goal is to reduce these 30 features into a few Principal Components.

In [None]:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print(f"Dataset shape: {df.shape}")
df.head()

## ‚öñÔ∏è Step 1: Standardization

PCA is calculated based on variance. If one feature has values between 1-100 and another between 0-1, PCA will incorrectly assume the first feature is more important.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

## üß± Step 2: Applying PCA

We will reduce the 30 dimensions into 3 for visualization, and then analyze how much information each component retains.

In [None]:
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2', 'PC3'])
pca_df['Target'] = y

## üìä Step 3: Advanced Visualizations

### 1. üìâ Scree Plot (Explained Variance)
This plot shows how much total variance (information) is captured by each principal component.

In [None]:
exp_var_pca = pca.explained_variance_ratio_
cum_sum_eigenvalues = np.cumsum(exp_var_pca)

plt.figure(figsize=(10, 5))
plt.bar(range(0, len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0, len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.title('Scree Plot - Captured Information')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

### 2. üîµ 2D Projection
Even with just two components, notice how clear the separation between malignant and benign tumors becomes.

In [None]:
plt.figure(figsize=(10, 7))
sns.scatterplot(x='PC1', y='PC2', hue='Target', data=pca_df, palette='viridis', alpha=0.7)
plt.title('2D PCA Projection of Breast Cancer Dataset')
plt.xlabel(f'PC1 ({exp_var_pca[0]:.2%} Variance)')
plt.ylabel(f'PC2 ({exp_var_pca[1]:.2%} Variance)')
plt.show()

### 3. üßä 3D Projection
A 3D view captures even more nuance (over 70% of total variance in this dataset).

In [None]:
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pca_df['PC1'], pca_df['PC2'], pca_df['PC3'], c=y, cmap='viridis', alpha=0.6)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('3D PCA Visualization')
plt.show()

### 4. üó∫Ô∏è Feature Biplot (Heatmap of Components)
We can see how much each original feature contributes to the first two Principal Components.

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(pca.components_[:2], 
            yticklabels=['PC1', 'PC2'], 
            xticklabels=data.feature_names, 
            cmap='coolwarm')
plt.title('Feature Contribution to Principal Components')
plt.show()

---

üîπ Key Takeaways

‚úî **Dimensionality Reduction**: We successfully reduced 30 features into 2-3 components while retaining the majority of the variance (information).

‚úî **Standardization is Critical**: Without scaling the breast cancer features, the PCA directions would be dominated by features with the largest raw numerical ranges.

‚úî **Visualization Insights**: PCA projected the data in a way that allows us to visually differentiate between tumor types, which would be impossible in a 30-column table.

üìå Meta
Author: Tharun Naik Ramavath
Series: 100 Days of Data Science & AI
Day: 33
Platform: LinkedIn
Notebook: Google Colab