# Using sklearn's PCA - The Professional Way

## Introduction

Now that we understand PCA from scratch, let's learn to use sklearn's optimized implementation.

### What You'll Learn
1. How to use `sklearn.decomposition.PCA`
2. Key parameters and their effects
3. Important attributes after fitting
4. How to determine optimal number of components
5. Best practices for real-world applications

### Why sklearn?
- **Optimized**: Uses efficient SVD algorithm
- **Well-tested**: Industry standard
- **Feature-rich**: Many useful options
- **Integrated**: Works seamlessly with sklearn pipelines

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.set_printoptions(precision=3, suppress=True)

print("âœ“ Libraries imported successfully!")

## 1. Basic Usage

In [None]:
# Create simple 2D data
np.random.seed(42)
X = np.array([
    [2.5, 2.4],
    [0.5, 0.7],
    [2.2, 2.9],
    [1.9, 2.2],
    [3.1, 3.0],
    [2.3, 2.7],
    [2.0, 1.6],
    [1.0, 1.1]
])

print("Original Data:")
print(X)
print(f"\nShape: {X.shape}")

In [None]:
# Create and fit PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("\nPCA Results:")
print(f"Transformed data shape: {X_pca.shape}")
print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")
print(f"\nComponents (eigenvectors):\n{pca.components_}")

## 2. Key Parameters

### n_components
Most important parameter - determines dimensionality of output.

In [None]:
# Test different n_components
print("Testing different n_components:\n")

# Keep all components
pca_all = PCA()
pca_all.fit(X)
print(f"n_components=None: {pca_all.n_components_} components")
print(f"  Variance: {pca_all.explained_variance_ratio_}")

# Keep 1 component
pca_1 = PCA(n_components=1)
pca_1.fit(X)
print(f"\nn_components=1: {pca_1.n_components_} component")
print(f"  Variance: {pca_1.explained_variance_ratio_}")

# Keep 95% variance
pca_95 = PCA(n_components=0.95)
pca_95.fit(X)
print(f"\nn_components=0.95: {pca_95.n_components_} components selected")
print(f"  Variance: {pca_95.explained_variance_ratio_}")
print(f"  Total: {pca_95.explained_variance_ratio_.sum():.4f}")

## 3. Important Attributes

In [None]:
# Explore PCA attributes
pca = PCA(n_components=2)
pca.fit(X)

print("Important PCA Attributes:\n")
print("=" * 60)

print(f"\n1. components_ (shape {pca.components_.shape}):")
print("   Principal axes in feature space")
print(pca.components_)

print(f"\n2. explained_variance_ (shape {pca.explained_variance_.shape}):")
print("   Amount of variance explained by each component")
print(pca.explained_variance_)

print(f"\n3. explained_variance_ratio_ (shape {pca.explained_variance_ratio_.shape}):")
print("   Percentage of variance explained")
print(pca.explained_variance_ratio_)

print(f"\n4. singular_values_ (shape {pca.singular_values_.shape}):")
print("   Singular values from SVD")
print(pca.singular_values_)

print(f"\n5. mean_ (shape {pca.mean_.shape}):")
print("   Per-feature mean")
print(pca.mean_)

print(f"\n6. n_components_: {pca.n_components_}")
print(f"7. n_features_: {pca.n_features_}")
print(f"8. n_samples_: {pca.n_samples_}")
print(f"9. noise_variance_: {pca.noise_variance_}")

## 4. Determining Optimal Number of Components

Three common approaches:
1. Explained variance threshold (e.g., 95%)
2. Scree plot (elbow method)
3. Domain knowledge

In [None]:
# Load a richer dataset for demonstration
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names

print("Iris Dataset:")
print(f"Samples: {X_iris.shape[0]}")
print(f"Features: {X_iris.shape[1]}")
print(f"Feature names: {feature_names}")

# Standardize features
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)

print("\nâœ“ Data standardized (important for PCA!)")

In [None]:
# Fit PCA with all components
pca_full = PCA()
X_iris_pca = pca_full.fit_transform(X_iris_scaled)

print("PCA Results on Iris:")
print(f"\nExplained variance ratio by component:")
for i, var in enumerate(pca_full.explained_variance_ratio_, 1):
    print(f"  PC{i}: {var:.4f} ({var*100:.2f}%)")

cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
print(f"\nCumulative explained variance:")
for i, var in enumerate(cumulative_var, 1):
    print(f"  First {i} PC(s): {var:.4f} ({var*100:.2f}%)")

In [None]:
# Create scree plot and cumulative variance plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Scree plot
ax1.bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
       pca_full.explained_variance_ratio_,
       alpha=0.7, color='steelblue', edgecolor='black', linewidth=2)
ax1.plot(range(1, len(pca_full.explained_variance_ratio_) + 1),
        pca_full.explained_variance_ratio_,
        'ro-', linewidth=2, markersize=8)
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Scree Plot', fontsize=14, fontweight='bold')
ax1.set_xticks(range(1, len(pca_full.explained_variance_ratio_) + 1))
ax1.grid(axis='y', alpha=0.3)

# Cumulative variance plot
ax2.plot(range(1, len(cumulative_var) + 1), cumulative_var,
        'o-', linewidth=2, markersize=8, color='darkgreen')
ax2.axhline(y=0.95, color='r', linestyle='--', linewidth=2, label='95% threshold')
ax2.axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% threshold')
ax2.fill_between(range(1, len(cumulative_var) + 1), 0, cumulative_var,
                alpha=0.2, color='green')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
ax2.set_title('Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_xticks(range(1, len(cumulative_var) + 1))
ax2.legend(fontsize=10)
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Recommendation
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nðŸ’¡ Recommendation: Use {n_95} components for 95% variance retention")

## 5. Visualization in PCA Space

In [None]:
# Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
X_iris_2d = pca_2d.fit_transform(X_iris_scaled)

# Create scatter plot
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']
target_names = iris.target_names

for i, color, target_name in zip([0, 1, 2], colors, target_names):
    plt.scatter(X_iris_2d[y_iris == i, 0],
               X_iris_2d[y_iris == i, 1],
               color=color, alpha=0.7, s=80,
               edgecolors='k', linewidths=0.5,
               label=target_name)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)', fontsize=13)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)', fontsize=13)
plt.title('Iris Dataset in PCA Space', fontsize=15, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nðŸ’¡ The 3 iris species are well-separated in PCA space!")
print(f"   PC1 and PC2 together explain {(pca_2d.explained_variance_ratio_.sum())*100:.1f}% variance")

## 6. Inverse Transform

In [None]:
# Reconstruct data from 2 components
X_reconstructed = pca_2d.inverse_transform(X_iris_2d)

# Calculate reconstruction error
mse = np.mean((X_iris_scaled - X_reconstructed) ** 2)
print(f"Reconstruction MSE (2 components): {mse:.6f}")

# Compare original vs reconstructed for first sample
print(f"\nFirst sample comparison:")
print(f"Original (scaled): {X_iris_scaled[0]}")
print(f"Reconstructed:     {X_reconstructed[0]}")
print(f"Difference:        {X_iris_scaled[0] - X_reconstructed[0]}")

## 7. Feature Importance in PCs

In [None]:
# Analyze component loadings
components_df = pd.DataFrame(
    pca_2d.components_.T,
    columns=['PC1', 'PC2'],
    index=feature_names
)

print("Component Loadings:")
print(components_df)
print("\nInterpretation:")
print("- Larger absolute values = feature contributes more to that PC")
print("- Sign indicates direction of contribution")

In [None]:
# Visualize loadings
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# PC1 loadings
ax1.barh(feature_names, components_df['PC1'], color='steelblue', alpha=0.7, edgecolor='black')
ax1.set_xlabel('Loading Value', fontsize=12)
ax1.set_title('PC1 Feature Loadings', fontsize=13, fontweight='bold')
ax1.axvline(0, color='black', linewidth=0.8)
ax1.grid(axis='x', alpha=0.3)

# PC2 loadings
ax2.barh(feature_names, components_df['PC2'], color='coral', alpha=0.7, edgecolor='black')
ax2.set_xlabel('Loading Value', fontsize=12)
ax2.set_title('PC2 Feature Loadings', fontsize=13, fontweight='bold')
ax2.axvline(0, color='black', linewidth=0.8)
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ’¡ This shows which original features contribute most to each PC")

## 8. Best Practices

In [None]:
print("PCA Best Practices:\n")
print("=" * 60)
print("\n1. ALWAYS standardize features before PCA")
print("   - Use StandardScaler for features with different scales")
print("   - PCA is sensitive to feature magnitudes")

print("\n2. Choose n_components wisely:")
print("   - Start with 95% variance threshold")
print("   - Use scree plot to find 'elbow'")
print("   - Consider computational constraints")

print("\n3. Interpret results:")
print("   - Examine component loadings")
print("   - Visualize in PC space")
print("   - Check if separations make domain sense")

print("\n4. Handle missing values:")
print("   - Impute before PCA (PCA can't handle NaN)")
print("   - Or use specialized missing data PCA")

print("\n5. Use in pipelines:")
print("   - Integrate with sklearn Pipeline")
print("   - Combine with other preprocessing")
print("   - Useful for hyperparameter tuning")

print("\n6. Watch for:")
print("   - Outliers (can dominate PCs)")
print("   - Non-linear relationships (consider kernel PCA)")
print("   - Categorical variables (encode properly first)")
print("=" * 60)

## Key Takeaways

### What You Learned

1. **sklearn PCA is easy**: Just fit and transform!
2. **Key parameters**: `n_components` (most important)
3. **Important attributes**: `components_`, `explained_variance_ratio_`
4. **Determining components**: Use scree plot or variance threshold
5. **Always standardize**: Critical for meaningful PCA
6. **Interpret carefully**: Look at loadings and visualizations

### When to Use PCA

- **Visualization**: Reduce to 2D/3D for plotting
- **Noise reduction**: Keep signal, discard noise
- **Speed up models**: Fewer features = faster training
- **Feature engineering**: Create meaningful combinations
- **Multicollinearity**: Remove correlated features

### Next Steps

In the next notebook:
- Compare our implementation with sklearn
- Verify they produce identical results
- Understand performance differences

---

**Great job!** You now know how to use sklearn's PCA like a pro.

Continue to: `comparison_scratch_vs_sklearn.ipynb`