# Module 1: Introduction to Scikit-Learn

## Part 6: Principal Component Analysis (PCA)

In this part, we will explore Principal Component Analysis (PCA), a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving the most important information. PCA is widely used for feature extraction and visualization. 

### 6.1 Understanding Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique that transforms a dataset into a new coordinate system, called the principal components. The principal components are linear combinations of the original features, and they are ordered in terms of the amount of variance they capture in the data. The first principal component (PC1) explains the most variance, followed by PC2, PC3, and so on. PCA identifies the directions in which the data varies the most ad minimizing data loss and projects the data onto those directions.

The key idea behind PCA is to reduce the dimensionality of the data while preserving as much information as possible. It achieves this by finding a set of orthogonal axes (principal components) that explain the maximum variance in the data. Principal components are orthogonal, ensuring that they are uncorrelated.

Principal components are linear combinations of the original features that capture the maximum variance in the data.

### 6.2 Training and Evaluation

To apply PCA, we need a dataset with numerical features. The algorithm computes the principal components by performing a linear transformation on the data. Each principal component is a linear combination of the original features, and they are derived in a way that maximizes the explained variance.

PCA assumes that the data is centered (zero mean) and that features have similar scales so standardize the data to have zero mean and unit variance. StandardScaler or MinMaxScaler can be used to scale the features appropriately.

Choosing the number of principal components (dimensionality) involves balancing dimension reduction and information preservation. One common approach is to look at the cumulative explained variance ratio and choose the number of components that capture a significant portion of the variance (e.g., 95% or 99%). The cumulative explained variance ratio is a concept used in Principal Component Analysis (PCA) to understand how much of the total variance in the dataset is explained by each successive principal component.

Once trained, we can use the PCA model to transform new, unseen data points into the reduced dimensional space. The transformed data points will have fewer dimensions, as we choose to keep only a subset of the principal components.

PCA does not require explicit evaluation since its primary goal is dimensionality reduction. It is often used as a preprocessing step to improve the performance of machine learning models or for data visualization.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X = iris.data
y = iris.target

fig = plt.figure(figsize=(10, 4))
ax1 = fig.add_subplot(121, projection='3d')
ax1.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', s=60)
ax1.set_title('Original Data (3D)')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_zlabel('Feature 3')
explained_variance_ratio_original = np.var(X, axis=0) / np.sum(np.var(X, axis=0))
cumulative_variance_ratio_original = np.cumsum(explained_variance_ratio_original)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
ax2 = fig.add_subplot(122)
ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=60)
ax2.set_title('Data After PCA (2D)')
ax2.set_xlabel('Principal Component 1')
ax2.set_ylabel('Principal Component 2')
plt.tight_layout()
plt.show()
explained_variance_ratio_pca = pca.explained_variance_ratio_
cumulative_variance_ratio_pca = np.cumsum(explained_variance_ratio_pca)
print("Cumulative Explained Variance Ratio (Original 3D Data):", cumulative_variance_ratio_original)
print("Cumulative Explained Variance Ratio (Data After PCA):", cumulative_variance_ratio_pca)

This example showcases the application of Principal Component Analysis (PCA) to visualize high-dimensional data. Using the Iris dataset, we first plot the original 3D data and then apply PCA to reduce it to 2D while retaining meaningful information. The cumulative explained variance ratios are computed for both the original data and the data after PCA, offering insights into the variance preservation achieved by dimensionality reduction. PCA serves as a powerful tool for dimension reduction and visualization, aiding in data analysis and feature selection tasks.

### 6.3 Summary

Principal Component Analysis (PCA) is a fundamental technique for dimensionality reduction in data analysis and machine learning. By identifying the principal components that capture the most variance in the data, PCA helps reduce data dimensionality while retaining crucial information. This process improves computational efficiency, removes noise, and aids in visualization and feature selection. Careful consideration of the number of principal components to retain is necessary to strike a balance between dimension reduction and information preservation.