Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement Principal Component Analysis (PCA) #12596

Merged
merged 1 commit into from
Mar 2, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions DIRECTORY.md
Original file line number Diff line number Diff line change
@@ -395,6 +395,7 @@
* [Minimum Tickets Cost](dynamic_programming/minimum_tickets_cost.py)
* [Optimal Binary Search Tree](dynamic_programming/optimal_binary_search_tree.py)
* [Palindrome Partitioning](dynamic_programming/palindrome_partitioning.py)
* [Range Sum Query](dynamic_programming/range_sum_query.py)
* [Regex Match](dynamic_programming/regex_match.py)
* [Rod Cutting](dynamic_programming/rod_cutting.py)
* [Smith Waterman](dynamic_programming/smith_waterman.py)
@@ -608,6 +609,7 @@
* [Mfcc](machine_learning/mfcc.py)
* [Multilayer Perceptron Classifier](machine_learning/multilayer_perceptron_classifier.py)
* [Polynomial Regression](machine_learning/polynomial_regression.py)
* [Principle Component Analysis](machine_learning/principle_component_analysis.py)
* [Scoring Functions](machine_learning/scoring_functions.py)
* [Self Organizing Map](machine_learning/self_organizing_map.py)
* [Sequential Minimum Optimization](machine_learning/sequential_minimum_optimization.py)
85 changes: 85 additions & 0 deletions machine_learning/principle_component_analysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""
Principal Component Analysis (PCA) is a dimensionality reduction technique
used in machine learning. It transforms high-dimensional data into a lower-dimensional
representation while retaining as much variance as possible.

This implementation follows best practices, including:
- Standardizing the dataset.
- Computing principal components using Singular Value Decomposition (SVD).
- Returning transformed data and explained variance ratio.
"""

import doctest

import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


def collect_dataset() -> tuple[np.ndarray, np.ndarray]:
"""
Collects the dataset (Iris dataset) and returns feature matrix and target values.

:return: Tuple containing feature matrix (X) and target labels (y)

Example:
>>> X, y = collect_dataset()
>>> X.shape
(150, 4)
>>> y.shape
(150,)
"""
data = load_iris()
return np.array(data.data), np.array(data.target)


def apply_pca(data_x: np.ndarray, n_components: int) -> tuple[np.ndarray, np.ndarray]:
"""
Applies Principal Component Analysis (PCA) to reduce dimensionality.

:param data_x: Original dataset (features)
:param n_components: Number of principal components to retain
:return: Tuple containing transformed dataset and explained variance ratio

Example:
>>> X, _ = collect_dataset()
>>> transformed_X, variance = apply_pca(X, 2)
>>> transformed_X.shape
(150, 2)
>>> len(variance) == 2
True
"""
# Standardizing the dataset
scaler = StandardScaler()
data_x_scaled = scaler.fit_transform(data_x)

# Applying PCA
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(data_x_scaled)

return principal_components, pca.explained_variance_ratio_


def main() -> None:
"""
Driver function to execute PCA and display results.
"""
data_x, data_y = collect_dataset()

# Number of principal components to retain
n_components = 2

# Apply PCA
transformed_data, variance_ratio = apply_pca(data_x, n_components)

print("Transformed Dataset (First 5 rows):")
print(transformed_data[:5])

print("\nExplained Variance Ratio:")
print(variance_ratio)


if __name__ == "__main__":
doctest.testmod()
main()