
# Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms the data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

## Theoretical Background

PCA involves the following steps:

1. **Standardize the Data**: Center the data by subtracting the mean of each feature and scale to unit variance.
2. **Covariance Matrix**: Compute the covariance matrix of the standardized data.
3. **Eigenvalues and Eigenvectors**: Calculate the eigenvalues and eigenvectors of the covariance matrix.
4. **Principal Components**: Sort the eigenvectors by decreasing eigenvalues and select the top k eigenvectors.
5. **Transform Data**: Project the original data onto the selected eigenvectors to get the principal components.

Mathematically, PCA can be represented as:

\[ X_{new} = X \cdot W \]

where \( X \) is the original data, \( W \) is the matrix of selected eigenvectors, and \( X_{new} \) is the transformed data.



## Hands-on Example

Let's walk through a hands-on example using Python and the scikit-learn library.

### Step 1: Import Libraries

First, we need to import the necessary libraries.


In [None]:

```python
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# For this example, we will use the Iris dataset
from sklearn.datasets import load_iris
```



### Step 2: Load and Standardize the Data

We will use the Iris dataset for this example.


In [None]:

```python
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Standardize the data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Convert to DataFrame for easier handling
df = pd.DataFrame(X_std, columns=feature_names)
df['target'] = y
```



### Step 3: Covariance Matrix and Eigenvalues

Compute the covariance matrix and find the eigenvalues and eigenvectors.


In [None]:

```python
# Compute the covariance matrix
cov_matrix = np.cov(X_std.T)

# Compute the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort the eigenvalues and corresponding eigenvectors
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]

print("Eigenvalues:", eigenvalues)
print("Eigenvectors:
", eigenvectors)
```



### Step 4: PCA with Scikit-learn

Perform PCA using the scikit-learn library and transform the data.


In [None]:

```python
# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)

# Create a DataFrame with the principal components
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df_pca['target'] = y

# Explained variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
```



### Step 5: Visualize the Result

Visualize the first two principal components.


In [None]:

```python
# Plot the first two principal components
plt.figure(figsize=(10, 7))
sns.scatterplot(x='PC1', y='PC2', hue='target', data=df_pca, palette='viridis', s=100)
plt.title('PCA of Iris Dataset')
plt.show()
```
