## PCA Exercises and Solutions

### Exercise 1: Data Preprocessing for PCA


1. Load the provided dataset `data_pca.csv` into a Pandas DataFrame.
2. Check for missing values and handle them appropriately.
3. Standardize the dataset using `StandardScaler` from `sklearn.preprocessing`.
    

In [None]:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Load the dataset
df = pd.read_csv('data_pca.csv')

# Step 2: Check for missing values
missing_values = df.isnull().sum()

# Handle missing values (e.g., fill with mean)
df.fillna(df.mean(), inplace=True)

# Step 3: Standardize the dataset
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)

# Convert back to DataFrame for convenience
df_standardized = pd.DataFrame(standardized_data, columns=df.columns)

print("Standardized Data:")
print(df_standardized.head())
    

### Exercise 2: Applying PCA


1. Perform PCA on the standardized data to reduce the dimensionality to 2 principal components.
2. Print the explained variance ratio for each component.
3. Create a scatter plot of the data points in the reduced 2D space.
    

In [None]:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Step 1: Apply PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(standardized_data)

# Step 2: Print explained variance ratio
explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance)

# Step 3: Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(pca_data[:, 0], pca_data[:, 1], alpha=0.7, edgecolors='b')
plt.title('PCA - 2 Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
    

### Exercise 3: Choosing the Number of Components


1. Perform PCA without specifying the number of components (use all components).
2. Plot the cumulative explained variance ratio to decide the optimal number of components for 95% variance coverage.
    

In [None]:

# Step 1: Perform PCA with all components
pca_full = PCA()
pca_full.fit(standardized_data)

# Step 2: Plot cumulative explained variance ratio
cumulative_variance = pca_full.explained_variance_ratio_.cumsum()

plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.title('Cumulative Explained Variance')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.96, '95% Threshold', color='red', fontsize=12)
plt.grid()
plt.show()
    

### Exercise 4: PCA in Practice


1. Use PCA to reduce the dimensionality of a high-dimensional dataset (e.g., `digits` dataset from `sklearn.datasets`).
2. Visualize the first two principal components of the data, using a color code to distinguish between classes.
    

In [None]:

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Step 1: Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Step 2: Standardize the dataset
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Step 3: Apply PCA
pca_digits = PCA(n_components=2)
X_pca = pca_digits.fit_transform(X_standardized)

# Step 4: Visualize the PCA result
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title('PCA of Digits Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter, label='Digit Class')
plt.grid()
plt.show()
    