<a href="https://colab.research.google.com/github/tnspra/data-structures_labmanual/blob/main/PCA_vs_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCA and LDA

Exam Registration Number:24011103063

Name:Tata Naga Sripranav



![PCA vs LDA](https://sebastianraschka.com/images/blog/2014/linear-discriminant-analysis/lda_1.png)

## PCA on Images
Perform PCA on image data and study reconstruction quality for

$$ k=2,4,8,16,64,256 $$

### Exercise 1

- Select one image randomly

- Convert image into a vector of size $ 4096 \times 1$

- Normalize pixel values to $[0,1]$

- Find the Covariance Matrix and Perform PCA.

- Code for SCREE PLOT (Check for Elbow shape)

- Reconstruct the image (Figure out)

- Display all reconstructions in a grid and relate it with SCREE Plot

You will use olivetti faces dataset. Let $n$ be the total number of images in the dataset. let $x$ be the original image and $x'$ be the image reconstructed using $k$ principal components then its MSE is given as follows:

$$MSE=\frac{1}{n}||xâˆ’x'||^2$$

- Plot MSE vs k

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

# 1. Load and Preprocess
# fetch_olivetti_faces automatically normalizes pixel values to [0, 1]
faces_data = fetch_olivetti_faces()
X = faces_data.data  # shape (400, 4096)

# Select one image randomly for reconstruction
random_idx = np.random.randint(0, 400)
original_face = X[random_idx]

# 2. Perform PCA and calculate Scree Plot data
# We fit on the full 4096 components initially to see the whole variance
pca_full = PCA().fit(X)
explained_variance = pca_full.explained_variance_ratio_

# 3. Study Reconstruction Quality for specific k values
k_values = [2, 4, 8, 16, 64, 256]
reconstructions = []
mse_values = []

for k in k_values:
    pca = PCA(n_components=k)
    # Step: Transform (Compress) and then Inverse Transform (Reconstruct)
    X_compressed = pca.fit_transform(X)
    X_reconstructed = pca.inverse_transform(X_compressed)

    # Save the specific random image reconstruction
    reconstructions.append(X_reconstructed[random_idx])

    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(X[random_idx], X_reconstructed[random_idx])
    mse_values.append(mse)

# --- PLOTTING ---

# Plot 1: Scree Plot (Cumulative Variance)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(np.cumsum(explained_variance[:300])) # Zoomed in for clarity
plt.title('Scree Plot (Elbow Method)')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance')

# Plot 2: MSE vs k
plt.subplot(1, 2, 2)
plt.plot(k_values, mse_values, marker='o', color='red')
plt.title('MSE vs k')
plt.xlabel('k (Components)')
plt.ylabel('Mean Squared Error')
plt.show()

# Plot 3: Grid of Reconstructions
fig, axes = plt.subplots(1, len(k_values) + 1, figsize=(20, 5))
axes[0].imshow(original_face.reshape(64, 64), cmap='gray')
axes[0].set_title("Original")
axes[0].axis('off')

for i, k in enumerate(k_values):
    axes[i+1].imshow(reconstructions[i].reshape(64, 64), cmap='gray')
    axes[i+1].set_title(f"k={k}")
    axes[i+1].axis('off')

plt.show()

# Comparison of PCA vs LDA

## Dataset 1

The dataset is the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

[Dataset Link](https://archive.ics.uci.edu/dataset/109/wine)

### Exercise 2

- Perform Feature Engineeering for the above dataset.
- Model 1: Train Logistic regression model for the above dataset.
- Model 2: Perform PCA to the dataset, train logistic regression with the new datset
- Model 3: Perform LDA to the dataset and use Logistic Regression for the above dataset.
- Compare Train and Test accuracy for the Model 1, 2 and 3

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load the dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. FEATURE SCALING (Crucial for PCA/LDA)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# --- APPROACH A: PCA + Logistic Regression ---
# Compressing 13 features down to 2 components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

classifier_pca = LogisticRegression()
classifier_pca.fit(X_train_pca, y_train)
y_pred_pca = classifier_pca.predict(X_test_pca)

# --- APPROACH B: LDA + Logistic Regression ---
# Compressing 13 features down to 2 components (Max is C-1, so 3-1=2)
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train) # LDA needs y_train!
X_test_lda = lda.transform(X_test)

classifier_lda = LogisticRegression()
classifier_lda.fit(X_train_lda, y_train)
y_pred_lda = classifier_lda.predict(X_test_lda)

# --- APPROACH C: Just Direct Logistic Regression ---
classifier_reg = LogisticRegression()
classifier_reg.fit(X_train, y_train)
y_pred = classifier_reg.predict(X_test)

# 4. Results Comparison
print(f"Accuracy with PCA: {accuracy_score(y_test, y_pred_pca):.2%}")
print(f"Accuracy with LDA: {accuracy_score(y_test, y_pred_lda):.2%}")
print(f"Accuracy without LDA/PCA: {accuracy_score(y_test, y_pred):.2%}")