# **Assginment 4 : KNN & PCA**

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?


#  ** K-Nearest Neighbors (KNN)**

KNN is a **supervised machine-learning algorithm** used for **classification** and **regression**. It predicts outputs based on the idea that **similar data points are close to each other** in feature space.

It is a **lazy learner**‚Äîmeaning it does not build a model during training. Instead, it stores the training data and makes predictions only when needed.


#  **How KNN Works**

For a new input point:

1. Compute its distance to all points in the training dataset.
2. Select the **K closest points** (neighbors).
3. Make a prediction using those neighbors.

Common distance metrics:

* Euclidean (most common)
* Manhattan
* Minkowski
* Cosine similarity


#  **KNN for Classification**

Steps:

1. Find the **K nearest neighbors** of the new point.
2. Each neighbor ‚Äúvotes‚Äù for its class label.
3. The class with the **majority vote** becomes the prediction.

Example:
If K = 5 and among the neighbors 3 are Class A and 2 are Class B ‚Üí predicted class = **A**.


#  **KNN for Regression**

Steps:

1. Find the **K nearest neighbors**.
2. Instead of voting, take the **average value** (or weighted average) of their numerical labels.

Example:
Neighbor values = 10, 12, 14 ‚Üí prediction = (10 + 12 + 14) / 3 = **12**.

**Weighted regression:** closer points have more influence by using weights like 1/distance.


#  **Choosing the Value of K**

* **Small K** ‚Üí more complex, risk of overfitting.
* **Large K** ‚Üí smoother, risk of underfitting.
* Typically selected using cross-validation.


# **Strengths of KNN**

* Very simple, easy to understand.
* Works well for small/medium datasets.
* Handles non-linear decision boundaries.
* No training time needed.



#  **Weaknesses of KNN**

* Slow prediction on large datasets (computes many distances).
* Sensitive to irrelevant features and differences in scale ‚Üí requires normalization.
* Performs poorly in very high-dimensional spaces (curse of dimensionality).





---

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?


#  **Curse of Dimensionality?**

The **Curse of Dimensionality** refers to a set of problems that arise when data has **too many features (dimensions)**. As the number of dimensions increases:

* Data becomes **sparse** (spread out).
* Distances between points become **less meaningful**.
* Models that rely on distance or neighborhood relationships become less effective.

In simple terms:
**High-dimensional space makes it harder to find ‚Äúnearby‚Äù points.**


#  **Why Does This Happen?**

When you add more dimensions:

1. **Space grows exponentially**, but the amount of data does not.

   * You would need exponentially more data to cover the space effectively.

2. **All points become far apart.**

   * The difference between the nearest and farthest neighbors shrinks.

3. **Distance metrics stop working well.**

   * Euclidean distance becomes less discriminative.


#  **How the Curse of Dimensionality Affects KNN**

KNN heavily depends on **distance** to find the closest neighbors. In high dimensions, this becomes problematic.

### **1. Distances lose meaning**

When dimensionality is high:

* Nearest neighbors are **not much closer** than farthest neighbors.
* KNN struggles to identify truly similar points.

This leads to **poor classification and regression accuracy**.


### **2. Increased risk of overfitting**

With many irrelevant features:

* Noise dominates genuine patterns.
* KNN may pick misleading neighbors because distance is distorted.


### **3. Computational cost increases**

More dimensions mean:

* More distance calculations
* Each distance computation becomes more expensive
* Prediction time becomes slow

Since KNN is a lazy learner, this slows down the algorithm significantly.


### **4. Need for feature scaling and dimensionality reduction**

To combat high-dimensional issues, techniques like:

* **PCA (Principal Component Analysis)**
* **Feature selection (e.g., removing irrelevant features)**
* **Normalization/standardization**

are often applied before using KNN.




---

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?


#  **What Is Principal Component Analysis (PCA)?**

**PCA** is an **unsupervised dimensionality reduction technique** that transforms high-dimensional data into a smaller set of **new variables** called **principal components**.

These components:

* Are **linear combinations** of the original features
* Capture the **maximum variance** in the data
* Are **uncorrelated** with each other
* Are ordered:

  * 1st component ‚Üí most variance
  * 2nd component ‚Üí second-most variance
  * ...and so on

PCA is mainly used to:

* Reduce dimensionality (to combat the curse of dimensionality)
* Remove noise
* Visualize high-dimensional data
* Speed up algorithms that struggle with many features


#  **How PCA Works (Conceptually)**

1. Standardize the data.
2. Compute the covariance matrix.
3. Find its eigenvalues and eigenvectors.
4. Sort components by variance explained.
5. Select the top *k* components.
6. Transform data into the new component space.

The transformed features are **not original features**, but new axes that best describe the data's structure.


#  **How PCA Differs from Feature Selection**

PCA is a **feature extraction** method, not a selection method.

| Aspect               | PCA (Feature Extraction)                                                   | Feature Selection                             |
| -------------------- | -------------------------------------------------------------------------- | --------------------------------------------- |
| **What it does**     | Creates new features (principal components) by combining original features | Chooses a subset of the existing features     |
| **Output features**  | New, transformed, uncorrelated components                                  | Original features only                        |
| **Interpretability** | Low ‚Äî components are combinations of variables                             | High ‚Äî uses actual features                   |
| **Type**             | Unsupervised                                                               | Usually supervised (based on target variable) |
| **Goal**             | Reduce dimensionality while preserving variance                            | Remove irrelevant or redundant features       |
| **Changes data?**    | Yes ‚Äî transforms it                                                        | No ‚Äî keeps original feature meanings          |


#  Example

Suppose you have 10 correlated features.

* **PCA** may create 3 new components that capture 95% of the variance.
* **Feature selection** may choose only 3 of the original 10 features.

Both reduce dimensionality, but in very different ways.




---

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?


# **What Are Eigenvalues and Eigenvectors in PCA?**

In PCA, we compute the **covariance matrix** of the data and then find its **eigenvalues** and **eigenvectors**.

### **Eigenvectors**

* Define the **directions** of the new feature axes (principal components).
* Each eigenvector represents a direction in which the data varies.
* They are orthogonal (uncorrelated) to each other.

### **Eigenvalues**

* Tell us **how much variance** each eigenvector (component) captures.
* Larger eigenvalue ‚Üí more important principal component.
* They allow us to rank components by significance.

In short:

* **Eigenvectors = directions of maximum variance**
* **Eigenvalues = amount of variance in those directions**


#  **Why Are Eigenvalues and Eigenvectors Important in PCA?**

### **1. They determine the principal components**

PCA picks the top *k* eigenvectors (based on largest eigenvalues).
These become the new axes after transformation.


### **2. They help reduce dimensionality**

Eigenvalues indicate how much information (variance) each component contains.

Example:
If the first 2 components capture 95% of total variance, we can safely reduce from, say, 20 dimensions to 2.


### **3. They ensure uncorrelated components**

Because eigenvectors of the covariance matrix are orthogonal:

* Principal components do not overlap in the information they describe.
* Models often perform better with uncorrelated features.


### **4. They help remove noise**

Small eigenvalues correspond to directions with very little variance, often noise.

By removing components with small eigenvalues, PCA:

* Simplifies data
* Enhances signal
* Reduces overfitting


#  Example

Imagine your data lies mostly along a diagonal line on a 2D plane:

1. PCA finds the eigenvector aligned with this line ‚Üí **1st principal component**.
2. The eigenvalue for this direction is large ‚Üí lots of variance.
3. A second eigenvector perpendicular to the line captures little variance (small eigenvalue).

So you keep the first component and drop the second ‚Üí dimensionality reduced from 2D to 1D.


---

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?


# **How KNN and PCA Complement Each Other**

KNN and PCA are commonly used together because PCA helps address several weaknesses of KNN. The combination improves both **accuracy** and **efficiency**.


#  **1. PCA reduces dimensionality ‚Üí makes KNN more effective**

KNN suffers from the **curse of dimensionality** because distance becomes meaningless when there are too many features.

PCA:

* Reduces the number of features
* Removes noise
* Keeps the most informative directions of variance

This makes distance-based algorithms like KNN **more reliable**.


#  **2. PCA removes correlated and redundant features**

If many features are correlated:

* KNN may overemphasize those dimensions
* Distances can become distorted

PCA transforms the data into **uncorrelated (orthogonal)** components, which leads to more meaningful distance calculations.

This directly improves KNN performance.

#  **3. PCA reduces noise ‚Üí KNN makes better neighbor choices**

KNN has no internal mechanism for ignoring noise, since it relies entirely on raw distances.

PCA removes components with very small variance (often noise), helping KNN:

* Avoid misleading neighbors
* Improve generalization
* Reduce overfitting


#  **4. PCA improves KNN speed**

KNN is slow at prediction time because it must compute distances to all points.

Reducing dimensions via PCA:

* Decreases computational cost
* Makes KNN faster
* Reduces memory usage

This is especially important for large datasets.


#  **5. PCA enables better visualization before applying KNN**

Using PCA to project high-dimensional data into 2D or 3D helps you:

* Visualize class separation
* Detect clusters
* Identify whether KNN is appropriate

This supports better model design.


#  **6. PCA + KNN is a common ML pipeline**

A typical pipeline looks like:

1. **Scale the data** (important for both PCA and KNN)
2. **Apply PCA** to reduce to *k* principal components
3. **Use KNN** on the transformed feature space

This pipeline often provides:

* Higher accuracy
* Better generalization
* Faster prediction
* More stable distance metrics





---

In [3]:
# Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

data = load_wine()        # loads dataset directly from sklearn
X = data.data             # features
y = data.target           # labels

# Split into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)

accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print("Accuracy WITHOUT Scaling:", accuracy_no_scale)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

accuracy_with_scale = accuracy_score(y_test, y_pred_scaled)
print("Accuracy WITH Scaling:", accuracy_with_scale)




Accuracy WITHOUT Scaling: 0.7222222222222222
Accuracy WITH Scaling: 0.9444444444444444


----


In [4]:
# Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.


from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

data = load_wine()
X = data.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio of Each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained Variance Ratio of Each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


----


In [5]:
# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.


from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)
print("Accuracy on ORIGINAL scaled dataset:", accuracy_original)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)
print("Accuracy on PCA (2 components) dataset:", accuracy_pca)



Accuracy on ORIGINAL scaled dataset: 0.9444444444444444
Accuracy on PCA (2 components) dataset: 0.9444444444444444


---

In [6]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import pandas as pd

data = load_wine()
X = data.data
y = data.target
target_names = data.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

def evaluate_knn(metric_name, **knn_kwargs):
    knn = KNeighborsClassifier(**knn_kwargs)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n--- KNN (metric = {metric_name}) ---")
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=target_names, digits=4))
    print("Confusion Matrix:")
    print(pd.DataFrame(confusion_matrix(y_test, y_pred),
                       index=[f"true_{t}" for t in target_names],
                       columns=[f"pred_{t}" for t in target_names]))
    return acc

acc_euclidean = evaluate_knn("euclidean", n_neighbors=5, metric="minkowski", p=2)

acc_manhattan = evaluate_knn("manhattan", n_neighbors=5, metric="manhattan")

print("\nSummary of accuracies:")
print(f"Euclidean (p=2) : {acc_euclidean:.4f}")
print(f"Manhattan (L1)   : {acc_manhattan:.4f}")

ks = [1,3,5,7,9,11]
results = []
for k in ks:
    knn_e = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2).fit(X_train_scaled, y_train)
    knn_m = KNeighborsClassifier(n_neighbors=k, metric="manhattan").fit(X_train_scaled, y_train)
    acc_e = accuracy_score(y_test, knn_e.predict(X_test_scaled))
    acc_m = accuracy_score(y_test, knn_m.predict(X_test_scaled))
    results.append((k, acc_e, acc_m))

print("\nAccuracy by k:")
print(pd.DataFrame(results, columns=["k", "euclidean_acc", "manhattan_acc"]).set_index("k"))




--- KNN (metric = euclidean) ---
Accuracy: 0.9444
Classification Report:
              precision    recall  f1-score   support

     class_0     1.0000    1.0000    1.0000        18
     class_1     1.0000    0.8571    0.9231        21
     class_2     0.8333    1.0000    0.9091        15

    accuracy                         0.9444        54
   macro avg     0.9444    0.9524    0.9441        54
weighted avg     0.9537    0.9444    0.9448        54

Confusion Matrix:
              pred_class_0  pred_class_1  pred_class_2
true_class_0            18             0             0
true_class_1             0            18             3
true_class_2             0             0            15

--- KNN (metric = manhattan) ---
Accuracy: 0.9815
Classification Report:
              precision    recall  f1-score   support

     class_0     1.0000    1.0000    1.0000        18
     class_1     1.0000    0.9524    0.9756        21
     class_2     0.9375    1.0000    0.9677        15

    accuracy   

---

Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

*   Use PCA to reduce dimensionality
*   Decide how many components to keep
*  Use KNN for classification post-dimensionality reduction
*  Evaluate the model
* Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data



You are working with a **high-dimensional gene expression dataset**, where the number of features (genes) is extremely large, and the number of samples (patients) is small. This causes **overfitting** and poor generalization in traditional machine-learning models.

Below is how to design a **PCA ‚Üí KNN classification pipeline** and justify it as a robust real-world biomedical approach.


# üîπ **1. Using PCA to Reduce Dimensionality**

Gene expression datasets often contain **thousands of genes**. Many of these:

* Are correlated
* Contain noise
* Provide redundant information

To reduce dimensionality:

1. **Standardize the data** (PCA requires scaling)
2. Apply PCA to transform the original features into a smaller number of **principal components**
3. These components capture the majority of biological signal while reducing noise

PCA helps combat:

* Overfitting
* Noise sensitivity
* The curse of dimensionality


# üîπ **2. Deciding How Many Components to Keep**

We select the number of principal components using:

### **Explained Variance Ratio**

Choose enough components to capture **90‚Äì95% of total variance**, ensuring minimal information loss.

### **Scree / Elbow Plot**

Pick the point where adding more components yields diminishing returns.

###  **Cross-validation**

Evaluate classification accuracy for different numbers of components and choose the number that performs best.

This ensures that PCA keeps only the **most meaningful biological patterns**.


# üîπ **3. Using KNN for Classification After PCA**

After PCA transforms the dataset:

* The data becomes **lower-dimensional, noise-reduced, and uncorrelated**.
* KNN becomes more effective because distances are now meaningful.
* Overfitting risk decreases dramatically.

Steps:

1. Train a **KNN classifier** on the PCA-transformed features.
2. Tune **k** using cross-validation.
3. Predict cancer type for new patients.

KNN is simple, interpretable, and benefits greatly from PCA.


# üîπ **4. Model Evaluation**

Use:

### **Stratified train‚Äìtest split or cross-validation**

Ensures all cancer types are represented.

###  **Metrics**

* Accuracy
* Precision, recall, F1-score
* Confusion matrix
* ROC-AUC (one-vs-rest for multiclass)

### **Repeated cross-validation**

Because sample sizes are small, repeating K-fold CV yields more stable estimates.


# üîπ **5. Justification to Stakeholders (Real-World Biomedical Context)**

This PCA ‚Üí KNN pipeline is a strong choice because:

###  **Addresses overfitting in high-dimensional biomedical data**

PCA extracts the core biological signal and removes noise.

###  **Produces stable, reproducible results**

PCA reduces variance, ensuring the model doesn‚Äôt learn patient-specific noise.

###  **Improves interpretability**

Principal components can be mapped back to gene groups, allowing domain experts to study which pathways drive classification.

###  **Computationally efficient**

KNN on reduced PCA features is much faster and more scalable.

###  **Widely used in genomics**

Dimensionality reduction + distance-based methods are standard in cancer subtype analysis, clustering, and biomarker discovery.

###  **Transparent & trusted**

Unlike black-box deep learning, PCA + KNN is interpretable and easy to validate in clinical settings.







In [7]:
# Question 10: PCA + KNN Pipeline for High-Dimensional Data


from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

data = load_wine()
X = data.data
y = data.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print("Number of PCA components retained:", pca.n_components_)
print("Explained variance ratio:", pca.explained_variance_ratio_)

X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42, stratify=y
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)


y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy after PCA + KNN:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Number of PCA components retained: 10
Explained variance ratio: [0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019]

Accuracy after PCA + KNN: 0.9629629629629629

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      0.90      0.95        21
           2       0.88      1.00      0.94        15

    accuracy                           0.96        54
   macro avg       0.96      0.97      0.96        54
weighted avg       0.97      0.96      0.96        54

