#KNN & PCA | Assignment

#Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**Definition:**  
K-Nearest Neighbors (KNN) is a **non-parametric, instance-based learning algorithm** used for both classification and regression.  
It predicts the output of a new data point based on the **majority label (classification)** or **average value (regression)** of its 'K' nearest neighbors in the training dataset.

---

**How it works:**

### 1. Classification:
1. Choose a value of `K` (number of neighbors to consider).  
2. Calculate the **distance** between the new data point and all training points (commonly Euclidean distance).  
3. Identify the `K` nearest neighbors.  
4. Assign the **most frequent class** among the neighbors to the new data point.  

**Example:**  
- New point → Find 5 nearest neighbors → Class counts: {Class A: 3, Class B: 2} → Predict **Class A**.

### 2. Regression:
1. Same as classification: find `K` nearest neighbors using distance.  
2. Take the **average (or weighted average) of the neighbors’ target values**.  
3. Assign this as the predicted value for the new point.

**Example:**  
- New point → Find 3 nearest neighbors → Target values: [5.2, 4.8, 6.0] → Predict **(5.2 + 4.8 + 6.0)/3 = 5.33**.

---

**Key Points:**
- **Non-parametric:** No assumption about data distribution.  
- **Lazy learner:** No training phase; all computation happens during prediction.  
- **Distance metric matters:** Common choices – Euclidean, Manhattan, Minkowski.  
- **Choice of K:** Small K → sensitive to noise; Large K → smoother predictions but may miss local patterns.

---
#Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

**Definition:**  
The **Curse of Dimensionality** refers to the set of problems that arise when working with **high-dimensional data** (many features).  
As the number of dimensions increases, the **volume of the feature space grows exponentially**, making the data points **sparser**.  

---

**Effect on K-Nearest Neighbors (KNN):**

1. **Distance Metrics Become Less Meaningful:**
   - KNN relies on distance (e.g., Euclidean) to find neighbors.
   - In high dimensions, **all points tend to become almost equidistant**.
   - This makes it difficult to identify truly "nearest" neighbors.

2. **Increased Computational Cost:**
   - More dimensions → more calculations for distance → slower prediction.

3. **Overfitting Risk:**
   - Sparse high-dimensional data may cause KNN to **fit noise rather than patterns**.
   - Small K values become unstable; predictions become less reliable.

4. **Reduced Predictive Power:**
   - High dimensionality can **dilute the effect of relevant features**.
   - KNN performance often drops unless dimensionality reduction is applied.

---

**Mitigation Strategies:**
- **Feature Selection:** Keep only the most relevant features.  
- **Dimensionality Reduction:** Use PCA, t-SNE, or other techniques.  
- **Increase Training Data:** More samples can help reduce sparsity effects.  
- **Distance Weighting:** Weight closer neighbors more heavily to reduce noise impact.

---
#Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Definition of PCA:**  
Principal Component Analysis (PCA) is an **unsupervised dimensionality reduction technique** that transforms the original correlated features into a **new set of uncorrelated features** called **principal components (PCs)**.  
- Each PC is a linear combination of the original features.  
- The first few PCs capture the **maximum variance** in the data.

**How PCA works:**
1. Standardize the dataset (mean=0, variance=1).  
2. Compute the **covariance matrix** of the features.  
3. Calculate **eigenvectors and eigenvalues** of the covariance matrix.  
4. Sort eigenvectors by decreasing eigenvalues → top eigenvectors become principal components.  
5. Project the original data onto these principal components to reduce dimensionality.

---

**Difference from Feature Selection:**

| Aspect               | PCA                                      | Feature Selection                     |
|----------------------|-----------------------------------------|--------------------------------------|
| Type                 | Feature **extraction**                  | Feature **selection**                 |
| Output               | New features (principal components)     | Subset of original features           |
| Goal                 | Reduce dimensionality while preserving **variance** | Keep most **informative/relevant features** |
| Method               | Linear combination of all features      | Evaluate features using metrics (e.g., correlation, mutual information, model importance) |
| Interpretability     | Less interpretable (PCs are combinations)| More interpretable (original features)|

**Key Point:**  
- PCA transforms data into a new space (features are combinations).  
- Feature selection chooses a **subset of existing features** without changing them.

---
#Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?


**Definition:**

- **Eigenvectors:** Directions in the feature space along which the data varies the most.  
  - In PCA, each eigenvector represents a **principal component**.  
  - They define the new axes after dimensionality reduction.

- **Eigenvalues:** Magnitudes corresponding to eigenvectors that indicate **how much variance is captured** along each principal component.  
  - Larger eigenvalues → more variance explained along that eigenvector.  

---

**Why are they important in PCA?**

1. **Determine Principal Components:**
   - Eigenvectors define the new coordinate system (principal components) for the data.
   
2. **Measure Importance of Components:**
   - Eigenvalues show **how much information (variance)** each component carries.
   - Components with small eigenvalues can often be discarded to reduce dimensionality without losing much information.

3. **Dimensionality Reduction:**
   - By selecting the top-k eigenvectors with largest eigenvalues, PCA projects data onto a lower-dimensional space while **retaining most of the variance**.

---

**Example Conceptually:**
- Original 3D data → PCA finds eigenvectors: `[v1, v2, v3]`  
- Eigenvalues: `[5.0, 2.0, 0.1]`  
- Project data onto top 2 eigenvectors (`v1` and `v2`) → reduces to 2D while keeping ~97% of variance.

---
#Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

**Combining KNN and PCA:**

1. **Challenge with KNN in High Dimensions:**
   - KNN relies on distance metrics (e.g., Euclidean distance).  
   - High-dimensional data suffers from the **Curse of Dimensionality** → distances become less meaningful, KNN performance drops.

2. **Role of PCA:**
   - PCA reduces dimensionality by projecting data onto **principal components** that retain most of the variance.  
   - This removes noisy or less informative features and **reduces sparsity** in high-dimensional space.

3. **Pipeline Benefits:**
   - **Step 1:** Apply PCA → compress features to lower-dimensional space.  
   - **Step 2:** Apply KNN → distances now computed on fewer, more informative dimensions.  
   - **Advantages:**  
     - Faster computation for KNN.  
     - More robust predictions due to reduced noise.  
     - Mitigates the Curse of Dimensionality.  

---

**Example Conceptually:**
- Original dataset: 50 features → PCA reduces to 10 principal components  
- KNN is applied on these 10 components instead of 50 features → better classification/regression performance.

**Summary:**  
- PCA acts as a **preprocessing step** that improves KNN efficiency and accuracy, especially on high-dimensional datasets.

---
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.


In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_wine()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ------------------------------
# 3. KNN without feature scaling
# ------------------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ------------------------------
# 4. KNN with feature scaling
# ------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# ------------------------------
# 5. Compare accuracies
# ------------------------------
print("KNN Accuracy without scaling:", acc_no_scaling)
print("KNN Accuracy with scaling:", acc_scaled)


KNN Accuracy without scaling: 0.7222222222222222
KNN Accuracy with scaling: 0.9444444444444444


#Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.


In [2]:
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 1. Load dataset
data = load_wine()
X = data.data

# 2. Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train PCA model
pca = PCA()
pca.fit(X_scaled)

# 4. Print explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
for i, ratio in enumerate(explained_variance_ratio, start=1):
    print(f"PC{i}: {ratio:.4f}")

# Optional: cumulative variance
import numpy as np
cumulative_variance = np.cumsum(explained_variance_ratio)
print("\nCumulative Variance Explained:", cumulative_variance)


PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Cumulative Variance Explained: [0.36198848 0.55406338 0.66529969 0.73598999 0.80162293 0.85098116
 0.89336795 0.92017544 0.94239698 0.96169717 0.97906553 0.99204785
 1.        ]


#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.


In [3]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_wine()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# 3. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. KNN on original dataset
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# 5. PCA transformation (retain top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# 6. KNN on PCA-transformed dataset
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# 7. Compare accuracies
print("KNN Accuracy on Original Dataset:", acc_original)
print("KNN Accuracy on PCA-Transformed Dataset (2 PCs):", acc_pca)


KNN Accuracy on Original Dataset: 0.9444444444444444
KNN Accuracy on PCA-Transformed Dataset (2 PCs): 0.9444444444444444


#Question 9: Train a KNN Classifier with different distance metrics (euclidean,manhattan) on the scaled Wine dataset and compare the results.

In [4]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_wine()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# 3. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ------------------------------
# 4. KNN with Euclidean distance
# ------------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ------------------------------
# 5. KNN with Manhattan distance
# ------------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# ------------------------------
# 6. Compare results
# ------------------------------
print("KNN Accuracy with Euclidean distance:", acc_euclidean)
print("KNN Accuracy with Manhattan distance:", acc_manhattan)


KNN Accuracy with Euclidean distance: 0.9444444444444444
KNN Accuracy with Manhattan distance: 0.9814814814814815


#Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.Due to the large number of features and a small number of samples, traditional models overfit.
#Explain how you would:
#● Use PCA to reduce dimensionality
#● Decide how many components to keep
#● Use KNN for classification post-dimensionality reduction
#● Evaluate the model
#● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

**Scenario:**  
- Dataset: High-dimensional gene expression data  
- Problem: Large number of features (thousands of genes) but small number of patient samples → risk of **overfitting**.  
- Goal: Classify patients into cancer types using KNN.

---

## Step-by-Step Approach:

### 1. Use PCA to Reduce Dimensionality
- Apply **Principal Component Analysis (PCA)** to transform the original high-dimensional features into a **smaller set of uncorrelated principal components**.  
- Benefits:  
  - Reduces noise and redundancy in features.  
  - Helps KNN perform better by mitigating the Curse of Dimensionality.

### 2. Decide How Many Components to Keep
- Use the **explained variance ratio** from PCA.  
- Keep the minimum number of components that **capture ~90-95% of the total variance**.  
- This balances dimensionality reduction with retaining essential information.

### 3. Use KNN for Classification Post-Dimensionality Reduction
- Standardize the data before PCA.  
- Fit KNN on the **PCA-transformed training set**.  
- Use distance-based metrics (Euclidean or Manhattan) to classify patients based on nearest neighbors in the reduced feature space.

### 4. Evaluate the Model
- Split dataset into **train and test sets** or use **cross-validation**.  
- Evaluate using metrics appropriate for multi-class classification:  
  - **Accuracy**  
  - **F1-score (macro or weighted)**  
  - **Confusion matrix** to see misclassification patterns.  
  - Optional: ROC-AUC for one-vs-rest classification.

### 5. Justify this Pipeline to Stakeholders
- **Dimensionality Reduction:** PCA reduces thousands of genes to a manageable number of components → prevents overfitting.  
- **Robust Classification:** KNN is simple, interpretable, and works well on the transformed low-dimensional space.  
- **Reproducibility:** Pipeline is systematic and can be applied to new patient samples.  
- **Biomedical Relevance:** Captures major variance patterns in gene expression while filtering noise, improving generalization.  

**Conclusion:**  
- Using PCA + KNN forms a **robust, interpretable pipeline** for high-dimensional biomedical data, minimizing overfitting while retaining predictive power.