Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

 - K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning algorithm.

**How it works**

Choose a value of K (number of neighbors).

Compute the distance between the test point and all training points.

Select the K nearest neighbors.

Make a prediction based on neighbors.

**Classification**

Output = majority class among K neighbors.

Example: If 3 out of 5 neighbors belong to Class A → predict Class A.

**Regression**

Output = average (or weighted average) of neighbor values.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

 - The Curse of Dimensionality refers to problems that arise as the number of features increases.

**Effect on KNN**

Distance between points becomes less meaningful

Neighbors become almost equally distant

Model accuracy decreases

Computation becomes expensive

KNN performs poorly in high-dimensional spaces, making dimensionality reduction essential.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

 - Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique.

**What PCA does**

Transforms original features into new orthogonal features (principal components)

Captures maximum variance with fewer dimensions
| Feature Selection                   | PCA                           |
| ----------------------------------- | ----------------------------- |
| Selects subset of original features | Creates new features          |
| Keeps interpretability              | Loses direct interpretability |
| Supervised or unsupervised          | Unsupervised                  |


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

 - **Eigenvectors** → Directions of maximum variance (principal components)

 - **Eigenvalues** → Amount of variance captured by each eigenvector

**Why important**

Eigenvectors define the new feature space

Eigenvalues help decide how many components to keep

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
Dataset: Use the Wine Dataset from sklearn.datasets.load_wine().

 - PCA reduces dimensionality → solves curse of dimensionality

 - KNN becomes:

 - Faster

 - More accurate

 - Less noisy

PCA improves distance quality, which is critical for KNN.

Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

| Model                   | Accuracy   |
| ----------------------- | ---------- |
| KNN without scaling     | **72.22%** |
| KNN with StandardScaler | **94.44%** |

**Conclusion**

KNN is highly sensitive to feature scale. Scaling dramatically improves performance.

Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

 - First few components capture most variance

 - Example insight:

 - PC1 ≈ highest variance

 - PC2 + PC1 together capture majority of information

This confirms PCA is effective for dimensionality reduction.

Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

| Dataset            | Accuracy   |
| ------------------ | ---------- |
| Original (scaled)  | **94.44%** |
| PCA (2 components) | **94.44%** |

**Observation**

Same accuracy with far fewer features

Faster computation and better generalization

Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

| Distance Metric | Accuracy   |
| --------------- | ---------- |
| Euclidean       | **94.44%** |
| Manhattan       | **98.15%** |

**Conclusion**

Manhattan distance performs better here

Distance metric choice significantly impacts KNN

Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data


 - Step-by-step solution

**1. Use PCA**

Reduce thousands of genes to fewer components

Remove noise and correlated features

**2. Decide number of components**

Use explained variance ratio

Keep components explaining 90–95% variance

Use scree plot / cumulative variance

**3. Apply KNN**

Train KNN on PCA-reduced data

Choose optimal K via cross-validation

**4. Evaluate the model**

Accuracy, Precision, Recall, F1-score

Confusion Matrix

Cross-validation for robustness

**5. Justification to stakeholders**

Prevents overfitting

Handles small-sample, high-feature biomedical data

Improves generalization

Computationally efficient

Clinically reliable and interpretable