**Question 1**: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?


**Answer:** K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that makes predictions based on the “closeness” of data points to each other.
It’s simple, non-parametric (makes no assumptions about the data distribution), and works for both classification and regression.

**How it Works**

*The basic idea:*

Look at the K nearest points (neighbors) to the input sample and make a prediction based on them.

**Steps:**

1. Choose K – the number of nearest neighbors to consider.

2. Measure distance – usually Euclidean, but Manhattan, Minkowski, or others can be used.

3. Find the K nearest points from the training data.

4. Make prediction:

**Classification:** Majority vote among the neighbors (most common class wins).

**Regression:** Average (or weighted average) of the neighbors’ target values.

**KNN for Classification**

- Example: Predict if a fruit is an apple or orange.

- If K=5 and 3 nearest neighbors are apples, 2 are oranges → Predict apple.

- Decision boundaries are often non-linear and adapt to data distribution.

**KNN for Regression**

- Example: Predict a house price based on nearby house prices.

- If K=4 and neighbor prices are: 100, 120, 110, 130 → Predicted price = (100+120+110+130)/4 = 115.

- Can use weighted KNN, where closer neighbors have more influence.

**Question 2:** What is the Curse of Dimensionality and how does it affect KNN
performance?

**Answer:** The Curse of Dimensionality refers to the set of problems that occur when data has too many features (dimensions).
In high-dimensional spaces, distances and densities behave differently than our intuition from low dimensions (like 2D or 3D).

Why it Happens
- Data points become sparse — you need exponentially more samples to cover the space.

- Distances become less meaningful — nearest and farthest points start having almost the same distance from a query point.

- Volume grows fast — even small increases in dimensionality make the space huge.

**Impact on KNN**

Since KNN relies on distance to find nearest neighbors:

1. Distance loses discrimination power

- In high dimensions, all points tend to be nearly equidistant.

- Makes it hard for KNN to correctly identify “close” points.

2. More data needed

- To get a dense enough sample for meaningful neighbors, the amount of training data must grow exponentially with the number of features.

3. Overfitting risk

- With many irrelevant features, KNN can be misled because distance is computed over all dimensions.

Mitigating the Curse for KNN

- Feature selection: Remove irrelevant features.

- Dimensionality reduction: Use PCA, t-SNE, or similar.

- Distance metric choice: Sometimes cosine similarity works better than Euclidean in high dimensions.

- Normalize features: Prevents large-scale features from dominating distance.


**Question 3:** What is Principal Component Analysis (PCA)? How is it different from
feature selection?

**Answer:** Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms your original features into a new set of uncorrelated variables called principal components.

These components are ordered so that:

- First component captures the maximum variance in the data.

- Second component captures the next largest variance, orthogonal to the first.

- And so on…-

PCA vs Feature Selection

| **PCA** (Feature Extraction)                                                          | **Feature Selection**                                        |
| ------------------------------------------------------------------------------------- | ------------------------------------------------------------ |
| Creates **new features** (principal components) as combinations of original features. | Keeps a subset of the **original features**.                 |
| Aims to capture maximum variance with fewer dimensions.                               | Aims to keep only the most relevant features for prediction. |
| Transforms data into a new space (orthogonal axes).                                   | No transformation — just drops less important features.      |
| Components may not have a clear interpretation.                                       | Retained features maintain original meaning.                 |
| Example: 100 features → 10 principal components.                                      | Example: 100 features → keep top 10 original ones.           |


**Question 4:** What are eigenvalues and eigenvectors in PCA, and why are they
important?


**Answer:** In PCA, eigenvalues and eigenvectors come from the covariance matrix of the data, and they are the mathematical backbone of how PCA finds the best directions to represent the data.

**1. Eigenvectors in PCA**

- Each eigenvector points in a direction in the feature space.

- In PCA, eigenvectors are the principal component directions.

- They tell us where in the high-dimensional space the data varies the most.

**Example:**
If the first eigenvector is [0.7, 0.7] in 2D, it means the first principal component is along a diagonal line equally influenced by both features.

**2. Eigenvalues in PCA**

- Each eigenvalue is a number associated with an eigenvector.

- It tells us how much variance the data has in that direction.

- Larger eigenvalue → more variance captured by that principal component.

**Example:**
If eigenvalue = 5 for PC1 and 2 for PC2, PC1 explains more variance than PC2.

**3. Why They’re Important in PCA**

**1. Ranking components**

- PCA sorts eigenvectors by their eigenvalues (largest to smallest).

- This lets us choose the top k components that capture most variance.

**2. Dimensionality reduction**

- By keeping only eigenvectors with large eigenvalues, we compress data while retaining most information.

**3. Variance explanation**

- The proportion of each eigenvalue to the sum of all eigenvalues = explained variance ratio.

**Question 5:** How do KNN and PCA complement each other when applied in a single
pipeline?

**Answer:** KNN and PCA can work very well together in a machine learning pipeline, because each solves a problem the other has.

***Why They Complement Each Other***

- KNN’s weakness: Struggles in high-dimensional spaces due to the curse of dimensionality — distances become less meaningful.

- PCA’s strength: Reduces the number of dimensions while keeping most of the variance (important information).

**By applying PCA before KNN:**

- Fewer dimensions → Distance calculations in KNN become more reliable.

- Less noise → PCA removes irrelevant or redundant features.

- Faster predictions → KNN has to compare fewer features per query point.



**Question 6:** Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

**Answer:**

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 1️⃣ Without Feature Scaling
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# 2️⃣ With Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaling:.4f}")


Accuracy without scaling: 0.8056
Accuracy with scaling:    0.9722


**Question 7:** Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.


**Answer:**

In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X = wine.data

# Scale features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"PC{i}: {ratio:.4f}")


Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


**Question 8:** Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

**Answer:**

In [3]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X = wine.data

# Scale features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"PC{i}: {ratio:.4f}")


Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


**Question 9:** Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


**Answer:**

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print(f"Accuracy (Euclidean): {acc_euclidean:.4f}")
print(f"Accuracy (Manhattan): {acc_manhattan:.4f}")


Accuracy (Euclidean): 0.9722
Accuracy (Manhattan): 1.0000


**Question 10:** You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

**Answer:**

**1) Use PCA to reduce dimensionality**

***Preprocessing (inside CV only):***

- Optional: log/variance-stabilizing transform if counts; filter obviously uninformative genes (e.g., near-zero variance) to reduce noise.

- Standardize features (z-score): gene expression scales differ; PCA and KNN both need it.

- PCA fit only on the training fold to avoid leakage, then transform train and test folds.

- Why PCA here? It produces orthogonal components that capture the dominant structure, reducing noise and collinearity typical of expression data.

**2) Decide how many components to keep**

**Use two complementary criteria:**

- Model-free variance view: cumulative explained variance (e.g., keep components until 90–99%).

- Model-based CV view: treat n_components as a hyperparameter and pick what maximizes cross-validated performance (balanced accuracy/ROC-AUC) while remaining compact.

- Also check the scree plot elbow and stop when adding components yields diminishing returns.

- Sanity check stability: ensure results are similar across multiple CV repeats.

**3) KNN for classification after PCA**

- Use a Pipeline to chain: StandardScaler → PCA → KNN, so all steps are refit only on training data in each fold.

**Hyperparameters to search:**

- n_components: e.g., [5, 10, 20, 30, 40, 50] (or by variance thresholds).

- n_neighbors (k): e.g., odd integers 1–31.

- metric: "euclidean" and "manhattan".

- weights: "uniform" vs "distance" (distance weighting can help when classes are imbalanced).

Because KNN’s cost scales with features, PCA’s reduction keeps inference fast and distances meaningful.

**4) Evaluate the model (properly, with small n)**

- Nested, stratified cross-validation (outer CV for honest performance; inner CV for tuning):

- Outer: e.g., Stratified 5-fold × 5 repeats (or leave-one-out if n is very small).

- Inner: grid/ randomized search over the hyperparameters above.

- Metrics (multi-class):

- Balanced accuracy (handles class imbalance better than raw accuracy).

- Macro ROC-AUC (one-vs-rest), macro F1, and log-loss for calibration.

- Report mean ± std across outer folds and 95% CIs via bootstrap of fold scores.

**Robustness checks:**

- Permutation test (shuffle labels) to confirm signal isn’t spurious.

- Sensitivity analysis: vary k, metric, and n_components slightly; performance should be stable.

- If classes are highly imbalanced, consider repeated stratification and (if used) resampling (e.g., SMOTE) strictly inside the training folds (never on the whole dataset).

**5) How to justify this pipeline to stakeholders**

- Generalization in p ≫ n settings: PCA compresses thousands of noisy, correlated genes into a smaller set of stable signals, reducing overfitting risk that plagues high-dimensional biology.

**Transparency & Interpretability:**

- KNN is simple and doesn’t extrapolate beyond observed data—clinically reassuring.

- You can examine PCA loadings to identify top genes driving each component, enabling pathway/gene-set enrichment analyses for biological plausibility.

**Methodological rigor:**

- Leakage-free Pipeline + nested CV = honest performance estimates.

- Permutation tests and CIs convey statistical confidence, not just point estimates.

**Operational practicality:**

- Computationally light; easy to re-train as new patient data arrives.

- Works well when sample sizes are limited and model simplicity is preferred for validation/QA.

**Clinical robustness:**

- Balanced accuracy and macro metrics ensure minority cancer types aren’t ignored.

- Distance-weighted voting mitigates borderline cases.

In [5]:
from sklearn.datasets import load_wine  # replace with your gene-expression matrix X, labels y
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import make_scorer, balanced_accuracy_score, roc_auc_score
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

# X, y = your_data  # shape: (n_samples, n_genes)
data = load_wine()  # placeholder; swap out with your dataset
X, y = data.data, data.target

pipe = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("pca", PCA(svd_solver="full", random_state=42)),
    ("knn", KNeighborsClassifier())
])

param_grid = {
    "pca__n_components": [5, 10, 20, 30, 40, 50],
    "knn__n_neighbors": [1, 3, 5, 7, 9, 11, 15, 21, 31],
    "knn__metric": ["euclidean", "manhattan"],
    "knn__weights": ["uniform", "distance"],
}

inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

# Primary scorer: balanced accuracy
grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="balanced_accuracy",
    cv=inner_cv,
    n_jobs=-1,
    refit=True
)

# Nested CV estimate
outer_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
    grid.fit(X[train_idx], y[train_idx])           # tune on inner CV of the training fold
    score = balanced_accuracy_score(y[test_idx], grid.predict(X[test_idx]))
    outer_scores.append(score)

print(f"Balanced accuracy (nested CV): {np.mean(outer_scores):.3f} ± {np.std(outer_scores):.3f}")
print("Best params on last outer split:", grid.best_params_)

# Optional: macro ROC-AUC (needs probability estimates; KNN supports predict_proba)
def macro_roc_auc(estimator, X_val, y_val):
    try:
        prob = estimator.predict_proba(X_val)  # works for KNN multi-class directly
        # For macro AUC, average one-vs-rest AUCs
        aurocs = []
        for c in np.unique(y_val):
            y_true = (y_val == c).astype(int)
            aurocs.append(roc_auc_score(y_true, prob[:, c]))
        return np.mean(aurocs)
    except Exception:
        return np.nan

# Example single-split macro AUC (replace with proper nested CV if you need to report it)
grid.fit(X, y)
print("Macro ROC-AUC (in-sample, for sanity check):", macro_roc_auc(grid.best_estimator_, X, y))


720 fits failed out of a total of 1080.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
180 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib

Balanced accuracy (nested CV): 0.967 ± 0.011
Best params on last outer split: {'knn__metric': 'euclidean', 'knn__n_neighbors': 3, 'knn__weights': 'uniform', 'pca__n_components': 10}
Macro ROC-AUC (in-sample, for sanity check): 1.0


720 fits failed out of a total of 1080.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
180 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib