Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

Answer:
K-Nearest Neighbors (KNN) is a non-parametric, instance-based machine learning algorithm. It does not build an explicit model; instead, it stores all training data and makes predictions based on similarity.

How it works:

Choose a number of neighbors k.

Compute the distance (Euclidean, Manhattan, etc.) between the new data point and all training points.

Select the k closest points (neighbors).

Predict:

Classification: Take a majority vote among neighbors classes.

Regression: Take the mean (or weighted mean) of neighbors values.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer:
The Curse of Dimensionality refers to problems that occur when working with high-dimensional data.

As the number of features increases:

Data points become sparse.

Distances lose meaning (nearest and farthest points become almost equally distant).

Models require exponentially more data to generalize well.

Effect on KNN:

Distance metrics become unreliable.

Irrelevant features can dominate the distance measure.

Leads to poor classification/regression accuracy.

Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

Answer:
Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique. It projects data onto new axes called principal components that maximize variance.

Key ideas:

First principal component: direction of maximum variance.

Next components: orthogonal directions with decreasing variance.

Helps reduce dimensionality while retaining most information.

PCA vs. Feature Selection:

PCA: Creates new features (linear combinations of original features).

Feature selection: Keeps a subset of original features without transformation.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

Answer:

Eigenvectors: Directions (axes) of maximum variance → principal components.

Eigenvalues: Amount of variance explained along each eigenvector.

Importance in PCA:

Sort eigenvalues (largest to smallest) → decide how many components to keep.

Eigenvectors with top eigenvalues form the reduced feature space.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Answer:

KNN relies on distances → affected by irrelevant/noisy features.

PCA reduces dimensionality and noise → keeps only meaningful directions.

Pipeline:

Step 1: StandardScaler (normalize features).

Step 2: PCA (reduce dimensionality).

Step 3: KNN (classification/regression).

In [3]:
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42, stratify=y)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn.predict(X_test))

# With scaling
pipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=5))])
pipe.fit(X_train, y_train)
acc_scale = accuracy_score(y_test, pipe.predict(X_test))

print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scale)


Accuracy without scaling: 0.7777777777777778
Accuracy with scaling: 0.9333333333333333


In [5]:
#Question 7: Train a PCA model on the Wine dataset and print explained variance ratio.


from sklearn.decomposition import PCA
import numpy as np

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
pca.fit(X_scaled)

print("Explained variance ratio:", np.round(pca.explained_variance_ratio_,))

Explained variance ratio: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [6]:
#Question 8: Train a KNN Classifier on PCA-transformed dataset (top 2 components). Compare accuracy.


pca2 = PCA(n_components=2)
X_pca = pca2.fit_transform(X_scaled)

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_pca, y,
                                                            test_size=0.25, random_state=42, stratify=y)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_p, y_train_p)
acc_pca2 = accuracy_score(y_test_p, knn_pca.predict(X_test_p))

print("Accuracy with PCA (2 components):", acc_pca2)
print("Accuracy with original scaled data:", acc_scale)

Accuracy with PCA (2 components): 0.9333333333333333
Accuracy with original scaled data: 0.9333333333333333


In [7]:
#Question 9: Train a KNN Classifier with different distance metrics.


for metric in ["euclidean", "manhattan"]:
    model = KNeighborsClassifier(n_neighbors=5, metric=metric)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
    print(f"Accuracy with {metric}: {acc:.4f}")


Accuracy with euclidean: 0.7778
Accuracy with manhattan: 0.8000


Question 10: High-dimensional gene expression dataset scenario.

Approach & Justification:

PCA for dimensionality reduction:

Reduces thousands of gene features → top components capturing 95% variance.

Removes noise and redundancy.

Decide number of components:

Use cumulative explained variance (e.g., keep components until ≥95% variance).

Apply KNN after PCA:

Distance-based classification works better in reduced space.

Evaluate model:

Use cross-validation and test accuracy.

Compare different k values for KNN.

Justification to stakeholders:

Reduces overfitting (small samples, many features).

Improves interpretability and efficiency.

Provides a robust pipeline for real-world biomedical data.




In [2]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Synthetic high-dimensional dataset
Xg, yg = make_classification(
    n_samples=240, n_features=2000,
    n_informative=50, n_classes=3, random_state=42
)

# Scale features
Xg_scaled = StandardScaler().fit_transform(Xg)

# PCA (retain 95% variance)
pca_g = PCA(n_components=0.95)
Xg_pca = pca_g.fit_transform(Xg_scaled)

# Evaluate KNN with cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(model, Xg_pca, yg, cv=cv)

print("PCA components kept:", pca_g.n_components_)
print("Cross-validation accuracy:", scores.mean())


PCA components kept: 216
Cross-validation accuracy: 0.3625
