1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
-> K-Nearest Neighbors (KNN) is a supervised, instance-based learning algorithm that makes predictions by finding the K closest data points to a new input based on a distance metric.

Classification: Uses majority voting among neighbors

Regression: Uses average (or weighted average) of neighbors’ values

2: What is the Curse of Dimensionality and how does it affect KNN performance?
-> The Curse of Dimensionality refers to problems that arise when number of features increases, causing distance-based models like KNN to lose effectiveness.

3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
-> PCA is an unsupervised dimensionality reduction technique that transforms features into new orthogonal components capturing maximum variance.

4: What are eigenvalues and eigenvectors in PCA, and why are they important?
-> Eigenvectors → directions of maximum variance

Eigenvalues → magnitude of variance along those directions


5: How do KNN and PCA complement each other when applied in a single pipeline?
-> PCA reduces dimensionality → KNN performs better and faster.

Problems with KNN

Sensitive to high dimensions
Computationally heavy

PCA helps by

Reducing noise
Making distance meaningful
Lowering computation

End Result

Higher accuracy
Better generalization
Faster predictions




In [1]:
# 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_scaled)

acc_without_scaling, acc_with_scaling


(0.7407407407407407, 0.9629629629629629)

In [2]:
# 7: Train a PCA model and print explained variance ratio.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

pca.explained_variance_ratio_


array([0.36198848, 0.1920749 , 0.11123631, 0.0706903 , 0.06563294,
       0.04935823, 0.04238679, 0.02680749, 0.02222153, 0.01930019,
       0.01736836, 0.01298233, 0.00795215])

In [3]:
# 8: Train KNN on PCA-transformed data (top 2 components).

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test, y_pred)


0.9814814814814815

In [4]:
# 9: Compare KNN with different distance metrics.

for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    print(metric, accuracy_score(y_test, y_pred))


euclidean 0.9629629629629629
manhattan 0.9629629629629629


10: High-Dimensional Gene Expression Dataset – PCA + KNN Pipeline
-> (Conceptual + Justification)

Step 1: PCA

Remove noise
Handle multicollinearity
Reduce overfitting

Step 2: Choose components
Scree plot
Cumulative variance ≥ 95%

Step 3: Apply KNN
After PCA, distances become meaningful

Step 4: Evaluation

Cross-validation
Accuracy, F1-score
Confusion matrix

Step 5: Stakeholder Justification
Robust against overfitting
Interpretable pipeline
Industry-accepted approach
Computationally efficient