1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
   - K-Nearest Neighbors (KNN) is a supervised, instance-based (lazy learning) algorithm that makes predictions based on the K closest training samples to a new data point, using a distance metric like Euclidean distance.

   - KNN for Classification
   - Find the K nearest neighbors
   - Take a majority vote
   - The class with the highest count becomes the prediction
   - Example: If K=5 and neighbors are [A, A, B, A, B] → output = A

   - KNN for Regression
   - Find the K nearest neighbors
   - Predict the average (or weighted average) of their target values
   - Example: targets [10, 12, 9, 11, 13] → output ≈ 11

2. What is the Curse of Dimensionality and how does it affect KNN performance?
   - The Curse of Dimensionality refers to problems that arise when the number of features (dimensions) becomes very large.

   - Effect on KNN
   - In high dimensions, points become far apart
   - Distance measures become less meaningful (all distances look similar)
   - KNN struggles to find “truly close” neighbors
   - Model accuracy decreases and computation becomes expensive
   - So, KNN performs best when features are few and meaningful, or after dimensionality reduction (like PCA).

3. What is Principal Component Analysis (PCA)? How is it different from feature selection?
   - PCA is an unsupervised dimensionality reduction technique that transforms the original features into new features called principal components.

   - What PCA does
   - Finds directions of maximum variance in data
   - Projects data into fewer dimensions while keeping most information

   - PCA vs Feature Selection
   - PCA creates new features (components) and feature selection keeps original features.
   - PCA uses linear comninations and Feature Selection chooses subset of existing columns.
   - PCA is unsupervised and Feature Selection can be supervised.
   - PCA improves noise handling and Feature Selection keeps interpretability of features.

4. What are eigenvalues and eigenvectors in PCA, and why are they important?
   - In PCA, we compute the covariance matrix of the dataset.
   - Eigenvectors represent the directions (axes) of maximum variance (principal components).
   - Eigenvalues represent the amount of variance captured along each eigenvector.

   - Why important?
   - Eigenvectors decide the new coordinate system
   - Eigenvalues decide which components are most useful
   - Higher eigenvalue → more information retained

5. How do KNN and PCA complement each other when applied in a single pipeline?
   - KNN depends heavily on distance calculations. PCA helps by reducing dimensions and removing noise.

   - Benefits of PCA + KNN
   - Reduces Curse of Dimensionality
   - Improves distance quality → better neighbors
   - Faster training/prediction (less computation)
   - Often improves accuracy on noisy/high-dimensional datasets

In [1]:
# 6. Use the Wine Dataset from sklearn.datasets.load_wine().
# Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaled)

print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling:", acc_scaling)

Accuracy without scaling: 0.8055555555555556
Accuracy with scaling: 0.9722222222222222


In [2]:
# 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

wine = load_wine()
X = wine.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:")
print(pca.explained_variance_ratio_)

Explained variance ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [3]:
# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset. 

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on original scaled data:", acc_original)
print("Accuracy on PCA (2 components):", acc_pca)

Accuracy on original scaled data: 0.9722222222222222
Accuracy on PCA (2 components): 0.9166666666666666


In [4]:
# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results. 

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_e = knn_euclidean.predict(X_test_scaled)
acc_e = accuracy_score(y_test, y_pred_e)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_m = knn_manhattan.predict(X_test_scaled)
acc_m = accuracy_score(y_test, y_pred_m)

print("Accuracy with Euclidean:", acc_e)
print("Accuracy with Manhattan:", acc_m)

Accuracy with Euclidean: 0.9722222222222222
Accuracy with Manhattan: 1.0


10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. 
Due to the large number of features and a small number of samples, traditional models overfit. 
Explain how you would: 
● Use PCA to reduce dimensionality 
● Decide how many components to keep 
● Use KNN for classification post-dimensionality reduction 
● Evaluate the model 
● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

    - Gene expression datasets usually have:
    - Thousands of features (genes)
    - Very few samples
    - This causes overfitting and unstable models.

    - Step 1: Use PCA to reduce dimensionality
    - Standardize features first
    - Apply PCA to compress thousands of genes into fewer components
    - Removes noise and redundancy

    - Step 2: Decide how many components to keep
    - Use:
    - Explained variance ratio
    - Keep enough components to retain 95%–99% variance
    - Example: choose n_components=0.95

    - Step 3: Use KNN after PCA
    - Train KNN on reduced feature space
    - Use cross-validation to choose best K
    - Distance becomes meaningful again due to lower dimensions

    - Step 4: Evaluate the model
    - Use:
    - Train-test split + cross-validation
    - Accuracy, Precision, Recall, F1-score
    - Confusion matrix
    - If dataset is imbalanced → use ROC-AUC

    - Step 5: Justify pipeline to stakeholders
    - Explain that:
    - PCA reduces noise and prevents overfitting
    - KNN becomes faster and more accurate in lower dimensions
    - Pipeline is interpretable at a high level (variance retained)
    - Cross-validation ensures reliability on real-world biomedical data

In [None]:
# Example Code for Q10
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

# X_gene = gene expression matrix (samples x genes)
# y = cancer type labels

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=0.95)),
    ("knn", KNeighborsClassifier())
])

param_grid = {
    "knn__n_neighbors": [3, 5, 7, 9],
    "knn__metric": ["euclidean", "manhattan"]
}

grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_gene, y)

print("Best Params:", grid.best_params_)
print("Best CV Score:", grid.best_score_)

# Evaluate on test set
X_train, X_test, y_train, y_test = train_test_split(X_gene, y, test_size=0.2, random_state=42)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
