Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?


Ans]

KNN finds nearest data points and predicts using vote (classification) or average (regression).

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?


Ans]

The Curse of Dimensionality means:

- When the number of features (dimensions) increases, data points become far apart and less useful.

dimension = feature/column
- Example: age, salary, height, weight, etc.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?


Ans]

Principal Component Analysis (PCA) is a dimensionality reduction technique.

- It reduces the number of features while keeping most of the important information.

PCA creates new features called principal components.

Feature Selection means:

-> Choosing the best features from the original data.

- It does not create new features

- It only removes unnecessary ones

Example:

- Original features: Age, Height, Weight, Name

- Remove Name

- Keep Age, Height, Weight

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Ans]

Eigenvectors show the direction of maximum data spread (variance).

In PCA:

- Each eigenvector = a new axis (principal component)

- They show how data is oriented

Simple meaning:
- Eigenvectors decide the direction of new features.


Eigenvalues show how important each eigenvector is.

- Large eigenvalue → more information

- Small eigenvalue → less information

Simple meaning:
- Eigenvalues tell how much data variance is captured.

Why Are They Important in PCA?

PCA works by:

1. Finding eigenvectors and eigenvalues

2. Sorting them by eigenvalue (largest to smallest)

3. Keeping only top eigenvectors

So:

- Eigenvectors → new features

- Eigenvalues → which features to keep

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Ans]
Basic Idea

- PCA is used first, then KNN is applied.

Why?

- KNN works on distance

- Too many features = poor distance calculation

- PCA reduces features and noise

Step-by-Step Pipeline (Simple)

1. PCA

- Reduces number of features
Removes noise

- Keeps important information

2. KNN

- Finds nearest neighbors

- Works faster and more accurately

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


Ans]

In [None]:


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------- Without Feature Scaling --------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -------- With Feature Scaling --------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

acc_no_scaling, acc_scaled


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

Ans]

In [None]:

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X = wine.data

# Feature scaling (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_

explained_variance


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.


Ans]

In [None]:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---------- Original Dataset (with scaling) ----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# ---------- PCA (Top 2 Components) ----------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

acc_original, acc_pca


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

Ans]

In [None]:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -------- KNN with Euclidean distance --------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -------- KNN with Manhattan distance --------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

acc_euclidean, acc_manhattan


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

Ans]

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# -------------------------------
# 1. Create high-dimensional data
# -------------------------------
X, y = make_classification(
    n_samples=100,
    n_features=1000,   # High-dimensional (genes)
    n_informative=50,
    n_classes=3,
    random_state=42
)

print("Original shape:", X.shape)

# -------------------------------
# 2. Standardize the data
# -------------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# -------------------------------
# 3. Apply PCA
# -------------------------------
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

print("Reduced shape after PCA:", X_pca.shape)
print("Explained variance:", np.sum(pca.explained_variance_ratio_))

# -------------------------------
# 4. Train-test split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42
)

# -------------------------------
# 5. Apply KNN
# -------------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# -------------------------------
# 6. Prediction and Evaluation
# -------------------------------
y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("\nAccuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)

# -------------------------------
# 7. Cross-validation
# -------------------------------
cv_scores = cross_val_score(knn, X_pca, y, cv=5)

print("\nCross-validation accuracy:", cv_scores.mean())
