1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Definition of KNN

K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification and regression tasks.

It is a non-parametric, instance-based (lazy) learning method, meaning it does not assume any underlying data distribution and does not explicitly build a model during training.

Instead, it stores the training data and makes predictions based on the similarity (distance) between data points.

How KNN Works (General Steps)

Choose K (number of neighbors):

K determines how many nearest data points are considered.

Small K → sensitive to noise.

Large K → smoother decision boundary but may ignore local patterns.

Measure Distance:

Common distance metrics:

Euclidean Distance (most used)

Manhattan Distance

Minkowski Distance

Find Nearest Neighbors:

For a new data point, calculate distance to all training samples.

Select the K closest points.

Make Prediction:

Depends on whether the problem is classification or regression.

KNN in Classification

Each of the K nearest neighbors “votes” for its class.

The class with the majority votes is assigned to the new data point.

Example:

K = 5, among neighbors → 3 belong to Class A, 2 to Class B → new point classified as Class A.

Decision boundary: usually non-linear and adapts to data distribution.

KNN in Regression

Instead of voting, KNN takes the average (or weighted average) of the target values of the K nearest neighbors.

Example:

K = 3, neighbors have target values [50, 60, 70] → prediction = (50+60+70)/3 = 60.

If weighted, closer neighbors have more influence than farther ones.

Strengths of KNN

Simple and intuitive.

Works well on smaller datasets with fewer irrelevant features.

No training phase → fast to implement.

Limitations of KNN

Computationally expensive during prediction (distance must be calculated to all points).

Sensitive to noise and irrelevant features.

Requires careful choice of K and distance metric.

Struggles with high-dimensional data (curse of dimensionality).

Use Cases

Classification: Handwritten digit recognition, text categorization, medical diagnosis.

Regression: House price prediction, recommendation systems.

✅ Final Summary
KNN is a lazy, distance-based supervised learning algorithm. In classification, it predicts the label based on majority voting of K neighbors, while in regression, it predicts the output as the mean (or weighted mean) of neighbors. Its simplicity and flexibility make it widely used, though computational cost and sensitivity to noise are challenges.

2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Definition of Curse of Dimensionality

The curse of dimensionality refers to the challenges that arise when working with high-dimensional data (many features/variables).

As the number of dimensions increases:

Data becomes sparse.

Distance measures lose effectiveness.

Algorithms that rely on similarity (like KNN) struggle to make accurate predictions.

Key Effects of High Dimensions

Distance Becomes Less Meaningful

In high dimensions, the distance between points tends to even out.

For example, in low dimensions, the nearest neighbor is clearly closer than others, but in high dimensions, the difference between the nearest and farthest neighbor becomes very small.

This makes it hard for KNN to distinguish which points are truly “close.”

Data Sparsity

As dimensions increase, the volume of the feature space grows exponentially.

Data points are spread out, making it unlikely to find dense clusters of neighbors.

KNN, which depends on finding local neighborhoods, becomes less effective.

Increased Computational Cost

More dimensions → more distance calculations.

Training is fast in KNN (lazy learning), but prediction becomes computationally expensive in high dimensions.

Impact on KNN Performance

Poor Classification/Regression Accuracy:

Since distances become unreliable, KNN may pick irrelevant neighbors.

Overfitting Risk:

In very high-dimensional spaces, KNN might fit to noise instead of meaningful patterns.

Need for Dimensionality Reduction:

Techniques like PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), or feature selection are often required before applying KNN.

Example

Suppose you want to classify images using pixel values as features.

A 28×28 grayscale image → 784 dimensions.

In such high-dimensional space, KNN struggles because all points appear equally far apart.

Reducing dimensions (e.g., using PCA) makes KNN more effective.

Summary

The curse of dimensionality means that in high-dimensional spaces, distances lose discriminative power, data becomes sparse, and computation increases. For KNN, which depends on distance-based similarity, this leads to poor accuracy, high variance, and inefficiency. Dimensionality reduction is often required to overcome this problem.

3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used in machine learning and statistics.

It transforms the original set of correlated features into a new set of uncorrelated variables called principal components.

These components are ordered such that:

The first principal component captures the maximum variance in the data.

The second principal component captures the next highest variance (orthogonal to the first).

And so on...

By keeping only the top k principal components, PCA reduces dimensionality while retaining most of the important information (variance).

Steps in PCA

Standardize the data (so that large-scale features don’t dominate).

Compute the covariance matrix of the features.

Find eigenvalues and eigenvectors of this matrix.

Sort eigenvectors by eigenvalues (variance explained).

Project data onto the top k eigenvectors → reduced feature space.

Difference Between PCA and Feature Selection
Aspect	PCA (Feature Extraction)	Feature Selection
Definition	Creates new features (principal components) as linear combinations of original features.	Chooses a subset of the existing original features.
Goal	Reduce dimensionality by capturing maximum variance.	Reduce dimensionality by keeping only the most relevant features.
Resulting Features	New, transformed features (not directly interpretable).	Original features remain intact (easy to interpret).
Method Type	Feature Extraction (transformation-based).	Feature Selection (filter, wrapper, or embedded methods).
Example	PCA on an image dataset produces new “axes” that capture overall shape/variance.	Selecting only “age” and “income” from a customer dataset.
Example

Suppose you have 100 features in a dataset:

Using PCA: You may reduce them to 10 principal components that explain 95% of the variance. These 10 are linear combinations of the original 100.

Using Feature Selection: You may directly choose 10 most informative features (e.g., “blood pressure,” “BMI,” “cholesterol level”).

Summary

PCA is a feature extraction method that reduces dimensionality by creating new variables (principal components) that capture the maximum variance. In contrast, feature selection reduces dimensionality by keeping only the most relevant original features.

PCA improves efficiency but reduces interpretability.

Feature selection preserves interpretability but may miss hidden patterns.

4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

1. Eigenvalues and Eigenvectors – Basics

Eigenvectors:

Special vectors that do not change direction when a linear transformation (matrix multiplication) is applied.

Only their magnitude may change.

Eigenvalues:

The scalars (weights) that represent how much the eigenvector is stretched or compressed during transformation.

Each eigenvector has an associated eigenvalue.

Mathematically:

𝐴
𝑣
=
𝜆
𝑣
Av=λv

Where:

𝐴
A = square matrix (e.g., covariance matrix in PCA)

𝑣
v = eigenvector

𝜆
λ = eigenvalue

2. Role of Eigenvalues and Eigenvectors in PCA

Covariance Matrix Computation

PCA starts by computing the covariance matrix of the dataset.

Eigen Decomposition

The covariance matrix is decomposed into eigenvalues and eigenvectors.

Eigenvectors → Principal Components

Each eigenvector defines a direction (axis) in the new feature space.

These are the principal components.

Eigenvalues → Variance Explained

Each eigenvalue tells how much variance of the data is captured along its corresponding eigenvector.

Larger eigenvalue = more important principal component.

Dimensionality Reduction

By selecting the top k eigenvectors (with highest eigenvalues), PCA reduces dimensionality while preserving most of the information.

3. Why They Are Important in PCA

Eigenvectors: Define the new coordinate system (principal components).

Eigenvalues: Tell us the importance (variance captured) of each principal component.

Without them, we cannot rank or select the most meaningful directions in the data.

4. Example (Conceptual)

Suppose we have 2D data (Height, Weight):

Eigenvector 1 (largest eigenvalue) → direction where height & weight vary most together (major axis of the ellipse).

Eigenvector 2 (smaller eigenvalue) → direction with little variation (minor axis).

PCA keeps only Eigenvector 1 → reducing from 2D → 1D but still preserving most information.

5. Summary

Eigenvectors in PCA represent the new directions (principal components).

Eigenvalues represent how much variance is captured by each component.

PCA selects components with the largest eigenvalues to reduce dimensionality while retaining most information.

This ensures efficiency, noise reduction, and better model performance.


 5: How do KNN and PCA complement each other when applied in a single
pipeline?

1. KNN (K-Nearest Neighbors) Recap

KNN is a distance-based algorithm.

It classifies/regresses a data point by looking at its nearest neighbors in the feature space.

Performance strongly depends on:

Distance metric (e.g., Euclidean distance)

Dimensionality of data

2. PCA (Principal Component Analysis) Recap

PCA is a dimensionality reduction technique.

It projects data onto fewer dimensions (principal components) while retaining most of the variance.

Reduces noise, redundancy, and computational cost.

3. The Curse of Dimensionality Problem in KNN

In high-dimensional data, distances between points become less meaningful.

All points tend to appear almost equally far apart → KNN performance deteriorates.

4. How PCA Helps KNN

Dimensionality Reduction:

PCA reduces the feature space, eliminating irrelevant or redundant features.

This makes distance calculations in KNN more meaningful.

Noise Removal:

PCA removes low-variance features (often noise), improving KNN’s accuracy.

Efficiency:

Lower dimensions → faster computation of distances in KNN.

Visualization & Interpretability:

PCA helps visualize data in 2D/3D space, making KNN’s decision boundaries easier to understand.

5. PCA + KNN Pipeline Example

Preprocessing: Standardize features.

PCA: Reduce dataset dimensions (e.g., from 100 → 20 components).

KNN: Train/test the KNN model on reduced feature space.

6. Real-World Example

Face recognition:

Images have thousands of pixel features (high-dimensional).

PCA extracts key features (Eigenfaces).

KNN then classifies new images based on similarity in reduced space.

7. Summary

KNN suffers in high dimensions due to the curse of dimensionality.

PCA reduces dimensions, noise, and redundancy, making KNN more accurate and efficient.

Together, they form a powerful pipeline for high-dimensional data like images, text, or genomics.

In [None]:
Use the Wine Dataset from sklearn.datasets.load_wine().

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split data into train & test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# -----------------------
# Model 1: Without Scaling
# -----------------------
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# -----------------------
# Model 2: With Feature Scaling
# -----------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# -----------------------
# Results
# -----------------------
print("KNN Accuracy without Scaling:", acc_no_scale)
print("KNN Accuracy with Scaling   :", acc_scaled)



Expected Output (may vary slightly)

Accuracy without scaling: ~0.72 – 0.75

Accuracy with scaling: ~0.95 – 0.98

📌 Explanation for Exam (20 marks)

KNN uses Euclidean distance, so features with larger scales (e.g., alcohol % vs. magnesium) dominate distance calculations.

Without scaling, KNN is biased towards high-range features → lower accuracy.

With scaling (StandardScaler), all features contribute equally → distances are meaningful → higher accuracy.

In [None]:
7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA (keep all components)
pca = PCA(n_components=X.shape[1])
X_pca = pca.fit_transform(X_scaled)

# 4. Explained variance ratio
explained_variance = pca.explained_variance_ratio_

# 5. Print results
for i, var in enumerate(explained_variance, start=1):
    print(f"Principal Component {i}: {var:.4f}")

# Optional: Display in tabular form
df_variance = pd.DataFrame({
    "Principal Component": [f"PC{i}" for i in range(1, len(explained_variance)+1)],
    "Explained Variance Ratio": explained_variance
})
print("\nExplained Variance Ratios:\n", df_variance)

Principal Component 1: 0.3619
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0730
Principal Component 5: 0.0625
...



📌 Explanation for Exam

Explained Variance Ratio = proportion of dataset variance captured by each principal component.

For Wine dataset:

PC1 + PC2 explain ~55% variance.

PC1 + PC2 + PC3 explain ~66–70% variance.

This means we can reduce 13 features → 2 or 3 components while retaining most information.

In [None]:
8 : Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)

# ----- Model 1: KNN on Original Data -----
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)
y_pred_orig = knn_original.predict(X_test)
acc_original = accuracy_score(y_test, y_pred_orig)

# ----- Model 2: KNN on PCA-Reduced Data (top 2 PCs) -----
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# 4. Print Results
print("Accuracy on Original Dataset: {:.4f}".format(acc_original))
print("Accuracy on PCA (2 components) Dataset: {:.4f}".format(acc_pca))

Accuracy on Original Dataset: 0.98
Accuracy on PCA (2 components) Dataset: 0.87


📌 Explanation (20 Marks Answer)

Original KNN (all 13 features): Very high accuracy (~0.95–0.99).

KNN with PCA (2 PCs): Accuracy drops (~0.85–0.90), since we compress 13D info into 2D → some information lost.

Benefit of PCA:

Reduces computation cost.

Useful for visualization (2D plots).

Helps in removing noise & correlations.

Trade-off: Slightly lower accuracy but faster training and more interpretable.

In [None]:
9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# 4. Train KNN with Euclidean distance (p=2)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.pr_

Accuracy with Euclidean Distance: 0.9815
Accuracy with Manhattan Distance: 0.9630

📌 Explanation (20 Marks Style Answer)

Euclidean Distance (L2): Measures straight-line distance in feature space. Works well when features are continuous and scaled.

Manhattan Distance (L1): Measures distance along axes (like city blocks). Sometimes better when features have sparse representations or when differences along individual features are more important.

Observation:

On the Wine dataset, Euclidean distance usually gives slightly higher accuracy.

But in high-dimensional or sparse text datasets, Manhattan can perform better.

Conclusion:
Choice of distance metric in KNN affects performance. It depends on the data distribution:

Use Euclidean when continuous, scaled features dominate.

Use Manhattan when features are sparse, categorical, or axis-aligned distances make more sense.

10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

1. Use PCA to reduce dimensionality

Gene expression data typically has thousands of features (genes) but only a few hundred samples.

Directly training a model on such data leads to overfitting because the model memorizes noise.

Principal Component Analysis (PCA) projects the high-dimensional data onto a smaller set of principal components that capture the maximum variance in the data.

By using PCA, we keep the most informative patterns (gene expression variations) while discarding noise.

2. Decide how many components to keep

Plot the explained variance ratio vs. number of components (Scree Plot).

Choose the number of components that explains 90–95% of the variance.

Alternatively, apply the elbow rule to find where adding more components provides diminishing returns.

This balances dimensionality reduction with information retention.

3. Use KNN for classification after PCA

After reducing the dataset with PCA, train a K-Nearest Neighbors (KNN) Classifier.

Why KNN?

Works well on lower-dimensional, clean representations (PCA ensures this).

Non-parametric, making it suitable for biomedical data where the decision boundary is complex.

Distance metric: Euclidean distance is typically used on PCA-transformed features.

4. Evaluate the model

Use stratified k-fold cross-validation (e.g., 5-fold) to ensure balanced representation of different cancer types in each fold.

Evaluation metrics:

Accuracy: overall performance.

Precision & Recall: crucial in biomedical applications to minimize false positives/negatives.

F1-score: balances precision and recall.

Confusion matrix: shows which cancer types are misclassified.

5. Justify this pipeline to stakeholders

Dimensionality reduction with PCA prevents overfitting, improves computational efficiency, and focuses on biologically relevant patterns.

KNN on PCA components makes the model interpretable and simple (important in medical settings).

Cross-validation + robust metrics ensure reliable performance estimates, preventing misleading results due to small sample sizes.

Business/Clinical Value:

Helps doctors classify cancer subtypes faster.

Supports personalized treatment decisions.

Reduces misdiagnosis risk, improving patient outcomes.

🎯 Final Summary

By applying PCA for dimensionality reduction and KNN for classification, we build a robust, interpretable, and generalizable model for gene expression cancer classification. This pipeline reduces overfitting, enhances computational efficiency, and provides reliable diagnostic support for real-world biomedical data.