Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

- K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. It is a lazy learning method because it does not learn a model during training; instead, it memorizes the training dataset and makes predictions at the time of testing.

- How KNN Works:

1. Choose a value of K (number of nearest neighbors to consider).

2. For a new data point:

- Calculate the distance (commonly Euclidean distance) between the new point and all training points.

- Select the K nearest neighbors (points with the smallest distances).

- Make prediction based on these neighbors.

-  KNN in Classification:

- Each of the K neighbors "votes" for its class.

- The class with the majority votes is assigned to the new data point.

- Example:
- If K = 5 and among 5 neighbors → 3 belong to Class A and 2 belong to Class B, then the new point will be classified as Class A.

- KNN in Regression:

- Instead of voting, KNN takes the average (or weighted average) of the target values of the K nearest neighbors.

- The predicted value is the mean of neighbors’ values.

- Example:
If K = 3 and the neighbors have values 10, 12, 14, the predicted value = (10 + 12 + 14) / 3 = 12

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

- The Curse of Dimensionality refers to the problems that occur when data has a very large number of features (dimensions). As the number of dimensions increases, data points become sparse and the concept of "closeness" (distance between points) becomes less meaningful.

- How it affects KNN:

1. Distance becomes less reliable:

- In high dimensions, the distance between the nearest and farthest neighbors becomes almost the same.

- This makes it hard for KNN to identify truly "nearest" neighbors.

2. Increased computational cost:

- More features = more distance calculations.

- Prediction becomes very slow for large datasets.

3. Risk of overfitting:

- With too many irrelevant features, KNN may give wrong predictions because noise dominates useful signals.

4. Need for more data:

- In higher dimensions, a huge amount of data is required to cover the feature space properly, otherwise KNN struggles.

- Example:

- Imagine you have 2D data (Height, Weight). Distances are easy to measure.

- If you add 100+ irrelevant features (like random numbers), then every point looks "far" from every other point, and KNN fails to find meaningful neighbors.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

- Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and statistics.

- It transforms the original features into a new set of uncorrelated features called principal components.

- These principal components are linear combinations of the original features and are ordered such that:

- The first component captures the maximum variance in the data.

- The second component captures the maximum remaining variance, and so on.


| **Aspect**           | **PCA (Dimensionality Reduction)**                                                | **Feature Selection**                                    |
| -------------------- | --------------------------------------------------------------------------------- | -------------------------------------------------------- |
| **Method**           | Creates new features (principal components) as combinations of original features. | Selects a subset of the original features.               |
| **Goal**             | Reduce dimensionality while keeping maximum variance.                             | Keep only the most relevant features.                    |
| **Interpretability** | New features are hard to interpret.                                               | Original features are kept, so easy to interpret.        |
| **Type**             | Feature transformation technique.                                                 | Feature elimination/selection technique.                 |
| **Use case**         | Useful when features are highly correlated.                                       | Useful when only some features contribute to prediction. |


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

- Eigenvalues and Eigenvectors (Basics):

- Eigenvectors are directions along which data varies the most. They show the orientation of the new feature space (principal components).

- Eigenvalues represent the amount of variance captured by each eigenvector (principal component). A higher eigenvalue means that direction captures more information from the data.

- Why are they important in PCA?

1. Identify Principal Components:

- Eigenvectors define the principal components (new transformed features).

- For example, PC1 = eigenvector with the largest eigenvalue.

2. Rank Components by Importance:

- Eigenvalues tell how much variance each component explains.

- We can select only the top k components with the highest eigenvalues to reduce dimensionality.

3. Data Compression:

- By keeping only components with large eigenvalues, we reduce features while retaining most of the useful information.

- Noise Reduction:

- Components with very small eigenvalues contribute little variance (often noise), so they can be discarded.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

- K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) are often used together in machine learning pipelines because they solve each other’s weaknesses.

- How they complement each other:

1. PCA reduces dimensionality → improves KNN performance:

- KNN suffers from the curse of dimensionality when data has many features.

- PCA reduces the number of dimensions while keeping most important information, making distance calculations in KNN more meaningful.

2. Noise reduction before KNN:

- PCA removes less informative features (small eigenvalues).

- This helps KNN focus only on relevant patterns and avoid noisy distances.

3. Speed improvement:

- KNN requires computing distances from the test point to all training points.

- With fewer dimensions after PCA, these distance calculations become faster.

4. Better generalization:

- PCA prevents overfitting by removing redundant features.

- KNN then makes predictions based on a cleaner, compressed representation of the data.

- Example Pipeline:

1. Start with a dataset having 100 features.

2. Apply PCA → reduce to 10 principal components (retaining 95% variance).

3. Use KNN on these 10 components → predictions are faster, more accurate, and less noisy.

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling   :", acc_scaling)


Accuracy without Scaling: 0.7222222222222222
Accuracy with Scaling   : 0.9444444444444444


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

wine = load_wine()
X, y = wine.data, wine.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)


print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.


In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy on Original Dataset (13 features):", acc_original)
print("Accuracy on PCA Dataset (2 components):", acc_pca)


Accuracy on Original Dataset (13 features): 0.9444444444444444
Accuracy on PCA Dataset (2 components): 1.0


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)


knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)


print("Accuracy with Euclidean Distance:", acc_euclidean)
print("Accuracy with Manhattan Distance:", acc_manhattan)


Accuracy with Euclidean Distance: 0.9444444444444444
Accuracy with Manhattan Distance: 0.9444444444444444


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

- In biomedical datasets (like gene expression), there are often thousands of features (genes) but only a few hundred samples (patients). This creates a high risk of overfitting.
To solve this, we use PCA + KNN pipeline.

Step 1: Use PCA to Reduce Dimensionality

- PCA transforms the high-dimensional gene data into a smaller set of principal components.

- This reduces noise and correlation between features while retaining most of the variance (information).

Step 2: Decide How Many Components to Keep

- We look at the explained variance ratio (scree plot).

- We choose the smallest number of components that capture ~90–95% variance.

- This ensures a balance between information retention and model simplicity.

Step 3: Use KNN for Classification Post-PCA

- After dimensionality reduction, apply KNN.

- KNN works better in lower dimensions since distance metrics are more reliable.

Step 4: Evaluate the Model

- Split dataset into train/test.

- Compute accuracy score (and optionally confusion matrix or F1-score).

- Compare performance with and without PCA.

Step 5: Justify to Stakeholders

- Biomedical data is noisy & high-dimensional → PCA removes irrelevant variation.

- PCA ensures interpretability (components reflect major patterns in gene activity).

- KNN is simple, transparent, and non-parametric → doctors/researchers can trust predictions.

- This pipeline provides a robust, generalizable solution without overfitting.

In [5]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Original features:", X.shape[1])
print("Reduced PCA components:", X_train_pca.shape[1])

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_pca, y_train)
y_pred = knn.predict(X_test_pca)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy after PCA + KNN:", accuracy)


Original features: 13
Reduced PCA components: 10
Accuracy after PCA + KNN: 0.9444444444444444
