1. What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
- K-Nearest Neighbors (KNN) is a simple, intuitive, and non-parametric supervised learning algorithm. Its core philosophy is "similarity": it assumes that similar data points exist in close proximity to each other.
How KNN works?
1- Choose the value of K: Decide how many "neighbors" (nearby data points) the algorithm should look at.
2- Calculate Distance: When a new data point (query point) arrives, the algorithm calculates its distance from every single point in the training set.
3- Find the K Nearest Neighbors: Sort the calculated distances and pick the K points with the smallest values.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?
- The Curse of Dimensionality refers to a set of phenomena that occur when analyzing and organizing data in high-dimensional spaces (datasets with a large number of features) that do not happen in low-dimensional settings.

How it Affects KNN Performance?
1- Distance Concentration (Distances Lose Meaning)
In high-dimensional space, the difference between the distance to the nearest neighbor and the distance to the farthest neighbor tends to become negligible.
2- Data Sparsity
To maintain the same density of data points as you add dimensions, you would need an exponential increase in the amount of data.
3- Increased Computational Complexity
KNN is already a "lazy learner" that performs calculations at the time of prediction.

3. What is Principal Component Analysis (PCA)? How is it different from
feature selection?
- Principal Component Analysis (PCA) is a powerful statistical technique used to simplify complex datasets. It is the most popular method for dimensionality reduction, transforming a large set of variables into a smaller one that still contains most of the information from the original set.
How PCA different from feature selection?
1- Fundamental Definition:-
Feature Selection: This is a process of choosing. You select a subset of the original variables and discard the rest. The features you keep remain exactly as they were.

PCA (Feature Extraction): This is a process of transformation. It combines the original variables into a set of entirely new, artificial variables (Principal Components).

2- Information Retention:-
Feature Selection: Information in the discarded features is completely lost. If you drop "Year of Birth" because it's redundant with "Age," that specific column is gone.

PCA: It attempts to compress information. Even if you reduce 10 variables down to 3 Principal Components, those 3 components still contain "bits" of information from all 10 original variables.

3- Mathematical Relationship:-
Feature Selection: Features are evaluated based on their relationship to the target variable (e.g., how well "Price" correlates with "Sales").

PCA: Components are created based on variance within the features themselves. It is an unsupervised technique that doesn't care about the target variable; it only looks for where the most "spread" exists in the data.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?
- In Principal Component Analysis (PCA), eigenvalues and eigenvectors are the mathematical engines that allow the algorithm to actually reduce dimensionality while keeping your data's "soul" (its variance) intact.
How are they important:-
They allow PCA to perform three critical tasks:
Ranking Information: By sorting eigenvalues from largest to smallest, PCA automatically ranks your new features by how much "information" they provide.

Dimensionality Reduction: You can decide to keep only the eigenvectors with the largest eigenvalues  and safely discard the rest. This shrinks your dataset while retaining most of its patterns.

Eliminating Correlation: The eigenvectors are mathematically guaranteed to be perpendicular to each other. This means your new features (Principal Components) are uncorrelated, which solves the problem of "redundant" data.

5.  How do KNN and PCA complement each other when applied in a single
pipeline?
- When applied in a single machine learning pipeline, PCA and KNN act as a powerful duo where the strengths of one directly address the weaknesses of the other. PCA handles the data preparation and structural cleanup, while KNN focuses on the final decision-making.
Here is how they complement each other point-by-point:
1- Defeating the "Curse of Dimensionality-
The Complement: PCA shrinks the "vast vacuum" of high-dimensional space into a much smaller, denser subspace. By reducing 100 features down to 5 principal components, PCA ensures that the "nearest" neighbors KNN finds are actually relevant and truly similar.

2- Radical Speed Improvements-
KNN is computationally expensive because it must calculate the distance to every single point in the dataset for every prediction.

The Complement: PCA reduces the number of mathematical operations required for every distance calculation. Calculating distance across 3 dimensions (after PCA) is significantly faster than calculating it across 50 dimensions, making your model viable for larger datasets or real-time use.

3- Noise Reduction and Signal Boosting-
Raw data often contains "noisy" features that don't help with classification but still confuse KNN's distance calculations.

The Complement: PCA identifies the directions with the most variance (the signal) and discards the directions with very little variance (often the noise). This "cleans" the data before KNN ever sees it, often leading to higher accuracy.



In [1]:
#6  Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


wine = load_wine()
X = wine.data
y = wine.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)



scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)

y_pred_scaling = knn_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)


print("Accuracy without feature scaling:", accuracy_no_scaling)
print("Accuracy with feature scaling   :", accuracy_scaling)



Accuracy without feature scaling: 0.7407407407407407
Accuracy with feature scaling   : 0.9629629629629629


In [2]:
#7 Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


wine = load_wine()
X = wine.data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


pca = PCA()
X_pca = pca.fit_transform(X_scaled)


print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_, start=1):
    print(f"PC{i}: {ratio:.4f}")


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


In [3]:
#8 Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


wine = load_wine()
X = wine.data
y = wine.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


scaler_original = StandardScaler()
X_train_scaled = scaler_original.fit_transform(X_train)
X_test_scaled = scaler_original.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)

y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)


pca = PCA(n_components=2)

X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)


knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)


print("Accuracy on original dataset:", accuracy_original)
print("Accuracy on PCA-transformed dataset (2 components):", accuracy_pca)


Accuracy on original dataset: 0.9629629629629629
Accuracy on PCA-transformed dataset (2 components): 0.9814814814814815


In [4]:
#9  Train a KNN Classifier with different distance metrics (euclidean,manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


wine = load_wine()
X = wine.data
y = wine.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)

y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)


knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)

y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)


print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)


Accuracy with Euclidean distance: 0.9629629629629629
Accuracy with Manhattan distance: 0.9629629629629629


10. You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)

- 1- Using PCA to Reduce Dimensionality (Why & How):
Problem in gene expression data
-> Thousands of genes (features)
-> Very few patient samples
-> Leads to:
1- Overfitting
2- High variance
3- Poor generalization

Solution: PCA
-> PCA transforms correlated gene features into uncorrelated principal components
-> Keeps maximum biological variance
-> Removes noise and redundancy
-> Makes distance-based models (like KNN) reliable

2- Deciding How Many Components to Keep:
We choose components by:
-> Cumulative explained variance
-> Common biomedical rule:
Keep components explaining 90–95% variance
This preserves biological signal while removing noise

3- Using KNN After PCA:
Why KNN?
-> Non-parametric (no strong assumptions)
-> Effective after dimensionality reduction
-> Works well once noise and redundancy are removed

4- Model Evaluation Strategy:-
To ensure robustness:
-> Train/Test split
-> Accuracy score
-> (Optionally: confusion matrix, cross-validation)
This ensures:
-> No data leakage
-> Reliable generalization

5- Justification to Stakeholders (Medical / Research Teams):-
Why this pipeline is robust for biomedical data:
-> Reduces overfitting in small-sample, high-dimensional data
-> Preserves biological variance
-> Improves model interpretability
-> Computationally efficient
-> Widely used in genomics and clinical ML research


In [5]:
# code of 10th question:-

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np




data = load_breast_cancer()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Original number of features:", X.shape[1])
print("Reduced number of features after PCA:", X_train_pca.shape[1])


knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)

y_pred = knn.predict(X_test_pca)


accuracy = accuracy_score(y_test, y_pred)
print("KNN Accuracy after PCA:", accuracy)


Original number of features: 30
Reduced number of features after PCA: 10
KNN Accuracy after PCA: 0.9649122807017544
