#Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
Answer:

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression. It works on the idea that similar data points are close to each other.

In KNN, when a new data point is given, the algorithm:

Calculates the distance between the new point and all training data points.

Selects the K closest neighbors.

Makes a prediction based on these neighbors.

In classification, the class with the majority vote among the K neighbors is assigned.

In regression, the average value of the K neighbors is used as the prediction.

KNN is simple, intuitive, and works well when the dataset is not very large.

#Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?
Answer:

The Curse of Dimensionality refers to problems that arise when the number of features (dimensions) becomes very large.

In high-dimensional data:

Distance between data points becomes less meaningful.

All points appear almost equally distant.

KNN struggles to find truly “nearest” neighbors.

This negatively affects KNN because it relies heavily on distance calculations. As dimensions increase, model accuracy decreases and computation becomes expensive.

Dimensionality reduction techniques like PCA are often used to overcome this problem

#Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
Answer:

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms original features into a smaller set of new features called principal components.

These components:

Are uncorrelated

Capture maximum variance in the data

Difference between PCA and Feature Selection:

PCA	Feature Selection
Creates new features	Selects existing features
Uses transformation	No transformation
Reduces multicollinearity	Keeps original meaning

PCA is mainly used when features are highly correlated.

#Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?
Answer:

In PCA:

Eigenvectors represent the directions of maximum variance.

Eigenvalues represent the amount of variance captured in those directions.

Eigenvectors define the principal components, while eigenvalues tell us how important each component is.

Higher eigenvalue → more information captured
Lower eigenvalue → less useful component

They help decide which components to keep and which to discard.

#Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
Answer:

PCA reduces the number of features and removes noise, while KNN performs classification based on distance.

When combined:

PCA reduces dimensionality → distances become meaningful

KNN becomes faster and more accurate

Overfitting is reduced

This makes the PCA + KNN pipeline efficient for high-dimensional datasets.


#Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# KNN without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
acc_without_scaling = accuracy_score(y_test, knn.predict(X_test))

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with scaling
knn.fit(X_train_scaled, y_train)
acc_with_scaling = accuracy_score(y_test, knn.predict(X_test_scaled))

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


#Question 7: Train a PCA model and print explained variance ratio

In [2]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Scale data
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


#Question 8: Train KNN on PCA-transformed data (top 2 components)

In [3]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Scaling
X_scaled = StandardScaler().fit_transform(X)

# PCA (2 components)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42
)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
accuracy = accuracy_score(y_test, knn.predict(X_test))

print("Accuracy with PCA (2 components):", accuracy)


Accuracy with PCA (2 components): 0.9814814814814815


#Question 9: KNN with different distance metrics

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

# Scale
X_scaled = StandardScaler().fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# Euclidean
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))

# Manhattan
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


Euclidean Accuracy: 0.9629629629629629
Manhattan Accuracy: 0.9629629629629629


#Question 10: High-dimensional Gene Expression Dataset (Conceptual)
Answer:

PCA for Dimensionality Reduction
PCA is used to reduce thousands of gene features into fewer principal components while preserving important variance.

Choosing Number of Components
Use explained variance ratio and select components that retain 90–95% variance.

KNN after PCA
Apply KNN on reduced features to improve accuracy and reduce overfitting.

Evaluation
Use cross-validation, accuracy, precision-recall, and ROC-AUC.

Business Justification

Reduces noise and overfitting

Improves prediction reliability

Makes model interpretable and computationally efficient

This pipeline provides a robust and realistic solution for biomedical data analysis.