**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

> It is a supervised learning algorithm that predicts a new data point's category or value based on the majority of its "k" closest neighbors in a dataset. For classification, it assigns the new point to the most frequent class among its k neighbors, while for regression, it predicts the new point's value by averaging the values of its k neighbors.


**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

> With increase in the number of features the model performance get degrade, it escalates the risk of overfitting and spurious correlations. To remove this we use: feature selection, feature extraction.


**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

> PCA (Principal Component Analysis) is a dimensionality reduction technique used in data analysis and machine learning.
>
>It helps you to reduce the number of features in a dataset while keeping the most important information.
>
>It changes your original features into new features these new features don’t overlap with each other and the first few keep most of the important differences found in the original data.
>
>It is different from feature selection as it is done by removing irrelevant or redundant ones.


**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

> Eigenvectors are the new axes of the data that represent directions of maximum variance, while eigenvalues are the scalar values indicating the amount of variance captured by each corresponding eigenvector.
>
>They allow PCA to reduce data dimensionality by selecting the eigenvectors with the largest eigenvalues, effectively retaining the most significant patterns and information in the dataset.

**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

> KNN and PCA complient each other as first PCA reduce the number of features in the data and then KNN used this reduced data for classification. PCA simply remove the noisy data making KNN faster and more accurate with less features.


**Question 6: Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

In [1]:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the Wine dataset
wine = load_wine(as_frame=True)
X, y = wine.data, wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Case 1: Classifier WITHOUT feature scaling

knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)

# Make predictions and evaluate accuracy
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)


# Case 2: Classifier WITH feature scaling (StandardScaler)

scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the KNN classifier on scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# Make predictions and evaluate accuracy
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {accuracy_unscaled:.2f}")
print(f"Accuracy with scaling: {accuracy_scaled:.2f}")

if accuracy_scaled > accuracy_unscaled:
    print("\nConclusion: Feature scaling significantly improved the model's accuracy.")
elif accuracy_scaled < accuracy_unscaled:
    print("\nConclusion: Scaling slightly decreased accuracy. This can sometimes occur, though it is less common for KNN.")
else:
    print("\nConclusion: Feature scaling had no effect on the model's accuracy.")

Accuracy without scaling: 0.72
Accuracy with scaling: 0.94

Conclusion: Feature scaling significantly improved the model's accuracy.


**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Instantiate PCA and fit it to the scaled data
# You can choose the number of components, or let it choose all of them
pca = PCA()
pca.fit(X_scaled)
# Access the explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Print the ratio for each component
for i, ratio in enumerate(explained_variance):
    print(f"Principal Component {i+1}: {ratio:.2f}")

Principal Component 1: 0.36
Principal Component 2: 0.19
Principal Component 3: 0.11
Principal Component 4: 0.07
Principal Component 5: 0.07
Principal Component 6: 0.05
Principal Component 7: 0.04
Principal Component 8: 0.03
Principal Component 9: 0.02
Principal Component 10: 0.02
Principal Component 11: 0.02
Principal Component 12: 0.01
Principal Component 13: 0.01


**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# 1. Load and Prepare Data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train KNN on Original Data
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)
print(f"Accuracy on original dataset: {accuracy_original:.2f}")

# 3. Perform PCA and Train KNN on Transformed Data
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy on PCA-transformed dataset (2 components): {accuracy_pca:.2f}")

# 4. Compare Accuracies
if accuracy_pca > accuracy_original:
    print("KNN on PCA-transformed data achieved higher accuracy.")
elif accuracy_pca < accuracy_original:
    print("KNN on original data achieved higher accuracy.")
else:
    print("KNN on both original and PCA-transformed data achieved similar accuracy.")

Accuracy on original dataset: 1.00
Accuracy on PCA-transformed dataset (2 components): 1.00
KNN on both original and PCA-transformed data achieved similar accuracy.


**Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

In [4]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load the Wine dataset
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train a KNN classifier with Euclidean distance

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)

# 5. Train a KNN classifier with Manhattan distance

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)

# 6. Evaluate and compare the models
print("--- KNN with Euclidean Distance ---")
euclidean_accuracy = accuracy_score(y_test, y_pred_euclidean)
print(f"Accuracy: {euclidean_accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_euclidean, target_names=wine_data.target_names))

print("\n--- KNN with Manhattan Distance ---")
manhattan_accuracy = accuracy_score(y_test, y_pred_manhattan)
print(f"Accuracy: {manhattan_accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_manhattan, target_names=wine_data.target_names))



--- KNN with Euclidean Distance ---
Accuracy: 0.94

Classification Report:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        18
     class_1       1.00      0.86      0.92        21
     class_2       0.83      1.00      0.91        15

    accuracy                           0.94        54
   macro avg       0.94      0.95      0.94        54
weighted avg       0.95      0.94      0.94        54


--- KNN with Manhattan Distance ---
Accuracy: 0.98

Classification Report:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        18
     class_1       1.00      0.95      0.98        21
     class_2       0.94      1.00      0.97        15

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54



**Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit. Explain how you would:**
* Use PCA to reduce dimensionality
* Decide how many components to keep
* Use KNN for classification post-dimensionality reduction
* Evaluate the model
* Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data.

>* Standardize → Apply PCA → Keep principal components capturing most variance.
>* Keep components explaining 95% variance
>* KNN works well post-PCA since noise and redundancy are reduced.
>* Use Stratified k-fold cross-validation with metrics: accuracy, F1-score, ROC-AUC.
>* PCA+KNN is interpretable, reduces overfitting, handles small-sample/high-dimensional biomedical data efficiently.



In [5]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, make_scorer

# Suppose: X = gene expression features, y = cancer types
# X, y = load_your_data()

# 1. Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

# 3. KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# 4. Evaluate with stratified k-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_acc = cross_val_score(knn, X_pca, y, cv=cv, scoring='accuracy')
scores_f1 = cross_val_score(knn, X_pca, y, cv=cv, scoring='f1_weighted')

print(f"Mean Accuracy: {scores_acc.mean():.3f}")
print(f"Mean F1-score: {scores_f1.mean():.3f}")
print(f"Number of PCA Components: {pca.n_components_}")

# 5. Fit final model on all data
knn.fit(X_pca, y)


Mean Accuracy: 0.966
Mean F1-score: 0.966
Number of PCA Components: 10
