1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
- K-Nearest Neighbors (KNN) is a simple, non-parametric, lazy-learning supervised algorithm that predicts the label or value of a new data point based on the similarity of its nearest neighbors. It calculates distance to find the closest points, using majority voting for classification and averaging for regression.

2. What is the Curse of Dimensionality and how does it affect KNN
performance?
- The Curse of Dimensionality refers to the exponential increase in volume associated with adding extra dimensions (features) to data, making it sparse and distance metrics less meaningful.

3.  What is Principal Component Analysis (PCA)? How is it different from
feature selection?
- Principal Component Analysis is an unsupervised dimensionality reduction technique that transforms a large set of variables into a smaller, uncorrelated set called "principal components" while retaining maximum data variance.

4. What are eigenvalues and eigenvectors in PCA, and why are they
important?
- In PCA, eigenvectors represent the directions of maximum variance in data, while eigenvalues are scalars indicating the magnitude of variance captured along those directionsIn PCA, eigenvectors represent the directions of maximum variance in data, while eigenvalues are scalars indicating the magnitude of variance captured along those directions.

5. How do KNN and PCA complement each other when applied in a single
pipeline?
- When applied in a single pipeline, PCA and KNN complement each other by addressing each other's weaknesses: PCA provides dimensionality reduction and noise reduction, which significantly improves the computational efficiency and distance-based accuracy of the KNN algorithm.

In [None]:
#6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)

knn_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("--- KNN Classifier Accuracy Comparison ---")
print(f"Accuracy without feature scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with feature scaling:    {accuracy_scaled:.4f}")

if accuracy_scaled > accuracy_no_scaling:
    print("\nConclusion: Feature scaling significantly improved model accuracy.")
elif accuracy_no_scaling > accuracy_scaled:
    print("\nConclusion: Accuracy without scaling was higher (uncommon for KNN).")
else:
    print("\nConclusion: Feature scaling had no impact on the model accuracy in this case.")

--- KNN Classifier Accuracy Comparison ---
Accuracy without feature scaling: 0.7407
Accuracy with feature scaling:    0.9630

Conclusion: Feature scaling significantly improved model accuracy.


7. Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
- Training a PCA model on the wine dataset involves standardizing the 13 features, applying PCA, and analyzing the explained_variance_ratio_. PCA decomposes data, revealing that the first few principal components explain the majority of variance, often achieving over 80% with only two or three components.

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=None)
pca.fit(X_scaled)
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i + 1}: {ratio:.4f}")
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
print("\nCumulative explained variance:")
for i, cum_ratio in enumerate(cumulative_variance):
    print(f"PC{i + 1}: {cum_ratio:.4f}")

Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Cumulative explained variance:
PC1: 0.3620
PC2: 0.5541
PC3: 0.6653
PC4: 0.7360
PC5: 0.8016
PC6: 0.8510
PC7: 0.8934
PC8: 0.9202
PC9: 0.9424
PC10: 0.9617
PC11: 0.9791
PC12: 0.9920
PC13: 1.0000


In [None]:
#8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=None)
pca.fit(X_scaled)
print("Explained variance ratio of each principal component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i + 1}: {ratio:.4f}")
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
print("\nCumulative explained variance:")
for i, cum_ratio in enumerate(cumulative_variance):
    print(f"PC{i + 1}: {cum_ratio:.4f}")

Explained variance ratio of each principal component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Cumulative explained variance:
PC1: 0.3620
PC2: 0.5541
PC3: 0.6653
PC4: 0.7360
PC5: 0.8016
PC6: 0.8510
PC7: 0.8934
PC8: 0.9202
PC9: 0.9424
PC10: 0.9617
PC11: 0.9791
PC12: 0.9920
PC13: 1.0000


In [None]:
#9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

print("--- Results with Euclidean Distance (L2) ---")
print(f"Accuracy: {accuracy_euclidean:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_euclidean))
print("-" * 40)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("--- Results with Manhattan Distance (L1) ---")
print(f"Accuracy: {accuracy_manhattan:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_manhattan))
print("-" * 40)

print("--- Comparison Summary ---")
print(f"Euclidean Distance Accuracy: {accuracy_euclidean:.4f}")
print(f"Manhattan Distance Accuracy: {accuracy_manhattan:.4f}")


--- Results with Euclidean Distance (L2) ---
Accuracy: 0.9630
Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.90      0.95        21
           2       0.93      1.00      0.97        14

    accuracy                           0.96        54
   macro avg       0.96      0.97      0.96        54
weighted avg       0.97      0.96      0.96        54

----------------------------------------
--- Results with Manhattan Distance (L1) ---
Accuracy: 0.9630
Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.90      0.95        21
           2       0.93      1.00      0.97        14

    accuracy                           0.96        54
   macro avg       0.96      0.97      0.96        54
weighted avg       0.97      0.96      0.96        54

---------------------------

In [None]:
#10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples, traditional models overfit.
