Theory Questions:

1. What is K-Nearest Neighbors (KNN) and how does it work?

KNN is a supervised machine learning algorithm used for classification and regression. It works by storing the training data and predicting the output based on the majority label (for classification) or average (for regression) of the K closest data points (neighbors) in the feature space, using a distance metric like Euclidean distance.


2. What is the difference between KNN Classification and KNN Regression?

KNN Classification predicts the class label based on majority voting among the K nearest neighbors.
KNN Regression predicts a continuous value by averaging the outputs of the K nearest neighbors.




3. What is the role of the distance metric in KNN?

The distance metric determines how "close" data points are. Common metrics:
Euclidean Distance (default for continuous data)
Manhattan Distance
Minkowski Distance

The accuracy of KNN heavily depends on the choice of an appropriate distance metric.

4. What is the Curse of Dimensionality in KNN?

As the number of dimensions (features) increases, the data becomes sparse, and all points start appearing equidistant. This degrades the performance of KNN, as distance metrics lose their effectiveness.

5. How can we choose the best value of K in KNN?

Use cross-validation to test various values of K.
A small K can be noisy and overfit.
A large K can smooth out predictions but may underfit.
Typically, odd values of K are preferred to avoid ties in classification.

6. What are KD Tree and Ball Tree in KNN?

KD Tree: A space-partitioning tree data structure that organizes points in K-dimensional space using axis-aligned splits.
Ball Tree: A binary tree where each node represents a n-dimensional ball (region), useful for non-axis aligned and high-dimensional data.

7. When should you use KD Tree vs. Ball Tree?

Use KD Tree for low-dimensional data (usually < 20 dimensions).
Use Ball Tree for high-dimensional data or data where KD Tree performance degrades.

8. What are the disadvantages of KNN?

Computationally expensive at prediction time.

Sensitive to irrelevant features and feature scaling.

Poor performance in high dimensions.

Doesn’t work well with missing values.


9. How does feature scaling affect KNN?

Feature scaling (e.g., normalization or standardization) is critical for KNN. Since it relies on distance, features with larger scales can dominate the distance metric and skew results.

10. What is PCA (Principal Component Analysis)?

PCA is a dimensionality reduction technique that transforms the original features into a smaller set of uncorrelated variables (principal components), capturing the maximum variance.

11. How does PCA work?

Standardize the data.

Compute the covariance matrix.

Find eigenvectors and eigenvalues.

Sort eigenvectors by descending eigenvalues.

Select the top k eigenvectors to form the new feature space.



12. What is the geometric intuition behind PCA?

PCA finds the directions (axes) in the feature space where the data varies the most and projects the data onto those directions to reduce dimensions while retaining maximum variance.

13. What is the difference between Feature Selection and Feature Extraction?

Feature Selection: Selects a subset of original features (e.g., removing irrelevant features).
Feature Extraction: Transforms data into a new feature space (e.g., PCA).

14. What are Eigenvalues and Eigenvectors in PCA?

Eigenvectors represent the directions of maximum variance (principal components).
Eigenvalues represent the magnitude of variance in each direction (importance of each component).

15. How do you decide the number of components to keep in PCA?

Use:
Scree plot (elbow method)
Cumulative explained variance (e.g., choose components that explain 95% of the variance)

16. Can PCA be used for classification?

Yes, PCA can be used as a preprocessing step to reduce dimensionality and improve classification performance. But PCA itself is not a classification algorithm.

17. What are the limitations of PCA?

Assumes linear relationships.

Components may not be interpretable.

Sensitive to scaling.

Doesn’t consider class labels (unsupervised).


18. How do KNN and PCA complement each other?

PCA reduces dimensions and noise, which helps improve KNN performance by mitigating the curse of dimensionality and making distance metrics more meaningful.

19. How does KNN handle missing values in a dataset?

KNN doesn’t inherently handle missing values. You need to preprocess the data:
Impute missing values (mean, median, or using KNN imputation).
Remove rows/columns with missing data.


20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?

PCA is unsupervised, LDA is supervised.
PCA maximizes variance, LDA maximizes class separability.
PCA does not use class labels, LDA requires class labels.
PCA creates principal components, LDA creates discriminant components.
PCA captures global structure, LDA focuses on between-class vs. within-class variance.
PCA is suitable for feature extraction, LDA is suitable for classification.
PCA can be used with or without labels, LDA cannot work without labels.






Practical questions

In [None]:
#21. Train a KNN Classifier on the Iris dataset and print model accuracy

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))






In [None]:
#22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)

from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=200, n_features=1, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


In [None]:
#23. Train a KNN Classifier using different distance metrics and compare accuracy

knn_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')

knn_euclidean.fit(X_train, y_train)
knn_manhattan.fit(X_train, y_train)

acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print("Euclidean Accuracy:", acc_euclidean)
print("Manhattan Accuracy:", acc_manhattan)


In [None]:
#24. Train a KNN Classifier with different values of K and visualize decision boundaries

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from mlxtend.plotting import plot_decision_regions

X, y = make_classification(n_samples=300, n_features=2, n_informative=2, n_redundant=0, random_state=42)

for k in [1, 3, 5, 7]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    plt.figure()
    plot_decision_regions(X, y, clf=knn, legend=2)
    plt.title(f"K = {k}")
    plt.show()


In [None]:
#25. Apply Feature Scaling before training a KNN model and compare results

from sklearn.preprocessing import StandardScaler

# Without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=3)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print("Unscaled Accuracy:", acc_unscaled)
print("Scaled Accuracy:", acc_scaled)


In [None]:
#26. Train a PCA model on synthetic data and print explained variance ratio

from sklearn.decomposition import PCA

X, _ = make_classification(n_samples=200, n_features=5, random_state=42)
pca = PCA(n_components=5)
pca.fit(X)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)


In [None]:
#27. Apply PCA before training a KNN Classifier and compare accuracy

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print("Accuracy without PCA:", acc_scaled)
print("Accuracy with PCA:", acc_pca)


In [None]:
#28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [1, 3, 5, 7],
    'metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)


In [None]:
#29. Train a KNN Classifier and check the number of misclassified samples

y_pred = knn_scaled.predict(X_test_scaled)
misclassified = (y_test != y_pred).sum()

print("Number of Misclassified Samples:", misclassified)


In [None]:
#30. Train a PCA model and visualize the cumulative explained variance

import numpy as np

pca = PCA()
pca.fit(X_train_scaled)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cumulative_variance)+1), cumulative_variance, marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Cumulative Explained Variance")
plt.grid(True)
plt.show()


In [None]:
#31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')

knn_uniform.fit(X_train_scaled, y_train)
knn_distance.fit(X_train_scaled, y_train)

acc_uniform = accuracy_score(y_test, knn_uniform.predict(X_test_scaled))
acc_distance = accuracy_score(y_test, knn_distance.predict(X_test_scaled))

print("Uniform Weights Accuracy:", acc_uniform)
print("Distance Weights Accuracy:", acc_distance)


In [None]:
#32. Train a KNN Regressor and analyze the effect of different K values on performance.
for k in [1, 3, 5, 7, 9]:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, model.predict(X_test))
    print(f"K={k}, MSE={mse}")


In [None]:
#33. Implement KNN Imputation for handling missing values in a dataset.
from sklearn.impute import KNNImputer
import numpy as np

X_missing = X.copy()
X_missing[::10] = np.nan  # introduce missing values
imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X_missing)

print("Missing values imputed using KNN.")


In [None]:
#34. Train a PCA model and visualize the data projection onto the first two principal components.
X_pca = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.title("PCA Projection (First 2 Components)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar()
plt.show()


In [None]:
#35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.
for algo in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algo)
    knn.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_scaled))
    print(f"{algo.upper()} Accuracy: {acc}")


In [None]:
#36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot.
X_hd = make_classification(n_samples=500, n_features=50)[0]
pca_hd = PCA().fit(X_hd)
plt.plot(np.cumsum(pca_hd.explained_variance_ratio_), marker='o')
plt.title("Scree Plot")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.show()


In [None]:
#37. Train a KNN Classifier and evaluate performance using Precision, Recall, and Fl-Score.
from sklearn.metrics import classification_report

knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred))


In [None]:
#38. Train a PCA model and analyze the effect of different numbers of components on accuracy.
for n in [1, 2, 3, 4]:
    X_train_pca = PCA(n_components=n).fit_transform(X_train_scaled)
    X_test_pca = PCA(n_components=n).fit(X_train_scaled).transform(X_test_scaled)
    knn = KNeighborsClassifier()
    knn.fit(X_train_pca, y_train)
    print(f"Components: {n}, Accuracy: {accuracy_score(y_test, knn.predict(X_test_pca))}")


In [None]:
#39. Train a KNN Classifier with different leaf_size values and compare accuracy.
for size in [10, 20, 30, 50]:
    knn = KNeighborsClassifier(leaf_size=size)
    knn.fit(X_train_scaled, y_train)
    print(f"Leaf Size: {size}, Accuracy: {accuracy_score(y_test, knn.predict(X_test_scaled))}")


In [None]:
#40. Train a PCA model and visualize how data points are transformed before and after PCA.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
ax1.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y)
ax1.set_title("Original Features")
X_pca = PCA(n_components=2).fit_transform(X_scaled)
ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
ax2.set_title("PCA Transformed")
plt.show()


In [None]:
#41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.
from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
print(classification_report(y_test, knn.predict(X_test_scaled)))


In [None]:
#42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.
for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    mse = mean_squared_error(y_test, knn.predict(X_test))
    print(f"{metric.capitalize()} MSE: {mse}")


In [None]:
#43. Train a KNN Classifier and evaluate using ROC-AUC score.
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

y_bin = label_binarize(y_test, classes=np.unique(y))
y_score = knn.predict_proba(X_test_scaled)

print("ROC-AUC Score:", roc_auc_score(y_bin, y_score, multi_class='ovr'))


In [None]:
#44. Train a PCA model and visualize the variance captured by each principal component.
pca = PCA()
pca.fit(X_scaled)
plt.bar(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_)
plt.title("Variance by Principal Component")
plt.xlabel("Component")
plt.ylabel("Variance Ratio")
plt.show()


In [None]:
#45. Train a KNN Classifier and perform feature selection before training.
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X_scaled, y)

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print("Accuracy with Selected Features:", accuracy_score(y_test, knn.predict(X_test)))


In [None]:
#46. Train a PCA model and visualize the data reconstruction error after reducing dimensions.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed)**2)
print("Reconstruction Error:", reconstruction_error)


In [None]:
#47. Train a KNN Classifier and visualize the decision boundary.
from mlxtend.plotting import plot_decision_regions

X_2d, y_2d = X_scaled[:, :2], y
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_2d, y_2d)

plt.figure()
plot_decision_regions(X_2d, y_2d, clf=knn, legend=2)
plt.title("KNN Decision Boundary")
plt.show()


In [None]:
#48. Train a PCA model and analyze the effect of different numbers of components on data variance.
pca = PCA().fit(X_scaled)
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title("Cumulative Variance by PCA Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid(True)
plt.show()
