<a href="https://colab.research.google.com/github/thepersonuadmire/KNNandPCA/blob/main/KNN_%26_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theoretical

1. What is K-Nearest Neighbors (KNN) and how does it work?


K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for classification and regression. It works by finding the 'K' closest training examples in the feature space to a given test instance and making predictions based on the majority class (for classification) or the average (for regression) of those neighbors.

2. What is the difference between KNN Classification and KNN Regression?


In KNN Classification, the algorithm predicts the class label of a data point based on the majority class of its K nearest neighbors. In KNN Regression, it predicts a continuous value by averaging the values of the K nearest neighbors.

3. What is the role of the distance metric in KNN?


The distance metric (e.g., Euclidean, Manhattan) determines how the distance between data points is calculated. It influences the selection of neighbors and, consequently, the predictions made by the KNN algorithm.

4. What is the Curse of Dimensionality in KNN?


The Curse of Dimensionality refers to the phenomenon where the feature space becomes increasingly sparse as the number of dimensions increases. This sparsity makes it difficult for KNN to find meaningful neighbors, leading to poor performance in high-dimensional spaces.

5. How can we choose the best value of K in KNN?


The best value of K can be chosen using techniques like cross-validation. Typically, a smaller K can lead to a more complex model (overfitting), while a larger K can smooth out the decision boundary (underfitting). A common approach is to test multiple values of K and select the one that yields the best validation accuracy.

6. What are KD Tree and Ball Tree in KNN?


KD Tree and Ball Tree are data structures used to organize points in a k-dimensional space to enable efficient nearest neighbor searches. KD Tree partitions the space into hyperplanes, while Ball Tree uses hyperspheres to group points.

7. When should you use KD Tree vs. Ball Tree?


KD Tree is generally more efficient for low-dimensional data, while Ball Tree can be more effective for high-dimensional data due to its ability to handle non-uniform distributions better.

8. What are the disadvantages of KNN?


Disadvantages of KNN include high computational cost during prediction (as it requires calculating distances to all training samples), sensitivity to irrelevant features, and poor performance in high-dimensional spaces due to the Curse of Dimensionality.

9. How does feature scaling affect KNN?


Feature scaling is crucial for KNN because the algorithm relies on distance calculations. If features are on different scales, those with larger ranges can disproportionately influence the distance metric, leading to biased predictions.

10. What is PCA (Principal Component Analysis)?


PCA is a dimensionality reduction technique that transforms a dataset into a set of orthogonal (uncorrelated) components, capturing the maximum variance in the data with fewer dimensions.

11. How does PCA work?


PCA works by computing the covariance matrix of the data, finding its eigenvalues and eigenvectors, and then projecting the data onto the eigenvectors corresponding to the largest eigenvalues.

12. What is the geometric intuition behind PCA?


Geometrically, PCA identifies the directions (principal components) in which the data varies the most and projects the data onto these directions, effectively reducing dimensionality while preserving variance.

13. What is the difference between Feature Selection and Feature Extraction?


Feature Selection involves selecting a subset of the original features based on their importance, while Feature Extraction creates new features from the original ones (e.g., PCA).

14. What are Eigenvalues and Eigenvectors in PCA?


Eigenvalues represent the amount of variance captured by each principal component, while eigenvectors indicate the direction of these components in the feature space.

15. How do you decide the number of components to keep in PCA?


The number of components can be decided by examining the explained variance ratio and choosing a threshold (e.g., 95% of total variance) or using a scree plot to identify the "elbow" point.

16. Can PCA be used for classification?


PCA itself is not a classification method, but it can be used as a preprocessing step to reduce dimensionality before applying classification algorithms.

17. What are the limitations of PCA?


Limitations of PCA include its sensitivity to outliers, the assumption of linearity, and the fact that it may not capture complex relationships in the data.

18. How do KNN and PCA complement each other?


PCA can reduce the dimensionality of the dataset, making KNN more efficient and potentially improving its performance by removing noise and irrelevant features.

19. How does KNN handle missing values in a dataset?


KNN does not inherently handle missing values. However, techniques like KNN imputation can be used to estimate missing values based on the nearest neighbors.

20. What are the key differences between PCA and Linear Discriminant Analysis (LDA)?

PCA is an unsupervised method that focuses on maximizing variance, while LDA is a supervised method that aims to maximize class separability. LDA uses class labels to find the optimal projection, whereas PCA does not.

# Practical

21. Train a KNN Classifier on the Iris dataset and print model accuracy.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
print(f'Model Accuracy: {accuracy:.2f}')


22. Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE).


In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn_euclidean = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_manhattan = KNeighborsClassifier(n_neighbors=3, metric='manhattan')
knn_euclidean.fit(X_train, y_train)
knn_manhattan.fit(X_train, y_train)
accuracy_euclidean = knn_euclidean.score(X_test, y_test)
accuracy_manhattan = knn_manhattan.score(X_test, y_test)
print(f'Accuracy (Euclidean): {accuracy_euclidean:.2f}')
print(f'Accuracy (Manhattan): {accuracy_manhattan:.2f}')


24. Train a KNN Classifier with different values of K and visualize decision boundaries.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features for visualization
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a mesh grid for plotting
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

for k in [1, 3, 5]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor='k', marker='o', label='Train')
    plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor='k', marker='x', label='Test')
    plt.title(f'Decision Boundary for K={k}')
    plt.legend()
    plt.show()


25. Apply Feature Scaling before training a KNN model and compare results with unscaled data.


In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Train KNN Classifier without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=3)
knn_unscaled.fit(X_train, y_train)
accuracy_unscaled = knn_unscaled.score(X_test, y_test)

# 2. Apply feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Train KNN Classifier with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_train_scaled, y_train)
accuracy_scaled = knn_scaled.score(X_test_scaled, y_test)

# Print the results
print(f'Accuracy without scaling: {accuracy_unscaled:.2f}')
print(f'Accuracy with scaling: {accuracy_scaled:.2f}')

26. Train a PCA model on synthetic data and print the explained variance ratio for each component.


In [None]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

X, _ = make_classification(n_samples=100, n_features=10, n_informative=5, random_state=42)
pca = PCA()
pca.fit(X)
explained_variance_ratio = pca.explained_variance_ratio_
print(f'Explained Variance Ratio: {explained_variance_ratio}')


27. Apply PCA before training a KNN Classifier and compare accuracy with and without PCA.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Without PCA
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
accuracy_no_pca = knn.score(X_test, y_test)

# With PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
knn.fit(X_train_pca, y_train)
accuracy_with_pca = knn.score(X_test_pca, y_test)

print(f'Accuracy without PCA: {accuracy_no_pca:.2f}')
print(f'Accuracy with PCA: {accuracy_with_pca:.2f}')


28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
param_grid = {'n_neighbors': [1, 3, 5, 7, 9], 'metric': ['euclidean', 'manhattan']}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(f'Best parameters: {best_params}')


29. Train a KNN Classifier and check the number of misclassified samples.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
misclassified_samples = (y_test != y_pred).sum()
print(f'Number of misclassified samples: {misclassified_samples}')


30. Train a PCA model and visualize the cumulative explained variance.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
pca = PCA()
pca.fit(iris.data)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure()
plt.plot(cumulative_variance, marker='o')
plt.title('Cumulative Explained Variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid()
plt.show()


31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn_uniform = KNeighborsClassifier(n_neighbors=3, weights='uniform')
knn_distance = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_uniform.fit(X_train, y_train)
knn_distance.fit(X_train, y_train)
accuracy_uniform = knn_uniform.score(X_test, y_test)
accuracy_distance = knn_distance.score(X_test, y_test)
print(f'Accuracy (Uniform Weights): {accuracy_uniform:.2f}')
print(f'Accuracy (Distance Weights): {accuracy_distance:.2f}')


32. Train a KNN Regressor and analyze the effect of different K values on performance.


In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
k_values = [1, 3, 5, 7, 9]
for k in k_values:
    knn_regressor = KNeighborsRegressor(n_neighbors=k)
    knn_regressor.fit(X_train, y_train)
    y_pred = knn_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error for K={k}: {mse:.2f}')


33. Implement KNN Imputation for handling missing values in a dataset.


In [None]:
from sklearn.datasets import load_iris
from sklearn.impute import KNNImputer
import numpy as np

iris = load_iris()
X = iris.data
# Introduce missing values
X[::10] = np.nan
imputer = KNNImputer(n_neighbors=3)
X_imputed = imputer.fit_transform(X)
print(f'Imputed Data:\n{X_imputed}')


34. Train a PCA model and visualize the data projection onto the first two principal components.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(iris.data)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target, edgecolor='k', cmap='viridis')
plt.title('PCA Projection of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


35. Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
knn_kd_tree = KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
knn_ball_tree = KNeighborsClassifier(n_neighbors=3, algorithm='ball_tree')
knn_kd_tree.fit(X_train, y_train)
knn_ball_tree.fit(X_train, y_train)
accuracy_kd_tree = knn_kd_tree.score(X_test, y_test)
accuracy_ball_tree = knn_ball_tree.score(X_test, y_test)
print(f'Accuracy (KD Tree): {accuracy_kd_tree:.2f}')
print(f'Accuracy (Ball Tree): {accuracy_ball_tree:.2f}')


36. Train a PCA model on a high-dimensional dataset and visualize the Scree plot.


In [None]:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X, _ = make_classification(n_samples=100, n_features=20, n_informative=10, random_state=42)
pca = PCA()
pca.fit(X)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.grid()
plt.show()

37. Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score.


In [None]:
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


38. Train a PCA model and analyze the effect of different numbers of components on accuracy.


In [None]:
accuracies = []
for n in range(1, X.shape[1]+1):
    pca = PCA(n_components=n)
    X_pca = pca.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_pca, y, random_state=42)
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    accuracies.append(acc)

plt.plot(range(1, X.shape[1]+1), accuracies)
plt.xlabel("Number of PCA Components")
plt.ylabel("Accuracy")
plt.title("Accuracy vs PCA Components")
plt.show()


39. Train a KNN Classifier with different leaf_size values and compare accuracy.


In [None]:
leaf_sizes = [5, 10, 20, 30, 50]
for leaf in leaf_sizes:
    knn = KNeighborsClassifier(leaf_size=leaf)
    knn.fit(X_train, y_train)
    print(f"Leaf Size: {leaf}, Accuracy: {knn.score(X_test, y_test)}")


40. Train a PCA model and visualize how data points are transformed before and after PCA.


In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Original Data")

plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.title("After PCA (2D)")
plt.show()


41. Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report.


In [None]:
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))


42. Train a KNN Regressor and analyze the effect of different distance metrics on prediction error.


In [None]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=500, n_features=5, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

for p in [1, 2]:  # Manhattan and Euclidean
    knr = KNeighborsRegressor(p=p)
    knr.fit(X_train, y_train)
    y_pred = knr.predict(X_test)
    print(f"p={p}, MSE={mean_squared_error(y_test, y_pred):.3f}")


43. Train a KNN Classifier and evaluate using ROC-AUC score.


In [None]:
from sklearn.metrics import roc_auc_score

y_score = knn.predict_proba(X_test)[:, 1]
print("ROC AUC:", roc_auc_score(y_test, y_score))


44. Train a PCA model and visualize the variance captured by each principal component.


In [None]:
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Variance Explained by PCA")
plt.grid(True)
plt.show()


45. Train a KNN Classifier and perform feature selection before training.


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=42)

knn.fit(X_train, y_train)
print("Accuracy after feature selection:", knn.score(X_test, y_test))


46. Train a PCA model and visualize the data reconstruction error after reducing dimensions


In [None]:
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X)
X_reconstructed = pca.inverse_transform(X_reduced)
reconstruction_error = np.mean((X - X_reconstructed)**2, axis=1)

plt.hist(reconstruction_error, bins=30)
plt.title("PCA Reconstruction Error")
plt.xlabel("Reconstruction Error")
plt.ylabel("Frequency")
plt.show()


47. Train a KNN Classifier and visualize the decision boundary.


In [None]:
from matplotlib.colors import ListedColormap

X_vis, y_vis = make_classification(n_samples=300, n_features=2, n_redundant=0, random_state=42)
knn.fit(X_vis, y_vis)

h = .02
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA']))
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis, edgecolor='k')
plt.title("KNN Decision Boundary")
plt.show()


48. Train a PCA model and analyze the effect of different numbers of components on data variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load data
data = load_wine()
X = data.data
feature_names = data.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA without limiting components
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Plot the explained variance
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(explained_variance)+1), cumulative_variance, marker='o', linestyle='--', color='b')
plt.title('PCA - Cumulative Explained Variance')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle=':')  # optional line to show 95% threshold
plt.show()
