### 1  What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?
   - K-Nearest Neighbors (KNN) is a simple, supervised learning algorithm that classifies new data points or predicts values based on the majority/average of their 'k' closest neighbors in the feature space, working on the principle that similar things exist near each other. For classification, it uses a majority vote (plurality) of labels from the k-nearest points to assign a class. For regression, it averages the continuous values of the k-nearest neighbors to predict a new value, often weighting closer neighbors more.


### 2 What is the Curse of Dimensionality and how does it affect KNN
performance?
   - The Curse of Dimensionality means that as features (dimensions) increase, data becomes sparse, distances lose meaning, and algorithms struggle, especially KNN, which relies on distance; this causes poor generalization (overfitting), as neighbors become less distinct, making it hard to find truly "close" points and requiring exponentially more data to cover the space


### 3  What is Principal Component Analysis (PCA)? How is it different from
feature selection?
   - Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated variables called Principal Components (PCs). By identifying directions (axes) where the data has the highest variance, PCA compresses data while preserving as much information—defined statistically as variance—as possible.The primary difference lies in how the number of features is reduced. Feature selection keeps a subset of the original variables, while PCA creates entirely new ones.
Feature 	Feature Selection	PCA (Feature Extraction)
Output	A subset of the original variables.	Entirely new "artificial" variables (PCs).
Data Integrity	Retains original features and their physical meaning.	Combines variables; new components are often hard to interpret.
Approach	Discards "unimportant" columns entirely.	Projects all original data into a lower-dimensional space.
Information	Information in discarded features is lost.	Tries to preserve the "signal" from all features in the new components.
Use Case	Best when interpretability of specific variables is critical.	Best for removing noise, handling multicollinearity, or visualizing complex data.

### 4  What are eigenvalues and eigenvectors in PCA, and why are they
important?
   - In PCA, eigenvectors define the new axes (Principal Components) showing directions of maximum data variance, while eigenvalues are scalar values indicating the amount of variance along those axes; they are crucial for dimensionality reduction, as they let us select components with the most significant information (highest eigenvalues) to simplify data while retaining most variability, making complex datasets easier to analyze.

### 5  How do KNN and PCA complement each other when applied in a single
pipeline?
   - In a single pipeline, Principal Component Analysis (PCA) acts as a crucial preprocessing step for the K-Nearest Neighbors (KNN) algorithm, addressing the "curse of dimensionality" and improving performance, efficiency, and accuracy.
How PCA Complements KNN
Dimensionality Reduction: KNN performance often suffers in high-dimensional spaces because distance calculations become less meaningful and computationally expensive (the "curse of dimensionality"). PCA transforms the data into a lower-dimensional space by identifying and retaining only the most informative features (principal components) that explain the most variance, effectively discarding redundant or noisy information.
Improved Computational Efficiency: With fewer dimensions to process, the computational cost and memory requirements for KNN's distance calculations are significantly reduced. This leads to faster training and query times, which is especially beneficial for large datasets like image pixel data.
Enhanced Accuracy: By removing noise and focusing on the most relevant features, PCA can help the KNN algorithm generalize better to unseen data, often leading to improved classification or regression accuracy compared to using the raw, high-dimensional data.
Reduced Multicollinearity: PCA transforms correlated features into a set of new, uncorrelated components (orthogonal directions). This can be particularly helpful as the basic Euclidean distance metric used in KNN is sensitive to correlated features. The Pipeline in Practice
In a typical machine learning pipeline, PCA is implemented as a transformation step before the KNN model is trained:
Preprocessing/Standardization: The data is often standardized (mean-centered and scaled) before PCA is applied, as PCA is sensitive to the scale of the features.
PCA Transformation: PCA is applied to the preprocessed data to reduce its dimensionality, with the number of principal components chosen to retain a sufficient amount of the original data's variance (e.g., 95%).
KNN Model: The KNN algorithm is then applied to the new, lower-dimensional data to perform the final classification or regression task.

### 6  Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
  -

In [2]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score

# 1. Load the Dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names
target_names = wine.target_names

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# --- Scenario 1: Without Feature Scaling ---
print("--- Without Feature Scaling ---")
knn_unscaled = KNeighborsClassifier(n_neighbors=5) # Using k=5 as an example
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy (Unscaled): {accuracy_unscaled:.4f}")

# --- Scenario 2: With Feature Scaling (StandardScaler) ---
print("\n--- With StandardScaler ---")
scaler_std = StandardScaler()
X_train_scaled_std = scaler_std.fit_transform(X_train)
X_test_scaled_std = scaler_std.transform(X_test)

knn_scaled_std = KNeighborsClassifier(n_neighbors=5)
knn_scaled_std.fit(X_train_scaled_std, y_train)
y_pred_scaled_std = knn_scaled_std.predict(X_test_scaled_std)
accuracy_scaled_std = accuracy_score(y_test, y_pred_scaled_std)
print(f"Accuracy (Standard Scaled): {accuracy_scaled_std:.4f}")

# --- Scenario 3: With Feature Scaling (MinMaxScaler) ---
print("\n--- With MinMaxScaler ---")
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
X_train_scaled_minmax = scaler_minmax.fit_transform(X_train)
X_test_scaled_minmax = scaler_minmax.transform(X_test)

knn_scaled_minmax = KNeighborsClassifier(n_neighbors=5)
knn_scaled_minmax.fit(X_train_scaled_minmax, y_train)
y_pred_scaled_minmax = knn_scaled_minmax.predict(X_test_scaled_minmax)
accuracy_scaled_minmax = accuracy_score(y_test, y_pred_scaled_minmax)
print(f"Accuracy (MinMax Scaled): {accuracy_scaled_minmax:.4f}")

# --- Comparison ---
print("\n--- Comparison ---")
print(f"Unscaled Accuracy: {accuracy_unscaled:.4f}")
print(f"Standard Scaled Accuracy: {accuracy_scaled_std:.4f}")
print(f"MinMax Scaled Accuracy: {accuracy_scaled_minmax:.4f}")

--- Without Feature Scaling ---
Accuracy (Unscaled): 0.7222

--- With StandardScaler ---
Accuracy (Standard Scaled): 0.9444

--- With MinMaxScaler ---
Accuracy (MinMax Scaled): 0.9630

--- Comparison ---
Unscaled Accuracy: 0.7222
Standard Scaled Accuracy: 0.9444
MinMax Scaled Accuracy: 0.9630


In [None]:
### 7 : Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.


In [3]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Standardize the data before applying PCA
scaler_pca = StandardScaler()
X_scaled_for_pca = scaler_pca.fit_transform(X)

# Train a PCA model
pca = PCA()
pca.fit(X_scaled_for_pca)

# Print the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.4f}")

Explained Variance Ratio of each Principal Component:
Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


This output shows the proportion of variance in the original dataset that is captured by each principal component. You can decide how many components to keep based on the cumulative explained variance you want to achieve (e.g., 95% of the variance).

In [None]:
### 8 Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

# Task
To address the task, I'll first perform PCA to reduce the dimensionality of the Wine dataset to its top two principal components. Then, I'll train a K-Nearest Neighbors (KNN) classifier using this PCA-transformed data. Finally, I'll evaluate the KNN model's accuracy on the transformed test set and compare it with the accuracies obtained from the unscaled, Standard Scaled, and MinMax Scaled datasets.

First, I'll apply PCA to the `X_train_scaled_std` and `X_test_scaled_std` data to retain the top 2 principal components.
I'll begin by executing the code.
```python
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Retain Top 2 Principal Components
pca_2_components = PCA(n_components=2)
X_train_pca = pca_2_components.fit_transform(X_train_scaled_std)
X_test_pca = pca_2_components.transform(X_test_scaled_std)

print("Explained Variance Ratio of the Top 2 Principal Components:")
for i, ratio in enumerate(pca_2_components.explained_variance_ratio_):
    print(f"Principal Component {i+1}: {ratio:.4f}")
print(f"Cumulative Explained Variance: {pca_2_components.explained_variance_ratio_.sum():.4f}")

# Train KNN on PCA Data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

# Evaluate KNN on PCA Data
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"\nAccuracy (PCA-transformed with top 2 components): {accuracy_pca:.4f}")

# Compare Accuracies
print("\n--- Final Accuracy Comparison ---")
print(f"Unscaled Accuracy: {accuracy_unscaled:.4f}")
print(f"Standard Scaled Accuracy: {accuracy_scaled_std:.4f}")
print(f"MinMax Scaled Accuracy: {accuracy_scaled_minmax:.4f}")
print(f"PCA-transformed (top 2 components) Accuracy: {accuracy_pca:.4f}")

# Summarize the impact of PCA
print("\n--- Impact of PCA on KNN Performance ---")
if accuracy_pca > accuracy_unscaled and accuracy_pca > accuracy_scaled_std and accuracy_pca > accuracy_scaled_minmax:
    print("PCA with top 2 components achieved the highest accuracy among all methods tested.")
elif accuracy_pca > accuracy_unscaled:
    print("PCA-transformed data significantly improved KNN accuracy compared to unscaled data.")
else:
    print("PCA-transformed data improved KNN accuracy compared to unscaled data, but not as much as direct feature scaling methods.")

print("PCA helped reduce dimensionality while retaining a significant portion of the variance (approx. 55.41% with 2 components), which resulted in a competitive accuracy for the KNN classifier.")
print("While not surpassing the accuracies achieved by Standard Scaler or MinMax Scaler alone, PCA still provides a strong performance, especially considering the significant reduction in features (from 13 to 2). This demonstrates PCA's ability to extract meaningful information and mitigate the curse of dimensionality for KNN.")
```

## Retain Top 2 Principal Components

### Subtask:
Apply PCA to the standardized training and testing datasets to reduce dimensionality to the top 2 principal components.


**Reasoning**:
The subtask requires applying PCA to the standardized training and testing datasets to reduce dimensionality to the top 2 principal components. This involves creating a PCA instance, fitting it to the scaled training data, and then transforming both the scaled training and testing data.



In [4]:
from sklearn.decomposition import PCA

# Create a PCA instance with n_components=2
pca = PCA(n_components=2)

# Apply fit_transform to X_train_scaled_std to get X_train_pca
X_train_pca = pca.fit_transform(X_train_scaled_std)

# Apply transform to X_test_scaled_std to get X_test_pca
X_test_pca = pca.transform(X_test_scaled_std)

print("Shape of X_train_pca:", X_train_pca.shape)
print("Shape of X_test_pca:", X_test_pca.shape)

Shape of X_train_pca: (124, 2)
Shape of X_test_pca: (54, 2)


In [None]:
### 9 Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


## Train KNN on PCA Data

### Subtask:
Train a K-Nearest Neighbors (KNN) classifier using the PCA-transformed training data.


**Reasoning**:
The subtask is to train a K-Nearest Neighbors (KNN) classifier using the PCA-transformed training data. I will import the necessary class, instantiate the model, and then fit it to the PCA-transformed training data and labels.



In [5]:
from sklearn.neighbors import KNeighborsClassifier

# Instantiate a KNeighborsClassifier object with n_neighbors=5
knn_pca = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN model to the PCA-transformed training features and training labels
knn_pca.fit(X_train_pca, y_train)

print("KNN model trained on PCA-transformed data.")

KNN model trained on PCA-transformed data.


In [None]:
### 10

## Evaluate KNN on PCA Data

### Subtask:
Predict the target values for the PCA-transformed test data using the trained KNN classifier and calculate the accuracy score.


**Reasoning**:
The subtask requires making predictions on the PCA-transformed test data using the trained KNN classifier and then calculating the accuracy score. I will use the `predict` method of the `knn_pca` model and the `accuracy_score` function to achieve this.



In [6]:
from sklearn.metrics import accuracy_score

# 1. Use the knn_pca model to make predictions on the X_test_pca dataset
y_pred_pca = knn_pca.predict(X_test_pca)

# 2. Calculate the accuracy of these predictions
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# 3. Print the calculated accuracy score
print(f"Accuracy (PCA-transformed with top 2 components): {accuracy_pca:.4f}")

Accuracy (PCA-transformed with top 2 components): 0.9444


## Compare Accuracy with Original Dataset

### Subtask:
Compare the accuracy of the KNN classifier on the PCA-transformed dataset with the accuracies obtained from the unscaled, Standard Scaled, and MinMax Scaled datasets.

**Reasoning**:
I have already provided a markdown block to explain the comparison. Now I will generate the code block to perform the accuracy comparison as requested in the subtask.



In [7]:
print("\n--- Final Accuracy Comparison ---")
print(f"Unscaled Accuracy: {accuracy_unscaled:.4f}")
print(f"Standard Scaled Accuracy: {accuracy_scaled_std:.4f}")
print(f"MinMax Scaled Accuracy: {accuracy_scaled_minmax:.4f}")
print(f"PCA-transformed (top 2 components) Accuracy: {accuracy_pca:.4f}")

# Summarize the impact of PCA
print("\n--- Impact of PCA on KNN Performance ---")
if accuracy_pca > accuracy_unscaled and accuracy_pca > accuracy_scaled_std and accuracy_pca > accuracy_scaled_minmax:
    print("PCA with top 2 components achieved the highest accuracy among all methods tested.")
elif accuracy_pca > accuracy_unscaled:
    print("PCA-transformed data significantly improved KNN accuracy compared to unscaled data.")
else:
    print("PCA-transformed data improved KNN accuracy compared to unscaled data, but not as much as direct feature scaling methods.")

print("PCA helped reduce dimensionality while retaining a significant portion of the variance (approx. 55.41% with 2 components), which resulted in a competitive accuracy for the KNN classifier.")
print("While not surpassing the accuracies achieved by Standard Scaler or MinMax Scaler alone, PCA still provides a strong performance, especially considering the significant reduction in features (from 13 to 2). This demonstrates PCA's ability to extract meaningful information and mitigate the curse of dimensionality for KNN.")


--- Final Accuracy Comparison ---
Unscaled Accuracy: 0.7222
Standard Scaled Accuracy: 0.9444
MinMax Scaled Accuracy: 0.9630
PCA-transformed (top 2 components) Accuracy: 0.9444

--- Impact of PCA on KNN Performance ---
PCA-transformed data significantly improved KNN accuracy compared to unscaled data.
PCA helped reduce dimensionality while retaining a significant portion of the variance (approx. 55.41% with 2 components), which resulted in a competitive accuracy for the KNN classifier.
While not surpassing the accuracies achieved by Standard Scaler or MinMax Scaler alone, PCA still provides a strong performance, especially considering the significant reduction in features (from 13 to 2). This demonstrates PCA's ability to extract meaningful information and mitigate the curse of dimensionality for KNN.


## Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

### Subtask:
Train KNN classifiers with 'euclidean' and 'manhattan' distance metrics on the Standard Scaled Wine dataset and compare their accuracies.


**Reasoning**:
I will instantiate two `KNeighborsClassifier` models, one with 'euclidean' and another with 'manhattan' distance metrics, train them on the `X_train_scaled_std` data, make predictions on `X_test_scaled_std`, and then calculate and print their respective accuracy scores.



In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# 1. Instantiate and train KNN with Euclidean metric
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled_std, y_train)

# 2. Make predictions and calculate accuracy for Euclidean
y_pred_euclidean = knn_euclidean.predict(X_test_scaled_std)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)
print(f"Accuracy (Euclidean Distance): {accuracy_euclidean:.4f}")

# 3. Instantiate and train KNN with Manhattan metric
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled_std, y_train)

# 4. Make predictions and calculate accuracy for Manhattan
y_pred_manhattan = knn_manhattan.predict(X_test_scaled_std)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)
print(f"Accuracy (Manhattan Distance): {accuracy_manhattan:.4f}")

# 5. Print both accuracy scores for comparison
print("\n--- Comparison of Distance Metrics ---")
print(f"Euclidean Accuracy: {accuracy_euclidean:.4f}")
print(f"Manhattan Accuracy: {accuracy_manhattan:.4f}")

Accuracy (Euclidean Distance): 0.9444
Accuracy (Manhattan Distance): 0.9815

--- Comparison of Distance Metrics ---
Euclidean Accuracy: 0.9444
Manhattan Accuracy: 0.9815


## Summary:

### Q&A
*   **How does the KNN model's accuracy on PCA-transformed data compare to other scaling methods?**
    The KNN model achieved an accuracy of 0.9444 with PCA-transformed data (top 2 components). This is a significant improvement over the unscaled data's accuracy of 0.7222 and is comparable to the Standard Scaled data's accuracy of 0.9444. However, it was slightly lower than the MinMax Scaled data's highest accuracy of 0.9630.
*   **Which distance metric (Euclidean or Manhattan) performed better for KNN on the scaled Wine dataset?**
    The Manhattan distance metric achieved a higher accuracy of 0.9815, outperforming the Euclidean distance metric which yielded an accuracy of 0.9444 on the Standard Scaled Wine dataset.

### Data Analysis Key Findings
*   Principal Component Analysis (PCA) successfully reduced the dimensionality of the Wine dataset from 13 features to 2 principal components for both training and testing sets.
*   The top 2 principal components collectively explained approximately 55.41% of the total variance in the dataset.
*   A K-Nearest Neighbors (KNN) classifier trained on this 2-component PCA-transformed data achieved an accuracy of 0.9444.
*   Comparing across different data preparations, the accuracies were:
    *   Unscaled: 0.7222
    *   Standard Scaled: 0.9444
    *   MinMax Scaled: 0.9630
    *   PCA-transformed (top 2 components): 0.9444
*   When evaluating KNN on Standard Scaled data with different distance metrics, the Manhattan distance achieved an accuracy of 0.9815, which was higher than the Euclidean distance's accuracy of 0.9444.

### Insights or Next Steps
*   PCA effectively reduced dimensionality by 85% (from 13 to 2 features) while maintaining a high classification accuracy comparable to Standard Scaling. This suggests that the top two principal components capture the most critical information for classification in this dataset, mitigating the curse of dimensionality.
*   The choice of distance metric significantly impacts KNN performance. In this case, the Manhattan distance yielded a superior accuracy of 0.9815 compared to Euclidean distance on the Standard Scaled data, indicating that it might be more suitable for this dataset's feature distribution. Further investigation into other distance metrics or hyperparameter tuning for KNN with Manhattan distance could lead to even better results.
