## k-Nearest Neighbors
- non-parametric instance-based ML algorithm
    - uses *distance metric*
- used both for regression and classification tasks:
    - **classification - the majority class among the k nearest neighbors**
    - **regression - the average of the target values of the k nearest neighbors**

#### Steps
1. Feature scaling
2. Calculate distances
3. Identify *k* nearest neighbors
4. Make predictions

#### Optimal value of *k*
- Small k: 
    - (+) capture local variations in data
    - (-) high sensitivity to noice
- Large k:
    - (+) smoother decision boundaries
    - (-) can miss finer details
- Common practices
    - use **cross-validation** to determin the optimal value of *k*
    - a common starting point:  $ k = \sqrt{n} $  (n: number of training samples)

#### Limitations
- computationally expensive
- dependent on feature scaling
- not robust to imbalanced data

### Exercise 1: Implement k-NN for a classification task (Iris), experimenting with different values of k

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [3]:
# Load the Iris dataset
data = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Scale the features (optional, but often recommended for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Experiment with different values of k
for k in range(1, 11):
    # Initialize k-NN classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Fit the model on the training data
    knn.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = knn.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy for k={k}: {accuracy:.2f}')

Accuracy for k=1: 1.00
Accuracy for k=2: 1.00
Accuracy for k=3: 1.00
Accuracy for k=4: 1.00
Accuracy for k=5: 1.00
Accuracy for k=6: 1.00
Accuracy for k=7: 1.00
Accuracy for k=8: 1.00
Accuracy for k=9: 1.00
Accuracy for k=10: 1.00


-> We can use k=1 for this dataset

### Exercise 2: Compare k-NN results to logistic regression

In [8]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a logistic regression model
logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = logistic_model.predict(X_test)

# Compute accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f'Logistic Regression Accuracy: {accuracy_lr:.2f}')

# Evaluate k-NN with k=1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

# Compute accuracy for k-NN
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f'k-NN (k=1) Accuracy: {accuracy_knn:.2f}')

print("\n")

# Detailed comparison
print("Classification Report for Logistic Regression:\n", classification_report(y_test, y_pred_lr, target_names=data.target_names))
print("Classification Report for k-NN (k=1):\n", classification_report(y_test, y_pred_knn, target_names=data.target_names))



Logistic Regression Accuracy: 1.00
k-NN (k=1) Accuracy: 1.00


Classification Report for Logistic Regression:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Classification Report for k-NN (k=1):
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

