# K-Nearest Neighbors

### **K-Nearest Neighbors (KNN): Hyperparameter Tuning**

**Hyperparameters to tune**:

-   **`n_neighbors`**: The number of neighbors to consider when making a
    prediction.
-   **`weights`**: How the neighbors’ votes are weighted (e.g.,
    `'uniform'`, `'distance'`).
-   **`metric`**: The distance metric (e.g., `'euclidean'`,
    `'manhattan'`, `'minkowski'`).
-   **`algorithm`**: The algorithm used to compute the nearest neighbors
    (`'auto'`, `'ball_tree'`, `'kd_tree'`, `'brute'`).
-   **`p`**: Power parameter for the Minkowski distance metric
    (typically 2 for Euclidean distance).
-   **`leaf_size`**: Affects the speed of tree-based algorithms (only
    for `ball_tree` or `kd_tree`).

**Tuning Method**:

-   **Grid Search**: Test different values for `n_neighbors`, `weights`,
    `metric`, etc.
-   **Randomized Search**: Explore a wide range of hyperparameters
    quickly.

**Technique**:

-   **GridSearchCV** or **RandomizedSearchCV** from
    `sklearn.model_selection`.
-   **Cross-validation**: Use K-fold cross-validation to find the
    optimal values that generalize well.

## Advantages

| Good Stuff                                 | Why It Rocks                              |
|------------------------------------|------------------------------------|
| **No training time**                       | Stores data, done!                        |
| **Simple & intuitive**                     | Neighbors vote — that’s it                |
| **Works with any number of classes**       | Binary, multiclass — no problem           |
| **Adapts to data**                         | Complex boundaries? It just works         |
| **No assumptions about data distribution** | Unlike Naive Bayes or Logistic Regression |

------------------------------------------------------------------------

## Disadvantages

| Bad Stuff                            | Why It Hurts                                     |
|------------------------------------|------------------------------------|
| **Slow at prediction time**          | Has to search the whole training set every time! |
| **Memory hog**                       | Stores entire dataset in RAM                     |
| **Sensitive to irrelevant features** | You better scale or select features well         |
| **Needs scaling (standardization)**  | Because distances are affected by scale          |
| **K selection is tricky**            | Too low? Noisy. Too high? Biased.                |

-   Iris

    ``` python
    import pandas as pd
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    import seaborn as sns
    import matplotlib.pyplot as plt

    # Load data
    iris = load_iris()
    X = iris.data
    y = iris.target

    # Scale
    X_scaled = StandardScaler().fit_transform(X)

    # Train-test
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

    # Model
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)

    # Predict & Evaluate
    y_pred = model.predict(X_test)
    print(f"Iris Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
    plt.title("Confusion Matrix - Iris")
    plt.show()
    print(classification_report(y_test, y_pred, target_names=iris.target_names))
    ```

-   Breast Cancer Wisconsin

    ``` python
    df = pd.read_csv('breast_cancer.csv')
    df = df.loc[:, ~df.columns.str.contains('^Unnamed|id', case=False)]
    df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

    X = df.drop('diagnosis', axis=1)
    y = df['diagnosis']

    X_scaled = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print(f"Breast Cancer Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
    plt.title("Confusion Matrix - Breast Cancer")
    plt.show()
    print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))
    ```

-   Titanic

    ``` python
    df = pd.read_csv('titanic.csv')

    # Sample preprocessing (tweak as needed)
    df = df[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']]
    df.dropna(inplace=True)
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

    X = df.drop('Survived', axis=1)
    y = df['Survived']

    X_scaled = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print(f"Titanic Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues', xticklabels=['Died', 'Survived'], yticklabels=['Died', 'Survived'])
    plt.title("Confusion Matrix - Titanic")
    plt.show()
    print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))
    ```