## **KNN for Anomaly Detection**

## Step 1: Import the Required Libraries and Load the Data

- Import the **pandas, NumPy, matplotlib.pyplot, and sklearn.neighbors** libraries
- Load the **iris** dataset and create a DataFrame with only **sepal_length** and **sepal_width** columns


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
df = data[["sepal_length", "sepal_width"]]

## Step 2: Plot the Input Data

- Create a scatterplot of the input data


In [None]:
plt.scatter(df["sepal_length"], df["sepal_width"])

__Observations:__
- Here, we can see the scatter plot of sepal length and sepal width.
- We may observe some anomalies in the data, so let’s try KNN to identify them.

## Step 3: Instantiate and Fit the Nearest Neighbors Model

- Create an array for the input data
- Instantiate the NearestNeighbors model with 3 neighbors
- Fit the model to the input data

In [None]:
X = df.values

In [None]:
nbrs = NearestNeighbors(n_neighbors = 3)
nbrs.fit(X)

## Step 4: Calculate the Mean Distances and Determine the Cutoff Value

- Get the distances and indexes of the k-nearest neighbors from the model output
- Calculate the mean of the k-distances for each observation
- Plot the mean distances
- Determine the cutoff value for outliers (e.g., > 0.15)


In [None]:
distances, indexes = nbrs.kneighbors(X)
plt.plot(distances.mean(axis =1))

__Observations:__
- Here, we can see the mean distance from the k-neighbors.
- The point above 0.15 is considered an anomaly.


In [None]:
outlier_index = np.where(distances.mean(axis = 1) > 0.15)
outlier_index

__Observation:__
- These are the arrays that have anomalies.

## Step 5: Filter and Plot the Outlier Values

- Filter the outlier values from the original data
- Plot the original data and the outlier values in different colors


In [None]:
outlier_values = df.iloc[outlier_index]
outlier_values

__Observation:__
- These are the values of the outliers for the arrays with anomalies.

In [None]:
plt.scatter(df["sepal_length"], df["sepal_width"], color = "b", s = 65)
plt.scatter(outlier_values["sepal_length"], outlier_values["sepal_width"], color = "r")

__Observations:__
- Here, we can see the anomalies using k-nearest neighbor.
- The anomalies are highlighted in the scatter plot in red.

